# Quotient Polynomial Evaluation

## Overview

The quotient polynomial evaluation stage of Halo2, denoted as "Eval_h" in the table below, is the most computationally expensive stage of the Halo2 proof generation process, taking between 1/3 and 2/3 of the total proof runtime, depending on the specific proof. This stage consists of applying a large graph of simple arithmetic operations (addition, subtraction, multiplication, etc) to a set of input polynomials (fixed, advice, instance, etc) in the extended coset domain, combining them into a single output polynomial H.

## GPU Implementation

The individual operations (addition, subtraction, multiplication, etc) that comprise the quotient polynomial evaluation are trivially implemented on the GPU, where each arithmetic operation is applied in parallel to the corresponding coefficients of the polynomials used as inputs to the operation. However, two difficulties arise: a) some method is required for transpiling the evaluation graph into GPU code; the graphs are big enough such that doing so manually is intractable and explicitly writing every operation into a kernel can lead to hours-long NVCC compilation of the kernel, and b) a large proof will have far more input and intermediate polynomials than can be stored in device memory at once. The cuSnark library addresses these difficulties by using a single kernel which processes the evaluation graph as a list of opcodes, taking as an argument a set of input polynomials, storing intermediate computations in a buffer, and writing any output values to a set of output polynomials. A python script is used to break up the evaluation graph into smaller opcode lists that comprise sequential launches of this generic kernel, such that the number of input/output arrays needed by any individual kernel launch do not exceed available device memory.

## Performance

The following table outlines the performance improvements yielded with just the cuSnark polynomial evaluation employed, showing the (CPU baseline) and accelerated results for the different proof stages in a set of proofs of various sizes. Times are in seconds, obtained on a AMD EPYC 7702 64-Core Processor with 4x NVIDIA GeForce RTX 3090 (24 GB) GPUs.

Last updated