# Permutation Generation

## Overview

One component of lookup generation in Halo2 involves constructing permutations of columns of the trace table, which can be computationally demanding. Different versions of Halo2 follow different methods for this construction. One method involves creating permutations A' and S' of columns A and S, such that (A'[i] - A'[i+1])*(S'[i] - A'[i]) = 0.

## GPU Implementation

The GPU algorithm used in the cuSnark library to efficiently construct these permutations is as follows:

1) Apply a hash function H to each element of A to determine an index into a pair of hash tables storing the total count of that element's value, and a corresponding index in A where that value exists.

2) Perform exclusive prefix scan on the count hash table to determine the starting index in A' for the grouping of each unique value in A

3) Insert each element of A into A', using the hash function & tables, and the result of the prefix scan above, to determine the correct index in A'. For the first occurrence of a given value in A', also set the corresponding value in S', and a boolean "set" flag at the same index in a flag array indicating that index of S' has been set.

4) For each element of S, use the hash tables to look up if that element has been inserted into S' already. If so, set the corresponding element of a boolean "used" flag array to true.

5) Perform inclusive prefix scans to sum the elements of the "set" and "used" flag arrays

6) Use the summed set array to create a mapping between [index in a list of leftover elements in S] and [index in S'].

7) For each element of S which has not yet been used in S', use the summed used array to determine the element's index in the leftover list, and map that leftover index to an index in S'. Then, write the leftover element into S'.

## Performance

As mentioned above, not all versions of Halo2 use this method for constructing permutations, so this acceleration is not used in the benchmark proofs referenced by earlier sections. In another Halo2 version which does use this construction, on a proof with 2^24 rows and 1 lookup operation, the CPU lookup generation takes** 39.32 seconds**, while the GPU accelerated version (using a single RTX 4090 card) takes **3.94 seconds**.

Last updated