Chapter 21: Minimizing Commitment Costs

This chapter closes Part VI (Prover Optimization, Chapters 19-21), which is optional on a first read. The rest of the book does not depend on it. The material here is essential for anyone designing or implementing a fast prover.

This chapter lives at the frontier. The techniques here, some from papers published in 2024 and 2025, represent the current edge of what's known about fast proving. We assume comfort with polynomial commitments (Chapter 9), sum-check (Chapter 3), and the memory checking ideas from Chapter 14. First-time readers may find themselves reaching for earlier chapters often; that's expected. The reward for persisting is a view of how the fastest SNARKs actually work.

Profile any modern SNARK prover and the same pattern appears. The proving algorithm touches each constraint once. The information-theoretic protocol is near-optimal. Yet wall-clock time is dominated by something else entirely: polynomial commitments.

For elliptic curve-based systems, the bottleneck is multi-scalar multiplication (MSM): computing $\sum_{i} s_{i} \cdot G_{i}$ where each $s_{i}$ is a scalar and each $G_{i}$ is a curve point. A single curve exponentiation costs roughly 3,000 field multiplications. An MSM over $N$ points costs about $N / lo g N$ exponentiations. For a polynomial of degree $1 0^{6}$ , commitment alone requires $\approx 3 \times 1 0^{8}$ field operations, while the proving algorithm itself, after the linear-time sum-check techniques of Chapter 19, runs in only $1 0^{7}$ . The cryptography dwarfs the algebra. The two surrounding chapters develop the rest of the picture: Chapter 19 establishes why sum-check provers are now fast enough that commitments dominate, and Chapter 20 traces the STARK-side optimization story, where the bottleneck instead concentrates in NTT and hashing because FRI absorbs the commitment cost into the prover pipeline.

This chapter focuses on the elliptic curve setting, where sum-check-based minimization techniques apply most directly.

This observation crystallizes into a design principle: commit to as little as possible. Not zero (some commitment is necessary for succinctness) but the absolute minimum required for soundness.

This chapter develops the techniques that make minimization possible. Together with fast sum-check proving, they form the foundation of the fastest modern SNARKs.

The Two-Stage Paradigm

Every modern SNARK decomposes into two phases. First, the prover commits to the witness, to intermediate values, and to auxiliary polynomials that will help later proofs. Second, the prover runs an interactive argument that demonstrates those committed objects satisfy the required constraints.

Both phases cost time. And here's the trap: more commitment means more proving. Every committed object must later be shown well-formed. If you commit to a polynomial, you'll eventually need to prove something about it: its evaluations, its degree, its relationship to other polynomials. Each such proof compounds the cost.

The obvious extremes are both suboptimal. Commit nothing, and proofs cannot be succinct: the verifier must read the entire witness. Commit everything, and you drown in overhead: each intermediate value requires cryptographic operations and well-formedness proofs.

The art lies in the middle: commit to exactly what enables succinct verification. No more.

Untrusted Advice

Sometimes the sweet spot involves enlarging the witness: adding extra values that the prover must compute alongside the original ones. The witness is what gets committed, so adding a few helper values just makes the same witness polynomial slightly longer. The trade-off can be favorable: the extra values often let the constraint system avoid hard operations entirely.

Consider division. Proving "I correctly computed $a / b$ " by directly encoding division as a constraint is expensive, since division is not a native operation in polynomial constraint systems. The constraint system speaks the language of multiplication and addition over a finite field, not Euclidean division.

The workaround is to enlarge the witness with the quotient $q$ and remainder $R$ , and then verify the multiplicative identity:

The prover adds $q$ and $R$ to the witness vector. They are committed as part of the same polynomial(s) that already hold $a$ and $b$ , with no separate commitment object.
The constraint system enforces $a = q \cdot b + R$ and $R < b$ .

Every value lives inside the committed witness polynomial; the verifier never sees any of them in the clear. The constraint is checked the same way every other constraint is: as a polynomial identity opened at a random point via the PCS. The win is that this identity uses only multiplication and a range check, both native, instead of requiring the constraint system to implement division. The prover paid for slightly more witness entries to avoid encoding a hard operation, and the verifier never had to learn what $q$ and $R$ actually are.

This pattern is called untrusted advice: the prover volunteers additional witness data that, if the constraints check out, accelerates the overall proof. The verifier does not trust the advice blindly; the constraints guarantee it is consistent with the original claim.

The trade-off is specific: we pay for a slightly longer witness polynomial (more entries to commit, so a slightly larger MSM) to save on constraint degree. The constraints that check the enlarged witness can be lower-degree than the constraints that would have encoded the hard operation directly. Since high-degree constraints are expensive to prove via sum-check, the exchange often favors a longer witness with simpler constraints.

The pattern generalizes. Any computation with an efficient verification shortcut benefits:

Square roots. To prove $y = x$ , the prover commits to $y$ and proves $y^{2} = x$ and $y \geq 0$ . One multiplication plus a range check, rather than implementing the square root algorithm in constraints.

Sorting. To prove a list is sorted, the prover commits to the sorted output and proves: (1) it's a permutation of the input (via permutation argument), and (2) adjacent elements satisfy $a_{i} \leq a_{i + 1}$ . Linear comparisons rather than $O (n lo g n)$ sorting constraints.

Inverses. To prove $y = x^{- 1}$ , commit to $y$ and check $x \cdot y = 1$ . Field inversion (expensive to express directly) becomes a single multiplication.

Exponentiation. To prove $y = g^{x}$ , the prover commits to $y$ and all intermediate values from the square-and-multiply algorithm: $r_{0} = 1, r_{1}, r_{2}, \dots, r_{k} = y$ . Each step satisfies $r_{i + 1} = r_{i}^{2}$ (if bit $x_{i} = 0$ ) or $r_{i + 1} = r_{i}^{2} \cdot g$ (if $x_{i} = 1$ ). Verifying $k$ quadratic constraints is far cheaper than expressing the full exponentiation logic.

Whenever verifying a result costs less than computing it, the prover should compute and commit while the constraint system only checks. The prover bears the computational burden; the constraint system bears only the verification burden. This division of labor is the essence of succinct proofs, now applied within the proof system itself.

Batch Evaluation Arguments

Suppose the prover has committed to addresses $(y_{1}, \dots, y_{T})$ and claimed read results $(z_{1}, \dots, z_{T})$ , the values the prover claims it received from each lookup. A public function $f : {0, 1}^{ℓ} \to F$ is known to all. The prover wants to demonstrate:

$z_{1} = f (y_{1}), z_{2} = f (y_{2}), \dots, z_{T} = f (y_{T})$

One approach: prove each evaluation separately. That's $T$ independent proofs, linear in the number of evaluations. Can we do better?

Think of $f$ as a memory array indexed by $ℓ$ -bit addresses. Each $(y_{i}, z_{i})$ pair is a read operation, "I read value $z_{i}$ from address $y_{i}$ ," and the prover claims all $T$ reads are consistent with the memory $f$ . (Later in this chapter we will see that this read-only setting is the simpler half of a more general memory checking problem, where the table itself can be updated over time.)

One approach uses lookup arguments (Chapter 14), proving that each $(y_{i}, z_{i})$ pair exists in the table ${(x, f (x)) : x \in {0, 1}^{ℓ}}$ . But sum-check offers a more direct path that exploits the structure of the problem.

Three Flavors of Batching

Before diving into sum-check, let's map the batching landscape. The term "batching" appears throughout this book, but it means different things in different contexts.

Approach 1: Batching verification equations. The simplest form. Suppose you have $T$ equations to check: $L_{1} = R_{1}, \dots, L_{T} = R_{T}$ . Sample a random $α$ and check the single combined equation $\sum_{j} α^{j} L_{j} = \sum_{j} α^{j} R_{j}$ . By Schwartz-Zippel, if any original equation fails, the combined equation fails with high probability. This reduces $T$ verification checks to one.

Chapter 2 uses this for Schnorr batch verification. Chapter 13 uses it to combine PLONK's constraint polynomials. Chapter 15 uses it to merge STARK quotients. The pattern is ubiquitous: random linear combination collapses many checks into one.

Approach 2: Batching PCS openings. Polynomial commitment schemes often support proving multiple evaluations cheaper than proving each separately. KZG's batch opening (Chapter 9) proves $f (z_{1}) = v_{1}, \dots, f (z_{k}) = v_{k}$ with a single group element, using the quotient $\frac{f ( X ) - R ( X )}{Z ( X )}$ where $R$ is the interpolant of the claimed evaluations and $Z$ is the vanishing polynomial of the query points. This quotient exists as a polynomial iff every claimed evaluation is correct, so its commitment doubles as the batch proof. Proof size stays constant regardless of $k$ . This batching is PCS-specific; other schemes have different mechanisms.

Approach 3: Batching via domain-level sum-check. This is what this section develops. Rather than batch the $T$ claims directly, we restructure the problem as a sum over the domain of $f$ . The key equation:

$z (r^{'}) = x \in {0, 1}^{ℓ} \sum r a (x, r^{'}) \cdot f (x)$

This sum nominally has $2^{ℓ}$ terms (one per address in the domain), but $r a$ is sparse: out of $K \cdot T$ possible entries, only $T$ are non-zero, since each access touches exactly one address. Sum-check exploits this sparsity in the access matrix, not in $f$ itself ( $f$ can be perfectly dense). At the end of the protocol, the verifier needs a single evaluation $\tilde{f} (r)$ at a random point: one PCS opening, not $T$ .

Comparing the three approaches

The three approaches batch at different levels, and that is what determines what each one saves. Approaches 1 and 2 operate at the claim level: the prover must still open $f$ at all $T$ points $y_{1}, \dots, y_{T}$ . Approach 1 saves verifier work (one check instead of $T$ ) but does not reduce openings; Approach 2 compresses the proof but still requires the prover to compute all $T$ evaluations internally. Approach 3 batches at the domain level: the $T$ point evaluations collapse into a single random evaluation, and the prover opens $\tilde{f}$ at exactly one point.

Each approach therefore answers a different question.

Approach 1 (batch verification equations) answers "I have many unrelated checks; can the verifier handle them in one shot?" Use it whenever you have multiple equations to verify, even outside the PCS setting. The combiner is just transcript-level randomness, costing nothing beyond sampling one field element. The prover does the same work either way; only verifier work shrinks. This is what PLONK uses to combine constraint polynomials and what STARK quotient batching uses.

Approach 2 (PCS batch opening) answers "I have one committed polynomial; how do I send many opening proofs in one go?" Use it when $f$ is already committed (typically via KZG) and you need to prove evaluations at multiple points. The win is purely in proof size: one group element instead of $T$ . The prover still computes all $T$ evaluations internally and does the corresponding MSM work; nothing about $f$ 's structure or the access pattern matters.

Approach 3 (sum-check over the domain) answers "I have many evaluations of the same polynomial with structured access; can the prover do less work overall?" Use it when (a) you are proving many evaluations $f (y_{j})$ of the same $f$ , and (b) the access pattern has structure the sum-check can exploit, in particular the one-hot or tensor-decomposable structure of the access matrix $r a$ . Crucially, this is structure in how the polynomial is queried, not structure in the polynomial itself. The decisive parameters are $T$ (number of accesses) and $K = 2^{ℓ}$ (domain size): when $T ≪ K$ , exploiting the access sparsity is what makes $T$ accesses to a $K$ -sized table feasible. Without that structure, Approach 3 has nothing to exploit and Approach 2 is simpler.

There is a deeper connection across all three. Evaluating an MLE at a random point $r^{'}$ is a random linear combination, weighted by the Lagrange basis $eq (r^{'}, \cdot)$ rather than powers of $α$ . The sum-check formulation in Approach 3 is random linear combination in MLE clothing, but operating at the domain level unlocks optimizations that claim-level batching (Approaches 1 and 2) cannot reach.

The Sum-Check Approach

Now we develop Approach 3 in detail. Let $\tilde{f}$ be the multilinear extension of $f$ . The access matrix $r a (x, j)$ from the previous section is the Boolean matrix with $r a (x, j) = 1$ iff $y_{j} = x$ , so each column $j$ is one-hot at the row corresponding to address $y_{j}$ .

Example. Suppose $f$ is defined on 2-bit addresses ${00, 01, 10, 11}$ , and we have $T = 3$ accesses to addresses $y_{1} = 01$ , $y_{2} = 11$ , $y_{3} = 01$ . The access matrix is:

$r a = 010000010100 rows: x \in {00, 01, 10, 11}, columns: j \in {1, 2, 3}$

Each column $j$ encodes "which address did access $j$ hit?" as a one-hot vector: column $j$ equals the basis vector $e_{y_{j}}$ . Here column 1 is $e_{01}$ (since $y_{1} = 01$ ), column 2 is $e_{11}$ (since $y_{2} = 11$ ), and column 3 is $e_{01}$ again (since $y_{3} = 01$ ).

For a single evaluation, we can write:

$z_{j} = x \in {0, 1}^{ℓ} \sum r a (x, j) \cdot f (x)$

This looks like overkill. The one-hot structure of $r a (\cdot, j)$ zeroes out every term except the one at address $y_{j}$ , so the sum trivially collapses to $f (y_{j}) = z_{j}$ . Why bother?

The heuristic that turns this into a single check is the multilinear extension trick used throughout the book: lift a vector of values defined on the Boolean hypercube into a polynomial on the full field, then evaluate that polynomial at one random point off the hypercube. By Schwartz-Zippel, that one evaluation catches any error in the original vector with overwhelming probability.

Define the "error" at index $j$ as the gap between the claimed output and what the lookup should return:

$e_{j} = z_{j} - x \in {0, 1}^{ℓ} \sum r a (x, j) \cdot f (x)$

There are $T$ such errors, one per access. All evaluations are correct iff $e_{j} = 0$ for every $j$ . Checking $T$ separate equalities defeats the purpose of batching, so we apply the trick. The vector $(e_{1}, \dots, e_{T})$ is defined on the hypercube ${0, 1}^{l o g T}$ . Its multilinear extension $e$ is a polynomial on $F^{l o g T}$ , and $e$ is the zero polynomial iff every $e_{j} = 0$ . The verifier picks a random $r^{'} \in F^{l o g T}$ and asks: is $\tilde{e} (r^{'}) = 0$ ? If all $e_{j}$ vanish, the answer is yes for any $r^{'}$ ; if any $e_{j}$ is non-zero, Schwartz-Zippel says the answer is no with overwhelming probability. One evaluation, $T$ checks collapsed.

Substituting the definition of $e_{j}$ and using the linearity of the MLE construction, the check $\tilde{e} (r^{'}) = 0$ becomes:

$z (r^{'}) = x \in {0, 1}^{ℓ} \sum r a (x, r^{'}) \cdot f (x)$

If this single identity holds at the random $r^{'}$ , all $T$ original evaluations are correct with high probability. The $T$ separate access claims have collapsed into one identity over the entire domain ${0, 1}^{ℓ}$ , which sum-check is built to prove.

Sum-check proves this identity. The prover commits to $r a$ and $z$ , then runs sum-check to verify consistency with the public $f$ .

The Sparsity Advantage

The sum nominally ranges over all $2^{ℓ}$ addresses, potentially enormous (imagine $ℓ = 128$ for CPU word operations). The reason it stays tractable is the structure of the access matrix. A vector or matrix is one-hot if every column contains exactly one non-zero entry, and that entry equals 1. The access matrix $r a$ is one-hot by construction: each access $j$ touches exactly one address $y_{j}$ , so column $j$ has a 1 at row $y_{j}$ and zeros everywhere else.

The consequence is dramatic. The matrix $r a$ has dimensions $K \times T$ with $K = 2^{ℓ}$ , so naively it has $K \cdot T$ entries, but only $T$ of them are non-zero. Any sum that appears to range over $K$ positions actually touches only the $T$ non-zero terms. This is why batch evaluation costs $O (T)$ rather than $O (K T)$ : the one-hot structure makes the exponentially large table effectively linear-sized. When $K = 2^{128}$ (as in Jolt's instruction lookups), this is the difference between tractable and impossible.

One-hotness handles the access side (only $T$ non-zero terms in $r a$ ) but the sum still nominally folds the dense polynomial $\tilde{f}$ over the full $2^{ℓ}$ -element domain. Naive sum-check over this dense factor still costs $O (2^{ℓ})$ . The prefix-suffix algorithm from Chapter 19 closes the remaining gap: by splitting the variables into halves and running two chained sum-checks, the dense work shrinks from $O (2^{ℓ})$ to $O (2^{ℓ / c})$ for any constant $c$ . Combined with the one-hot access, the prover runs in $O (T + 2^{ℓ / c})$ total. Compared to proving each evaluation separately (which costs $Ω (T)$ just to state the claims), the batch approach matches the lower bound while providing cryptographic guarantees.

Virtual Polynomials

Start with a toy case. Suppose the prover has committed to multilinear polynomials $a$ and $b$ , and the protocol later refers to $c (x) = a (x) \cdot b (x)$ . Should the prover separately commit to $c$ ?

No, because $c$ contains no information beyond what is already in $a$ and $b$ . Whenever the verifier needs $c (r)$ at a random point $r$ , the protocol can ask for $a (r)$ and $b (r)$ instead, then compute $c (r) = a (r) \cdot b (r)$ locally. The polynomial $c$ is virtual: it exists implicitly through the formula $c = a \cdot \tilde{b}$ , never committed, never stored. The prover saves one MSM; the verifier loses nothing.

The general principle behind virtualization is that any polynomial algebraically determined by already-committed polynomials does not need its own commitment. Whenever the verifier needs an evaluation of the virtual polynomial at some point $r$ , the protocol reduces that demand to evaluations of the source polynomials at $r$ , and the verifier reconstructs the result from the formula. The savings cascade: if a virtual polynomial's sources are themselves virtual, the same trick applies recursively, and only the root polynomials in the dependency graph ever get committed.

This principle is what makes the access matrix tractable. In our batch evaluation, $r a$ has $K = 2^{ℓ}$ rows (one per possible address) and $T$ columns (one per access). For a zkVM with 32-bit addresses, $K = 2^{32}$ , so the matrix has billions of entries. Committing to it directly is impossible. The escape is to not commit $r a$ as a single object: instead, decompose it into smaller pieces that can be committed and treat the full $r a$ as virtual. The next subsection develops this decomposition.

Tensor Decomposition

The access matrix is the natural target for virtualization, but virtualization needs source polynomials to factor through. The trick is that addresses themselves are bit strings, and matching an $ℓ$ -bit address means matching every bit. We can therefore factor the address-match into separate per-chunk matches, each over a much smaller space.

Concretely, an address $k \in {0, 1}^{ℓ}$ splits into $d$ chunks of $ℓ / d$ bits each:

$k = (k_{1}, \dots, k_{d}) where each k_{i} \in {0, 1}^{ℓ / d}$

For each chunk $i$ , define a smaller access matrix $r a_{i}$ where $r a_{i} (k_{i}, j) = 1$ iff the $i$ -th chunk of access $y_{j}$ equals $k_{i}$ . Each $r a_{i}$ has dimensions $K^{1/ d} \times T$ , exponentially smaller than the original $K \times T$ .

The full access happens when every chunk matches, which is exactly the product:

$r a (k, j) = i = 1 \prod d r a_{i} (k_{i}, j)$

The original $r a$ never gets committed. The prover commits only to the $d$ small matrices $r a_{1}, \dots, r a_{d}$ , and the full $r a$ exists virtually through this product formula.

Example. Return to our 2-bit addresses with accesses $y_{1} = 01$ , $y_{2} = 11$ , $y_{3} = 01$ . Split each address into $d = 2$ chunks of 1 bit each: $y_{1} = (0, 1)$ , $y_{2} = (1, 1)$ , $y_{3} = (0, 1)$ .

The chunk matrices are (columns: $j \in {1, 2, 3}$ ):

$r a_{1} = (100110) rows: first bit k_{1} \in {0, 1}$

$r a_{2} = (010101) rows: second bit k_{2} \in {0, 1}$

In $r a_{1}$ : row 0 has 1s in columns 1 and 3 because accesses $y_{1} = 01$ and $y_{3} = 01$ have first bit 0. Row 1 has a 1 in column 2 because $y_{2} = 11$ has first bit 1.

In $r a_{2}$ : row 1 has 1s in all columns because all three accesses ( $01, 11, 01$ ) have second bit 1.

To recover $r a (01, j = 1)$ : check $r a_{1} (0, 1) \cdot r a_{2} (1, 1) = 1 \cdot 1 = 1$ . Indeed, access 1 hit address 01. For $r a (10, j = 1)$ : $r a_{1} (1, 1) \cdot r a_{2} (0, 1) = 0 \cdot 0 = 0$ . Access 1 did not hit address 10.

Instead of one $4 \times 3$ matrix (12 entries), we store two $2 \times 3$ matrices (12 entries total, same here, but the savings grow with $ℓ$ ).

The commitment savings are dramatic. Instead of a $K \times T$ matrix, the prover commits to $d$ matrices of size $K^{1/ d} \times T$ each. For $K = 2^{128}$ and $d = 4$ : from $2^{128}$ to $4 \times 2^{32}$ .

The exponential has become polynomial.

Virtualizing Everything

Once you see virtualization, you see it everywhere. The product example above is the smallest case; in real systems the same principle applies to entire computation traces. A zkVM executing a million instructions touches several polynomials per instruction: opcode, operands, intermediate values, flags. Naive commitment requires millions of polynomials, each with its own MSM. Virtualization reduces this to perhaps a dozen root polynomials, with everything else derived. The difference is a 30-second proof versus a 3-second proof.

The read values need not exist. Recall the batch evaluation setup: $z = (z_{1}, \dots, z_{T})$ is the vector of read results, with $z_{j}$ being the value returned when the prover read address $y_{j}$ in step $j$ . These feel like primary data; in a zkVM they are exactly the values an instruction sees coming out of memory, and the rest of the computation depends on them. Surely they need to be committed?

They do not. The read results are completely determined by the access pattern $r a$ (which addresses were read) and the table $f$ (what each address contains). Concretely:

$z (r^{'}) = x \in {0, 1}^{ℓ} \sum r a (x, r^{'}) \cdot f (x)$

The right side defines $z$ implicitly from $r a$ and $f$ . The prover never commits to $z$ . When the verifier needs $z (r^{'})$ , sum-check reduces this evaluation to evaluations of $r a$ and $f$ , both of which are already committed (the access matrix) or public (the table). The pattern is the same as $c = a \cdot b$ from earlier, just with a sum instead of a product as the defining formula.

GKR as virtualization. The GKR protocol (Chapter 7) builds an entire verification strategy from this idea. A layered arithmetic circuit computes layer by layer from input to output. The naive approach commits to every layer's values. GKR commits to almost nothing:

Let $\tilde{V}_{k}$ denote the multilinear extension of gate values at layer $k$ . The layer reduction identity:

$V_{k} (r) = i, j \in {0, 1}^{s} \sum mult_{k} (r, i, j) \cdot V_{k - 1} (i) \cdot \tilde{V}_{k - 1} (j) + \dots$

Each layer's values are virtual: defined via sum-check in terms of the previous layer. Iterate from output to input: only $\tilde{V}_{0}$ (the input layer) is ever committed. A circuit with 100 layers has 99 virtual layers that exist only as claims passed through sum-check reductions.

More examples. The pattern appears throughout modern SNARKs.

Constraint polynomials. In Spartan (Chapter 19), the polynomial $a (x) \cdot b (x) - \tilde{c} (x)$ is never committed. Sum-check verifies it equals zero on the hypercube by evaluating at random points.
Grand products. Permutation arguments express $Z (X)$ as a running product. Each $Z (i)$ is determined by $Z (i - 1)$ and the current term. One starting value plus a recurrence defines everything.
Folding. In Nova (Chapter 23), the accumulated instance is virtual. Each fold updates a claim about what could be verified (not data sitting in memory).
Write values from read values. In read-write memory checking, the prover commits to read addresses, write addresses, and increments $Δ$ . What about write values? They need not be committed: $wv (j) = rv (j) + Δ (j)$ . The write value at cycle $j$ is the previous value at that address plus the change. Three committed objects define four.

The design principle that emerges from these examples is to ask not "what do I need to store?" but "what can I define implicitly?" Every polynomial expressible as a function of others is a candidate for virtualization. Every value recoverable from a sum-check reduction need never be committed. The fastest provers are the ones that commit least, because computation is cheap but cryptography is expensive.

Sum-checks as a DAG

The design principle above applies to individual polynomials, but virtualization at scale creates a structural picture worth seeing in its own right. When a sum-check ends at a random point $r$ and the polynomial it was reasoning about is virtual, the resulting evaluation claim has to be discharged by another sum-check. That second sum-check might itself end with a claim about another virtual polynomial, requiring a third, and so on. The dependencies form a directed acyclic graph (DAG): each sum-check is a node, the output claims it produces are outgoing edges, and the input claims it consumes are incoming edges. Committed polynomials are sources (no incoming edges from other sum-checks); the final opening proof is the sink.

The DAG induces a partial order, and that partial order determines the minimum number of stages the protocol must run in. Two sum-checks can share a stage only if neither depends on the other's output. The longest path in the DAG sets a lower bound on the number of stages: protocols with deep chains of virtualization unavoidably have many sequential rounds. Jolt, which proves RISC-V execution, runs roughly 40 sum-checks organized into 8 stages by this dependency structure.

Within each stage, independent sum-checks can be batched via random linear combination. Sample $ρ_{1}, \dots, ρ_{k}$ from the verifier's transcript, form $g_{batch} = \sum_{i} ρ_{i} \cdot g_{i}$ , and run one sum-check on the combined claim. This is the horizontal dimension of optimization: batching within a stage. Stages are the vertical dimension: sequential dependencies that cannot be avoided. The design recipe for a fast prover is to map the full DAG, minimize the number of stages (constrained by the longest path), and batch every independent sum-check within each stage.

A small example illustrates the structure:

graph TD
    Claim["Top-level claim"]

    subgraph Stage1["Stage 1"]
        S1["sum-check A"]
    end

    subgraph Stage2["Stage 2"]
        S2a["sum-check B"]
        S2b["sum-check C"]
    end

    subgraph Stage3["Stage 3"]
        O1["open P₁"]
        O2["open P₂"]
        O3["open P₃"]
    end

    Claim --> S1
    S1 --> S2a
    S1 --> S2b
    S2a --> O1
    S2a --> O2
    S2b --> O2
    S2b --> O3

Read top-to-bottom for execution order. Stage 1 runs one sum-check that ends with two residual claims, both about virtual polynomials. Stage 2 discharges those residual claims with two independent sum-checks (B and C), which collapse into a single batched sum-check via random linear combination. Stage 3 discharges the resulting claims with PCS openings on the three committed polynomials, which collapse into a single batched opening.

The vertical axis (stages) is bounded by dependencies: stage 2 cannot start until stage 1 has produced its residual claims, and stage 3 cannot start until stage 2 is done. The horizontal axis within each stage is free, so anything independent collapses via batching. A protocol designer cannot shrink the height (stages) without restructuring the protocol's data dependencies, but they can always shrink the width by batching anything independent.

Time-Varying Functions

So far virtualization has applied to static objects: a derived polynomial $c = a \cdot \tilde{b}$ , an access matrix $r a$ that factors into chunks, a vector of read results $z$ determined by addresses and a fixed table. The next test for the principle is a moving target: state that changes over time. This is the third instance of the virtualization theme, now applied to the trickier case where the table being read evolves between accesses.

Batch evaluation proves claims of the form $z_{j} = f (y_{j})$ where $f$ is fixed. Real computation does not work that way. Registers change. Memory gets written. The lookup tables from Chapter 14 assume static data, yet a CPU's registers are anything but static. When a zkVM executes ADD R1, R2, R3, it reads R1 and R2, computes the sum, writes to R3. The next instruction might read R3 and get the new value. The value at R3 depends on when you query it.

The general phenomenon is the time-varying function problem. A function $f$ that gets updated at certain steps; a query $f (y_{j})$ at time $j$ returns the value $f$ held at that moment. The claim "I correctly evaluated $f$ " depends on the timing of the evaluation.

Setup and the Naive Cost

Formally, over $T$ time steps the computation performs operations on a table with $K$ entries. Each operation is either a read (query position $k$ , receive value $v$ ) or a write (set position $k$ to value $v$ ). The prover's job is to demonstrate that every read returns the value from the most recent write to that position.

The naive way to verify this is to commit to a $K \times T$ matrix where entry $(k, j)$ records the value at position $k$ after step $j$ . For a zkVM with 32 registers and a million instructions, this is $32 \times 1 0^{6} = 3.2 \times 1 0^{7}$ entries: expensive but conceivable. For RAM with $2^{32}$ addresses and a million instructions, this is $2^{52}$ entries, vastly beyond what any prover could commit. Direct commitment is impossible at zkVM scale.

This is exactly the situation virtualization was built for. The state table is enormous, but it is determined by the write history. We do not need to commit it; we need to commit only the data that uniquely determines it.

The Unified Principle

What lets us virtualize the state table is that read-only and time-varying tables turn out to share the same verification structure. Both answer the question "what value should this read return?" the same way: as a sum over positions, weighted by an access indicator, verified via sum-check. The only difference is whether the table itself is fixed or reconstructed from a write history. Throughout this subsection, $K$ is the table size (number of positions), $T$ is the number of operations, and we use the standard memory-checking notation: $r a$ for read addresses, $r v$ for read values (the same object as $z$ in the batch evaluation section), $w a$ for write addresses, $w v$ for write values. The parallel naming makes the read/write symmetry visible.

Recall the read-only case from the batch evaluation section: the value at position $k$ is just a fixed $f (k)$ , the verification equation is $r v_{j} = \sum_{k} r a (k, j) \cdot f (k)$ , and the prover commits to the tensor-decomposed chunks of $r a$ while leaving the read values $r v$ virtual. The function $f$ itself is public or preprocessed; nothing about $f$ needs to be committed at all.

The read-write case has the same verification equation but with one critical change: $f$ now depends on time. Define $f (k, j)$ as "what value is stored at position $k$ just before time $j$ ?" Then:

$r v_{j} = k \sum r a (k, j) \cdot f (k, j)$

The challenge is that $f (k, j)$ is now a $K \times T$ table, far too large to commit. The previous trick (tensor decomposition) does not save us: the time-dependence does not factor through chunking the way an address does. We need a different escape, and virtualization provides it. The state table is determined by the write history, so we can reconstruct it from writes rather than store it. Let $w a_{j^{'}}$ denote the address written to at step $j^{'}$ , and $Δ_{j^{'}}$ the value added to that address (zero if step $j^{'}$ is a read, non-zero if it is a write). Then:

$f (k, j) = initial (k) + j^{'} < j \sum 1 [w a_{j^{'}} = k] \cdot Δ_{j^{'}}$

Read this as a walk through history. For each past step $j^{'} < j$ , the indicator $1 [w a_{j^{'}} = k]$ asks "did we write to address $k$ at step $j^{'}$ ?" If yes, include the increment $Δ_{j^{'}}$ ; if no, skip it. The sum picks out exactly the prior writes that targeted address $k$ and adds them to the initial value.

The massive $K \times T$ state table dissolves into two sparse objects. The first is a $T$ -vector of write addresses $w a$ . Just like the read addresses $r a$ , each entry of $w a$ is an $ℓ$ -bit position in the same $K$ -sized table, so the same tensor decomposition applies: split each $ℓ$ -bit address into $d$ chunks of $ℓ / d$ bits, encode $w a$ as $d$ smaller chunk matrices $w a_{1}, \dots, w a_{d}$ of size $K^{1/ d} \times T$ each, and treat the full $w a$ as virtual through the product $w a (k, j) = \prod_{i} w a_{i} (k_{i}, j)$ . Nothing about the read versus write distinction changes how chunking works; it depends only on addresses being bit strings. The second sparse object is a length- $T$ increment vector $Δ$ , which has no address structure to chunk and gets committed directly. The state table $f (k, j)$ itself is virtual.

The committed objects in each case:

Case	Committed	Virtual
Read-only	$r a$ chunks	$r v$ , table $f$ (public)
Read-write	$r a$ chunks, $w a$ chunks, $Δ$	$r v$ , state table $f (k, j)$

The read-write prover commits to a few extra objects (write addresses and the increment vector) but never commits the state table. This is what makes time-varying memory tractable.

Data	Changes?	Technique	Committed	Virtual
Instruction tables	No	Read-only	$r a$ chunks	$r v$ , table $f$
Bytecode	No	Read-only	$r a$ chunks	$r v$ , table $f$
Registers	Yes	Read-write	$r a$ , $w a$ chunks, $Δ$	$r v$ , state $f (k, j)$
RAM	Yes	Read-write	$r a$ , $w a$ chunks, $Δ$	$r v$ , state $f (k, j)$

Both techniques use the same sum-check structure. The difference is that read-only tables have $f (k)$ fixed (public or preprocessed), while read-write tables have $f (k, j)$ that must be virtualized from the write history.

Both paths lead to the same destination, where commitment cost is proportional to operations $T$ (not table size $K$ ). A table with $2^{128}$ entries costs no more to access than one with $2^{8}$ .

Why This Matters for Real Systems

In a zkVM proving a million CPU cycles, memory operations dominate the execution trace. Every instruction reads registers, many access RAM, all fetch from bytecode. A RISC-V instruction like lw t0, 0(sp) involves: one bytecode fetch (read-only), one register read for sp (read-write), one memory read (read-write), one register write to t0 (read-write). Four memory operations for one instruction.

If each memory operation required commitment proportional to memory size, proving would be impossible. A million instructions × four operations × $2^{32}$ addresses = $2^{54}$ commitments. The sun would burn out first.

The techniques above make it tractable. Registers, RAM, and bytecode all reduce to the same pattern: commit to addresses and values (or increments), virtualize everything else. The distinction between "read-only" and "read-write" is simply whether the table $f$ is fixed or must be reconstructed from writes.

What emerges is a surprising economy. A zkVM with $2^{32}$ bytes of addressable RAM, 32 registers, and a megabyte of bytecode commits roughly the same amount per cycle regardless of these sizes. The commitment cost tracks operations, not capacity. Memory becomes (in a sense) free. You pay for what you use, not what you could use.

There is a deeper connection worth noting. Circuit wiring (the copy-constraint problem from Chapter 13) is itself a memory access pattern. When the output of gate $j$ feeds into gate $k$ as an input, the circuit is "reading" a value that was "written" by gate $j$ . Quotienting-based systems handle this through permutation arguments (grand products over accumulated ratios). In the memory-checking framework developed here, the same constraint reduces to a read-write access pattern over a table of wire values, verified via the same $r a$ / $w a$ machinery. Chapter 22 develops this parallel explicitly, showing that wiring constraints are where the two PIOP paradigms diverge most sharply in abstraction while converging in purpose.

The Padding Problem and Jagged Commitments

We've virtualized polynomials, memory states, and intermediate circuit layers. But a subtler waste remains: the boundaries between different-sized objects.

This problem emerged when zkVM teams tried to build universal recursion circuits. Recursion (Chapter 23) is the technique of proving that a verifier accepted another proof, expressed as a circuit and proven about. The dream of universal recursion is one such circuit that can verify any program's proof, regardless of what instructions that program used, so the same recursive infrastructure handles every workload. The reality was that different programs have different instruction mixes, and the verifier circuit seemed to depend on those mixes.

The Problem: Tables of Different Sizes

A zkVM's computation trace comprises multiple tables, one per CPU instruction type. The ADD table holds every addition executed; the MULT table every multiplication; the LOAD table every memory read. These tables have wildly different sizes depending on what the program actually does.

Consider two programs:

Program A: heavy on arithmetic. 1 million ADDs, 500,000 MULTs, 10,000 LOADs.
Program B: heavy on memory. 100,000 ADDs, 50,000 MULTs, 800,000 LOADs.

Same total operations, but completely different table shapes. If the verifier circuit depends on these shapes, we need a different circuit for every possible program behavior. That's not universal recursion but combinatorial explosion.

Now we need to commit to all this data. What are our options?

Option 1: Commit to each table separately. Each table becomes its own polynomial commitment. The problem is that verifier cost scales linearly with the number of tables. In a real zkVM with dozens of instruction types and multiple columns per table, verification becomes expensive. Worse, in recursive proving, where we prove that a verifier accepted, each separate commitment adds complexity to the circuit we're proving.

Option 2: Pad everything to the same size. Put all tables in one big matrix, padding shorter tables with zeros until they match the longest. Now we commit once. The problem is that if the longest table has $2^{20}$ rows and the shortest has $2^{10}$ , we're committing to a million zeros for the short table. Across many tables, the wasted commitments dwarf the actual data.

Neither option is satisfactory. We want the efficiency of a single commitment without paying for empty space.

The Intuition: Stacking Books on a Shelf

Think of each table as a stack of books. The ADD table is a tall stack (many additions). The MULT table is shorter (fewer multiplications). The LOAD table is somewhere in between.

If we arrange them side by side, we get a jagged skyline: different heights and lots of empty space above the shorter stacks. Committing to the whole rectangular region wastes the empty space.

But what if we packed the books differently? Take all the books off the shelf and line them up end-to-end in a single row. The first million books come from ADD, the next 50,000 from MULT, then 200,000 from LOAD. No gaps and no wasted space. The total length equals exactly the number of actual books.

This is the jagged commitment idea, which is to pack different-sized tables into one dense array. We commit to the packed array (cheap and without wasted space) and separately tell the verifier where each table's data begins and ends.

A Concrete Example

Suppose we have three tiny tables:

Table	Data	Height
A	[a₀, a₁, a₂]	3
B	[b₀, b₁]	2
C	[c₀, c₁, c₂, c₃]	4

If we arranged them as columns in a matrix, padding to height 4:

     A    B    C
0:  a₀   b₀   c₀
1:  a₁   b₁   c₁
2:  a₂    0   c₂
3:   0    0   c₃

We'd commit to 12 entries, but only 9 contain real data. The three zeros are waste.

Instead, pack them consecutively into a single array:

Index:  0   1   2   3   4   5   6   7   8
Value: a₀  a₁  a₂  b₀  b₁  c₀  c₁  c₂  c₃

Now we commit to exactly 9 values: the real data. We also record the cumulative heights: table A ends at index 3, table B ends at index 5, table C ends at index 9. Given these boundaries, we can recover which table any index belongs to, and its position within that table.

From Intuition to Protocol

Now formalize this. We have $2^{k}$ tables (columns), each with its own height $h_{y}$ . Arranged as a matrix, this forms a jagged function $p (x, y)$ where $x$ is the row (up to $2^{n}$ ) and $y$ identifies the table. The function satisfies $p (x, y) = 0$ whenever row $x \geq h_{y}$ (beyond that table's height).

The total non-zero entries number $M = \sum_{y} h_{y}$ . This sum is the trace area, the only quantity that actually matters for proving.

The prover packs all non-zero entries into a single dense array $q$ of length $M$ , deterministically: table 0's entries first, then table 1's, and so on. The 2D table with variable-height columns becomes a 1D array that skips the padding zeros entirely. We will call this operation flattening, since the variable-height skyline of the original tables is collapsed into a single flat row.

The cumulative heights $t_{y} = \sum_{y^{'} < y} h_{y^{'}}$ track where each column starts in the flattened array. Given a dense index $i$ , two functions recover the original coordinates:

$row_{t} (i)$ : the row within the padded table (offset from that column's start)
$col_{t} (i)$ : which column $i$ belongs to (found by comparing $i$ against cumulative heights)

For example, with heights $(16, 16, 256)$ , the cumulative heights are $(0, 16, 32)$ (one entry per column, recording where each column starts in the dense array). The total trace area is $M = 16 + 16 + 256 = 288$ , the position just past the last entry. Column 2 therefore occupies the range $[32, 288)$ . Index $i = 40$ falls in column 2 (since $32 \leq 40 < 288$ ) at row $40 - 32 = 8$ .

The prover commits to:

$q$ : the dense array of length $M$ containing all actual values
The cumulative heights $t_{y} = h_{0} + h_{1} + \dots + h_{y - 1}$ , sent in the clear (just $2^{k}$ integers)

The jagged polynomial $p$ is never committed. It exists only as a relationship between the dense $q$ and the boundary information.

Making It Checkable

The verifier wants to query the original jagged polynomial and ask, "what is $\tilde{p} (z_{r}, z_{c})$ ?" This asks for a weighted combination of entries from table $z_{c}$ at rows weighted by $z_{r}$ .

The key equation translates this into a sum over the dense array:

$\tilde{p} (z_{r}, z_{c}) = i \in {0, 1}^{m} \sum q (i) \cdot eq (row (i), z_{r}) \cdot eq (col (i), z_{c})$

The two $eq$ factors are selectors. The first, $eq (col (i), z_{c})$ , picks out entries belonging to the requested table; the second, $eq (row (i), z_{r})$ , picks out entries at the requested row. Their product enforces double selection: a term contributes $q (i)$ only when dense index $i$ maps to both the correct row and the correct column.

This is a sum over $M$ terms and exactly the sum-check form we've used throughout the chapter. The prover runs sum-check; at the end, the verifier needs $\tilde{q} (r)$ at a random point (handled by the underlying PCS) and the selector function evaluated at that point.

The selector function (despite involving $row_{t} (i)$ and $col_{t} (i)$ ) is efficiently computable, since it's a simple comparison of $i$ against the cumulative heights. This comparison can be done by a small read-once branching program (essentially a specialized circuit that checks if an index falls within a specific range using very few operations). This means its multilinear extension evaluates in $O (m \cdot 2^{k})$ field operations.

Remark (Batching selector evaluations). During sum-check, the verifier must evaluate the selector function $\hat{f}_{t}$ at each round's challenge point. With $m$ rounds, that's $m$ evaluations at $O (2^{k})$ each, totaling $O (m \cdot 2^{k})$ . A practical optimization: the prover claims all $m$ evaluations upfront, and the verifier batches them via random linear combination. Sample random $α$ , check $\sum_{j} α^{j} \hat{f}_{t} (r_{j}) = \sum_{j} α^{j} y_{j}$ where $y_{j}$ are the claimed values. The left side collapses to a single $\hat{f}_{t}$ evaluation at a combined point. Cost drops from $O (m \cdot 2^{k})$ to $O (m + 2^{k})$ .

The Payoff

The prover performs roughly $5 M$ field multiplications, or five per actual trace element, regardless of how elements are distributed across tables. The constant 5 comes from the sum-check structure: the summand is a product of three multilinear factors ( $q$ and the two $eq$ selectors), giving a degree-3 polynomial in each variable. The halving trick from Chapter 19, applied to a degree- $d$ sum-check, costs roughly $(d + 2) \cdot N$ field multiplications across all rounds (the $d$ factor for folding each multilinear piece each round, the $+ 2$ for forming the round polynomial's evaluations). With $d = 3$ and $N = M$ , that lands at $\approx 5 M$ . No padding, no wasted commitment, and a constant that does not depend on table count or table heights.

For the verifier, something useful happens. The verification circuit depends only on $m = lo g_{2} (M)$ (the log of total trace area), not on the individual table heights $h_{y}$ . Whether the trace has 100 tables of equal size or 100 tables of wildly varying sizes, the verifier does the same work.

This is the solution to the universal recursion problem from the beginning of this section. When proving proofs of proofs, the verifier circuit becomes the statement being proved. A circuit whose size depends on table configuration creates the combinatorial explosion we feared. But a circuit depending only on total trace area yields one universal recursion circuit.

One circuit to verify any program. The jagged boundaries dissolve into a single integer: total trace size.

The Deeper Point

Each virtualization earlier in this chapter replaced a polynomial with a formula: $c = a \cdot b$ avoided committing $c$ ; the tensor decomposition avoided committing the full access matrix $r a$ ; the write-history formula avoided committing the state table $f (k, j)$ . In each case, the thing being virtualized was a value at each point.

Jagged commitments extend the same idea to structure. What gets virtualized is not a polynomial's values but its shape: the staircase of boundaries where each table ends. The prover never commits to the $2^{n + k}$ -sized jagged polynomial $p$ with its zeros above each table's height. Instead it commits to the dense $M$ -sized array $q$ and sends the cumulative heights $t_{0}, t_{1}, \dots$ in the clear. The boundary information (which index belongs to which table, at which row) exists only through the formula that compares an index against the heights. The zeros that a padded approach would waste space on were never real data; they were artifacts of forcing variable-height tables into a rectangular grid. Flattening eliminates the grid, and the boundaries become metadata rather than committed content.

This is the chapter's recurring theme pushed to its furthest application: ask not what exists but what can be computed. Values, access patterns, state history, and now shape itself dissolve into formulas over committed roots.

Small-Value Preservation

We've focused on what to commit, but how large the committed values are matters too. Real witness values are usually small: 8-bit bytes, 32-bit words, 64-bit addresses. These fit in a single machine word even though the protocol's field has 256-bit elements. The dominant cost in curve-based commitment, computing $g^{x}$ via double-and-add, scales as $O (lo g ∣ x ∣)$ group operations. If $x$ is a 64-bit integer rather than a 256-bit field element, exponentiation takes 64 steps instead of 256, a 4× speedup. For an MSM over a million points, this translates to seconds of wall-clock time.

The optimization follows from keeping values small for as long as possible. Random challenges injected by the verifier are the main source of large field elements. Once a small witness value gets multiplied by a 256-bit challenge, the result is 256 bits and the cheapness is gone. A well-designed protocol postpones this inflation, arranging computations so that the bulk of the prover's work touches values that still fit in machine words. Jolt, Lasso, and related systems (Arasu et al., 2024) reported 4× prover speedups simply from tracking value sizes through the protocol, maintaining separate "small" and "large" categories for polynomials, and routing each to the appropriate arithmetic.

The impact compounds everywhere:

MSM with 64-bit scalars: 4× faster than 256-bit
Hashing small values has fewer field reductions
FFT with small inputs gives smaller intermediate values and fewer overflows
Sum-check products where inputs fit in 64 bits yield products that fit in 128 bits, so no modular reduction is needed

Modern sum-check-based systems track value sizes explicitly, and Jolt, Lasso, and related systems maintain separate "small" and "large" polynomial categories. Small polynomials get optimized 64-bit arithmetic. Large polynomials get full field operations. The boundary is tracked through the protocol.

The difference between a 10-second prover and a 2-second prover often lies in these details.

Key Takeaways

Commitment dominates prover cost in curve-based systems. A single elliptic curve exponentiation costs $\approx 3, 000$ field multiplications; an MSM over $N$ points costs $\approx N / lo g N$ exponentiations. After the linear-time sum-check techniques of Chapter 19, the prover spends more time committing than proving. Every optimization in this chapter reduces what the prover must commit.
Enlarge the witness to simplify constraints (untrusted advice). Adding helper values (quotients, square roots, intermediate exponentiation steps) to the witness polynomial makes the commitment slightly larger but lets the constraint system avoid expensive non-native operations. The prover computes; the constraints only verify.
Batch many lookups into one evaluation via sum-check. Proving $T$ evaluations $z_{j} = f (y_{j})$ reduces to a single sum-check instance over the domain of $f$ , exploiting the one-hot sparsity of the access matrix $r a$ . The access matrix factors via tensor decomposition ( $ℓ$ -bit addresses split into $d$ chunks of $ℓ / d$ bits), reducing commitment from $2^{ℓ} \times T$ to $d \times 2^{ℓ / d} \times T$ . A table with $2^{128}$ entries costs no more to access than one with $2^{8}$ .
Virtualization is the chapter's unifying principle. Any polynomial algebraically determined by already-committed polynomials does not need its own commitment. This applies to derived polynomials ( $c = a \cdot \tilde{b}$ ), read results (defined by access pattern and table), time-varying state (reconstructed from write history via $f (k, j) = initial (k) + \sum_{j^{'} < j} 1 [w a_{j^{'}} = k] \cdot Δ_{j^{'}}$ ), and GKR layer values (each defined via sum-check in terms of the previous layer). The same principle appears on the STARK side as DEEP-ALI (Chapter 20).
Virtualization creates a DAG of sum-checks. Each virtual polynomial requires another sum-check to discharge the evaluation claim. The protocol's structure is a directed acyclic graph: committed polynomials are sources, the final opening proof is the sink. The longest path determines the minimum number of sequential stages; everything independent within a stage collapses via batching.
Jagged commitments virtualize structure, not values. Flattening variable-height tables into one dense array avoids committing to padding zeros. The verifier circuit depends only on total trace area $M$ , not individual table heights, enabling one universal recursion circuit for all programs.
Keep values small as long as possible. MSM cost scales with scalar bit-width ( $O (lo g ∣ x ∣)$ group operations per exponentiation). Witness values are typically 8-64 bits; random challenges are 256 bits. Postponing the inflation from multiplying small values by large challenges keeps the bulk of the prover's work in cheap machine-word arithmetic.
Commitment cost tracks operations, not capacity. A zkVM with $2^{32}$ addressable memory, dozens of instruction types, and millions of cycles commits roughly the same amount per cycle regardless of memory size or instruction mix. Memory is free; only actual computation costs.

Minimizing Trust