a16z: How to achieve secure and efficient zkVM in stages (a must-read for developers)

Original title: The path to secure and efficient zkVMs: How to track progress

Original author: a16z crypto

Original Compilation: Golem, Odaily Planet Daily

zkVM (Zero-Knowledge Virtual Machine) promises to 'popularize SNARK,' allowing anyone (even those without professional SNARK expertise) to prove that they have correctly executed any program on a given input (or witness). Their core advantage lies in developer experience, but currently, they face significant challenges in terms of security and performance. To fulfill the promise of zkVM's vision, designers must overcome these challenges. In this article, I outline the possible stages of zkVM development, which will take several years to complete.

Challenge

In terms of security, zkVM is a highly complex software project that is still full of vulnerabilities. In terms of performance, the speed of proving program correctness may be tens of thousands of times slower than running locally, making it difficult for most applications to be deployed in the real world at present.

Despite these real-world challenges, most companies in the blockchain industry depict zkVM as deployable immediately. In fact, some projects have already paid a significant amount of computing costs to generate proofs of on-chain activities. However, because zkVM is still imperfect, this is merely an expensive way of pretending that the system is protected by SNARK, when in reality it is either protected by permission, or worse, vulnerable to attacks.

We are still years away from realizing a secure and high-performance zkVM. This article proposes a series of phased specific goals to track the progress of zkVM - these goals can eliminate hype and help the community focus on real progress.

Security Phase

zkVM based on SNARK usually consists of two main components:

· Polynomial Interactive Oracle Proof (PIOP): An interactive proof framework for proving statements about polynomials (or constraints derived from them).

· Polynomial commitment scheme (PCS): Ensure that the prover cannot lie about polynomial evaluations without being detected.

zkVM essentially encodes efficient execution as constraint systems - broadly meaning that they enforce the correct usage of registers and memory by the virtual machine - and then use SNARK to prove that these constraints are satisfied.

Ensuring that complex systems like zkVM have no errors is only possible through formal verification. The following is a breakdown of the security phases. Phase 1 focuses on the correct protocol, while Phase 2 and Phase 3 focus on the correct implementation.

Security Phase 1: Correct Protocol

  1. Formal verification of PIOP reliability;

  2. PCS has a form of verification proof that is binding under certain cryptographic assumptions or ideal models;

  3. If Fiat-Shamir is used, the concise proof obtained by combining PIOP and PCS is a secure formal verification proof in the random oracle model (enhanced with other cryptographic assumptions as needed);

  4. The constraint system applied by PIOP is equivalent to the formal verification proof of the semantics of VM.

  5. 'Glue' all these parts together into a single, formally verified secure SNARK proof for running any program specified by VM bytecode. If the protocol intends to achieve zero-knowledge, this property must also be formally verified to ensure that no sensitive information about the witness is leaked.

Recursive Warning: If zkVM uses recursion, it must verify every PIOP, commitment scheme, and constraint system involved in the recursion anywhere before considering this stage as complete.

Security Phase 2: Correct Validator Implementation

Formal verification proves that the actual implementation of the zkVM verifier (using Rust, Solidity, etc.) matches the protocol of the Phase 1 verification. Achieving this ensures that the implemented protocol is sound (rather than just a paper design or inefficient specification written in Lean, etc.).

The reason for focusing only on the validator implementation (rather than the prover) in phase 2 has two aspects. First, using the validator correctly is sufficient to ensure reliability (i.e., ensuring that the validator cannot believe any false statement is actually true). Second, the zkVM validator implementation is an order of magnitude simpler than the prover implementation.

Security Phase 3: Correct Prover Implementation

The actual implementation of the zkVM prover correctly generates the proofs of the proof system for the 1st and 2nd stage verification, that is, obtaining formal verification. This ensures integrity, meaning that no system using zkVM will be "stuck" with unproven statements. If the prover intends to achieve zero-knowledge, this property must be formally verified.

Estimated Schedule

· Phase 1 Progress: We can expect gradual achievements next year (such as ZKLib). But at least within two years, there is no zkVM that can fully meet the requirements of Phase 1;

· Phases 2 and 3: These phases can be advanced in parallel with some aspects of Phase 1. For example, some teams have already demonstrated that the implementation of the Plonk verifier matches the protocol in the paper (although the protocol itself may not have been fully verified). Nevertheless, I expect that no zkVM will reach Phase 3 in less than four years—and it may be even longer.

Key points to note: Fiat-Shamir security and verified bytecode

A major complicating factor is the unresolved research issues surrounding the security of the Fiat-Shamir transformation. All three stages consider Fiat-Shamir and random oracles as part of their impeccable security, but in reality, the entire paradigm may have vulnerabilities. This is due to the idealization of random oracles and the differences between them and the hash functions actually used. In the worst case, due to the Fiat-Shamir problem, a system that has reached the second stage may later be found to be completely insecure. This has raised serious concerns and ongoing research. We may need to modify the transformation itself to better guard against such vulnerabilities.

A system without recursion is theoretically more robust because certain known attacks involve circuits similar to those used in recursive proofs.

Another point worth noting is that if the bytecode itself is flawed, then the value of proving that the computer program has run correctly (specified by the bytecode) is limited. Therefore, the practicality of zkVM largely depends on the method of generating formally verified bytecode - a huge challenge that is beyond the scope of this article.

About the security of the post-quantum era

At least in the next five years (possibly longer), quantum computers will not pose a serious threat, while vulnerabilities are a survival risk. Therefore, the current focus should be on meeting the security and performance stages discussed in this article. If we can use non-quantum secure SNARKs to meet these security requirements more quickly, then we should do so until post-quantum SNARKs catch up, or people are seriously concerned about the emergence of encryption-related quantum computers in considering others.

The current performance of zkVM

At present, the cost factor generated by zkVM prover is close to 1 million times the native execution cost. If a program requires X cycles to run, the cost of proving correct execution is approximately X multiplied by 1 million CPU cycles. This was the case a year ago, and it remains so today.

Popular narratives typically describe this expense in a way that sounds acceptable. For example:

· "The annual cost of generating proofs for the Ethereum mainnet is less than one million US dollars for all the year."

·"We can almost use a cluster composed of dozens of GPUs to generate Ethereum block proofs in real time."

· "Our latest zkVM is 1000 times faster than its predecessor."

Although these statements are technically accurate, they may be misleading without the proper context. For example:

· zkVM is 1000 times faster than the old version, but the absolute speed is still very slow. This more illustrates how bad things are rather than how good they are.

· Someone has proposed to increase the computational workload of the Ethereum mainnet by 10 times. This will make the current zkVM performance slower.

The so-called 'almost real-time proof of Ethereum blocks' is still much slower than what many blockchain applications require (for example, Optimism's block time is 2 seconds, much faster than Ethereum's 12-second block time).

The continuous operation of tens of GPUs without any errors cannot achieve an acceptable guarantee of activity.

· Spending less than $1 million a year to prove that all activities on the Ethereum mainnet reflect the fact that it only costs about $25 a year for a full Ethereum node to execute calculations.

For applications outside of blockchain, such costs are clearly too high. No parallelization or engineering can offset such a huge expense. We should take a slowdown of zkVM speed not exceeding 100,000 times compared to native execution as the basic benchmark - even if this is just the first step. Real mainstream adoption may require costs close to 10,000 times or lower.

How to Measure Performance

SNARK performance has three main components:

· The inherent efficiency of the underlying proof system.

· Application-specific optimizations (e.g. precompilation).

· Engineering and hardware acceleration (such as GPU, FPGA, or multi-core CPU).

While the latter two are crucial for actual deployment, they are generally applicable to any proving system, so they may not necessarily reflect the underlying costs. For example, adding GPU acceleration and precompilation in zkEVM can easily achieve a 50x speedup, which is much faster than pure CPU-based methods without precompilation—enough to make inherently less efficient systems appear superior to systems that have not undergone the same level of optimization.

Therefore, the following focuses on the performance of SNARK without dedicated hardware and precompilation. This is different from the current benchmarking method, which typically consolidates all three factors into a 'headline number.' This is equivalent to judging the value of a diamond based on its polishing time rather than its inherent clarity. Our goal is to exclude the intrinsic overhead of a general proof system—helping the community eliminate confounding factors and focus on the true progress of proof system design.

Performance Phase

Here are 5 milestones of performance achievement. First, we need to reduce the overhead on the CPU by several orders of magnitude. Only then should the focus shift to further reduction through hardware. Memory usage rate also needs to be increased.

In all phases below, developers do not need to customize code based on zkVM settings to achieve the necessary performance. Developer experience is the main advantage of zkVM. Sacrificing DevEx to meet performance benchmarks would go against the very purpose of zkVM.

These metrics focus on the cost of the validator. However, if unlimited validator costs are allowed (i.e., no upper limit on proof size or verification time), any validator metric can be easily met. Therefore, for a system to comply with the stage, maximum values must be specified for proof size and verification time.

Performance Requirements

Phase 1 requirement: 'reasonable and non-trivial verification cost':

· Proof size: The proof size must be less than the witness size.

· Verification time: The speed of verifying the proof must not be slower than the local running program (i.e., executing calculations without proof of correctness).

These are the minimum concise requirements. They ensure that the proof size and verification time are no worse than sending the witness to the validator and allowing the validator to directly check its correctness.

Phase 2 and later requirements:

· Maximum proof size: 256 KB.

· Maximum verification time: 16 milliseconds.

These thresholds are deliberately enlarged to accommodate new fast proof technologies that may bring higher verification costs. At the same time, they exclude very expensive proofs, so few projects are willing to include them in the blockchain.

Speed Stage 1

Single-threaded proofs must be at least one hundred thousand times slower than native execution, measured across a range of applications (not just proofing Ethereum blocks), without relying on precompilation.

Specifically, imagine a RISC-V process running at approximately 30 billion cycles per second on a modern laptop. Achieving Stage 1 means you can prove approximately 30,000 RISC-V cycles per second on the same laptop (single-threaded). However, the verification cost must be as described before, 'reasonable and nontrivial.'

Speed Phase 2

Proof of single thread must be at most 10,000 times slower than native execution.

Alternatively, due to some promising SNARK methods (especially those based on binary fields) being hindered by current CPUs and GPUs, you can achieve this stage by comparing the use of FPGAs (or even ASICs):

FPGA can simulate the number of RISC-V cores at native speed;

Simulate and prove (nearly) real-time execution of the number of FPGAs required for RISC-V.

If the latter is at most 10,000 times larger than the former, you are eligible to enter Phase 2. On a standard CPU, the proof size must be no larger than 256 KB, and the validator time must be no more than 16 milliseconds.

Speed Stage 3

In addition to achieving Stage 2 speeds, precompiled implementations of automatic synthesis and formal verification can be used to achieve proof of less than 1000 times (for a wide range of applications). Essentially, a custom instruction set can be dynamically tailored for each program to accelerate the proof, but it must be easy to use and formally verifiable.

Memory Phase 1

The speed of stage 1 is achieved with less than 2 GB of memory required by the prover (while also achieving zero knowledge).

This is crucial for many mobile devices or browsers, so it has enabled countless client zkVM use cases. Client proof is important because our phones are our continuous connection to the real world: they track our location, credentials, etc. If generating proof requires more than 1-2 GB of memory, it is simply too much for most mobile devices today. Two points need to be clarified:

· The 2 GB space limit applies to large statements (statements that require tens of billions of CPU cycles to run locally). Proof systems that only implement space limits for small statements lack wide applicability.

If the prover is very slow, it is easy to keep the prover's space below 2 GB. Therefore, in order to make the memory stage 1 not simple, I require that speed stage 1 be met within the 2 GB space limit.

Memory Phase 2

The speed of Phase 1 is achieved with a memory usage of less than 200 MB (10 times better than Phase 1 memory).

Why push it below 2 GB? Consider a non-blockchain example: Every time you access a website via HTTPS, you download a certificate for authentication and encryption. Instead, the website can send zk-proofs with these certificates. Large websites may issue millions of such proofs per second. If each proof requires 2 GB of memory to generate, then a total of PB-level RAM is required. Further reducing memory usage is crucial for non-blockchain deployments.

Pre-compiled: Last mile or crutch?

In zkVM design, precompilation refers to specialized SNARKs (or constraint systems) tailored to specific functions, such as Keccak/SHA hashes for digital signatures or elliptic curve group operations. In Ethereum (where most heavy lifting involves Merkle hashing and signature checks), some handcrafted precompilations can reduce the prover's overhead. However, relying on them as a crutch does not enable SNARKs to achieve their intended purpose. The reasons are as follows:

· For most applications (both internal and external to the blockchain), it is still too slow: Even with precompiled hashes and signatures, the current zkVM is still too slow (both within and outside the blockchain environment) due to the inefficiency of the core proofing system.

· Security Vulnerabilities: Handwritten precompiled code that has not been formally verified is almost certain to be riddled with errors, which could lead to catastrophic security vulnerabilities.

· Developers' Experience Poor: In most zkVMs today, adding new precompiles means manually writing constraint systems for each feature—essentially reverting to a 1960s style workflow. Even with existing precompiles, developers must refactor code to call each precompile. We should optimize for security and developer experience, rather than sacrificing both for incremental performance gains. Doing so will only prove that the performance has not reached the level it should.

· I/O overhead without RAM: Although precompilation can improve the performance of intensive encryption tasks, they may not provide meaningful acceleration for more diverse workloads because they incur significant overhead when passing input/output and cannot utilize RAM. Even in a blockchain environment, as long as you go beyond single-layer L1s like Ethereum (for example, if you want to build a series of cross-chain bridges), you will encounter different hash functions and signature schemes. Repeated precompilation on the issue is not scalable and poses significant security risks.

For all these reasons, our top priority should be to improve the efficiency of the underlying zkVM. The best technology for zkVM will also generate the best precompiled contracts. I do believe that precompiled contracts will remain crucial in the long term, provided they are automatically synthesized and formally verified. This way, we can maintain the developer experience advantage of zkVM while avoiding catastrophic security risks. This perspective is reflected in stage 3 of the speed phase.

Estimated Schedule

I expect a few zkVMs to achieve speed stage 1 and memory stage 1 later this year. I think we will also achieve speed stage 2 within the next two years, but it is currently unclear whether we can achieve this goal without some new ideas that have not yet emerged. I expect the remaining stages (speed stage 3 and memory stage 2) to take a few years to achieve.

Summary

Although I have separately identified the security and performance phases of zkVM in this article, these aspects of zkVM are not completely independent. As more vulnerabilities are discovered in zkVM, it is expected that some vulnerabilities can only be fixed with a significant decrease in performance. Performance should be deferred until zkVM reaches security phase 2.

zkVM is expected to make zero-knowledge proofs truly popular, but they are still in the early stages - full of security challenges and huge performance overhead. Hype and marketing make it difficult to assess real progress. By clarifying clear security and performance milestones, we hope to provide a roadmap that eliminates distractions. We will achieve our goals, but it takes time and sustained effort.

Original Text Link

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments