About the Leaderboard

The Verus Proof Synthesis Leaderboard tracks progress in using Large Language Models (LLMs) to automatically generate formal correctness proofs for Rust programs verified with Verus.

Our goal is to provide standardized benchmarks and transparent comparisons to accelerate research in LLM-based formal verification.

📊 Benchmarks

Verus-Bench Algorithm-Level

Classic algorithmic verification tasks including sorting, searching, array manipulation, and mathematical proofs.

150

Tasks

Sources

~30

Avg LoC

~10

Avg Proof LoC

Sources: CloverBench, MBPP, Diffy, Misc

VeruSAGE-Bench Repository-Level

Complex verification tasks extracted from real-world verified systems including OS kernels, distributed systems, and storage systems.

849

Tasks

Projects

~950

Avg LoC

~50

Avg Proof LoC

Projects: Anvil, IronKV, NRKernel, ATMO, and more

VeruSAGE-Bench Source Projects

Anvil (AL)

Distributed Systems

Anvil Advanced (AC)

Distributed Systems

IronKV (IR)

Key-Value Store

Memory Allocator (MA)

Systems

Node Replication (NO)

Distributed Systems

NRKernel (NR)

OS Kernel

ATMO (OS)

Microkernel

Storage (ST)

Storage Systems

Vest (VE)

Serialization

🔬 Proof Synthesis Systems

AutoVerus

AutoVerus is a three-phase proof synthesis system for algorithm-level verification:

Inference: Generate initial proof candidates using few-shot examples
Refinement: Refine promising candidates with targeted improvements
Repair: Debug and fix remaining verification errors

📄 Paper: arXiv:2409.13082 (OOPSLA 2025)

VeruSAGE

VeruSAGE is an agentic framework for repository-level verification that uses an observation-reasoning-action loop with specialized tools for navigating large codebases.

Observation: Analyze verification errors and code context
Reasoning: Plan proof strategies using LLM capabilities
Action: Generate and refine proofs iteratively

📄 Paper: arXiv:2512.18436

📚 What is Verus?

Verus is a tool for verifying the correctness of Rust programs. It uses SMT-based verification to prove that code satisfies its specifications, which are written as pre/post-conditions and invariants in Rust syntax.

Unlike testing, which can only show the presence of bugs, verification can mathematically prove their absence. Verus has been used to verify production systems including IronKV and components of Azure infrastructure.

📖 Citation

If you use our benchmarks or systems in your research, please cite:

@article{autoverus, title={AutoVerus: Automated Proof Generation for Rust Code}, author={Chenyuan Yang and Xuheng Li and Md Rakib Hossain Misu and Jianan Yao and Weidong Cui and Yeyun Gong and Chris Hawblitzel and Shuvendu K. Lahiri and Jacob R. Lorch and Shuai Lu and Fan Yang and Ziqiao Zhou and Shan Lu}, journal={Proceedings of the ACM on Programming Languages}, volume={9}, number={OOPSLA2}, year={2025}, publisher={ACM} }

@misc{verusage, title={VeruSAGE: A Study of Agent-Based Verification for Rust Systems}, author={Chenyuan Yang and Natalie Neamtu and Chris Hawblitzel and Jacob R. Lorch and Shan Lu}, year={2025}, eprint={2512.18436}, archivePrefix={arXiv}, primaryClass={cs.OS} }

🔗 Related Links

📧 Contact

For questions, suggestions, or collaborations, please open an issue on our GitHub repository or contact the authors directly.