📤 Submit Your Results
We welcome submissions from the research community! Follow the guidelines below to add your proof synthesis system to the leaderboard.
🎯 Benchmarks
You can submit results for either or both benchmarks:
- Verus-Bench — 150 algorithm-level verification tasks (sorting, searching, etc.)
- VeruSAGE-Bench — 849 repository-level verification tasks from 8 real-world systems
📦 Getting the Benchmarks
Both benchmarks are available in the
benchmarks/
directory of our repository. Each directory contains a tasks.jsonl file for programmatic
access.
📋 Submission Process
leaderboard/data/*.json file. Include a brief description of your approach.
📄 Submission Schema
Your submission should follow this JSON format:
{
"submission_id": "your-system-model-version",
"system_name": "Your System Name",
"model": "LLM Model Used",
"date": "YYYY-MM-DD",
"results": {
"solved": 135,
"total": 150,
"percent_solved": 90.0,
"avg_time_seconds": 28.5,
"avg_cost_usd": 0.25
},
"breakdown": [
{"category": "CloverBench", "solved": 11, "total": 11},
{"category": "MBPP", "solved": 72, "total": 78}
],
"paper_url": "https://arxiv.org/abs/...",
"code_url": "https://github.com/...",
"verified": false,
"notes": "Brief description of your approach"
}
Required Fields
submission_id— Unique identifier (e.g., "mysystem-gpt4-v1.0")system_name— Name of your proof synthesis systemmodel— LLM model used (e.g., "GPT-4o", "Claude-3.5-Sonnet")date— Submission date in YYYY-MM-DD formatresults.solved— Number of tasks solvedresults.total— Total number of tasks attemptedresults.percent_solved— Percentage solved (solved/total × 100)
Optional Fields
results.avg_time_seconds— Average time per task (Verus-Bench)results.avg_time_minutes— Average time per task (VeruSAGE-Bench)results.avg_cost_usd— Average cost per task in USDbreakdown— Per-source (Verus-Bench) or per-project (VeruSAGE-Bench) breakdownpaper_url— Link to associated papercode_url— Link to code repositorynotes— Additional information about the submission
⚠️ Rules & Guidelines
No Cheating
Submissions using trivial solutions (e.g., assume false,
#[verifier::external_body])
to fake verification will be rejected. We may run spot-checks on submitted solutions.
- Use Standard Verus: Use the recommended Verus version specified in the benchmark README
- Report Honestly: Self-reported results should be accurate and reproducible
- Provide Code: Submissions with public code/papers are prioritized for verification
- One Entry per Configuration: Submit separate entries for different model/system combinations
✅ Verification Levels
Submissions are labeled with verification status:
- Verified — Results independently reproduced by maintainers
- Reported — Self-reported, format validated
To expedite verification, please provide detailed reproduction instructions and consider making your evaluation scripts publicly available.
📧 Contact
For questions about submissions, open an issue on our GitHub repository or email the maintainers.