📤 Submit Your Results

We welcome submissions from the research community! Follow the guidelines below to add your proof synthesis system to the leaderboard.

🎯 Benchmarks

You can submit results for either or both benchmarks:

Verus-Bench — 150 algorithm-level verification tasks (sorting, searching, etc.)
VeruSAGE-Bench — 849 repository-level verification tasks from 8 real-world systems

📦 Getting the Benchmarks

Both benchmarks are available in the benchmarks/ directory of our repository. Each directory contains a tasks.jsonl file for programmatic access.

📋 Submission Process

1 Run Evaluation

Run your proof synthesis system on the benchmark tasks. Each task should be verified using the Verus verifier. A task is considered solved if Verus verification succeeds.

2 Collect Results

Record the number of solved tasks, average time per task, and average cost per task (in USD). If possible, include a per-source/per-project breakdown.

3 Create Submission JSON

Format your results according to our schema (see below). Include links to your paper and code repository.

4 Submit via Pull Request

Open a pull request to our GitHub repository adding your entry to the appropriate leaderboard/data/*.json file. Include a brief description of your approach.

📄 Submission Schema

Your submission should follow this JSON format:

{
  "submission_id": "your-system-model-version",
  "system_name": "Your System Name",
  "model": "LLM Model Used",
  "date": "YYYY-MM-DD",
  "results": {
    "solved": 135,
    "total": 150,
    "percent_solved": 90.0,
    "avg_time_seconds": 28.5,
    "avg_cost_usd": 0.25
  },
  "breakdown": [
    {"category": "CloverBench", "solved": 11, "total": 11},
    {"category": "MBPP", "solved": 72, "total": 78}
  ],
  "paper_url": "https://arxiv.org/abs/...",
  "code_url": "https://github.com/...",
  "verified": false,
  "notes": "Brief description of your approach"
}

Required Fields

submission_id — Unique identifier (e.g., "mysystem-gpt4-v1.0")
system_name — Name of your proof synthesis system
model — LLM model used (e.g., "GPT-4o", "Claude-3.5-Sonnet")
date — Submission date in YYYY-MM-DD format
results.solved — Number of tasks solved
results.total — Total number of tasks attempted
results.percent_solved — Percentage solved (solved/total × 100)

Optional Fields

results.avg_time_seconds — Average time per task (Verus-Bench)
results.avg_time_minutes — Average time per task (VeruSAGE-Bench)
results.avg_cost_usd — Average cost per task in USD
breakdown — Per-source (Verus-Bench) or per-project (VeruSAGE-Bench) breakdown
paper_url — Link to associated paper
code_url — Link to code repository
notes — Additional information about the submission

⚠️ Rules & Guidelines

No Cheating

Submissions using trivial solutions (e.g., assume false, #[verifier::external_body]) to fake verification will be rejected. We may run spot-checks on submitted solutions.

Use Standard Verus: Use the recommended Verus version specified in the benchmark README
Report Honestly: Self-reported results should be accurate and reproducible
Provide Code: Submissions with public code/papers are prioritized for verification
One Entry per Configuration: Submit separate entries for different model/system combinations

✅ Verification Levels

Submissions are labeled with verification status:

Verified — Results independently reproduced by maintainers
Reported — Self-reported, format validated

To expedite verification, please provide detailed reproduction instructions and consider making your evaluation scripts publicly available.

📧 Contact

For questions about submissions, open an issue on our GitHub repository or email the maintainers.