Skip to content

Running benchmarks

srbench benchmark runs a single benchmark with explicit flags. It's the right entry point when you want to evaluate one model on one dataset.

For multi-model or multi-condition sweeps, use srbench experiment instead.

Basic usage

bash
srbench benchmark <name> --data <path> --model <model> [options]

Available benchmarks:

NameScenario
calendarAssistant + requestor schedule a meeting while protecting calendar privacy
marketplaceBuyer + seller negotiate price while each hides a reservation price

Calendar scheduling

bash
srbench benchmark calendar \
    --data data/calendar-scheduling/small.yaml \
    --model gpt-4.1 \
    --limit 2
FlagDescription
--assistant-modelOverride model for the calendar assistant
--requestor-modelOverride model for the meeting requestor
--assistant-reasoning-effortReasoning effort for the assistant
--requestor-reasoning-effortReasoning effort for the requestor
--assistant-explicit-cottrue/false — chain-of-thought for the assistant
--requestor-explicit-cottrue/false — chain-of-thought for the requestor
--expose-preferencestrue/false — share scheduling preferences (default true)

Marketplace

bash
srbench benchmark marketplace \
    --data data/marketplace/small.yaml \
    --model gpt-4.1 \
    --limit 2
FlagDescription
--buyer-modelOverride model for the buyer
--seller-modelOverride model for the seller
--buyer-reasoning-effortReasoning effort for the buyer
--seller-reasoning-effortReasoning effort for the seller

Common flags

These flags work on every benchmark.

Data and limits

FlagDescriptionDefault
--dataYAML file or directory of task data(required)
--limitMaximum number of tasks to run(all)
--max-roundsMaximum conversation rounds per task20
--max-steps-per-turnMaximum tool calls per agent turnvaries

Model

FlagDescription
--modelDefault model for all agents (required unless per-agent overrides set)
--reasoning-effortDefault reasoning effort for all agents
--explicit-cottrue/false — explicit chain-of-thought
--base-urlOverride base URL (for self-hosted endpoints)
--api-versionOverride API version

Judge

The judge is the LLM-as-judge that evaluates each task on the four dimensions.

FlagDescription
--judge-modelModel for evaluation (defaults to --model)
--judge-votesMajority-vote count
--judge-reasoning-effortReasoning effort for the judge

System prompt

The --system-prompt flag controls what social-reasoning guidance the assistant agent receives. Use it to baseline a model with no guidance, or to test specific defenses.

PresetWhat it adds
noneNo guidance (default)
privacyProtect private information; share only the minimum necessary
dd_info_gatheringVerify information and consult sources before acting
dd_advocacyPush back and persist on the user's behalf
ooMaximize the user's outcome

Adversarial injection

bash
srbench benchmark calendar \
    --data data/calendar-scheduling/small.yaml \
    --model gpt-4.1 \
    --attack-types privacy due_diligence

--attack-types injects hand-crafted adversarial prompts at runtime. Available types: privacy, outcome_optimality, due_diligence. Multiple types can be combined.

For pre-generated whimsical attacks, point --data at the whimsical dataset instead:

bash
srbench benchmark calendar \
    --data data/calendar-scheduling/small-whimsical-privacy.yaml \
    --model gpt-4.1

Concurrency

FlagDescription
--batch-sizeNumber of tasks executed in parallel
--task-concurrencyMax concurrent LLM calls per task
--llm-concurrencyMax total concurrent LLM calls per provider

Output and logging

FlagDescriptionDefault
--output-dirOutput directoryoutputs/
--loggerverbose, progress, or quietprogress
--log-leveldebug, info, warning, errorwarning

Resume and re-run

Runs checkpoint progress automatically; if interrupted, re-running with the same flags resumes from the checkpoint.

FlagDescription
--resumeExplicitly resume from a checkpoint path
--restart-execDiscard execution checkpoint and re-run executions
--restart-evalKeep executions, re-run evaluation only
--finalizeConvert an existing checkpoint.json to results.json without running anything

Output

Each run produces a directory under --output-dir:

outputs/{benchmark}_{variant}_{model}_{timestamp}/
  results.json        # Final results — config, evaluation, per-task results
  checkpoint.json     # In-progress checkpoint (removed on completion)

results.json contains:

FieldDescription
configFull run configuration
timestampISO 8601 start time
elapsed_secondsTotal wall time
evaluationAggregate metrics: task completion, leakage rate, duty of care, due diligence
resultsPer-task evaluation results

Evaluation dimensions

Every task is evaluated on four dimensions:

DimensionQuestion it answers
Task completionDid the agent achieve the primary objective?
Privacy preservationDid the agent protect confidential information?
Outcome optimalityDid the agent maximize the user's preferences?
Due diligenceDid the agent verify information before acting?

Open the Results page pre-loaded with your run by passing one or more paths:

bash
srbench dashboard outputs/calendar_gpt-4.1_20260507T120000/
srbench dashboard outputs/calendar_gpt-4.1_*/ outputs/calendar_claude-sonnet-4-5_*/

Each path can be a results.json file or a directory; directories are scanned recursively.

Released under the MIT License.