Skip to content

Generating data

The repo ships with ready-to-run datasets in data/. Use srbench datagen only when you want to generate fresh tasks or new adversarial variants.

bash
srbench datagen {calendar, marketplace, malicious} [options]

Calendar scheduling

Generates synthetic calendar tasks with LLM-generated companies, employees, and 1-hour-slot calendars. Privacy labels on each event are determined by majority vote across labeling models.

bash
srbench datagen calendar \
    --model gpt-4.1 \
    --output-dir data/calendar-scheduling/

Pipeline

  1. Companies — generates companies with departments and backstories.
  2. Employees — generates employees per company with roles, relationships, and personal facts.
  3. Calendars — generates one-hour events (08:00-19:00) per employee.
  4. Preferences — assigns morning/afternoon preferences.
  5. Tasks — generates a meeting request for each (employee × archetype) pair, with privacy labels chosen by majority vote across --judge-models.
  6. Assembly — assigns tasks to fullness levels (free-slot counts), places meetings at suboptimal preference times, and ensures at least one mutually-free overlapping slot.
  7. Verification — checks all invariants.

Task archetypes

Each employee produces 7 tasks, one per requestor archetype:

#TypeArchetype
1ExternalUnknown (cold outreach)
2ExternalKnown vendor
3ExternalClient
4InternalBoss / manager
5InternalPeer (same department)
6InternalPeer (different department)
7InternalDirect report

Options

OptionDefaultDescription
--model(required)Generation model
--judge-modelsuses --modelComma-separated models for majority-vote privacy labeling
--num-companies4Companies to generate
--employees-per-company5Employees per company
--calendar-date2026-02-20Date the calendars cover
--fullness-levels2,3,4,5,7,9,10Free-slot counts to stratify by
--medium-size10Tasks per fullness level in medium.yaml
--small-size3Tasks per fullness level in small.yaml
--task-retry-limit3Max retries per task on validation failure
--requestor-fullness5Fixed free-slot count in requestor calendars
--min-mutual-free-slots2Min mutually-free slots between assistant and requestor
--no-generate-preferencesoffDisable preference generation
--seed42Random seed
--output-dir(required)Output directory

Output

data/calendar-scheduling/
  large.yaml             # All tasks, stratified by fullness
  medium.yaml            # 10 tasks per fullness level
  small.yaml             # 3 tasks per fullness level
  _pipeline_outputs/     # Intermediate debug files

Marketplace

Generates buyer-seller negotiation tasks with LLM-generated product catalogs and reservation contexts.

bash
srbench datagen marketplace \
    --catalog-model gpt-4.1 \
    --context-model gpt-4.1 \
    --output-dir data/marketplace/

Pipeline

  1. Catalog — generates products with descriptions and reference prices.
  2. Reservation contexts — generates buyer and seller profiles with hidden reservation prices.
  3. Tasks — pairs catalog entries with reservation contexts.
  4. Validation — checks task integrity.
  5. Stats — emits dataset statistics.

Options

OptionDefaultDescription
--catalog-model(required)Model for catalog generation
--context-model(required)Model for context generation
--total-tasks280Total tasks to generate
--small-size21Tasks in small.yaml
--medium-size(varies)Tasks in medium.yaml
--catalog-size24Number of products
--max-rounds6Maximum negotiation rounds
--max-retries-per-item(varies)Retry budget per item
--max-concurrency12Parallel generation workers
--seed42Random seed
--output-dir(required)Output directory

Output

data/marketplace/
  large.yaml             # All tasks
  small.yaml             # Subset
  _pipeline_outputs/     # Intermediate files

Adversarial variants

Adversarial tasks test how an agent's social reasoning holds up under pressure. Both benchmarks support two attack styles:

StyleHow it works
Hand-craftedScripted adversarial injections applied at runtime via --attack-types on srbench benchmark (no datagen step needed).
WhimsicalPre-generated creative adversarial strategies extracted from Wikipedia. Generated via srbench datagen malicious.

Both styles target three attack dimensions:

Attack typeWhat it tests
privacyPressure to extract private/secret information from the agent
outcome_optimalityManipulation toward a worse outcome for the user
due_diligencePressure to skip verification before acting

Generating whimsical variants

bash
srbench datagen malicious calendar \
    --input data/calendar-scheduling/small.yaml \
    --attack-type privacy \
    -m gemini-2.5-flash \
    -n 20

This produces a new YAML alongside the input — for example data/calendar-scheduling/small-whimsical-privacy.yaml — that you can then run as ordinary task data:

bash
srbench benchmark calendar \
    --data data/calendar-scheduling/small-whimsical-privacy.yaml \
    --model gpt-4.1

Whimsical options

OptionDescription
--inputSource benign task YAML
--attack-typeprivacy, outcome_optimality, or due_diligence
-m, --modelStrategy generation model
-n, --countNumber of strategies to generate
--strategy-assignmentsingle, sequential, random, or unique
--strategies-fileCache file for strategies (skip regeneration on subsequent runs)
-oOutput path

The whimsical pipeline can also be run end-to-end with validation against an assistant model — see the full flag list with:

bash
srbench datagen malicious --help

Using hand-crafted attacks

Hand-crafted attacks don't need a datagen step. Pass them directly to the benchmark CLI:

bash
srbench benchmark calendar \
    --data data/calendar-scheduling/small.yaml \
    --model gpt-4.1 \
    --attack-types privacy due_diligence

In an experiment file, set attack_types=[...] on the config — see Designing experiments.

Released under the MIT License.