Statistical Fields

When you run evaluations with multiple trials (trials_per_task > 1), waza performs statistical analysis to quantify the reliability of your results. This guide explains the new fields in your results JSON.

When Statistical Fields Appear

Statistical fields are computed when:

trials_per_task > 1 in your eval YAML
Multiple results are aggregated
Comparison analysis is performed

Single-trial runs skip statistical analysis.

Confidence Intervals (`bootstrap_ci`)

A confidence interval is a range of scores that likely contains the true score, computed using bootstrap resampling.

Structure

{
  "task_id": "deploy-azure-app",
  "score": 0.85,
  "bootstrap_ci": {
    "lower_bound": 0.72,
    "upper_bound": 0.95,
    "confidence_level": 0.95,
    "method": "percentile"
  }
}

Fields

Field	Type	Description
`lower_bound`	float	Lower 95% confidence bound on the score
`upper_bound`	float	Upper 95% confidence bound on the score
`confidence_level`	float	Confidence level (typically 0.95 = 95%)
`method`	string	Computation method (`percentile` or `bca`)

Interpretation

Narrow interval → Consistent, reliable score
Wide interval → Highly variable results (consider more trials)
Interval includes 0.5 → Uncertain whether task passes or fails

Example

{
  "task": "code-review",
  "trials": 10,
  "mean_score": 0.82,
  "bootstrap_ci": {
    "lower_bound": 0.75,
    "upper_bound": 0.88,
    "confidence_level": 0.95
  }
}

Interpretation: The true score is 95% likely between 0.75 and 0.88. The interval doesn’t cross critical boundaries, so the score is reliable.

Significance Testing (`is_significant`)

The is_significant field indicates whether observed score differences are statistically meaningful or just due to randomness.

Structure

{
  "comparison": {
    "baseline_model": "gpt-4o",
    "test_model": "claude-sonnet-4.6",
    "baseline_score": 0.78,
    "test_score": 0.85,
    "is_significant": true,
    "p_value": 0.032
  }
}

Fields

Field	Type	Description
`is_significant`	bool	Whether difference passes significance threshold (p < 0.05)
`p_value`	float	Probability that difference is due to chance (lower = more significant)

Interpretation

is_significant	p_value	Meaning
`true`	< 0.05	Difference is statistically significant — likely real improvement
`false`	≥ 0.05	Difference is not significant — could be randomness

Example

{
  "task": "deploy-app",
  "model_a": { "score": 0.80, "trials": 5 },
  "model_b": { "score": 0.82, "trials": 5 },
  "difference": 0.02,
  "is_significant": false,
  "p_value": 0.18
}

Interpretation: Model B’s 0.02 improvement is not statistically significant. With only 5 trials each, the difference could be due to randomness. Run more trials (e.g., 10–20) to reduce uncertainty.

Normalized Gain (`normalized_gain`)

Normalized Gain measures improvement as a percentage of the maximum possible gain, useful for comparing progress across different score ranges.

Formula

Normalized Gain = (Actual Score - Baseline Score) / (1.0 - Baseline Score)

Structure

{
  "task": "code-quality",
  "baseline_score": 0.60,
  "improved_score": 0.75,
  "normalized_gain": 0.375
}

Interpretation

Scenario	Baseline	Improved	Gain	Normalized Gain
Low starting point	0.40	0.60	0.20	0.33 (33%)
High starting point	0.80	0.90	0.10	0.50 (50%)

In the second example, a 0.10 improvement from 0.80 is more impressive (50% of remaining room) than a 0.20 improvement from 0.40 (33% of remaining room).

When to Use

Comparing tasks with different baseline scores
Measuring effectiveness of an improvement across multiple tasks
Aggregating progress across heterogeneous benchmarks

Using These Fields for CI/CD Gating

Confidence intervals and significance testing enable data-driven quality gates in CI/CD.

Example: GitHub Actions Gate

- name: Run evaluations with 10 trials
  run: waza run --config.trials_per_task 10 -o results.json

- name: Check statistical significance
  run: |
    python3 << 'EOF'
    import json
    with open('results.json') as f:
      results = json.load(f)

    # Reject if confidence interval is too wide
    for task in results['tasks']:
      ci = task.get('bootstrap_ci', {})
      width = ci.get('upper_bound', 0) - ci.get('lower_bound', 0)
      if width > 0.20:  # Tolerance: ±0.10
        print(f"❌ {task['id']}: Interval too wide ({width:.2f})")
        exit(1)

      # Reject if not significant vs baseline
      if not task.get('is_significant', False):
        print(f"⚠️  {task['id']}: Improvement not significant")

    print("✅ All tasks meet statistical quality gates")
    EOF

Gate Strategies

Strategy	When	Threshold
Confidence width	High variability concerns	CI width < 0.15 (±0.075)
Significance	Avoid false improvements	p_value < 0.05
Normalized gain	Cross-task comparison	Gain > 0.25 (25% of room)

Example: Multi-Model CI Gate

- name: Compare models with statistical rigor
  run: |
    waza run \
      --model gpt-4o \
      --model claude-sonnet-4.6 \
      --config.trials_per_task 15 \
      -o results.json

- name: Check for significant improvement
  run: |
    python3 << 'EOF'
    import json, sys
    with open('results.json') as f:
      results = json.load(f)

    baseline_pass_rate = results['metrics']['baseline_pass_rate']
    improved_pass_rate = results['metrics']['improved_pass_rate']
    is_sig = results['metrics'].get('is_significant', False)

    if improved_pass_rate > baseline_pass_rate and is_sig:
      print(f"✅ Significant improvement: {baseline_pass_rate:.1%} → {improved_pass_rate:.1%}")
      sys.exit(0)
    else:
      print(f"❌ No significant improvement (p={results['metrics'].get('p_value', 1):.3f})")
      sys.exit(1)
    EOF

Minimal Trials Recommendations

Complexity	Min Trials	Typical	Rationale
Simple, deterministic	3	5	Low variability
Standard LLM tasks	5	10	Moderate variability
Complex multi-step tasks	10	20	High variability
Critical production gates	20	30	Stringent confidence

Example Results JSON

Single task with 5 trials

{
  "metadata": {
    "version": "1.0",
    "timestamp": "2025-01-15T14:32:00Z"
  },
  "digest": {
    "total_tasks": 1,
    "passed": 1,
    "failed": 0,
    "pass_rate": 1.0,
    "statistical_summary": {
      "bootstrap_ci": {
        "lower_bound": 0.95,
        "upper_bound": 1.0,
        "confidence_level": 0.95,
        "method": "percentile"
      },
      "is_significant": true
    }
  },
  "tasks": [
    {
      "id": "deploy-to-azure",
      "model": "gpt-4o",
      "trials": 5,
      "scores": [0.95, 1.0, 0.95, 1.0, 0.95],
      "score": 0.97,
      "passed": true,
      "bootstrap_ci": {
        "lower_bound": 0.95,
        "upper_bound": 1.0,
        "confidence_level": 0.95,
        "method": "percentile"
      },
      "grader_results": [
        {
          "name": "deploy-check",
          "score": 0.97,
          "passed": true
        }
      ]
    }
  ]
}

Multi-task comparison

{
  "digest": {
    "total_tasks": 3,
    "models": ["gpt-4o", "claude-sonnet-4.6"],
    "comparison": {
      "baseline_pass_rate": 0.75,
      "improved_pass_rate": 0.88,
      "normalized_gain": 0.52,
      "is_significant": true,
      "p_value": 0.018
    }
  },
  "tasks": [
    {
      "id": "task-1",
      "baseline_score": 0.80,
      "improved_score": 0.95,
      "normalized_gain": 0.75,
      "is_significant": true,
      "bootstrap_ci": {
        "lower_bound": 0.85,
        "upper_bound": 0.98,
        "confidence_level": 0.95
      }
    },
    {
      "id": "task-2",
      "baseline_score": 0.70,
      "improved_score": 0.80,
      "normalized_gain": 0.33,
      "is_significant": false,
      "bootstrap_ci": {
        "lower_bound": 0.65,
        "upper_bound": 0.90,
        "confidence_level": 0.95
      }
    }
  ]
}

Next steps

CI/CD Integration — Gate deployments on statistical significance
Graders Guide — Understanding grader weighting
CLI Reference — All command-line flags

Statistical Fields

When Statistical Fields Appear

Confidence Intervals (bootstrap_ci)

Structure

Fields

Interpretation

Example

Significance Testing (is_significant)

Structure

Fields

Interpretation

Example

Normalized Gain (normalized_gain)

Formula

Structure

Interpretation

When to Use

Using These Fields for CI/CD Gating

Example: GitHub Actions Gate

Gate Strategies

Example: Multi-Model CI Gate

Minimal Trials Recommendations

Example Results JSON

Single task with 5 trials

Multi-task comparison

Next steps

Confidence Intervals (`bootstrap_ci`)

Significance Testing (`is_significant`)

Normalized Gain (`normalized_gain`)