Skip to content

Statistical Fields

When you run evaluations with multiple trials (trials_per_task > 1), waza performs statistical analysis to quantify the reliability of your results. This guide explains the new fields in your results JSON.


Statistical fields are computed when:

  • trials_per_task > 1 in your eval YAML
  • Multiple results are aggregated
  • Comparison analysis is performed

Single-trial runs skip statistical analysis.


A confidence interval is a range of scores that likely contains the true score, computed using bootstrap resampling.

{
"task_id": "deploy-azure-app",
"score": 0.85,
"bootstrap_ci": {
"lower_bound": 0.72,
"upper_bound": 0.95,
"confidence_level": 0.95,
"method": "percentile"
}
}
FieldTypeDescription
lower_boundfloatLower 95% confidence bound on the score
upper_boundfloatUpper 95% confidence bound on the score
confidence_levelfloatConfidence level (typically 0.95 = 95%)
methodstringComputation method (percentile or bca)
  • Narrow interval → Consistent, reliable score
  • Wide interval → Highly variable results (consider more trials)
  • Interval includes 0.5 → Uncertain whether task passes or fails
{
"task": "code-review",
"trials": 10,
"mean_score": 0.82,
"bootstrap_ci": {
"lower_bound": 0.75,
"upper_bound": 0.88,
"confidence_level": 0.95
}
}

Interpretation: The true score is 95% likely between 0.75 and 0.88. The interval doesn’t cross critical boundaries, so the score is reliable.


The is_significant field indicates whether observed score differences are statistically meaningful or just due to randomness.

{
"comparison": {
"baseline_model": "gpt-4o",
"test_model": "claude-sonnet-4.6",
"baseline_score": 0.78,
"test_score": 0.85,
"is_significant": true,
"p_value": 0.032
}
}
FieldTypeDescription
is_significantboolWhether difference passes significance threshold (p < 0.05)
p_valuefloatProbability that difference is due to chance (lower = more significant)
is_significantp_valueMeaning
true< 0.05Difference is statistically significant — likely real improvement
false≥ 0.05Difference is not significant — could be randomness
{
"task": "deploy-app",
"model_a": { "score": 0.80, "trials": 5 },
"model_b": { "score": 0.82, "trials": 5 },
"difference": 0.02,
"is_significant": false,
"p_value": 0.18
}

Interpretation: Model B’s 0.02 improvement is not statistically significant. With only 5 trials each, the difference could be due to randomness. Run more trials (e.g., 10–20) to reduce uncertainty.


Normalized Gain measures improvement as a percentage of the maximum possible gain, useful for comparing progress across different score ranges.

Normalized Gain = (Actual Score - Baseline Score) / (1.0 - Baseline Score)
{
"task": "code-quality",
"baseline_score": 0.60,
"improved_score": 0.75,
"normalized_gain": 0.375
}
ScenarioBaselineImprovedGainNormalized Gain
Low starting point0.400.600.200.33 (33%)
High starting point0.800.900.100.50 (50%)

In the second example, a 0.10 improvement from 0.80 is more impressive (50% of remaining room) than a 0.20 improvement from 0.40 (33% of remaining room).

  • Comparing tasks with different baseline scores
  • Measuring effectiveness of an improvement across multiple tasks
  • Aggregating progress across heterogeneous benchmarks

Confidence intervals and significance testing enable data-driven quality gates in CI/CD.

- name: Run evaluations with 10 trials
run: waza run --config.trials_per_task 10 -o results.json
- name: Check statistical significance
run: |
python3 << 'EOF'
import json
with open('results.json') as f:
results = json.load(f)
# Reject if confidence interval is too wide
for task in results['tasks']:
ci = task.get('bootstrap_ci', {})
width = ci.get('upper_bound', 0) - ci.get('lower_bound', 0)
if width > 0.20: # Tolerance: ±0.10
print(f"❌ {task['id']}: Interval too wide ({width:.2f})")
exit(1)
# Reject if not significant vs baseline
if not task.get('is_significant', False):
print(f"⚠️ {task['id']}: Improvement not significant")
print("✅ All tasks meet statistical quality gates")
EOF
StrategyWhenThreshold
Confidence widthHigh variability concernsCI width < 0.15 (±0.075)
SignificanceAvoid false improvementsp_value < 0.05
Normalized gainCross-task comparisonGain > 0.25 (25% of room)
- name: Compare models with statistical rigor
run: |
waza run \
--model gpt-4o \
--model claude-sonnet-4.6 \
--config.trials_per_task 15 \
-o results.json
- name: Check for significant improvement
run: |
python3 << 'EOF'
import json, sys
with open('results.json') as f:
results = json.load(f)
baseline_pass_rate = results['metrics']['baseline_pass_rate']
improved_pass_rate = results['metrics']['improved_pass_rate']
is_sig = results['metrics'].get('is_significant', False)
if improved_pass_rate > baseline_pass_rate and is_sig:
print(f"✅ Significant improvement: {baseline_pass_rate:.1%} → {improved_pass_rate:.1%}")
sys.exit(0)
else:
print(f"❌ No significant improvement (p={results['metrics'].get('p_value', 1):.3f})")
sys.exit(1)
EOF

ComplexityMin TrialsTypicalRationale
Simple, deterministic35Low variability
Standard LLM tasks510Moderate variability
Complex multi-step tasks1020High variability
Critical production gates2030Stringent confidence

{
"metadata": {
"version": "1.0",
"timestamp": "2025-01-15T14:32:00Z"
},
"digest": {
"total_tasks": 1,
"passed": 1,
"failed": 0,
"pass_rate": 1.0,
"statistical_summary": {
"bootstrap_ci": {
"lower_bound": 0.95,
"upper_bound": 1.0,
"confidence_level": 0.95,
"method": "percentile"
},
"is_significant": true
}
},
"tasks": [
{
"id": "deploy-to-azure",
"model": "gpt-4o",
"trials": 5,
"scores": [0.95, 1.0, 0.95, 1.0, 0.95],
"score": 0.97,
"passed": true,
"bootstrap_ci": {
"lower_bound": 0.95,
"upper_bound": 1.0,
"confidence_level": 0.95,
"method": "percentile"
},
"grader_results": [
{
"name": "deploy-check",
"score": 0.97,
"passed": true
}
]
}
]
}
{
"digest": {
"total_tasks": 3,
"models": ["gpt-4o", "claude-sonnet-4.6"],
"comparison": {
"baseline_pass_rate": 0.75,
"improved_pass_rate": 0.88,
"normalized_gain": 0.52,
"is_significant": true,
"p_value": 0.018
}
},
"tasks": [
{
"id": "task-1",
"baseline_score": 0.80,
"improved_score": 0.95,
"normalized_gain": 0.75,
"is_significant": true,
"bootstrap_ci": {
"lower_bound": 0.85,
"upper_bound": 0.98,
"confidence_level": 0.95
}
},
{
"id": "task-2",
"baseline_score": 0.70,
"improved_score": 0.80,
"normalized_gain": 0.33,
"is_significant": false,
"bootstrap_ci": {
"lower_bound": 0.65,
"upper_bound": 0.90,
"confidence_level": 0.95
}
}
]
}