Statistical Fields
When you run evaluations with multiple trials (trials_per_task > 1), waza performs statistical analysis to quantify the reliability of your results. This guide explains the new fields in your results JSON.
When Statistical Fields Appear
Section titled “When Statistical Fields Appear”Statistical fields are computed when:
trials_per_task > 1in your eval YAML- Multiple results are aggregated
- Comparison analysis is performed
Single-trial runs skip statistical analysis.
Confidence Intervals (bootstrap_ci)
Section titled “Confidence Intervals (bootstrap_ci)”A confidence interval is a range of scores that likely contains the true score, computed using bootstrap resampling.
Structure
Section titled “Structure”{ "task_id": "deploy-azure-app", "score": 0.85, "bootstrap_ci": { "lower_bound": 0.72, "upper_bound": 0.95, "confidence_level": 0.95, "method": "percentile" }}Fields
Section titled “Fields”| Field | Type | Description |
|---|---|---|
lower_bound | float | Lower 95% confidence bound on the score |
upper_bound | float | Upper 95% confidence bound on the score |
confidence_level | float | Confidence level (typically 0.95 = 95%) |
method | string | Computation method (percentile or bca) |
Interpretation
Section titled “Interpretation”- Narrow interval → Consistent, reliable score
- Wide interval → Highly variable results (consider more trials)
- Interval includes 0.5 → Uncertain whether task passes or fails
Example
Section titled “Example”{ "task": "code-review", "trials": 10, "mean_score": 0.82, "bootstrap_ci": { "lower_bound": 0.75, "upper_bound": 0.88, "confidence_level": 0.95 }}Interpretation: The true score is 95% likely between 0.75 and 0.88. The interval doesn’t cross critical boundaries, so the score is reliable.
Significance Testing (is_significant)
Section titled “Significance Testing (is_significant)”The is_significant field indicates whether observed score differences are statistically meaningful or just due to randomness.
Structure
Section titled “Structure”{ "comparison": { "baseline_model": "gpt-4o", "test_model": "claude-sonnet-4.6", "baseline_score": 0.78, "test_score": 0.85, "is_significant": true, "p_value": 0.032 }}Fields
Section titled “Fields”| Field | Type | Description |
|---|---|---|
is_significant | bool | Whether difference passes significance threshold (p < 0.05) |
p_value | float | Probability that difference is due to chance (lower = more significant) |
Interpretation
Section titled “Interpretation”| is_significant | p_value | Meaning |
|---|---|---|
true | < 0.05 | Difference is statistically significant — likely real improvement |
false | ≥ 0.05 | Difference is not significant — could be randomness |
Example
Section titled “Example”{ "task": "deploy-app", "model_a": { "score": 0.80, "trials": 5 }, "model_b": { "score": 0.82, "trials": 5 }, "difference": 0.02, "is_significant": false, "p_value": 0.18}Interpretation: Model B’s 0.02 improvement is not statistically significant. With only 5 trials each, the difference could be due to randomness. Run more trials (e.g., 10–20) to reduce uncertainty.
Normalized Gain (normalized_gain)
Section titled “Normalized Gain (normalized_gain)”Normalized Gain measures improvement as a percentage of the maximum possible gain, useful for comparing progress across different score ranges.
Formula
Section titled “Formula”Normalized Gain = (Actual Score - Baseline Score) / (1.0 - Baseline Score)Structure
Section titled “Structure”{ "task": "code-quality", "baseline_score": 0.60, "improved_score": 0.75, "normalized_gain": 0.375}Interpretation
Section titled “Interpretation”| Scenario | Baseline | Improved | Gain | Normalized Gain |
|---|---|---|---|---|
| Low starting point | 0.40 | 0.60 | 0.20 | 0.33 (33%) |
| High starting point | 0.80 | 0.90 | 0.10 | 0.50 (50%) |
In the second example, a 0.10 improvement from 0.80 is more impressive (50% of remaining room) than a 0.20 improvement from 0.40 (33% of remaining room).
When to Use
Section titled “When to Use”- Comparing tasks with different baseline scores
- Measuring effectiveness of an improvement across multiple tasks
- Aggregating progress across heterogeneous benchmarks
Using These Fields for CI/CD Gating
Section titled “Using These Fields for CI/CD Gating”Confidence intervals and significance testing enable data-driven quality gates in CI/CD.
Example: GitHub Actions Gate
Section titled “Example: GitHub Actions Gate”- name: Run evaluations with 10 trials run: waza run --config.trials_per_task 10 -o results.json
- name: Check statistical significance run: | python3 << 'EOF' import json with open('results.json') as f: results = json.load(f)
# Reject if confidence interval is too wide for task in results['tasks']: ci = task.get('bootstrap_ci', {}) width = ci.get('upper_bound', 0) - ci.get('lower_bound', 0) if width > 0.20: # Tolerance: ±0.10 print(f"❌ {task['id']}: Interval too wide ({width:.2f})") exit(1)
# Reject if not significant vs baseline if not task.get('is_significant', False): print(f"⚠️ {task['id']}: Improvement not significant")
print("✅ All tasks meet statistical quality gates") EOFGate Strategies
Section titled “Gate Strategies”| Strategy | When | Threshold |
|---|---|---|
| Confidence width | High variability concerns | CI width < 0.15 (±0.075) |
| Significance | Avoid false improvements | p_value < 0.05 |
| Normalized gain | Cross-task comparison | Gain > 0.25 (25% of room) |
Example: Multi-Model CI Gate
Section titled “Example: Multi-Model CI Gate”- name: Compare models with statistical rigor run: | waza run \ --model gpt-4o \ --model claude-sonnet-4.6 \ --config.trials_per_task 15 \ -o results.json
- name: Check for significant improvement run: | python3 << 'EOF' import json, sys with open('results.json') as f: results = json.load(f)
baseline_pass_rate = results['metrics']['baseline_pass_rate'] improved_pass_rate = results['metrics']['improved_pass_rate'] is_sig = results['metrics'].get('is_significant', False)
if improved_pass_rate > baseline_pass_rate and is_sig: print(f"✅ Significant improvement: {baseline_pass_rate:.1%} → {improved_pass_rate:.1%}") sys.exit(0) else: print(f"❌ No significant improvement (p={results['metrics'].get('p_value', 1):.3f})") sys.exit(1) EOFMinimal Trials Recommendations
Section titled “Minimal Trials Recommendations”| Complexity | Min Trials | Typical | Rationale |
|---|---|---|---|
| Simple, deterministic | 3 | 5 | Low variability |
| Standard LLM tasks | 5 | 10 | Moderate variability |
| Complex multi-step tasks | 10 | 20 | High variability |
| Critical production gates | 20 | 30 | Stringent confidence |
Example Results JSON
Section titled “Example Results JSON”Single task with 5 trials
Section titled “Single task with 5 trials”{ "metadata": { "version": "1.0", "timestamp": "2025-01-15T14:32:00Z" }, "digest": { "total_tasks": 1, "passed": 1, "failed": 0, "pass_rate": 1.0, "statistical_summary": { "bootstrap_ci": { "lower_bound": 0.95, "upper_bound": 1.0, "confidence_level": 0.95, "method": "percentile" }, "is_significant": true } }, "tasks": [ { "id": "deploy-to-azure", "model": "gpt-4o", "trials": 5, "scores": [0.95, 1.0, 0.95, 1.0, 0.95], "score": 0.97, "passed": true, "bootstrap_ci": { "lower_bound": 0.95, "upper_bound": 1.0, "confidence_level": 0.95, "method": "percentile" }, "grader_results": [ { "name": "deploy-check", "score": 0.97, "passed": true } ] } ]}Multi-task comparison
Section titled “Multi-task comparison”{ "digest": { "total_tasks": 3, "models": ["gpt-4o", "claude-sonnet-4.6"], "comparison": { "baseline_pass_rate": 0.75, "improved_pass_rate": 0.88, "normalized_gain": 0.52, "is_significant": true, "p_value": 0.018 } }, "tasks": [ { "id": "task-1", "baseline_score": 0.80, "improved_score": 0.95, "normalized_gain": 0.75, "is_significant": true, "bootstrap_ci": { "lower_bound": 0.85, "upper_bound": 0.98, "confidence_level": 0.95 } }, { "id": "task-2", "baseline_score": 0.70, "improved_score": 0.80, "normalized_gain": 0.33, "is_significant": false, "bootstrap_ci": { "lower_bound": 0.65, "upper_bound": 0.90, "confidence_level": 0.95 } } ]}Next steps
Section titled “Next steps”- CI/CD Integration — Gate deployments on statistical significance
- Graders Guide — Understanding grader weighting
- CLI Reference — All command-line flags