Scoring Functions
Functions
Section titled “Functions”passAtK
Section titled “passAtK”Estimates the probability of at least one success in k attempts. Answers: “can the agent do this at all?”
Uses the unbiased combinatorial estimator (Chen et al., 2021):
pass@k = 1 - C(n-c, k) / C(n, k)| Parameter | Type | Description |
|---|---|---|
n | number | Total number of trials run |
c | number | Number of successful trials |
k | number | Number of attempts to consider |
function passAtK(n: number, c: number, k: number): number;Example: 3 successes out of 5 trials → passAtK(5, 3, 5) = 1.0 (guaranteed at least one success since we already observed 3)
passAtKNaive
Section titled “passAtKNaive”The simpler 1 - (1-p)^k formula. Available for comparison but not used in scoring — the unbiased estimator above is more accurate when K is small (which it always is in evals).
function passAtKNaive(perTrialPassRate: number, k: number): number;passToTheK
Section titled “passToTheK”Estimates the probability that all k trials succeed. Answers: “can I rely on this in production?”
pass^k = p^kfunction passToTheK(perTrialPassRate: number, k: number): number;Example: 80% pass rate, 5 trials → passToTheK(0.8, 5) = 0.328 (only 33% chance all 5 succeed)
computeMultiTrialMetrics
Section titled “computeMultiTrialMetrics”Computes all multi-trial metrics from a set of trial pass/fail results.
function computeMultiTrialMetrics(trialPassed: boolean[]): MultiTrialMetrics;Example:
computeMultiTrialMetrics([true, true, false, true, true]);// → { perTrialPassRate: 0.8, passAtK: 0.99, passToTheK: 0.33, k: 5 }computeStimulusScore
Section titled “computeStimulusScore”Aggregates trial grades into a per-stimulus score, including flakiness detection.
function computeStimulusScore( stimulusName: string, trialGrades: Array<{ grade: GraderResult; passed: boolean }>, hasGraders: boolean,): StimulusScore;computeSkillScore
Section titled “computeSkillScore”Aggregates stimulus scores into a per-skill score. Skill-level multi-trial metrics are the mean of per-stimulus metrics — not a pool of raw trial outcomes across different stimuli.
function computeSkillScore( skillName: string, stimulusScores: StimulusScore[], threshold: number,): SkillScore;MultiTrialMetrics
Section titled “MultiTrialMetrics”interface MultiTrialMetrics { perTrialPassRate: number; // fraction of trials that passed passAtK: number; // P(≥1 success in k trials) passToTheK: number; // P(all k trials succeed) k: number; // number of trials run}StimulusScore
Section titled “StimulusScore”interface StimulusScore { stimulusName: string; trialResults: GraderResult[][]; // [trial][grader] aggregateScore: number; // weighted score, averaged across trials multiTrial: MultiTrialMetrics; unscored: boolean; // true when no graders were configured flaky: boolean; // true when 0 < passRate < 1 flakinessPercent: number; // minority outcome / total × 100}SkillScore
Section titled “SkillScore”interface SkillScore { skillName: string; stimulusScores: StimulusScore[]; overallScore: number; // mean of stimulus aggregate scores overallMultiTrial: MultiTrialMetrics; passed: boolean; // overallScore ≥ threshold}Quick reference tables
Section titled “Quick reference tables”For interpretation guidance and full lookup tables, see Scoring concepts.