Skip to content

Scoring Functions

Estimates the probability of at least one success in k attempts. Answers: “can the agent do this at all?”

Uses the unbiased combinatorial estimator (Chen et al., 2021):

pass@k = 1 - C(n-c, k) / C(n, k)
ParameterTypeDescription
nnumberTotal number of trials run
cnumberNumber of successful trials
knumberNumber of attempts to consider
function passAtK(n: number, c: number, k: number): number;

Example: 3 successes out of 5 trials → passAtK(5, 3, 5) = 1.0 (guaranteed at least one success since we already observed 3)

The simpler 1 - (1-p)^k formula. Available for comparison but not used in scoring — the unbiased estimator above is more accurate when K is small (which it always is in evals).

function passAtKNaive(perTrialPassRate: number, k: number): number;

Estimates the probability that all k trials succeed. Answers: “can I rely on this in production?”

pass^k = p^k
function passToTheK(perTrialPassRate: number, k: number): number;

Example: 80% pass rate, 5 trials → passToTheK(0.8, 5) = 0.328 (only 33% chance all 5 succeed)

Computes all multi-trial metrics from a set of trial pass/fail results.

function computeMultiTrialMetrics(trialPassed: boolean[]): MultiTrialMetrics;

Example:

computeMultiTrialMetrics([true, true, false, true, true]);
// → { perTrialPassRate: 0.8, passAtK: 0.99, passToTheK: 0.33, k: 5 }

Aggregates trial grades into a per-stimulus score, including flakiness detection.

function computeStimulusScore(
stimulusName: string,
trialGrades: Array<{ grade: GraderResult; passed: boolean }>,
hasGraders: boolean,
): StimulusScore;

Aggregates stimulus scores into a per-skill score. Skill-level multi-trial metrics are the mean of per-stimulus metrics — not a pool of raw trial outcomes across different stimuli.

function computeSkillScore(
skillName: string,
stimulusScores: StimulusScore[],
threshold: number,
): SkillScore;
interface MultiTrialMetrics {
perTrialPassRate: number; // fraction of trials that passed
passAtK: number; // P(≥1 success in k trials)
passToTheK: number; // P(all k trials succeed)
k: number; // number of trials run
}
interface StimulusScore {
stimulusName: string;
trialResults: GraderResult[][]; // [trial][grader]
aggregateScore: number; // weighted score, averaged across trials
multiTrial: MultiTrialMetrics;
unscored: boolean; // true when no graders were configured
flaky: boolean; // true when 0 < passRate < 1
flakinessPercent: number; // minority outcome / total × 100
}
interface SkillScore {
skillName: string;
stimulusScores: StimulusScore[];
overallScore: number; // mean of stimulus aggregate scores
overallMultiTrial: MultiTrialMetrics;
passed: boolean; // overallScore ≥ threshold
}

For interpretation guidance and full lookup tables, see Scoring concepts.