Spaces:
Sleeping
Grader & LLM Judge β Internal Design
This document explains how WhyDidItFail scores an agent's performance after each episode. Scoring has two layers: a programmatic grader that runs keyword matching, and an LLM judge that evaluates reasoning quality. The two scores are combined into a single final number.
Overview
Agent submits diagnosis
β
βββΊ Programmatic Grader (server/graders.py)
β five sub-scores β keyword_score [0.0β1.0]
β
βββΊ LLM Judge (server/llm_judge.py)
three criteria β judge_score [0.0β1.0]
Final Score = 0.85 Γ keyword_score + 0.15 Γ judge_score
The grader is fast and deterministic β it runs synchronously inside the environment's step() call. The judge is a separate LLM call that runs after the episode ends (the WebSocket is already closed by then), so it never adds latency to the agent's action loop.
Part 1 β Programmatic Grader
File: server/graders.py
Entry point: grade(diagnosis, suggested_fix, scenario, steps_taken, inspection_order)
The grader produces five sub-scores and sums them. The result is clamped to [0.0, 1.0].
Total = diagnosis_score + evidence_diagnosis_penalty
+ evidence_score + efficiency_score + fix_score + ordering_bonus
1.1 Diagnosis Score β up to +0.70
This is the biggest sub-score. It checks whether the agent named the correct failure mode.
Every scenario has a correct_diagnosis field (e.g. "exploding_gradients"). There are two keyword maps in the grader:
Exact keywords β strong signal, worth +0.40 each:
"exploding_gradients" β ["exploding gradients", "exploding"]
"overfitting" β ["overfitting", "overfit"]
"dying_relu" β ["dying relu", "dead relu"]
...
Category keywords β weaker signal, worth +0.10 each:
"exploding_gradients" β ["nan", "gradient", "overflow", "diverge"]
"overfitting" β ["generalization", "val loss", "memoriz"]
...
The agent's diagnosis string is lowercased and scanned for both sets. Matches accumulate. The result is capped at 0.70.
Vague answer penalty: If no exact keyword matched and the diagnosis is fewer than 3 words, β0.10 is applied. This discourages one-word guesses.
Practical meaning: A correct exact label ("exploding gradients" or "exploding") earns +0.40 right away. Hitting a few category keywords on top of that can bring it up toward the 0.70 cap. A wrong diagnosis with some related vocabulary still earns partial credit, but cannot exceed 0.70.
1.2 Evidence-Diagnosis Penalty β 0.00 to β0.10
This penalty fires when the agent had access to the right data but still got the diagnosis wrong. It distinguishes between an agent that guessed blind versus one that inspected evidence and reasoned poorly.
| Situation | Penalty |
|---|---|
| All required sources inspected, diagnosis wrong | β0.10 |
| Some required sources inspected, diagnosis wrong | β0.05 |
| No required sources inspected, diagnosis wrong | 0.00 |
| Diagnosis correct (any case) | 0.00 |
The evidence score (next section) already penalises skipping required sources. This penalty stacks on top when the agent collected the evidence but drew the wrong conclusion β a clear reasoning failure.
1.3 Evidence Score β up to +0.25 (floored at β0.15)
This sub-score measures whether the agent inspected the right data sources before submitting.
Each scenario defines which sources are required. Easy scenarios require only logs. Medium requires logs + config. Hard requires logs + config + gradients.
+0.08 per required source inspected
β0.10 per required source NOT inspected
β0.02 per irrelevant source inspected
The raw sum is clamped to [β0.15, +0.25].
Example β hard scenario, agent inspected logs + config but skipped gradients:
- logs required and inspected: +0.08
- config required and inspected: +0.08
- gradients required but missing: β0.10
- Total: +0.06
Example β easy scenario, agent inspected logs + gradients (gradients not required):
- logs required and inspected: +0.08
- gradients not required: β0.02
- Total: +0.06
The β0.02 for irrelevant sources is intentionally mild β some exploration is acceptable. Missing a required source is much more costly.
1.4 Efficiency Score β up to +0.15
This sub-score rewards acting without waste. The minimum number of steps for any episode is: len(required_sources) + 1 (inspect all required sources, then submit).
If steps_taken == min_steps: score = 0.15 (full reward)
If steps_taken > min_steps: score = max(0.0, 0.15 β 0.02 Γ extra_steps^1.2)
If steps_taken < min_steps: score = max(0.0, 0.15 β 0.05 Γ missing_steps)
The exponent 1.2 makes the penalty accelerate as the agent wastes more steps β each additional wasted step costs slightly more than the previous one.
Early submission (fewer steps than the minimum) is also penalised because the agent almost certainly skipped a required source β but that case is already caught harder by the evidence score.
If steps_taken > max_steps (which is len(required_sources) Γ 3 + 2), the grader immediately returns 0.0 and skips all sub-scores. This is the hard ceiling.
1.5 Fix Score β up to +0.15 (or β0.05)
This sub-score checks whether the agent's suggested fix is actionable and correct.
Each scenario has a correct_fix string (e.g. "enable gradient clipping (clip_grad_norm=1.0)"). The grader tokenises that string, strips stop words (to, a, the, and, or, use, set, by), and keeps content words longer than 2 characters.
It then counts how many of those content words appear anywhere in the agent's suggested_fix string.
| Match ratio | Score |
|---|---|
| 100% of keywords matched | +0.15 |
| β₯ 60% matched | +0.10 |
| β₯ 30% matched | +0.05 |
| < 30% matched | 0.00 |
| No fix provided | β0.05 |
Omitting the fix always costs points. Even a partially correct fix earns more than silence.
1.6 Ordering Bonus β +0.05
A small bonus for inspecting required sources in the canonical order: logs β config β gradients.
The grader extracts the subsequence of inspected sources that are required (ignoring any irrelevant ones inspected in between), and checks whether that subsequence matches the canonical order for the scenario.
For an easy scenario (only logs required) the bonus is trivially earned. For hard scenarios (all three required) the agent must visit them in order.
This rewards structured investigation β the same order a human engineer would follow.
1.7 Maximum Achievable Scores
| Sub-score | Max | Min |
|---|---|---|
| Diagnosis | +0.70 | 0.00 |
| Evidence-Diagnosis Penalty | 0.00 | β0.10 |
| Evidence | +0.25 | β0.15 |
| Efficiency | +0.15 | 0.00 |
| Fix | +0.15 | β0.05 |
| Ordering Bonus | +0.05 | 0.00 |
| Total (clamped) | 1.00 | 0.00 |
The theoretical maximum without fix is 0.70 + 0.25 + 0.15 + 0.05 = 1.15, which is clamped to 1.00. The fix and ordering bonus are therefore "free" once the other scores are high β they can push a near-perfect run to a clean 1.00.
Part 2 β Step-Level Rewards
The grader described above runs only at submit_diagnosis. But the environment also emits a small reward on every inspection step, giving the agent an in-episode signal.
These step rewards come from _inspect_reward() in the environment, not from graders.py.
| Situation | Step Reward |
|---|---|
| First required source discovered | +0.10 |
| Second required source discovered | +0.07 |
| Third required source discovered | +0.05 |
| Irrelevant source inspected | β0.03 |
| Re-inspecting a source already seen | β0.05 |
The decaying reward schedule (+0.10 β +0.07 β +0.05) reflects that each additional required source is slightly less surprising than the first. It also gives a larger signal for discovering the first clue, which is usually the most diagnostic piece of evidence.
These step rewards are reported in [STEP] lines during inference but they are not included in the final episode score β grade() computes the terminal score independently.
Part 3 β LLM Judge
File: server/llm_judge.py
Entry point: judge(client, model, diagnosis, reasoning, suggested_fix, scenario, inspection_order)
3.1 What It Evaluates
The programmatic grader is blind to how the agent arrived at its answer. An agent could get lucky with the right keyword and still have incoherent reasoning. The LLM judge evaluates the quality of the agent's reasoning across three criteria:
| Criterion | Question it asks | Max |
|---|---|---|
evidence_grounding |
Does the reasoning cite specific values from the data the agent actually saw? | 5 |
causal_chain |
Does it logically connect that evidence to the diagnosed failure mode? | 5 |
fix_rationale |
Is the fix directly justified by the evidence and diagnosis? | 5 |
Raw score range: 0β15. Normalised to 0.0β1.0 by dividing by 15.
3.2 When It Runs
The judge runs after the episode ends β specifically after submit_diagnosis returns and the WebSocket connection is closed. This is intentional: the WebSocket session is single-use, so the judge call can't interfere with the agent's action loop.
Agent calls submit_diagnosis
β
βββΊ environment returns final_reward, done=True
β
βββΊ WebSocket closes
β
βββΊ inference.py calls judge()
β
βββΊ one LLM call β score
β
βββΊ Final Score = 0.85 Γ keyword + 0.15 Γ judge
3.3 What the Judge Sees
The judge prompt is built from:
- The agent's
diagnosis,suggested_fix, andreasoningstrings - The data the agent actually inspected β reconstructed from
inspection_orderand the scenario's source data
The judge is given only what the agent had access to, not the full scenario. This prevents the judge from penalising an agent for not citing data it never saw.
seen = {}
if "logs" in inspection_order:
seen["training_logs"] = scenario["logs"]
if "config" in inspection_order:
seen["config"] = scenario["config"]
if "gradients" in inspection_order:
seen["gradient_norms"] = scenario["gradient_norms"]
3.4 The Judge Prompt
The judge receives a single-turn user message (no system prompt). It is asked to return a JSON object with integer scores:
{"evidence_grounding": <int>, "causal_chain": <int>, "fix_rationale": <int>}
Temperature is set to 0.0 for determinism. max_tokens=64 is enough for the JSON response and prevents runaway output.
3.5 Fallback Behaviour
If the agent omits the reasoning field, the judge returns None immediately (no API call made). The caller treats None as "judge unavailable" and uses keyword score at full weight:
judge_score is None β Final Score = 1.0 Γ keyword_score
judge_score is float β Final Score = 0.85 Γ keyword_score + 0.15 Γ judge_score
If the LLM call itself fails (network error, parse error, etc.), the judge also returns None and logs the exception to stderr. The episode score is never blocked waiting on the judge.
3.6 Weight Rationale
The 85/15 split keeps the judge in a supporting role. The programmatic grader is:
- Fast β no extra API call during the episode
- Deterministic β same input always produces same score
- Directly tied to the correct answer β keyword matching is unambiguous
The judge adds 15% for reasoning quality. This is enough to meaningfully separate agents that cite evidence from those that guess correctly without explaining why, but not enough to override a solid keyword score with a harsh reasoning judgment.
Part 4 β Combined Final Score
Final Score = clamp(0.85 Γ keyword_score + 0.15 Γ judge_score, 0.0, 1.0)
Both the keyword score and the judge score are already in [0.0, 1.0], so the final score is also in that range without needing clamping in practice. The clamp is a safeguard.
Example β perfect run:
- keyword_score = 1.00 (correct label, all sources, minimum steps, full fix, correct order)
- judge_score = 1.00 (cites specific numbers, clean causal chain, fix matches evidence)
- Final = 0.85 Γ 1.00 + 0.15 Γ 1.00 = 1.00
Example β correct label, poor reasoning:
- keyword_score = 0.90 (correct label, good evidence, slightly wasteful)
- judge_score = 0.40 (vague reasoning, no specific numbers cited)
- Final = 0.85 Γ 0.90 + 0.15 Γ 0.40 = 0.765 + 0.060 = 0.825
Example β reasoning provided but LLM call fails:
- keyword_score = 0.90
- judge_score = None
- Final = 1.00 Γ 0.90 = 0.90
Summary
| Component | File | Weight | What it measures |
|---|---|---|---|
| Diagnosis Score | graders.py | Up to 0.70 of keyword_score | Correct failure mode label |
| Evidence-Diagnosis Penalty | graders.py | Up to β0.10 of keyword_score | Reasoning failure despite having evidence |
| Evidence Score | graders.py | Up to 0.25 of keyword_score | Right data sources inspected |
| Efficiency Score | graders.py | Up to 0.15 of keyword_score | Steps taken vs minimum |
| Fix Score | graders.py | Up to 0.15 of keyword_score | Actionable and correct fix |
| Ordering Bonus | graders.py | +0.05 of keyword_score | Canonical inspection order |
| LLM Judge | llm_judge.py | 15% of final score | Reasoning quality (evidence citation, causal logic, fix rationale) |