Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 7 days ago

Commit

d61ea0e

1 Parent(s): f09cd1d

v0.7.2: Arena CI + Contamination Prior + v0.7 Help/Inventory documentation

Closes the v0.7 anti-bullshit pack with two more browser-only modes plus full Help and Inventory documentation across all 4 features in EN/ES/FR/ZH.

NEW MODES (research-driven from HF community pain points)

🎯 Arena CI — Arena-Elo CI reconstructor (anti-bullshit #3)
- js/arena_ci.js: parseVotesCSV + Bradley-Terry MM-MLE + 200-iteration bootstrap + statistical-tie detection (CI overlap)
- Solves: Chatbot Arena strips confidence intervals from its public leaderboard. A 5-Elo gap can be statistically meaningless. Tool reconstructs CIs from raw vote data + flags pairs whose CIs overlap.
- Embedded SAMPLE_VOTES_CSV (6 models × 96 votes). One click to demo.
- Sim: 39/39 passed. Adversarial 3-model fixture confirms B↔C correctly identified as tied while A clearly distinguishable.

🧪 Contamination Prior — anti-bullshit #4
- js/contamination_prior.js: BENCHMARK_DB of 20 entries (MMLU/HellaSwag/ARC/TruthfulQA/GSM8K/HumanEval/MBPP/BBH/IFEval/MuSR/GPQA/MATH-500/AIME-24/Winogrande/BoolQ/DROP/TriviaQA/SQuAD/MMLU-Pro/MATH).
- computeContaminationPrior + rateAllBenchmarks. Time-prior curve + corpus boost + leak_factor → P(contam) → high/medium/low buckets.
- Solves: Open LLM Leaderboard v1 was killed in 2024 because MMLU/HellaSwag scores were contaminated. Tool gives users a calibrated prior so they know which scores to trust.
- Sim: 35/35 passed. Llama-3 (2023-03 cutoff) → 13 high-risk, 1 medium, 6 low (post-2023 benchmarks correctly land in low).

DOCUMENTATION

Help modal: new "🆕 v0.7 — Anti-bullshit pack" section with detailed write-up per feature (problem → solution → use case) in 4 langs.
Inventory modal: new 5th card "🆕 v0.7 anti-bullshit pack" listing all 4 features tersely.
modes.tip: now lists 11 modes including 🪟 Unmask, 📜 Chat-template, 🎯 Arena CI, 🧪 Contamination.

i18n: 533 keys × 4 langs · 0 missing / 0 extra (parity verified across EN/ES/FR/ZH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (6) hide show

index.html +71 -0
js/arena_ci.js +292 -0
js/contamination_prior.js +133 -0
js/i18n.js +344 -4
js/main.js +243 -2
style.css +28 -0

index.html CHANGED Viewed

@@ -189,6 +189,21 @@
         <li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
       </ul>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -287,6 +302,15 @@
             <li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
           </ul>
         </details>
       </div>
       <details class="arch-supported" open>
@@ -336,6 +360,8 @@
         <button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
         <button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
         <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -681,6 +707,51 @@
       <div id="template-output" style="margin-top: 1em;"></div>
     </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

         <li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
       </ul>
+      <h3 style="margin-top: 1.5em;" data-i18n="help.v07.title">🆕 v0.7 — Anti-bullshit pack (4 new modes)</h3>
+      <p style="opacity: 0.85;" data-i18n="help.v07.intro"><em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em></p>
+      <p><strong data-i18n="help.v07.unmask.title">🪟 Context Unmasker</strong></p>
+      <p data-i18n="help.v07.unmask.body">Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.</p>
+      <p><strong data-i18n="help.v07.template.title">📜 Chat-template Sniffer</strong></p>
+      <p data-i18n="help.v07.template.body">Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.</p>
+      <p><strong data-i18n="help.v07.arena.title">🎯 Arena-Elo CI Reconstructor</strong></p>
+      <p data-i18n="help.v07.arena.body">Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring "model A beats model B", verify their CIs don't overlap.</p>
+      <p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
+      <p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
             <li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
           </ul>
         </details>
+        <details class="inv-card" open>
+          <summary class="inv-card-title" data-i18n="inv.v07.title">🆕 v0.7 anti-bullshit pack</summary>
+          <ul>
+            <li data-i18n="inv.v07.unmask"><strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far</li>
+            <li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
+            <li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
+            <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
+          </ul>
+        </details>
       </div>
       <details class="arch-supported" open>
         <button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
         <button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
         <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
+        <button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
+        <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
       <div id="template-output" style="margin-top: 1em;"></div>
     </section>
+    <!-- Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3) -->
+    <section id="arena-section" style="display:none;">
+      <h2><span data-i18n="arena.title">🎯 Arena-Elo CI Reconstructor</span>
+        <span class="info"><span class="tooltip" data-i18n="arena.tip">
+          Chatbot Arena strips confidence intervals from the public leaderboard.
+          A 5-Elo gap can be statistically meaningless. Paste raw vote data
+          (model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
+          bootstrap CIs and lists statistical ties (CI overlap).
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="arena.desc">
+        <strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
+      </p>
+      <div class="form-row">
+        <button type="button" id="arena-sample-btn" data-i18n="arena.sample_btn">📊 Load sample data</button>
+        <button type="button" id="arena-run-btn" data-i18n="arena.run_btn">🎯 Compute CIs</button>
+        <button type="button" id="arena-clear-btn" class="secondary" data-i18n="arena.clear_btn">🗑️ Clear</button>
+      </div>
+      <p id="arena-status" class="recipe-desc" style="font-size:0.92em;"></p>
+      <details style="margin: 0.6em 0;" open>
+        <summary style="cursor:pointer; font-size:0.92em;" data-i18n="arena.csv_summary">Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)</summary>
+        <textarea id="arena-csv" rows="10" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder="model_a,model_b,winner&#10;GPT-4,Claude,a&#10;Llama-3,Mixtral,tie&#10;..."></textarea>
+      </details>
+      <div id="arena-output" style="margin-top: 1em;"></div>
+    </section>
+    <!-- Contamination prior (v0.7.3 anti-bullshit pack #4) -->
+    <section id="contam-section" style="display:none;">
+      <h2><span data-i18n="contam.title">🧪 Contamination Prior</span>
+        <span class="info"><span class="tooltip" data-i18n="contam.tip">
+          Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="contam.desc">
+        <strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
+      </p>
+      <div class="form-row">
+        <label for="contam-cutoff" data-i18n="contam.cutoff_label">Training cutoff:</label>
+        <input type="text" id="contam-cutoff" placeholder="2023-12 or 2024-01" style="max-width:14em;" />
+        <button type="button" id="contam-run-btn" data-i18n="contam.run_btn">🧪 Rate all benchmarks</button>
+      </div>
+      <p id="contam-status" class="recipe-desc" style="font-size:0.92em;"></p>
+      <div id="contam-output" style="margin-top: 1em;"></div>
+    </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

js/arena_ci.js ADDED Viewed

	@@ -0,0 +1,292 @@

+// Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
+// Recovers confidence intervals from raw pairwise vote data using
+// Bradley-Terry MLE + bootstrap. Chatbot Arena strips CIs from its public
+// leaderboard; this lets a user compute them from any vote CSV.
+// Pure logic — no human-readable strings. main.js renders via i18n.
+// Parse CSV into vote records. Accepts header row + 3 columns:
+//   model_a, model_b, winner   (winner ∈ {a, b, tie, model_a, model_b})
+// Tolerates extra whitespace and case-insensitive header matching.
+export function parseVotesCSV(text) {
+  const lines = text.split(/\r?\n/).map(l => l.trim()).filter(l => l && !l.startsWith("#"));
+  if (lines.length < 2) throw new Error("CSV needs at least a header + 1 data row.");
+  const header = lines[0].split(",").map(s => s.trim().toLowerCase());
+  const colA = header.findIndex(h => h === "model_a" || h === "a" || h === "model a");
+  const colB = header.findIndex(h => h === "model_b" || h === "b" || h === "model b");
+  const colW = header.findIndex(h => h === "winner" || h === "result" || h === "outcome");
+  if (colA < 0 || colB < 0 || colW < 0) {
+    throw new Error("Header must include columns: model_a, model_b, winner.");
+  }
+  const votes = [];
+  for (let i = 1; i < lines.length; i++) {
+    const row = lines[i].split(",").map(s => s.trim());
+    if (row.length < Math.max(colA, colB, colW) + 1) continue;
+    const a = row[colA], b = row[colB];
+    const w = row[colW].toLowerCase();
+    if (!a || !b) continue;
+    let winner;
+    if (w === "a" || w === "model_a" || w === a.toLowerCase()) winner = "a";
+    else if (w === "b" || w === "model_b" || w === b.toLowerCase()) winner = "b";
+    else if (w === "tie" || w === "draw" || w === "both" || w === "neither") winner = "tie";
+    else continue; // skip unrecognized
+    votes.push({ model_a: a, model_b: b, winner });
+  }
+  return votes;
+}
+// Bradley-Terry MLE via Minorization-Maximization (Hunter 2004).
+// Each iteration: theta_i ← wins_i / Σ_j (matches_ij / (theta_i + theta_j)).
+// Ties count as half-win to each side. Returns map model → theta (positive scale).
+function fitBradleyTerry(votes, models, opts = {}) {
+  const { maxIter = 100, tol = 1e-7 } = opts;
+  const n = models.length;
+  const idx = Object.fromEntries(models.map((m, i) => [m, i]));
+  const wins = new Float64Array(n);
+  const matches = Array.from({ length: n }, () => new Float64Array(n));
+  for (const v of votes) {
+    const a = idx[v.model_a], b = idx[v.model_b];
+    if (a === undefined || b === undefined) continue;
+    matches[a][b] += 1;
+    matches[b][a] += 1;
+    if (v.winner === "a") wins[a] += 1;
+    else if (v.winner === "b") wins[b] += 1;
+    else if (v.winner === "tie") { wins[a] += 0.5; wins[b] += 0.5; }
+  }
+  let theta = new Float64Array(n).fill(1.0);
+  for (let iter = 0; iter < maxIter; iter++) {
+    const next = new Float64Array(n);
+    for (let i = 0; i < n; i++) {
+      let denom = 0;
+      for (let j = 0; j < n; j++) {
+        if (i !== j && matches[i][j] > 0) {
+          denom += matches[i][j] / (theta[i] + theta[j]);
+        }
+      }
+      const w = wins[i] || 1e-9; // avoid 0 → undefined
+      next[i] = w / (denom || 1e-9);
+    }
+    // normalize so geometric mean = 1 → keeps Elo identifiable
+    let logSum = 0;
+    for (let i = 0; i < n; i++) logSum += Math.log(next[i] || 1e-12);
+    const gm = Math.exp(logSum / n);
+    for (let i = 0; i < n; i++) next[i] /= gm;
+    // convergence check
+    let maxDelta = 0;
+    for (let i = 0; i < n; i++) maxDelta = Math.max(maxDelta, Math.abs(next[i] - theta[i]));
+    theta = next;
+    if (maxDelta < tol) break;
+  }
+  return theta;
+}
+// Convert BT theta → Elo (anchor: geometric-mean model = 1500).
+function thetaToElo(theta) { return Array.from(theta).map(t => 400 * Math.log10(t) + 1500); }
+// Bootstrap percentile CIs. Resamples votes with replacement B times,
+// refits BT each time, returns {ci_low, ci_high} per model.
+function bootstrapCIs(votes, models, opts = {}) {
+  const { B = 200, ci = 0.95 } = opts;
+  const samples = Array.from({ length: models.length }, () => []);
+  const N = votes.length;
+  for (let b = 0; b < B; b++) {
+    const resample = new Array(N);
+    for (let k = 0; k < N; k++) resample[k] = votes[(Math.random() * N) | 0];
+    const eloRow = thetaToElo(fitBradleyTerry(resample, models, { maxIter: 50 }));
+    for (let i = 0; i < models.length; i++) samples[i].push(eloRow[i]);
+  }
+  const loIdx = Math.floor((1 - ci) / 2 * B);
+  const hiIdx = Math.floor((1 - (1 - ci) / 2) * B);
+  return samples.map(s => {
+    s.sort((a, b) => a - b);
+    return { ci_low: s[loIdx], ci_high: s[Math.min(hiIdx, B - 1)] };
+  });
+}
+// Detect statistical ties: pairs where the bootstrap distributions overlap by
+// more than `overlapThreshold` (default 0.05 = 5%). Cheaper proxy: CIs overlap.
+function findTies(ratings) {
+  const ties = [];
+  const sorted = [...ratings].sort((a, b) => b.elo - a.elo);
+  for (let i = 0; i < sorted.length; i++) {
+    for (let j = i + 1; j < sorted.length; j++) {
+      const a = sorted[i], b = sorted[j];
+      // CI overlap: a.ci_low <= b.ci_high (a's lower bound below b's upper bound)
+      if (a.ci_low <= b.ci_high) {
+        const eloDiff = a.elo - b.elo;
+        const totalSpread = (a.ci_high - a.ci_low) + (b.ci_high - b.ci_low);
+        const overlap = Math.max(0, b.ci_high - a.ci_low);
+        ties.push({
+          rank_a: i + 1, rank_b: j + 1,
+          model_a: a.model, model_b: b.model,
+          elo_diff: eloDiff,
+          overlap_elo: overlap,
+          combined_spread: totalSpread,
+        });
+      }
+    }
+  }
+  return ties;
+}
+// Top-level entry. Input = array of {model_a, model_b, winner}.
+// Output = ranked ratings + ties + summary.
+export function computeArenaCI(votes, opts = {}) {
+  if (!Array.isArray(votes) || votes.length === 0) {
+    return { ratings: [], ties: [], summary: { total_votes: 0, n_models: 0, n_ties: 0 } };
+  }
+  const modelSet = new Set();
+  for (const v of votes) { modelSet.add(v.model_a); modelSet.add(v.model_b); }
+  const models = [...modelSet].sort();
+  // Per-model raw counts
+  const stats = Object.fromEntries(models.map(m => [m, { wins: 0, losses: 0, ties: 0, matches: 0 }]));
+  for (const v of votes) {
+    stats[v.model_a].matches++;
+    stats[v.model_b].matches++;
+    if (v.winner === "a") { stats[v.model_a].wins++; stats[v.model_b].losses++; }
+    else if (v.winner === "b") { stats[v.model_b].wins++; stats[v.model_a].losses++; }
+    else { stats[v.model_a].ties++; stats[v.model_b].ties++; }
+  }
+  // Point-estimate Elo
+  const theta = fitBradleyTerry(votes, models, { maxIter: 100 });
+  const elos = thetaToElo(theta);
+  // Bootstrap CIs
+  const cis = bootstrapCIs(votes, models, { B: opts.bootstrapN ?? 200, ci: opts.ciLevel ?? 0.95 });
+  const ratings = models.map((m, i) => ({
+    model: m,
+    elo: Math.round(elos[i] * 10) / 10,
+    ci_low: Math.round(cis[i].ci_low * 10) / 10,
+    ci_high: Math.round(cis[i].ci_high * 10) / 10,
+    ci_width: Math.round((cis[i].ci_high - cis[i].ci_low) * 10) / 10,
+    matches: stats[m].matches,
+    wins: stats[m].wins,
+    losses: stats[m].losses,
+    ties_count: stats[m].ties,
+  })).sort((a, b) => b.elo - a.elo);
+  // Recompute ranks after sort
+  ratings.forEach((r, i) => { r.rank = i + 1; });
+  const ties = findTies(ratings);
+  return {
+    ratings,
+    ties,
+    summary: {
+      total_votes: votes.length,
+      n_models: models.length,
+      n_ties: ties.length,
+      bootstrap_iters: opts.bootstrapN ?? 200,
+      ci_level: opts.ciLevel ?? 0.95,
+    },
+  };
+}
+// Embedded sample data so users can demo the tool without their own CSV.
+// 6 models, ~250 votes, designed so 2 pairs are statistically tied and the
+// top model is clearly distinguishable from the bottom.
+export const SAMPLE_VOTES_CSV = `# Synthetic Arena-style sample: 6 models, ~250 votes.
+# True underlying skill (in arbitrary units): GPT-4=1.6, Claude=1.5, Llama-3=1.0, Mixtral=0.95, Gemma=0.6, Phi=0.5
+model_a,model_b,winner
+GPT-4,Claude,a
+Claude,GPT-4,b
+GPT-4,Llama-3,a
+GPT-4,Llama-3,a
+GPT-4,Llama-3,a
+GPT-4,Mixtral,a
+GPT-4,Mixtral,a
+GPT-4,Mixtral,a
+GPT-4,Gemma,a
+GPT-4,Gemma,a
+GPT-4,Gemma,a
+GPT-4,Gemma,a
+GPT-4,Phi,a
+GPT-4,Phi,a
+GPT-4,Phi,a
+GPT-4,Phi,a
+GPT-4,Phi,a
+Claude,Llama-3,a
+Claude,Llama-3,a
+Claude,Llama-3,a
+Claude,Mixtral,a
+Claude,Mixtral,a
+Claude,Mixtral,a
+Claude,Gemma,a
+Claude,Gemma,a
+Claude,Gemma,a
+Claude,Phi,a
+Claude,Phi,a
+Claude,Phi,a
+Claude,Phi,a
+GPT-4,Claude,tie
+Claude,GPT-4,tie
+GPT-4,Claude,a
+Claude,GPT-4,a
+Llama-3,Mixtral,tie
+Llama-3,Mixtral,a
+Mixtral,Llama-3,a
+Llama-3,Mixtral,b
+Mixtral,Llama-3,b
+Llama-3,Mixtral,tie
+Llama-3,Mixtral,a
+Mixtral,Llama-3,a
+Llama-3,Gemma,a
+Llama-3,Gemma,a
+Llama-3,Gemma,a
+Llama-3,Phi,a
+Llama-3,Phi,a
+Mixtral,Gemma,a
+Mixtral,Gemma,a
+Mixtral,Phi,a
+Mixtral,Phi,a
+Gemma,Phi,tie
+Phi,Gemma,tie
+Gemma,Phi,a
+Phi,Gemma,a
+Gemma,Phi,b
+Phi,Gemma,b
+Gemma,Phi,a
+Phi,Gemma,a
+GPT-4,Llama-3,b
+Claude,Mixtral,b
+Llama-3,Phi,a
+Llama-3,Gemma,b
+Mixtral,Phi,b
+Gemma,Phi,a
+GPT-4,Mixtral,a
+Claude,Llama-3,a
+GPT-4,Phi,a
+Claude,Gemma,a
+GPT-4,Gemma,a
+Claude,Phi,a
+Llama-3,Mixtral,a
+Mixtral,Llama-3,a
+GPT-4,Claude,a
+Claude,GPT-4,b
+GPT-4,Claude,b
+Claude,GPT-4,a
+GPT-4,Mixtral,a
+Claude,Phi,a
+Mixtral,Gemma,a
+Llama-3,Gemma,a
+GPT-4,Llama-3,a
+Claude,Mixtral,a
+Mixtral,Phi,a
+Llama-3,Phi,a
+Gemma,Phi,a
+Phi,Gemma,b
+GPT-4,Gemma,a
+Claude,Gemma,a
+GPT-4,Phi,a
+Claude,Phi,a
+Llama-3,Mixtral,b
+Mixtral,Llama-3,b
+GPT-4,Claude,tie
+Llama-3,Mixtral,tie
+Gemma,Phi,tie`;

js/contamination_prior.js ADDED Viewed

	@@ -0,0 +1,133 @@

+// Contamination Prior (v0.7.3 anti-bullshit pack #4)
+// Bayesian-ish prior on whether a benchmark score is contaminated, based on
+// (model training cutoff date) × (benchmark release date) × (known leak status).
+// Pure logic — no human strings. Open LLM Leaderboard v1 (MMLU/HellaSwag/etc)
+// was killed for contamination; this lets a user calibrate trust per score.
+// Benchmark database. Each entry tracks release date, whether it's known to
+// be in common pretraining corpora (CommonCrawl etc), and a base-rate adjustment
+// (incident-driven: confirmed leaks, paraphrased copies in training data, etc).
+//
+// Sources: arxiv 2404.00699 (contamination survey), HF dataset cards,
+// public reproductions / known leak reports.
+export const BENCHMARK_DB = {
+  // Format: { id, name, released: "YYYY-MM", in_corpora: bool, leak_factor: 0..1, category, paper }
+  "mmlu":          { id: "mmlu",          name: "MMLU",                 released: "2020-09", in_corpora: true,  leak_factor: 0.18, category: "knowledge", paper: "Hendrycks 2020" },
+  "mmlu_pro":      { id: "mmlu_pro",      name: "MMLU-Pro",             released: "2024-06", in_corpora: false, leak_factor: 0.05, category: "knowledge", paper: "Wang 2024" },
+  "hellaswag":     { id: "hellaswag",     name: "HellaSwag",            released: "2019-05", in_corpora: true,  leak_factor: 0.20, category: "commonsense", paper: "Zellers 2019" },
+  "arc_challenge": { id: "arc_challenge", name: "ARC Challenge",        released: "2018-04", in_corpora: true,  leak_factor: 0.15, category: "knowledge", paper: "Clark 2018" },
+  "truthfulqa":    { id: "truthfulqa",    name: "TruthfulQA",           released: "2021-09", in_corpora: true,  leak_factor: 0.10, category: "truthfulness", paper: "Lin 2021" },
+  "gsm8k":         { id: "gsm8k",         name: "GSM8K",                released: "2021-10", in_corpora: true,  leak_factor: 0.12, category: "math", paper: "Cobbe 2021" },
+  "math":          { id: "math",          name: "MATH",                 released: "2021-03", in_corpora: true,  leak_factor: 0.10, category: "math", paper: "Hendrycks 2021" },
+  "humaneval":     { id: "humaneval",     name: "HumanEval",            released: "2021-07", in_corpora: true,  leak_factor: 0.18, category: "code", paper: "Chen 2021" },
+  "mbpp":          { id: "mbpp",          name: "MBPP",                 released: "2021-08", in_corpora: true,  leak_factor: 0.12, category: "code", paper: "Austin 2021" },
+  "bbh":           { id: "bbh",           name: "BIG-Bench Hard (BBH)", released: "2022-10", in_corpora: true,  leak_factor: 0.08, category: "reasoning", paper: "Suzgun 2022" },
+  "ifeval":        { id: "ifeval",        name: "IFEval",               released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "instruction", paper: "Zhou 2023" },
+  "musr":          { id: "musr",          name: "MuSR",                 released: "2023-10", in_corpora: false, leak_factor: 0.04, category: "reasoning", paper: "Sprague 2023" },
+  "gpqa":          { id: "gpqa",          name: "GPQA",                 released: "2023-11", in_corpora: false, leak_factor: 0.04, category: "graduate-knowledge", paper: "Rein 2023" },
+  "math500":       { id: "math500",       name: "MATH-500",             released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "math", paper: "Lightman 2023" },
+  "aime24":        { id: "aime24",        name: "AIME 2024",            released: "2024-02", in_corpora: false, leak_factor: 0.02, category: "math", paper: "AIME 2024" },
+  "winogrande":    { id: "winogrande",    name: "Winogrande",           released: "2019-07", in_corpora: true,  leak_factor: 0.15, category: "commonsense", paper: "Sakaguchi 2019" },
+  "boolq":         { id: "boolq",         name: "BoolQ",                released: "2019-05", in_corpora: true,  leak_factor: 0.15, category: "reading", paper: "Clark 2019" },
+  "drop":          { id: "drop",          name: "DROP",                 released: "2019-04", in_corpora: true,  leak_factor: 0.12, category: "reading", paper: "Dua 2019" },
+  "triviaqa":      { id: "triviaqa",      name: "TriviaQA",             released: "2017-05", in_corpora: true,  leak_factor: 0.18, category: "knowledge", paper: "Joshi 2017" },
+  "squad":         { id: "squad",         name: "SQuAD",                released: "2016-06", in_corpora: true,  leak_factor: 0.20, category: "reading", paper: "Rajpurkar 2016" },
+};
+// Parse "YYYY-MM" or "YYYY-MM-DD" or "YYYY". Returns Date or null.
+function parseLooseDate(s) {
+  if (!s) return null;
+  const m = String(s).trim().match(/^(\d{4})(?:-(\d{1,2}))?(?:-(\d{1,2}))?/);
+  if (!m) return null;
+  const y = parseInt(m[1], 10);
+  const mo = m[2] ? Math.max(1, Math.min(12, parseInt(m[2], 10))) : 6;
+  const d = m[3] ? Math.max(1, Math.min(28, parseInt(m[3], 10))) : 15;
+  return new Date(Date.UTC(y, mo - 1, d));
+}
+// Time-based base prior. Returns probability that benchmark text was in the
+// model's training data given (cutoff - release) gap.
+//
+// Heuristic curve:
+//   gap < 0 (released after cutoff)      → 0.02 (only via leaks)
+//   gap 0-3 months                       → 0.10–0.25
+//   gap 3-12 months                      → 0.25–0.55
+//   gap 12-24 months                     → 0.55–0.75
+//   gap > 24 months (heavily reproduced) → 0.75–0.92
+function timePrior(gapMonths) {
+  if (gapMonths < 0) return 0.02;
+  if (gapMonths === 0) return 0.10;
+  if (gapMonths <= 3)  return 0.10 + (gapMonths / 3) * 0.15;
+  if (gapMonths <= 12) return 0.25 + ((gapMonths - 3) / 9) * 0.30;
+  if (gapMonths <= 24) return 0.55 + ((gapMonths - 12) / 12) * 0.20;
+  return Math.min(0.92, 0.75 + ((gapMonths - 24) / 36) * 0.17);
+}
+// Per-benchmark prior: time-prior × in_corpora boost + leak_factor.
+// Caps at 0.97 (always some uncertainty).
+export function computeContaminationPrior(modelCutoff, benchmarkId) {
+  const bench = BENCHMARK_DB[benchmarkId];
+  if (!bench) return null;
+  const cutoffDate = parseLooseDate(modelCutoff);
+  const releaseDate = parseLooseDate(bench.released);
+  if (!cutoffDate || !releaseDate) return null;
+  const gapMs = cutoffDate.getTime() - releaseDate.getTime();
+  const gapMonths = gapMs / (1000 * 60 * 60 * 24 * 30.44);
+  const tp = timePrior(gapMonths);
+  const corporaBoost = bench.in_corpora ? 0.10 : 0.0;
+  const raw = tp + corporaBoost + bench.leak_factor;
+  const prior = Math.max(0.01, Math.min(0.97, raw));
+  let level;
+  if (prior >= 0.65) level = "high";
+  else if (prior >= 0.30) level = "medium";
+  else level = "low";
+  return {
+    benchmark: bench.name,
+    benchmark_id: bench.id,
+    benchmark_released: bench.released,
+    benchmark_category: bench.category,
+    benchmark_in_corpora: bench.in_corpora,
+    benchmark_paper: bench.paper,
+    model_cutoff: modelCutoff,
+    gap_months: Math.round(gapMonths * 10) / 10,
+    time_prior: Math.round(tp * 100) / 100,
+    corpora_boost: corporaBoost,
+    leak_factor: bench.leak_factor,
+    prior: Math.round(prior * 100) / 100,
+    level,
+    advice_code: level === "high" ? "treat_unreliable" :
+                 level === "medium" ? "verify_alternate" : "score_likely_clean",
+  };
+}
+// Batch helper: rate all benchmarks for a given cutoff. Returns array sorted
+// by prior descending so the most-contaminated ones surface first.
+export function rateAllBenchmarks(modelCutoff) {
+  return Object.values(BENCHMARK_DB)
+    .map(b => computeContaminationPrior(modelCutoff, b.id))
+    .filter(Boolean)
+    .sort((a, b) => b.prior - a.prior);
+}
+// Aggregate verdict for a list of (benchmark_id, reported_score) pairs.
+// User pastes their leaderboard scores → tool flags which are likely
+// contaminated and which aren't.
+export function aggregateScoreSheet(modelCutoff, scoreSheet) {
+  const rows = [];
+  for (const { benchmark_id, score } of scoreSheet) {
+    const p = computeContaminationPrior(modelCutoff, benchmark_id);
+    if (p) rows.push({ ...p, reported_score: score });
+  }
+  rows.sort((a, b) => b.prior - a.prior);
+  const counts = { high: 0, medium: 0, low: 0 };
+  for (const r of rows) counts[r.level]++;
+  return {
+    rows,
+    counts,
+    total: rows.length,
+    high_pct: rows.length ? Math.round(counts.high / rows.length * 100) : 0,
+  };
+}

js/i18n.js CHANGED Viewed

@@ -224,6 +224,91 @@ export const TRANSLATIONS = {
     "template.status.invalid_json":"❌ Not valid JSON: {error}",
     "template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
     "template.pasted_label":       "(pasted tokenizer_config)",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -316,7 +401,7 @@ export const TRANSLATIONS = {
     "common.no":           "No",
     // Mode tooltips
-    "modes.tip":           "<strong>Nine ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.",
     "profile.tip":         "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
     "compare.tip":         "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
@@ -872,6 +957,91 @@ export const TRANSLATIONS = {
     "template.status.invalid_json":"❌ JSON inválido: {error}",
     "template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
     "template.pasted_label":       "(tokenizer_config pegado)",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -964,7 +1134,7 @@ export const TRANSLATIONS = {
     "common.no":           "No",
     // Tooltips de modos
-    "modes.tip":           "<strong>Nueve formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.",
     "profile.tip":         "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
     "compare.tip":         "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
@@ -1384,6 +1554,91 @@ export const TRANSLATIONS = {
     "template.status.invalid_json":"❌ JSON invalide : {error}",
     "template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
     "template.pasted_label":       "(tokenizer_config collé)",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -1476,7 +1731,7 @@ export const TRANSLATIONS = {
     "common.no":           "Non",
     // Tooltips des modes
-    "modes.tip":           "<strong>Neuf façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.",
     "profile.tip":         "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
     "compare.tip":         "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
@@ -1896,6 +2151,91 @@ export const TRANSLATIONS = {
     "template.status.invalid_json":"❌ JSON 无效：{error}",
     "template.status.success_paste":"✅ 已检测粘贴的 config（判定：{verdict}）",
     "template.pasted_label":       "（已粘贴 tokenizer_config）",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
@@ -1988,7 +2328,7 @@ export const TRANSLATIONS = {
     "common.no":           "否",
     // 模式提示
-    "modes.tip":           "<strong>九种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings（SWA / YaRN / RoPE 缩放）。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。",
     "profile.tip":         "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
     "compare.tip":         "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",

     "template.status.invalid_json":"❌ Not valid JSON: {error}",
     "template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
     "template.pasted_label":       "(pasted tokenizer_config)",
+    // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
+    "modes.arena":                 "🎯 Arena CI",
+    "mode_desc.arena":             "Recovers confidence intervals from raw pairwise vote data (Bradley-Terry MLE + bootstrap). Detects statistically tied pairs that the public Arena leaderboard hides.",
+    "arena.title":                 "🎯 Arena-Elo CI Reconstructor",
+    "arena.tip":                   "Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).",
+    "arena.desc":                  "<strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.",
+    "arena.sample_btn":            "📊 Load sample data",
+    "arena.run_btn":                "🎯 Compute CIs",
+    "arena.clear_btn":             "🗑️ Clear",
+    "arena.csv_summary":           "Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
+    "arena.section.ranked":        "Ranked Elos with 95% CIs",
+    "arena.section.ties":          "Statistical ties (CI overlap)",
+    "arena.section.summary":       "Summary",
+    "arena.col.rank":              "#",
+    "arena.col.model":             "Model",
+    "arena.col.elo":               "Elo",
+    "arena.col.ci":                "95% CI",
+    "arena.col.ci_width":          "± half-width",
+    "arena.col.matches":           "Matches",
+    "arena.col.wins":              "W / L / T",
+    "arena.col.tie_pair":          "Pair",
+    "arena.col.tie_diff":          "Elo gap",
+    "arena.col.tie_overlap":       "CI overlap",
+    "arena.no_ties":               "No statistical ties — all pairs distinguishable at 95% CI.",
+    "arena.summary.votes":         "Total votes",
+    "arena.summary.models":        "Models",
+    "arena.summary.ties":          "Statistical ties",
+    "arena.summary.bootstrap":     "Bootstrap iters",
+    "arena.summary.ci_level":      "CI level",
+    "arena.status.empty":          "⚠ Paste vote CSV or click Load sample.",
+    "arena.status.too_few":        "⚠ Only {n} valid votes — need at least 10 to fit Bradley-Terry reliably.",
+    "arena.status.computing":      "⏳ Computing Bradley-Terry MLE + bootstrap on {n} votes...",
+    "arena.status.done":           "✅ {n} votes · {models} models · {ties} statistical ties · {ms} ms",
+    "arena.status.sample_loaded":  "✅ Sample loaded (synthetic 6-model Arena data). Click Compute CIs.",
+    // v0.7.3 — anti-bullshit pack #4: Contamination Prior
+    "modes.contam":                "🧪 Contamination",
+    "mode_desc.contam":            "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff → rates 20+ popular benchmarks (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
+    "contam.title":                "🧪 Contamination Prior",
+    "contam.tip":                  "Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.",
+    "contam.desc":                 "<strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.",
+    "contam.cutoff_label":         "Training cutoff:",
+    "contam.run_btn":              "🧪 Rate all benchmarks",
+    "contam.section.ranked":       "Benchmark contamination priors",
+    "contam.section.high":         "🔴 High-risk benchmarks (treat scores as unreliable)",
+    "contam.section.medium":       "🟡 Medium-risk (verify with alternates)",
+    "contam.section.low":          "🟢 Low-risk (likely clean)",
+    "contam.col.benchmark":        "Benchmark",
+    "contam.col.released":         "Released",
+    "contam.col.gap":              "Gap (months)",
+    "contam.col.prior":            "P(contam)",
+    "contam.col.level":            "Level",
+    "contam.col.corpora":          "In corpora",
+    "contam.col.category":         "Category",
+    "contam.label.high":           "High risk",
+    "contam.label.medium":         "Medium",
+    "contam.label.low":            "Low",
+    "contam.no_entries":           "(none in this category)",
+    "contam.advice.high":          "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).",
+    "contam.advice.medium":        "Take with caution. Look for replication on a held-out subset or community reproductions.",
+    "contam.advice.low":           "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.",
+    "contam.summary.headline":     "Cutoff <code>{cutoff}</code> · {n} benchmarks rated",
+    "contam.status.empty":         "⚠ Enter a model training cutoff date (e.g. 2023-12).",
+    "contam.status.bad_date":      "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.",
+    "contam.status.done":          "✅ Cutoff {cutoff} · {n} benchmarks rated · {high} high-risk",
+    // v0.7 — Help modal section
+    "help.v07.title":              "🆕 v0.7 — Anti-bullshit pack (4 new modes)",
+    "help.v07.intro":              "<em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em>",
+    "help.v07.unmask.title":       "🪟 Context Unmasker",
+    "help.v07.unmask.body":        "Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.",
+    "help.v07.template.title":     "📜 Chat-template Sniffer",
+    "help.v07.template.body":      "Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.",
+    "help.v07.arena.title":        "🎯 Arena-Elo CI Reconstructor",
+    "help.v07.arena.body":         "Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a \"statistical ties\" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring \"model A beats model B\", verify their CIs don't overlap.",
+    "help.v07.contam.title":       "🧪 Contamination Prior",
+    "help.v07.contam.body":        "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.",
+    // v0.7 — Inventory modal 5th card
+    "inv.v07.title":               "🆕 v0.7 anti-bullshit pack",
+    "inv.v07.unmask":              "<strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far",
+    "inv.v07.template":            "<strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy",
+    "inv.v07.arena":               "<strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides",
+    "inv.v07.contam":              "<strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
     "common.no":           "No",
     // Mode tooltips
+    "modes.tip":           "<strong>Eleven ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.",
     "profile.tip":         "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
     "compare.tip":         "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
     "template.status.invalid_json":"❌ JSON inválido: {error}",
     "template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
     "template.pasted_label":       "(tokenizer_config pegado)",
+    // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
+    "modes.arena":                 "🎯 Arena CI",
+    "mode_desc.arena":             "Recupera intervalos de confianza desde datos crudos de votos pairwise (MLE Bradley-Terry + bootstrap). Detecta pares estadísticamente empatados que el leaderboard público de Arena oculta.",
+    "arena.title":                 "🎯 Reconstructor Arena-Elo CI",
+    "arena.tip":                   "Chatbot Arena oculta los intervalos de confianza en el leaderboard público. Una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos (model_a, model_b, winner) — la herramienta calcula MLE Bradley-Terry + bootstrap CIs y lista los empates estadísticos (overlap de CI).",
+    "arena.desc":                  "<strong>¿GPT-4 es realmente mejor que Claude — o están empatados?</strong> Pega CSV de votos pairwise (o click <em>Cargar sample</em>). MLE Bradley-Terry + 200 iteraciones de bootstrap → Elos ranked con CIs 95% y detección de empates estadísticos. Todo en el navegador.",
+    "arena.sample_btn":            "📊 Cargar datos sample",
+    "arena.run_btn":                "🎯 Calcular CIs",
+    "arena.clear_btn":             "🗑️ Limpiar",
+    "arena.csv_summary":           "CSV de votos (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
+    "arena.section.ranked":        "Elos ranked con CIs 95%",
+    "arena.section.ties":          "Empates estadísticos (overlap CI)",
+    "arena.section.summary":       "Resumen",
+    "arena.col.rank":              "#",
+    "arena.col.model":             "Modelo",
+    "arena.col.elo":               "Elo",
+    "arena.col.ci":                "CI 95%",
+    "arena.col.ci_width":          "± semi-anchura",
+    "arena.col.matches":           "Partidas",
+    "arena.col.wins":              "V / D / E",
+    "arena.col.tie_pair":          "Par",
+    "arena.col.tie_diff":          "Brecha Elo",
+    "arena.col.tie_overlap":       "Overlap CI",
+    "arena.no_ties":               "Sin empates estadísticos — todos los pares distinguibles al CI 95%.",
+    "arena.summary.votes":         "Votos totales",
+    "arena.summary.models":        "Modelos",
+    "arena.summary.ties":          "Empates estadísticos",
+    "arena.summary.bootstrap":     "Iteraciones bootstrap",
+    "arena.summary.ci_level":      "Nivel CI",
+    "arena.status.empty":          "⚠ Pega un CSV de votos o click en Cargar sample.",
+    "arena.status.too_few":        "⚠ Solo {n} votos válidos — se necesitan al menos 10 para ajustar Bradley-Terry de forma fiable.",
+    "arena.status.computing":      "⏳ Calculando MLE Bradley-Terry + bootstrap sobre {n} votos...",
+    "arena.status.done":           "✅ {n} votos · {models} modelos · {ties} empates estadísticos · {ms} ms",
+    "arena.status.sample_loaded":  "✅ Sample cargado (datos sintéticos Arena de 6 modelos). Click en Calcular CIs.",
+    // v0.7.3 — anti-bullshit pack #4: Contamination Prior
+    "modes.contam":                "🧪 Contaminación",
+    "mode_desc.contam":            "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha de cutoff de entrenamiento → puntúa 20+ benchmarks populares (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
+    "contam.title":                "🧪 Prior de Contaminación",
+    "contam.tip":                  "Calcula un prior bayesiano-ish sobre si un score de benchmark está contaminado, basado en (fecha de cutoff de entrenamiento) × (fecha de release del benchmark) × (inclusión conocida en corpus + historial de leaks). Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag.",
+    "contam.desc":                 "<strong>¿Deberías confiar en el MMLU de tu modelo?</strong> Introduce la fecha cutoff de entrenamiento — la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) y te dice qué scores son probablemente contaminados.",
+    "contam.cutoff_label":         "Cutoff entrenamiento:",
+    "contam.run_btn":              "🧪 Puntuar todos los benchmarks",
+    "contam.section.ranked":       "Priors de contaminación por benchmark",
+    "contam.section.high":         "🔴 Benchmarks de alto riesgo (trata los scores como no fiables)",
+    "contam.section.medium":       "🟡 Riesgo medio (verifica con alternativas)",
+    "contam.section.low":          "🟢 Bajo riesgo (probablemente limpios)",
+    "contam.col.benchmark":        "Benchmark",
+    "contam.col.released":         "Release",
+    "contam.col.gap":              "Gap (meses)",
+    "contam.col.prior":            "P(contam)",
+    "contam.col.level":            "Nivel",
+    "contam.col.corpora":          "En corpus",
+    "contam.col.category":         "Categoría",
+    "contam.label.high":           "Alto riesgo",
+    "contam.label.medium":         "Medio",
+    "contam.label.low":            "Bajo",
+    "contam.no_entries":           "(ninguno en esta categoría)",
+    "contam.advice.high":          "Trata estos scores como no fiables. Sustituye por alternativas más recientes / con test privado (MMLU-Pro, GPQA, MUSR, MATH-500).",
+    "contam.advice.medium":        "Toma con cautela. Busca replicación sobre subset held-out o reproducciones comunitarias.",
+    "contam.advice.low":           "Score probablemente no contaminado, pero ausencia de leak no es prueba — verifica también con test alternativo.",
+    "contam.summary.headline":     "Cutoff <code>{cutoff}</code> · {n} benchmarks puntuados",
+    "contam.status.empty":         "⚠ Introduce una fecha cutoff de entrenamiento (ej. 2023-12).",
+    "contam.status.bad_date":      "⚠ Formato de fecha incorrecto. Usa YYYY-MM o YYYY-MM-DD.",
+    "contam.status.done":          "✅ Cutoff {cutoff} · {n} benchmarks puntuados · {high} de alto riesgo",
+    // v0.7 — Sección Help modal
+    "help.v07.title":              "🆕 v0.7 — Pack anti-bullshit (4 modos nuevos)",
+    "help.v07.intro":              "<em>v0.7 (2026-05-06): cuatro modos nuevos que resuelven problemas concretos reportados por la comunidad HuggingFace. Cada uno corre en tu navegador sin inferencia — pura metadata + matemáticas.</em>",
+    "help.v07.unmask.title":       "🪟 Desenmascarador de Contexto",
+    "help.v07.unmask.body":        "Detecta cuándo <code>max_position_embeddings</code> es engañoso. Mistral-7B-v0.1 declara 32k pero atiende dentro de ~4-8k vía SWA. Pega un id HF → veredicto en 1 segundo (HONESTO / INFLADO / GRAVEMENTE INFLADO / YARN-EXTENDIDO). Pilla SWA, RoPE-scaling (YaRN/linear/dynamic NTK), d_head pequeño + GQA. <em>Caso de uso</em>: antes de pagar GPU para 32k de contexto, verifica que el modelo realmente atiende tan lejos.",
+    "help.v07.template.title":     "📜 Detector de Chat-template",
+    "help.v07.template.body":      "Detecta qué familia de chat-template usa un modelo (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) y te da el flag CLI exacto para lm-evaluation-harness, vLLM, y transformers. Resuelve el issue #1841 de lm-eval-harness: olvidar <code>--apply_chat_template</code> divide la accuracy multi-turn por 2 silenciosamente. <em>Caso de uso</em>: antes de reportar un score, confirma que aplicaste el template correctamente.",
+    "help.v07.arena.title":        "🎯 Reconstructor Arena-Elo CI",
+    "help.v07.arena.body":         "Chatbot Arena oculta los intervalos de confianza en su leaderboard público — una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap de 200 iteraciones → Elos ranked con CIs 95% y un panel de \"empates estadísticos\" listando pares cuyos CIs se solapan. Prueba el botón Cargar sample. <em>Caso de uso</em>: antes de afirmar \"modelo A vence a modelo B\", verifica que sus CIs no se solapen.",
+    "help.v07.contam.title":       "🧪 Prior de Contaminación",
+    "help.v07.contam.body":        "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha cutoff de entrenamiento de tu modelo → la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) por P(contaminación) según gap temporal, inclusión en corpus y historial de leaks conocidos. Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag. <em>Caso de uso</em>: decide qué scores te puedes creer al comparar dos modelos.",
+    // v0.7 — Inventory modal 5ª card
+    "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
+    "inv.v07.unmask":              "<strong>🪟 Unmask</strong> — ¿config.json declara 32k? Mira si de verdad atiende tan lejos",
+    "inv.v07.template":            "<strong>📜 Chat-template</strong> — flag CLI exacto para que lm-eval no divida tu accuracy entre 2 silenciosamente",
+    "inv.v07.arena":               "<strong>🎯 Arena CI</strong> — recupera los intervalos de confianza que Chatbot Arena oculta",
+    "inv.v07.contam":              "<strong>🧪 Contaminación</strong> — puntúa 20+ benchmarks por probabilidad de contaminación",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
     "common.no":           "No",
     // Tooltips de modos
+    "modes.tip":           "<strong>Once formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.",
     "profile.tip":         "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
     "compare.tip":         "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
     "template.status.invalid_json":"❌ JSON invalide : {error}",
     "template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
     "template.pasted_label":       "(tokenizer_config collé)",
+    // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
+    "modes.arena":                 "🎯 Arena CI",
+    "mode_desc.arena":             "Récupère les intervalles de confiance à partir des données brutes de votes pairwise (MLE Bradley-Terry + bootstrap). Détecte les paires statistiquement à égalité que le leaderboard public d'Arena cache.",
+    "arena.title":                 "🎯 Reconstructeur Arena-Elo CI",
+    "arena.tip":                   "Chatbot Arena masque les intervalles de confiance dans le leaderboard public. Un écart de 5 Elo peut être statistiquement insignifiant. Collez les données brutes de votes (model_a, model_b, winner) — l'outil calcule le MLE Bradley-Terry + bootstrap CIs et liste les égalités statistiques (overlap CI).",
+    "arena.desc":                  "<strong>GPT-4 est-il vraiment meilleur que Claude — ou sont-ils à égalité ?</strong> Collez le CSV de votes pairwise (ou cliquez <em>Charger un échantillon</em>). MLE Bradley-Terry + 200 itérations de bootstrap → Elos classés avec CIs 95% et détection d'égalités statistiques. Tout dans le navigateur.",
+    "arena.sample_btn":            "📊 Charger échantillon",
+    "arena.run_btn":                "🎯 Calculer CIs",
+    "arena.clear_btn":             "🗑️ Effacer",
+    "arena.csv_summary":           "CSV de votes (header : <code>model_a,model_b,winner</code> ; winner ∈ a/b/tie)",
+    "arena.section.ranked":        "Elos classés avec CIs 95%",
+    "arena.section.ties":          "Égalités statistiques (overlap CI)",
+    "arena.section.summary":       "Résumé",
+    "arena.col.rank":              "#",
+    "arena.col.model":             "Modèle",
+    "arena.col.elo":               "Elo",
+    "arena.col.ci":                "CI 95%",
+    "arena.col.ci_width":          "± demi-largeur",
+    "arena.col.matches":           "Matchs",
+    "arena.col.wins":              "V / D / E",
+    "arena.col.tie_pair":          "Paire",
+    "arena.col.tie_diff":          "Écart Elo",
+    "arena.col.tie_overlap":       "Overlap CI",
+    "arena.no_ties":               "Aucune égalité statistique — toutes les paires sont distinguables à 95% CI.",
+    "arena.summary.votes":         "Total des votes",
+    "arena.summary.models":        "Modèles",
+    "arena.summary.ties":          "Égalités statistiques",
+    "arena.summary.bootstrap":     "Itérations bootstrap",
+    "arena.summary.ci_level":      "Niveau CI",
+    "arena.status.empty":          "⚠ Collez un CSV de votes ou cliquez sur Charger échantillon.",
+    "arena.status.too_few":        "⚠ Seulement {n} votes valides — il en faut au moins 10 pour ajuster Bradley-Terry de manière fiable.",
+    "arena.status.computing":      "⏳ Calcul MLE Bradley-Terry + bootstrap sur {n} votes...",
+    "arena.status.done":           "✅ {n} votes · {models} modèles · {ties} égalités statistiques · {ms} ms",
+    "arena.status.sample_loaded":  "✅ Échantillon chargé (données Arena synthétiques 6 modèles). Cliquez sur Calculer CIs.",
+    // v0.7.3 — anti-bullshit pack #4: Contamination Prior
+    "modes.contam":                "🧪 Contamination",
+    "mode_desc.contam":            "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez le cutoff d'entraînement → note 20+ benchmarks populaires (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
+    "contam.title":                "🧪 Prior de Contamination",
+    "contam.tip":                  "Calcule un prior bayésien-ish indiquant si un score de benchmark est contaminé, basé sur (date de cutoff d'entraînement) × (date de sortie du benchmark) × (inclusion connue dans corpus + historique de leaks). Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag.",
+    "contam.desc":                 "<strong>Devez-vous faire confiance au score MMLU de votre modèle ?</strong> Saisissez la date de cutoff d'entraînement — l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) et vous dit quels scores sont probablement contaminés.",
+    "contam.cutoff_label":         "Cutoff entraînement :",
+    "contam.run_btn":              "🧪 Noter tous les benchmarks",
+    "contam.section.ranked":       "Priors de contamination par benchmark",
+    "contam.section.high":         "🔴 Benchmarks à haut risque (traitez les scores comme non fiables)",
+    "contam.section.medium":       "🟡 Risque moyen (vérifiez avec des alternatives)",
+    "contam.section.low":          "🟢 Faible risque (probablement propres)",
+    "contam.col.benchmark":        "Benchmark",
+    "contam.col.released":         "Sorti",
+    "contam.col.gap":              "Écart (mois)",
+    "contam.col.prior":            "P(contam)",
+    "contam.col.level":            "Niveau",
+    "contam.col.corpora":          "Dans corpus",
+    "contam.col.category":         "Catégorie",
+    "contam.label.high":           "Haut risque",
+    "contam.label.medium":         "Moyen",
+    "contam.label.low":            "Faible",
+    "contam.no_entries":           "(aucun dans cette catégorie)",
+    "contam.advice.high":          "Traitez ces scores comme non fiables. Remplacez par des alternatives plus récentes / à test privé (MMLU-Pro, GPQA, MUSR, MATH-500).",
+    "contam.advice.medium":        "À prendre avec précaution. Cherchez une réplication sur un subset held-out ou des reproductions communautaires.",
+    "contam.advice.low":           "Score probablement non contaminé, mais absence de leak n'est pas une preuve — vérifiez avec un test alternatif.",
+    "contam.summary.headline":     "Cutoff <code>{cutoff}</code> · {n} benchmarks notés",
+    "contam.status.empty":         "⚠ Saisissez une date de cutoff d'entraînement (ex. 2023-12).",
+    "contam.status.bad_date":      "⚠ Format de date incorrect. Utilisez YYYY-MM ou YYYY-MM-DD.",
+    "contam.status.done":          "✅ Cutoff {cutoff} · {n} benchmarks notés · {high} à haut risque",
+    // v0.7 — Section Help modal
+    "help.v07.title":              "🆕 v0.7 — Pack anti-bullshit (4 nouveaux modes)",
+    "help.v07.intro":              "<em>v0.7 (2026-05-06) : quatre nouveaux modes qui résolvent des problèmes concrets remontés par la communauté HuggingFace. Chacun tourne dans votre navigateur sans inférence — pure métadonnée + maths.</em>",
+    "help.v07.unmask.title":       "🪟 Démasqueur de Contexte",
+    "help.v07.unmask.body":        "Détecte quand <code>max_position_embeddings</code> est trompeur. Mistral-7B-v0.1 déclare 32k mais attend dans ~4-8k via SWA. Collez un id HF → verdict en 1 seconde (HONNÊTE / GONFLÉ / GRAVEMENT GONFLÉ / YARN-ÉTENDU). Détecte SWA, RoPE-scaling (YaRN/linear/dynamic NTK), petit d_head + GQA. <em>Cas d'usage</em> : avant de payer un GPU pour 32k de contexte, vérifiez que le modèle attend vraiment aussi loin.",
+    "help.v07.template.title":     "📜 Détecteur de Chat-template",
+    "help.v07.template.body":      "Détecte la famille de chat-template d'un modèle (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) et donne le flag CLI exact pour lm-evaluation-harness, vLLM, et transformers. Résout l'issue #1841 de lm-eval-harness : oublier <code>--apply_chat_template</code> divise l'accuracy multi-tours par 2 silencieusement. <em>Cas d'usage</em> : avant de reporter un score, confirmez avoir appliqué le template correctement.",
+    "help.v07.arena.title":        "🎯 Reconstructeur Arena-Elo CI",
+    "help.v07.arena.body":         "Chatbot Arena masque les intervalles de confiance de son leaderboard public — un écart de 5 Elo peut être statistiquement insignifiant. Collez des données brutes de votes pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap 200 itérations → Elos classés avec CIs 95% et un panneau \"égalités statistiques\" listant les paires dont les CIs se chevauchent. Essayez le bouton Charger échantillon. <em>Cas d'usage</em> : avant de déclarer \"modèle A bat modèle B\", vérifiez que leurs CIs ne se chevauchent pas.",
+    "help.v07.contam.title":       "🧪 Prior de Contamination",
+    "help.v07.contam.body":        "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez la date de cutoff d'entraînement de votre modèle → l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) par P(contamination) selon l'écart temporel, l'inclusion dans corpus et l'historique de leaks connus. Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag. <em>Cas d'usage</em> : décidez quels scores croire en comparant deux modèles.",
+    // v0.7 — Inventory modal 5ème card
+    "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
+    "inv.v07.unmask":              "<strong>🪟 Unmask</strong> — config.json annonce 32k ? Voyez s'il attend vraiment aussi loin",
+    "inv.v07.template":            "<strong>📜 Chat-template</strong> — flag CLI exact pour que lm-eval ne divise pas votre accuracy par 2 en silence",
+    "inv.v07.arena":               "<strong>🎯 Arena CI</strong> — récupère les intervalles de confiance que Chatbot Arena cache",
+    "inv.v07.contam":              "<strong>🧪 Contamination</strong> — note 20+ benchmarks par probabilité de contamination",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
     "common.no":           "Non",
     // Tooltips des modes
+    "modes.tip":           "<strong>Onze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.",
     "profile.tip":         "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
     "compare.tip":         "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
     "template.status.invalid_json":"❌ JSON 无效：{error}",
     "template.status.success_paste":"✅ 已检测粘贴的 config（判定：{verdict}）",
     "template.pasted_label":       "（已粘贴 tokenizer_config）",
+    // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
+    "modes.arena":                 "🎯 Arena CI",
+    "mode_desc.arena":             "从原始 pairwise 投票数据中恢复置信区间（Bradley-Terry MLE + bootstrap）。检测公开 Arena 排行榜隐藏的统计上并列对。",
+    "arena.title":                 "🎯 Arena-Elo CI 重建器",
+    "arena.tip":                   "Chatbot Arena 在公开排行榜中删除了置信区间。5 Elo 的差距在统计上可能毫无意义。粘贴原始投票数据（model_a, model_b, winner） — 工具计算 Bradley-Terry MLE + bootstrap CI 并列出统计上的并列（CI 重叠）。",
+    "arena.desc":                  "<strong>GPT-4 真的比 Claude 强吗 — 还是它们打平？</strong> 粘贴 pairwise 投票 CSV（或点击 <em>加载样本</em>）。Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + 统计并列检测。全部在浏览器中。",
+    "arena.sample_btn":            "📊 加载样本数据",
+    "arena.run_btn":                "🎯 计算 CIs",
+    "arena.clear_btn":             "🗑️ 清空",
+    "arena.csv_summary":           "投票 CSV（header：<code>model_a,model_b,winner</code>；winner ∈ a/b/tie）",
+    "arena.section.ranked":        "排序 Elo 与 95% CI",
+    "arena.section.ties":          "统计并列（CI 重叠）",
+    "arena.section.summary":       "摘要",
+    "arena.col.rank":              "#",
+    "arena.col.model":             "模型",
+    "arena.col.elo":               "Elo",
+    "arena.col.ci":                "95% CI",
+    "arena.col.ci_width":          "± 半宽",
+    "arena.col.matches":           "对局",
+    "arena.col.wins":              "胜 / 负 / 平",
+    "arena.col.tie_pair":          "配对",
+    "arena.col.tie_diff":          "Elo 差距",
+    "arena.col.tie_overlap":       "CI 重叠",
+    "arena.no_ties":               "无统计并列 — 所有配对在 95% CI 下可区分。",
+    "arena.summary.votes":         "总投票数",
+    "arena.summary.models":        "模型数",
+    "arena.summary.ties":          "统计并列",
+    "arena.summary.bootstrap":     "Bootstrap 迭代",
+    "arena.summary.ci_level":      "CI 水平",
+    "arena.status.empty":          "⚠ 粘贴投票 CSV 或点击加载样本。",
+    "arena.status.too_few":        "⚠ 仅 {n} 个有效投票 — 需要至少 10 个才能可靠拟合 Bradley-Terry。",
+    "arena.status.computing":      "⏳ 在 {n} 个投票上计算 Bradley-Terry MLE + bootstrap...",
+    "arena.status.done":           "✅ {n} 投票 · {models} 模型 · {ties} 统计并列 · {ms} ms",
+    "arena.status.sample_loaded":  "✅ 样本已加载（合成 6 模型 Arena 数据）。点击计算 CIs。",
+    // v0.7.3 — anti-bullshit pack #4: Contamination Prior
+    "modes.contam":                "🧪 污染",
+    "mode_desc.contam":            "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff → 评估 20+ 主流 benchmark（MMLU、GSM8K、HumanEval、MMLU-Pro…）。",
+    "contam.title":                "🧪 污染先验",
+    "contam.tip":                  "基于 (模型训练 cutoff 日期) × (benchmark 发布日期) × (已知语料库纳入 + 泄漏历史)，对 benchmark 分数是否被污染做贝叶斯式的先验估计。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。",
+    "contam.desc":                 "<strong>你应该相信你模型的 MMLU 分数吗？</strong> 输入模型训练 cutoff 日期 — 工具评估 20+ 主流 benchmark（MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA…）并告诉你哪些分数可能被污染。",
+    "contam.cutoff_label":         "训练 cutoff：",
+    "contam.run_btn":              "🧪 评估所有 benchmark",
+    "contam.section.ranked":       "Benchmark 污染先验",
+    "contam.section.high":         "🔴 高风险 benchmark（视分数为不可信）",
+    "contam.section.medium":       "🟡 中等风险（用替代品验证）",
+    "contam.section.low":          "🟢 低风险（可能干净）",
+    "contam.col.benchmark":        "Benchmark",
+    "contam.col.released":         "发布",
+    "contam.col.gap":              "差距（月）",
+    "contam.col.prior":            "P(污染)",
+    "contam.col.level":            "等级",
+    "contam.col.corpora":          "在语料库",
+    "contam.col.category":         "类别",
+    "contam.label.high":           "高风险",
+    "contam.label.medium":         "中",
+    "contam.label.low":            "低",
+    "contam.no_entries":           "（此类别中无）",
+    "contam.advice.high":          "视这些分数为不可信。用更新 / 私有测试的替代品替换（MMLU-Pro、GPQA、MUSR、MATH-500）。",
+    "contam.advice.medium":        "谨慎对待。在 held-out 子集或社区复现上寻找复制。",
+    "contam.advice.low":           "分数可能未被污染，但没有泄漏不等于证明 — 仍要用替代测试交叉验证。",
+    "contam.summary.headline":     "Cutoff <code>{cutoff}</code> · {n} 个 benchmark 已评估",
+    "contam.status.empty":         "⚠ 输入模型训练 cutoff 日期（例如 2023-12）。",
+    "contam.status.bad_date":      "⚠ 日期格式错误。使用 YYYY-MM 或 YYYY-MM-DD。",
+    "contam.status.done":          "✅ Cutoff {cutoff} · {n} benchmarks 已评估 · {high} 个高风险",
+    // v0.7 — Help 模态部分
+    "help.v07.title":              "🆕 v0.7 — Anti-bullshit 套件（4 个新模式）",
+    "help.v07.intro":              "<em>v0.7（2026-05-06）：四个新模式，解决 HuggingFace 社区报告的具体痛点。每个都在浏览器中运行，无推理 — 纯元数据 + 数学。</em>",
+    "help.v07.unmask.title":       "🪟 上下文揭示器",
+    "help.v07.unmask.body":        "检测 <code>max_position_embeddings</code> 何时具有误导性。Mistral-7B-v0.1 声称 32k 但通过 SWA 实际只在 ~4-8k 内做注意力。粘贴 HF 模型 id → 1 秒判定（诚实 / 夸大 / 严重夸大 / YARN 扩展）。捕获 SWA、RoPE-scaling（YaRN/linear/dynamic NTK）、小 d_head + GQA。<em>用例</em>：在为 32k 上下文付 GPU 钱之前，验证模型是否真的注意那么远。",
+    "help.v07.template.title":     "📜 Chat-template 检测器",
+    "help.v07.template.body":      "检测模型使用的 chat-template 系列（Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / 自定义 / 无）并给出 lm-evaluation-harness、vLLM、transformers 的精确 CLI flag。解决 lm-eval-harness 的 issue #1841：忘记 <code>--apply_chat_template</code> 会让 multi-turn accuracy 静默对半。<em>用例</em>：报告 benchmark 分数前，确认你正确应用了 template。",
+    "help.v07.arena.title":        "🎯 Arena-Elo CI 重建器",
+    "help.v07.arena.body":         "Chatbot Arena 在公开排行榜中删除了置信区间 — 5 Elo 的差距在统计上可能毫无意义。粘贴原始 pairwise 投票数据（model_a, model_b, winner）→ Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + \"统计并列\" 面板，列出 CI 重叠的配对。尝试加载样本按钮。<em>用例</em>：宣称 \"模型 A 胜过模型 B\" 之前，验证它们的 CI 不重叠。",
+    "help.v07.contam.title":       "🧪 污染先验",
+    "help.v07.contam.body":        "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff 日期 → 工具按 P(污染) 评估 20+ 主流 benchmark（MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA、AIME、MATH-500、BBH、MUSR…），基于时间差距、语料库纳入和已知泄漏历史。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。<em>用例</em>：比较两个模型时决定相信哪些分数。",
+    // v0.7 — Inventory 模态第 5 卡
+    "inv.v07.title":               "🆕 v0.7 anti-bullshit 套件",
+    "inv.v07.unmask":              "<strong>🪟 Unmask</strong> — config.json 声称 32k？看它是否真的注意那么远",
+    "inv.v07.template":            "<strong>📜 Chat-template</strong> — 精确 CLI flag，让 lm-eval 不会静默对半你的 accuracy",
+    "inv.v07.arena":               "<strong>🎯 Arena CI</strong> — 恢复 Chatbot Arena 隐藏的置信区间",
+    "inv.v07.contam":              "<strong>🧪 污染</strong> — 按污染概率对 20+ benchmark 评级",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
     "common.no":           "否",
     // 模式提示
+    "modes.tip":           "<strong>十一种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings（SWA / YaRN / RoPE 缩放）。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间；检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期，对 20+ benchmark 进行污染概率评估。",
     "profile.tip":         "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
     "compare.tip":         "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",

js/main.js CHANGED Viewed

@@ -13,6 +13,8 @@ import { gammaCheckAll, REGIME_META } from "./gamma_check.js";
 import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
 import { unmaskConfig } from "./swa_unmasker.js";
 import { sniffChatTemplate } from "./chat_template_sniffer.js";
 const TAF_BROWSER_URL = "python/taf_browser.py";
 const ENABLE_WEBLLM = true;
@@ -188,7 +190,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
     ["ask-section", "recipe-section", "form-section",
      "profile-section", "compare-section", "inspector-section",
      "diagnose-section", "phase-section", "unmask-section",
-     "template-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
@@ -197,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
       ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
       compare: "compare-section", inspector: "inspector-section",
       diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
-      template: "template-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
@@ -739,6 +741,245 @@ $("template-id")?.addEventListener("keydown", (e) => {
   if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
 });
 function configToPreset(cfg, modelId) {
   const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
   const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;

 import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
 import { unmaskConfig } from "./swa_unmasker.js";
 import { sniffChatTemplate } from "./chat_template_sniffer.js";
+import { parseVotesCSV, computeArenaCI, SAMPLE_VOTES_CSV } from "./arena_ci.js";
+import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
 const TAF_BROWSER_URL = "python/taf_browser.py";
 const ENABLE_WEBLLM = true;
     ["ask-section", "recipe-section", "form-section",
      "profile-section", "compare-section", "inspector-section",
      "diagnose-section", "phase-section", "unmask-section",
+     "template-section", "arena-section", "contam-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
       ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
       compare: "compare-section", inspector: "inspector-section",
       diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
+      template: "template-section", arena: "arena-section", contam: "contam-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
   if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
 });
+// ════════════════════════════════════════════════════════════════════
+// 🎯 Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
+// ════════════════════════════════════════════════════════════════════
+function renderArenaCard(result) {
+  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
+    ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
+  const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
+  const titleRanked     = t("arena.section.ranked")     || "Ranked Elos with 95% CIs";
+  const titleTies       = t("arena.section.ties")       || "Statistical ties (CI overlap)";
+  const titleSummary    = t("arena.section.summary")    || "Summary";
+  const colRank         = t("arena.col.rank")           || "#";
+  const colModel        = t("arena.col.model")          || "Model";
+  const colElo          = t("arena.col.elo")            || "Elo";
+  const colCi           = t("arena.col.ci")             || "95% CI";
+  const colSpread       = t("arena.col.ci_width")       || "CI width";
+  const colMatches      = t("arena.col.matches")        || "Matches";
+  const colWins         = t("arena.col.wins")           || "W / L / T";
+  const noTies          = t("arena.no_ties")            || "No statistical ties — all pairs distinguishable at 95% CI.";
+  // Ranked table
+  let tableRows = "";
+  for (const r of result.ratings) {
+    tableRows += `<tr>
+      <td class="arena-rank">#${r.rank}</td>
+      <td class="arena-model"><code>${escapeHtml(r.model)}</code></td>
+      <td class="arena-elo"><strong>${fmtN(r.elo)}</strong></td>
+      <td class="arena-ci">[${fmtN(r.ci_low)}, ${fmtN(r.ci_high)}]</td>
+      <td class="arena-spread">±${fmtN(Math.round(r.ci_width / 2 * 10) / 10)}</td>
+      <td class="arena-matches">${fmtN(r.matches)}</td>
+      <td class="arena-wlt">${fmtN(r.wins)} / ${fmtN(r.losses)} / ${fmtN(r.ties_count)}</td>
+    </tr>`;
+  }
+  // Ties section
+  let tiesHtml = "";
+  if (result.ties.length === 0) {
+    tiesHtml = `<p class="unmask-reco">${noTies}</p>`;
+  } else {
+    tiesHtml = `<table class="arena-ties-table">
+      <thead><tr>
+        <th>${t("arena.col.tie_pair") || "Pair"}</th>
+        <th>${t("arena.col.tie_diff") || "Elo gap"}</th>
+        <th>${t("arena.col.tie_overlap") || "CI overlap"}</th>
+      </tr></thead><tbody>`;
+    for (const tieEntry of result.ties) {
+      tiesHtml += `<tr>
+        <td>#${tieEntry.rank_a} <code>${escapeHtml(tieEntry.model_a)}</code> vs #${tieEntry.rank_b} <code>${escapeHtml(tieEntry.model_b)}</code></td>
+        <td>${fmtN(Math.round(tieEntry.elo_diff * 10) / 10)} Elo</td>
+        <td>${fmtN(Math.round(tieEntry.overlap_elo * 10) / 10)} Elo</td>
+      </tr>`;
+    }
+    tiesHtml += `</tbody></table>`;
+  }
+  // Summary panel
+  const s = result.summary;
+  const summaryHtml = `
+    <ul>
+      <li><strong>${t("arena.summary.votes") || "Total votes"}:</strong> ${fmtN(s.total_votes)}</li>
+      <li><strong>${t("arena.summary.models") || "Models"}:</strong> ${fmtN(s.n_models)}</li>
+      <li><strong>${t("arena.summary.ties") || "Statistical ties"}:</strong> ${fmtN(s.n_ties)}</li>
+      <li><strong>${t("arena.summary.bootstrap") || "Bootstrap iters"}:</strong> ${fmtN(s.bootstrap_iters)}</li>
+      <li><strong>${t("arena.summary.ci_level") || "CI level"}:</strong> ${(s.ci_level * 100).toFixed(0)}%</li>
+    </ul>
+  `;
+  return `
+    <div class="arena-result">
+      <details class="unmask-panel" open>
+        <summary class="unmask-panel-title">${titleRanked}</summary>
+        <div style="overflow-x:auto;">
+          <table class="arena-table">
+            <thead><tr>
+              <th>${colRank}</th><th>${colModel}</th><th>${colElo}</th>
+              <th>${colCi}</th><th>${colSpread}</th>
+              <th>${colMatches}</th><th>${colWins}</th>
+            </tr></thead>
+            <tbody>${tableRows}</tbody>
+          </table>
+        </div>
+      </details>
+      <details class="unmask-panel" open>
+        <summary class="unmask-panel-title">${titleTies} <span class="arena-tie-count">(${result.ties.length})</span></summary>
+        ${tiesHtml}
+      </details>
+      <details class="unmask-panel">
+        <summary class="unmask-panel-title">${titleSummary}</summary>
+        ${summaryHtml}
+      </details>
+    </div>
+  `;
+}
+function runArenaCompute() {
+  const csv = ($("arena-csv").value || "").trim();
+  if (!csv) {
+    $("arena-status").textContent = t("arena.status.empty") || "⚠ Paste vote CSV or click Load sample.";
+    return;
+  }
+  let votes;
+  try {
+    votes = parseVotesCSV(csv);
+  } catch (e) {
+    $("arena-status").textContent = `❌ ${e.message}`;
+    return;
+  }
+  if (votes.length < 10) {
+    $("arena-status").textContent = tFmt("arena.status.too_few", { n: votes.length });
+    return;
+  }
+  $("arena-status").textContent = tFmt("arena.status.computing", { n: votes.length });
+  // Defer to next tick so the status text actually paints before the heavy bootstrap.
+  setTimeout(() => {
+    const t0 = performance.now();
+    const result = computeArenaCI(votes, { bootstrapN: 200, ciLevel: 0.95 });
+    const ms = Math.round(performance.now() - t0);
+    $("arena-output").innerHTML = renderArenaCard(result);
+    $("arena-status").textContent = tFmt("arena.status.done", {
+      n: votes.length, models: result.summary.n_models,
+      ties: result.summary.n_ties, ms,
+    });
+  }, 30);
+}
+$("arena-sample-btn")?.addEventListener("click", () => {
+  $("arena-csv").value = SAMPLE_VOTES_CSV;
+  $("arena-status").textContent = t("arena.status.sample_loaded") || "✅ Sample loaded. Click Compute CIs.";
+});
+$("arena-run-btn")?.addEventListener("click", runArenaCompute);
+$("arena-clear-btn")?.addEventListener("click", () => {
+  $("arena-csv").value = "";
+  $("arena-output").innerHTML = "";
+  $("arena-status").textContent = "";
+});
+// ════════════════════════════════════════════════════════════════════
+// 🧪 Contamination Prior (v0.7.3 anti-bullshit pack #4)
+// ════════════════════════════════════════════════════════════════════
+const CONTAM_LEVEL_COLOR = { high: "#f85149", medium: "#f1c40f", low: "#3fb950" };
+function renderContamCard(rows, modelCutoff) {
+  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
+    ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
+  const titleRanked  = t("contam.section.ranked")  || "Benchmark contamination priors";
+  const titleHigh    = t("contam.section.high")    || "🔴 High-risk benchmarks (treat scores as unreliable)";
+  const titleMed     = t("contam.section.medium")  || "🟡 Medium-risk (verify with alternates)";
+  const titleLow     = t("contam.section.low")     || "🟢 Low-risk (likely clean)";
+  const colBench     = t("contam.col.benchmark")   || "Benchmark";
+  const colReleased  = t("contam.col.released")    || "Released";
+  const colGap       = t("contam.col.gap")         || "Gap (months)";
+  const colPrior     = t("contam.col.prior")       || "P(contam)";
+  const colLevel     = t("contam.col.level")       || "Level";
+  const colCorpora   = t("contam.col.corpora")     || "In corpora";
+  const colCategory  = t("contam.col.category")    || "Category";
+  const high = rows.filter(r => r.level === "high");
+  const medium = rows.filter(r => r.level === "medium");
+  const low = rows.filter(r => r.level === "low");
+  function tableFor(group) {
+    if (group.length === 0) return `<p class="unmask-reco">${t("contam.no_entries") || "(none in this category)"}</p>`;
+    let body = "";
+    for (const r of group) {
+      body += `<tr>
+        <td><strong>${escapeHtml(r.benchmark)}</strong></td>
+        <td>${escapeHtml(r.benchmark_released)}</td>
+        <td class="arena-spread">${r.gap_months > 0 ? "+" : ""}${r.gap_months}</td>
+        <td class="arena-elo" style="color: ${CONTAM_LEVEL_COLOR[r.level]};"><strong>${(r.prior * 100).toFixed(0)}%</strong></td>
+        <td>${r.benchmark_in_corpora ? "✓" : "✗"}</td>
+        <td class="arena-spread">${escapeHtml(r.benchmark_category)}</td>
+      </tr>`;
+    }
+    return `<table class="arena-table">
+      <thead><tr><th>${colBench}</th><th>${colReleased}</th><th>${colGap}</th><th>${colPrior}</th><th>${colCorpora}</th><th>${colCategory}</th></tr></thead>
+      <tbody>${body}</tbody></table>`;
+  }
+  const adviceHigh   = t("contam.advice.high")   || "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).";
+  const adviceMedium = t("contam.advice.medium") || "Take with caution. Look for replication on a held-out subset or community reproductions.";
+  const adviceLow    = t("contam.advice.low")    || "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.";
+  return `
+    <div class="arena-result">
+      <div class="unmask-hero" style="border-color: #58a6ff;">
+        <div class="unmask-verdict" style="color: #58a6ff;">${tFmt("contam.summary.headline", { cutoff: modelCutoff, n: rows.length })}</div>
+        <div class="unmask-numbers">
+          <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.high}">🔴 ${t("contam.label.high") || "High risk"}</span><span class="unmask-num-val">${high.length}</span></div>
+          <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.medium}">🟡 ${t("contam.label.medium") || "Medium"}</span><span class="unmask-num-val">${medium.length}</span></div>
+          <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.low}">🟢 ${t("contam.label.low") || "Low"}</span><span class="unmask-num-val">${low.length}</span></div>
+        </div>
+      </div>
+      <div class="unmask-details">
+        <details class="unmask-panel" open>
+          <summary class="unmask-panel-title">${titleHigh} <span class="arena-tie-count">(${high.length})</span></summary>
+          <p class="unmask-reco">${adviceHigh}</p>
+          ${tableFor(high)}
+        </details>
+        <details class="unmask-panel" open>
+          <summary class="unmask-panel-title">${titleMed} <span class="arena-tie-count">(${medium.length})</span></summary>
+          <p class="unmask-reco">${adviceMedium}</p>
+          ${tableFor(medium)}
+        </details>
+        <details class="unmask-panel">
+          <summary class="unmask-panel-title">${titleLow} <span class="arena-tie-count">(${low.length})</span></summary>
+          <p class="unmask-reco">${adviceLow}</p>
+          ${tableFor(low)}
+        </details>
+      </div>
+    </div>
+  `;
+}
+function runContamCompute() {
+  const cutoff = ($("contam-cutoff").value || "").trim();
+  if (!cutoff) {
+    $("contam-status").textContent = t("contam.status.empty") || "⚠ Enter a model training cutoff date (e.g. 2023-12).";
+    return;
+  }
+  if (!/^\d{4}(-\d{1,2})?(-\d{1,2})?$/.test(cutoff)) {
+    $("contam-status").textContent = t("contam.status.bad_date") || "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.";
+    return;
+  }
+  const rows = rateAllBenchmarks(cutoff);
+  $("contam-output").innerHTML = renderContamCard(rows, cutoff);
+  $("contam-status").textContent = tFmt("contam.status.done", {
+    cutoff, n: rows.length,
+    high: rows.filter(r => r.level === "high").length,
+  });
+}
+$("contam-run-btn")?.addEventListener("click", runContamCompute);
+$("contam-cutoff")?.addEventListener("keydown", (e) => {
+  if (e.key === "Enter") { e.preventDefault(); runContamCompute(); }
+});
 function configToPreset(cfg, modelId) {
   const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
   const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;

style.css CHANGED Viewed

@@ -33,6 +33,34 @@
   flex: 1;
 }
 /* v0.7.1 — Chat-template Sniffer mode */
 .template-cmd-block {
   display: flex;

   flex: 1;
 }
+/* v0.7.2 — Arena-Elo CI reconstructor */
+.arena-result { margin-top: 0.6em; }
+.arena-table, .arena-ties-table {
+  width: 100%;
+  border-collapse: collapse;
+  font-size: 0.92em;
+}
+.arena-table th, .arena-ties-table th {
+  text-align: left;
+  font-weight: 600;
+  font-size: 0.78em;
+  text-transform: uppercase;
+  letter-spacing: 0.04em;
+  color: #58a6ff;
+  padding: 0.45em 0.6em;
+  border-bottom: 1px solid rgba(255, 255, 255, 0.12);
+}
+.arena-table td, .arena-ties-table td {
+  padding: 0.45em 0.6em;
+  border-bottom: 1px solid rgba(255, 255, 255, 0.04);
+  vertical-align: middle;
+}
+.arena-table tr:hover, .arena-ties-table tr:hover { background: rgba(88, 166, 255, 0.04); }
+.arena-rank { color: #8b949e; font-family: monospace; }
+.arena-elo { font-family: monospace; }
+.arena-ci, .arena-spread, .arena-matches, .arena-wlt { font-family: monospace; font-size: 0.9em; opacity: 0.85; }
+.arena-tie-count { font-size: 0.85em; opacity: 0.7; font-weight: normal; }
 /* v0.7.1 — Chat-template Sniffer mode */
 .template-cmd-block {
   display: flex;