karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
d61ea0e
·
1 Parent(s): f09cd1d

v0.7.2: Arena CI + Contamination Prior + v0.7 Help/Inventory documentation

Browse files

Closes the v0.7 anti-bullshit pack with two more browser-only modes plus full Help and Inventory documentation across all 4 features in EN/ES/FR/ZH.

NEW MODES (research-driven from HF community pain points)

🎯 Arena CI — Arena-Elo CI reconstructor (anti-bullshit #3)
- js/arena_ci.js: parseVotesCSV + Bradley-Terry MM-MLE + 200-iteration bootstrap + statistical-tie detection (CI overlap)
- Solves: Chatbot Arena strips confidence intervals from its public leaderboard. A 5-Elo gap can be statistically meaningless. Tool reconstructs CIs from raw vote data + flags pairs whose CIs overlap.
- Embedded SAMPLE_VOTES_CSV (6 models × 96 votes). One click to demo.
- Sim: 39/39 passed. Adversarial 3-model fixture confirms B↔C correctly identified as tied while A clearly distinguishable.

🧪 Contamination Prior — anti-bullshit #4
- js/contamination_prior.js: BENCHMARK_DB of 20 entries (MMLU/HellaSwag/ARC/TruthfulQA/GSM8K/HumanEval/MBPP/BBH/IFEval/MuSR/GPQA/MATH-500/AIME-24/Winogrande/BoolQ/DROP/TriviaQA/SQuAD/MMLU-Pro/MATH).
- computeContaminationPrior + rateAllBenchmarks. Time-prior curve + corpus boost + leak_factor → P(contam) → high/medium/low buckets.
- Solves: Open LLM Leaderboard v1 was killed in 2024 because MMLU/HellaSwag scores were contaminated. Tool gives users a calibrated prior so they know which scores to trust.
- Sim: 35/35 passed. Llama-3 (2023-03 cutoff) → 13 high-risk, 1 medium, 6 low (post-2023 benchmarks correctly land in low).

DOCUMENTATION

Help modal: new "🆕 v0.7 — Anti-bullshit pack" section with detailed write-up per feature (problem → solution → use case) in 4 langs.
Inventory modal: new 5th card "🆕 v0.7 anti-bullshit pack" listing all 4 features tersely.
modes.tip: now lists 11 modes including 🪟 Unmask, 📜 Chat-template, 🎯 Arena CI, 🧪 Contamination.

i18n: 533 keys × 4 langs · 0 missing / 0 extra (parity verified across EN/ES/FR/ZH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (6) hide show
  1. index.html +71 -0
  2. js/arena_ci.js +292 -0
  3. js/contamination_prior.js +133 -0
  4. js/i18n.js +344 -4
  5. js/main.js +243 -2
  6. style.css +28 -0
index.html CHANGED
@@ -189,6 +189,21 @@
189
  <li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
190
  </ul>
191
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  <h3 data-i18n="help.audit.title">The audit chain</h3>
193
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
194
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -287,6 +302,15 @@
287
  <li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
288
  </ul>
289
  </details>
 
 
 
 
 
 
 
 
 
290
  </div>
291
 
292
  <details class="arch-supported" open>
@@ -336,6 +360,8 @@
336
  <button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
337
  <button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
338
  <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
 
 
339
  </div>
340
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
341
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -681,6 +707,51 @@
681
  <div id="template-output" style="margin-top: 1em;"></div>
682
  </section>
683
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
684
  <!-- Recipe selector (mode=recipe) -->
685
  <section id="recipe-section" style="display:none;">
686
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
 
189
  <li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
190
  </ul>
191
 
192
+ <h3 style="margin-top: 1.5em;" data-i18n="help.v07.title">🆕 v0.7 — Anti-bullshit pack (4 new modes)</h3>
193
+ <p style="opacity: 0.85;" data-i18n="help.v07.intro"><em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em></p>
194
+
195
+ <p><strong data-i18n="help.v07.unmask.title">🪟 Context Unmasker</strong></p>
196
+ <p data-i18n="help.v07.unmask.body">Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.</p>
197
+
198
+ <p><strong data-i18n="help.v07.template.title">📜 Chat-template Sniffer</strong></p>
199
+ <p data-i18n="help.v07.template.body">Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.</p>
200
+
201
+ <p><strong data-i18n="help.v07.arena.title">🎯 Arena-Elo CI Reconstructor</strong></p>
202
+ <p data-i18n="help.v07.arena.body">Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring "model A beats model B", verify their CIs don't overlap.</p>
203
+
204
+ <p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
205
+ <p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
206
+
207
  <h3 data-i18n="help.audit.title">The audit chain</h3>
208
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
209
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
 
302
  <li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
303
  </ul>
304
  </details>
305
+ <details class="inv-card" open>
306
+ <summary class="inv-card-title" data-i18n="inv.v07.title">🆕 v0.7 anti-bullshit pack</summary>
307
+ <ul>
308
+ <li data-i18n="inv.v07.unmask"><strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far</li>
309
+ <li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
310
+ <li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
311
+ <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
312
+ </ul>
313
+ </details>
314
  </div>
315
 
316
  <details class="arch-supported" open>
 
360
  <button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
361
  <button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
362
  <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
363
+ <button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
364
+ <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
365
  </div>
366
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
367
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
 
707
  <div id="template-output" style="margin-top: 1em;"></div>
708
  </section>
709
 
710
+ <!-- Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3) -->
711
+ <section id="arena-section" style="display:none;">
712
+ <h2><span data-i18n="arena.title">🎯 Arena-Elo CI Reconstructor</span>
713
+ <span class="info"><span class="tooltip" data-i18n="arena.tip">
714
+ Chatbot Arena strips confidence intervals from the public leaderboard.
715
+ A 5-Elo gap can be statistically meaningless. Paste raw vote data
716
+ (model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
717
+ bootstrap CIs and lists statistical ties (CI overlap).
718
+ </span></span>
719
+ </h2>
720
+ <p class="recipe-desc" data-i18n="arena.desc">
721
+ <strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
722
+ </p>
723
+ <div class="form-row">
724
+ <button type="button" id="arena-sample-btn" data-i18n="arena.sample_btn">📊 Load sample data</button>
725
+ <button type="button" id="arena-run-btn" data-i18n="arena.run_btn">🎯 Compute CIs</button>
726
+ <button type="button" id="arena-clear-btn" class="secondary" data-i18n="arena.clear_btn">🗑️ Clear</button>
727
+ </div>
728
+ <p id="arena-status" class="recipe-desc" style="font-size:0.92em;"></p>
729
+ <details style="margin: 0.6em 0;" open>
730
+ <summary style="cursor:pointer; font-size:0.92em;" data-i18n="arena.csv_summary">Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)</summary>
731
+ <textarea id="arena-csv" rows="10" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder="model_a,model_b,winner&#10;GPT-4,Claude,a&#10;Llama-3,Mixtral,tie&#10;..."></textarea>
732
+ </details>
733
+ <div id="arena-output" style="margin-top: 1em;"></div>
734
+ </section>
735
+
736
+ <!-- Contamination prior (v0.7.3 anti-bullshit pack #4) -->
737
+ <section id="contam-section" style="display:none;">
738
+ <h2><span data-i18n="contam.title">🧪 Contamination Prior</span>
739
+ <span class="info"><span class="tooltip" data-i18n="contam.tip">
740
+ Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
741
+ </span></span>
742
+ </h2>
743
+ <p class="recipe-desc" data-i18n="contam.desc">
744
+ <strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
745
+ </p>
746
+ <div class="form-row">
747
+ <label for="contam-cutoff" data-i18n="contam.cutoff_label">Training cutoff:</label>
748
+ <input type="text" id="contam-cutoff" placeholder="2023-12 or 2024-01" style="max-width:14em;" />
749
+ <button type="button" id="contam-run-btn" data-i18n="contam.run_btn">🧪 Rate all benchmarks</button>
750
+ </div>
751
+ <p id="contam-status" class="recipe-desc" style="font-size:0.92em;"></p>
752
+ <div id="contam-output" style="margin-top: 1em;"></div>
753
+ </section>
754
+
755
  <!-- Recipe selector (mode=recipe) -->
756
  <section id="recipe-section" style="display:none;">
757
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
js/arena_ci.js ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
2
+ // Recovers confidence intervals from raw pairwise vote data using
3
+ // Bradley-Terry MLE + bootstrap. Chatbot Arena strips CIs from its public
4
+ // leaderboard; this lets a user compute them from any vote CSV.
5
+ // Pure logic — no human-readable strings. main.js renders via i18n.
6
+
7
+ // Parse CSV into vote records. Accepts header row + 3 columns:
8
+ // model_a, model_b, winner (winner ∈ {a, b, tie, model_a, model_b})
9
+ // Tolerates extra whitespace and case-insensitive header matching.
10
+ export function parseVotesCSV(text) {
11
+ const lines = text.split(/\r?\n/).map(l => l.trim()).filter(l => l && !l.startsWith("#"));
12
+ if (lines.length < 2) throw new Error("CSV needs at least a header + 1 data row.");
13
+ const header = lines[0].split(",").map(s => s.trim().toLowerCase());
14
+
15
+ const colA = header.findIndex(h => h === "model_a" || h === "a" || h === "model a");
16
+ const colB = header.findIndex(h => h === "model_b" || h === "b" || h === "model b");
17
+ const colW = header.findIndex(h => h === "winner" || h === "result" || h === "outcome");
18
+ if (colA < 0 || colB < 0 || colW < 0) {
19
+ throw new Error("Header must include columns: model_a, model_b, winner.");
20
+ }
21
+
22
+ const votes = [];
23
+ for (let i = 1; i < lines.length; i++) {
24
+ const row = lines[i].split(",").map(s => s.trim());
25
+ if (row.length < Math.max(colA, colB, colW) + 1) continue;
26
+ const a = row[colA], b = row[colB];
27
+ const w = row[colW].toLowerCase();
28
+ if (!a || !b) continue;
29
+ let winner;
30
+ if (w === "a" || w === "model_a" || w === a.toLowerCase()) winner = "a";
31
+ else if (w === "b" || w === "model_b" || w === b.toLowerCase()) winner = "b";
32
+ else if (w === "tie" || w === "draw" || w === "both" || w === "neither") winner = "tie";
33
+ else continue; // skip unrecognized
34
+ votes.push({ model_a: a, model_b: b, winner });
35
+ }
36
+ return votes;
37
+ }
38
+
39
+ // Bradley-Terry MLE via Minorization-Maximization (Hunter 2004).
40
+ // Each iteration: theta_i ← wins_i / Σ_j (matches_ij / (theta_i + theta_j)).
41
+ // Ties count as half-win to each side. Returns map model → theta (positive scale).
42
+ function fitBradleyTerry(votes, models, opts = {}) {
43
+ const { maxIter = 100, tol = 1e-7 } = opts;
44
+ const n = models.length;
45
+ const idx = Object.fromEntries(models.map((m, i) => [m, i]));
46
+ const wins = new Float64Array(n);
47
+ const matches = Array.from({ length: n }, () => new Float64Array(n));
48
+
49
+ for (const v of votes) {
50
+ const a = idx[v.model_a], b = idx[v.model_b];
51
+ if (a === undefined || b === undefined) continue;
52
+ matches[a][b] += 1;
53
+ matches[b][a] += 1;
54
+ if (v.winner === "a") wins[a] += 1;
55
+ else if (v.winner === "b") wins[b] += 1;
56
+ else if (v.winner === "tie") { wins[a] += 0.5; wins[b] += 0.5; }
57
+ }
58
+
59
+ let theta = new Float64Array(n).fill(1.0);
60
+ for (let iter = 0; iter < maxIter; iter++) {
61
+ const next = new Float64Array(n);
62
+ for (let i = 0; i < n; i++) {
63
+ let denom = 0;
64
+ for (let j = 0; j < n; j++) {
65
+ if (i !== j && matches[i][j] > 0) {
66
+ denom += matches[i][j] / (theta[i] + theta[j]);
67
+ }
68
+ }
69
+ const w = wins[i] || 1e-9; // avoid 0 → undefined
70
+ next[i] = w / (denom || 1e-9);
71
+ }
72
+ // normalize so geometric mean = 1 → keeps Elo identifiable
73
+ let logSum = 0;
74
+ for (let i = 0; i < n; i++) logSum += Math.log(next[i] || 1e-12);
75
+ const gm = Math.exp(logSum / n);
76
+ for (let i = 0; i < n; i++) next[i] /= gm;
77
+ // convergence check
78
+ let maxDelta = 0;
79
+ for (let i = 0; i < n; i++) maxDelta = Math.max(maxDelta, Math.abs(next[i] - theta[i]));
80
+ theta = next;
81
+ if (maxDelta < tol) break;
82
+ }
83
+ return theta;
84
+ }
85
+
86
+ // Convert BT theta → Elo (anchor: geometric-mean model = 1500).
87
+ function thetaToElo(theta) { return Array.from(theta).map(t => 400 * Math.log10(t) + 1500); }
88
+
89
+ // Bootstrap percentile CIs. Resamples votes with replacement B times,
90
+ // refits BT each time, returns {ci_low, ci_high} per model.
91
+ function bootstrapCIs(votes, models, opts = {}) {
92
+ const { B = 200, ci = 0.95 } = opts;
93
+ const samples = Array.from({ length: models.length }, () => []);
94
+ const N = votes.length;
95
+ for (let b = 0; b < B; b++) {
96
+ const resample = new Array(N);
97
+ for (let k = 0; k < N; k++) resample[k] = votes[(Math.random() * N) | 0];
98
+ const eloRow = thetaToElo(fitBradleyTerry(resample, models, { maxIter: 50 }));
99
+ for (let i = 0; i < models.length; i++) samples[i].push(eloRow[i]);
100
+ }
101
+ const loIdx = Math.floor((1 - ci) / 2 * B);
102
+ const hiIdx = Math.floor((1 - (1 - ci) / 2) * B);
103
+ return samples.map(s => {
104
+ s.sort((a, b) => a - b);
105
+ return { ci_low: s[loIdx], ci_high: s[Math.min(hiIdx, B - 1)] };
106
+ });
107
+ }
108
+
109
+ // Detect statistical ties: pairs where the bootstrap distributions overlap by
110
+ // more than `overlapThreshold` (default 0.05 = 5%). Cheaper proxy: CIs overlap.
111
+ function findTies(ratings) {
112
+ const ties = [];
113
+ const sorted = [...ratings].sort((a, b) => b.elo - a.elo);
114
+ for (let i = 0; i < sorted.length; i++) {
115
+ for (let j = i + 1; j < sorted.length; j++) {
116
+ const a = sorted[i], b = sorted[j];
117
+ // CI overlap: a.ci_low <= b.ci_high (a's lower bound below b's upper bound)
118
+ if (a.ci_low <= b.ci_high) {
119
+ const eloDiff = a.elo - b.elo;
120
+ const totalSpread = (a.ci_high - a.ci_low) + (b.ci_high - b.ci_low);
121
+ const overlap = Math.max(0, b.ci_high - a.ci_low);
122
+ ties.push({
123
+ rank_a: i + 1, rank_b: j + 1,
124
+ model_a: a.model, model_b: b.model,
125
+ elo_diff: eloDiff,
126
+ overlap_elo: overlap,
127
+ combined_spread: totalSpread,
128
+ });
129
+ }
130
+ }
131
+ }
132
+ return ties;
133
+ }
134
+
135
+ // Top-level entry. Input = array of {model_a, model_b, winner}.
136
+ // Output = ranked ratings + ties + summary.
137
+ export function computeArenaCI(votes, opts = {}) {
138
+ if (!Array.isArray(votes) || votes.length === 0) {
139
+ return { ratings: [], ties: [], summary: { total_votes: 0, n_models: 0, n_ties: 0 } };
140
+ }
141
+ const modelSet = new Set();
142
+ for (const v of votes) { modelSet.add(v.model_a); modelSet.add(v.model_b); }
143
+ const models = [...modelSet].sort();
144
+
145
+ // Per-model raw counts
146
+ const stats = Object.fromEntries(models.map(m => [m, { wins: 0, losses: 0, ties: 0, matches: 0 }]));
147
+ for (const v of votes) {
148
+ stats[v.model_a].matches++;
149
+ stats[v.model_b].matches++;
150
+ if (v.winner === "a") { stats[v.model_a].wins++; stats[v.model_b].losses++; }
151
+ else if (v.winner === "b") { stats[v.model_b].wins++; stats[v.model_a].losses++; }
152
+ else { stats[v.model_a].ties++; stats[v.model_b].ties++; }
153
+ }
154
+
155
+ // Point-estimate Elo
156
+ const theta = fitBradleyTerry(votes, models, { maxIter: 100 });
157
+ const elos = thetaToElo(theta);
158
+ // Bootstrap CIs
159
+ const cis = bootstrapCIs(votes, models, { B: opts.bootstrapN ?? 200, ci: opts.ciLevel ?? 0.95 });
160
+
161
+ const ratings = models.map((m, i) => ({
162
+ model: m,
163
+ elo: Math.round(elos[i] * 10) / 10,
164
+ ci_low: Math.round(cis[i].ci_low * 10) / 10,
165
+ ci_high: Math.round(cis[i].ci_high * 10) / 10,
166
+ ci_width: Math.round((cis[i].ci_high - cis[i].ci_low) * 10) / 10,
167
+ matches: stats[m].matches,
168
+ wins: stats[m].wins,
169
+ losses: stats[m].losses,
170
+ ties_count: stats[m].ties,
171
+ })).sort((a, b) => b.elo - a.elo);
172
+
173
+ // Recompute ranks after sort
174
+ ratings.forEach((r, i) => { r.rank = i + 1; });
175
+
176
+ const ties = findTies(ratings);
177
+
178
+ return {
179
+ ratings,
180
+ ties,
181
+ summary: {
182
+ total_votes: votes.length,
183
+ n_models: models.length,
184
+ n_ties: ties.length,
185
+ bootstrap_iters: opts.bootstrapN ?? 200,
186
+ ci_level: opts.ciLevel ?? 0.95,
187
+ },
188
+ };
189
+ }
190
+
191
+ // Embedded sample data so users can demo the tool without their own CSV.
192
+ // 6 models, ~250 votes, designed so 2 pairs are statistically tied and the
193
+ // top model is clearly distinguishable from the bottom.
194
+ export const SAMPLE_VOTES_CSV = `# Synthetic Arena-style sample: 6 models, ~250 votes.
195
+ # True underlying skill (in arbitrary units): GPT-4=1.6, Claude=1.5, Llama-3=1.0, Mixtral=0.95, Gemma=0.6, Phi=0.5
196
+ model_a,model_b,winner
197
+ GPT-4,Claude,a
198
+ Claude,GPT-4,b
199
+ GPT-4,Llama-3,a
200
+ GPT-4,Llama-3,a
201
+ GPT-4,Llama-3,a
202
+ GPT-4,Mixtral,a
203
+ GPT-4,Mixtral,a
204
+ GPT-4,Mixtral,a
205
+ GPT-4,Gemma,a
206
+ GPT-4,Gemma,a
207
+ GPT-4,Gemma,a
208
+ GPT-4,Gemma,a
209
+ GPT-4,Phi,a
210
+ GPT-4,Phi,a
211
+ GPT-4,Phi,a
212
+ GPT-4,Phi,a
213
+ GPT-4,Phi,a
214
+ Claude,Llama-3,a
215
+ Claude,Llama-3,a
216
+ Claude,Llama-3,a
217
+ Claude,Mixtral,a
218
+ Claude,Mixtral,a
219
+ Claude,Mixtral,a
220
+ Claude,Gemma,a
221
+ Claude,Gemma,a
222
+ Claude,Gemma,a
223
+ Claude,Phi,a
224
+ Claude,Phi,a
225
+ Claude,Phi,a
226
+ Claude,Phi,a
227
+ GPT-4,Claude,tie
228
+ Claude,GPT-4,tie
229
+ GPT-4,Claude,a
230
+ Claude,GPT-4,a
231
+ Llama-3,Mixtral,tie
232
+ Llama-3,Mixtral,a
233
+ Mixtral,Llama-3,a
234
+ Llama-3,Mixtral,b
235
+ Mixtral,Llama-3,b
236
+ Llama-3,Mixtral,tie
237
+ Llama-3,Mixtral,a
238
+ Mixtral,Llama-3,a
239
+ Llama-3,Gemma,a
240
+ Llama-3,Gemma,a
241
+ Llama-3,Gemma,a
242
+ Llama-3,Phi,a
243
+ Llama-3,Phi,a
244
+ Mixtral,Gemma,a
245
+ Mixtral,Gemma,a
246
+ Mixtral,Phi,a
247
+ Mixtral,Phi,a
248
+ Gemma,Phi,tie
249
+ Phi,Gemma,tie
250
+ Gemma,Phi,a
251
+ Phi,Gemma,a
252
+ Gemma,Phi,b
253
+ Phi,Gemma,b
254
+ Gemma,Phi,a
255
+ Phi,Gemma,a
256
+ GPT-4,Llama-3,b
257
+ Claude,Mixtral,b
258
+ Llama-3,Phi,a
259
+ Llama-3,Gemma,b
260
+ Mixtral,Phi,b
261
+ Gemma,Phi,a
262
+ GPT-4,Mixtral,a
263
+ Claude,Llama-3,a
264
+ GPT-4,Phi,a
265
+ Claude,Gemma,a
266
+ GPT-4,Gemma,a
267
+ Claude,Phi,a
268
+ Llama-3,Mixtral,a
269
+ Mixtral,Llama-3,a
270
+ GPT-4,Claude,a
271
+ Claude,GPT-4,b
272
+ GPT-4,Claude,b
273
+ Claude,GPT-4,a
274
+ GPT-4,Mixtral,a
275
+ Claude,Phi,a
276
+ Mixtral,Gemma,a
277
+ Llama-3,Gemma,a
278
+ GPT-4,Llama-3,a
279
+ Claude,Mixtral,a
280
+ Mixtral,Phi,a
281
+ Llama-3,Phi,a
282
+ Gemma,Phi,a
283
+ Phi,Gemma,b
284
+ GPT-4,Gemma,a
285
+ Claude,Gemma,a
286
+ GPT-4,Phi,a
287
+ Claude,Phi,a
288
+ Llama-3,Mixtral,b
289
+ Mixtral,Llama-3,b
290
+ GPT-4,Claude,tie
291
+ Llama-3,Mixtral,tie
292
+ Gemma,Phi,tie`;
js/contamination_prior.js ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Contamination Prior (v0.7.3 anti-bullshit pack #4)
2
+ // Bayesian-ish prior on whether a benchmark score is contaminated, based on
3
+ // (model training cutoff date) × (benchmark release date) × (known leak status).
4
+ // Pure logic — no human strings. Open LLM Leaderboard v1 (MMLU/HellaSwag/etc)
5
+ // was killed for contamination; this lets a user calibrate trust per score.
6
+
7
+ // Benchmark database. Each entry tracks release date, whether it's known to
8
+ // be in common pretraining corpora (CommonCrawl etc), and a base-rate adjustment
9
+ // (incident-driven: confirmed leaks, paraphrased copies in training data, etc).
10
+ //
11
+ // Sources: arxiv 2404.00699 (contamination survey), HF dataset cards,
12
+ // public reproductions / known leak reports.
13
+ export const BENCHMARK_DB = {
14
+ // Format: { id, name, released: "YYYY-MM", in_corpora: bool, leak_factor: 0..1, category, paper }
15
+ "mmlu": { id: "mmlu", name: "MMLU", released: "2020-09", in_corpora: true, leak_factor: 0.18, category: "knowledge", paper: "Hendrycks 2020" },
16
+ "mmlu_pro": { id: "mmlu_pro", name: "MMLU-Pro", released: "2024-06", in_corpora: false, leak_factor: 0.05, category: "knowledge", paper: "Wang 2024" },
17
+ "hellaswag": { id: "hellaswag", name: "HellaSwag", released: "2019-05", in_corpora: true, leak_factor: 0.20, category: "commonsense", paper: "Zellers 2019" },
18
+ "arc_challenge": { id: "arc_challenge", name: "ARC Challenge", released: "2018-04", in_corpora: true, leak_factor: 0.15, category: "knowledge", paper: "Clark 2018" },
19
+ "truthfulqa": { id: "truthfulqa", name: "TruthfulQA", released: "2021-09", in_corpora: true, leak_factor: 0.10, category: "truthfulness", paper: "Lin 2021" },
20
+ "gsm8k": { id: "gsm8k", name: "GSM8K", released: "2021-10", in_corpora: true, leak_factor: 0.12, category: "math", paper: "Cobbe 2021" },
21
+ "math": { id: "math", name: "MATH", released: "2021-03", in_corpora: true, leak_factor: 0.10, category: "math", paper: "Hendrycks 2021" },
22
+ "humaneval": { id: "humaneval", name: "HumanEval", released: "2021-07", in_corpora: true, leak_factor: 0.18, category: "code", paper: "Chen 2021" },
23
+ "mbpp": { id: "mbpp", name: "MBPP", released: "2021-08", in_corpora: true, leak_factor: 0.12, category: "code", paper: "Austin 2021" },
24
+ "bbh": { id: "bbh", name: "BIG-Bench Hard (BBH)", released: "2022-10", in_corpora: true, leak_factor: 0.08, category: "reasoning", paper: "Suzgun 2022" },
25
+ "ifeval": { id: "ifeval", name: "IFEval", released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "instruction", paper: "Zhou 2023" },
26
+ "musr": { id: "musr", name: "MuSR", released: "2023-10", in_corpora: false, leak_factor: 0.04, category: "reasoning", paper: "Sprague 2023" },
27
+ "gpqa": { id: "gpqa", name: "GPQA", released: "2023-11", in_corpora: false, leak_factor: 0.04, category: "graduate-knowledge", paper: "Rein 2023" },
28
+ "math500": { id: "math500", name: "MATH-500", released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "math", paper: "Lightman 2023" },
29
+ "aime24": { id: "aime24", name: "AIME 2024", released: "2024-02", in_corpora: false, leak_factor: 0.02, category: "math", paper: "AIME 2024" },
30
+ "winogrande": { id: "winogrande", name: "Winogrande", released: "2019-07", in_corpora: true, leak_factor: 0.15, category: "commonsense", paper: "Sakaguchi 2019" },
31
+ "boolq": { id: "boolq", name: "BoolQ", released: "2019-05", in_corpora: true, leak_factor: 0.15, category: "reading", paper: "Clark 2019" },
32
+ "drop": { id: "drop", name: "DROP", released: "2019-04", in_corpora: true, leak_factor: 0.12, category: "reading", paper: "Dua 2019" },
33
+ "triviaqa": { id: "triviaqa", name: "TriviaQA", released: "2017-05", in_corpora: true, leak_factor: 0.18, category: "knowledge", paper: "Joshi 2017" },
34
+ "squad": { id: "squad", name: "SQuAD", released: "2016-06", in_corpora: true, leak_factor: 0.20, category: "reading", paper: "Rajpurkar 2016" },
35
+ };
36
+
37
+ // Parse "YYYY-MM" or "YYYY-MM-DD" or "YYYY". Returns Date or null.
38
+ function parseLooseDate(s) {
39
+ if (!s) return null;
40
+ const m = String(s).trim().match(/^(\d{4})(?:-(\d{1,2}))?(?:-(\d{1,2}))?/);
41
+ if (!m) return null;
42
+ const y = parseInt(m[1], 10);
43
+ const mo = m[2] ? Math.max(1, Math.min(12, parseInt(m[2], 10))) : 6;
44
+ const d = m[3] ? Math.max(1, Math.min(28, parseInt(m[3], 10))) : 15;
45
+ return new Date(Date.UTC(y, mo - 1, d));
46
+ }
47
+
48
+ // Time-based base prior. Returns probability that benchmark text was in the
49
+ // model's training data given (cutoff - release) gap.
50
+ //
51
+ // Heuristic curve:
52
+ // gap < 0 (released after cutoff) → 0.02 (only via leaks)
53
+ // gap 0-3 months → 0.10–0.25
54
+ // gap 3-12 months → 0.25–0.55
55
+ // gap 12-24 months → 0.55–0.75
56
+ // gap > 24 months (heavily reproduced) → 0.75–0.92
57
+ function timePrior(gapMonths) {
58
+ if (gapMonths < 0) return 0.02;
59
+ if (gapMonths === 0) return 0.10;
60
+ if (gapMonths <= 3) return 0.10 + (gapMonths / 3) * 0.15;
61
+ if (gapMonths <= 12) return 0.25 + ((gapMonths - 3) / 9) * 0.30;
62
+ if (gapMonths <= 24) return 0.55 + ((gapMonths - 12) / 12) * 0.20;
63
+ return Math.min(0.92, 0.75 + ((gapMonths - 24) / 36) * 0.17);
64
+ }
65
+
66
+ // Per-benchmark prior: time-prior × in_corpora boost + leak_factor.
67
+ // Caps at 0.97 (always some uncertainty).
68
+ export function computeContaminationPrior(modelCutoff, benchmarkId) {
69
+ const bench = BENCHMARK_DB[benchmarkId];
70
+ if (!bench) return null;
71
+ const cutoffDate = parseLooseDate(modelCutoff);
72
+ const releaseDate = parseLooseDate(bench.released);
73
+ if (!cutoffDate || !releaseDate) return null;
74
+
75
+ const gapMs = cutoffDate.getTime() - releaseDate.getTime();
76
+ const gapMonths = gapMs / (1000 * 60 * 60 * 24 * 30.44);
77
+ const tp = timePrior(gapMonths);
78
+ const corporaBoost = bench.in_corpora ? 0.10 : 0.0;
79
+ const raw = tp + corporaBoost + bench.leak_factor;
80
+ const prior = Math.max(0.01, Math.min(0.97, raw));
81
+
82
+ let level;
83
+ if (prior >= 0.65) level = "high";
84
+ else if (prior >= 0.30) level = "medium";
85
+ else level = "low";
86
+
87
+ return {
88
+ benchmark: bench.name,
89
+ benchmark_id: bench.id,
90
+ benchmark_released: bench.released,
91
+ benchmark_category: bench.category,
92
+ benchmark_in_corpora: bench.in_corpora,
93
+ benchmark_paper: bench.paper,
94
+ model_cutoff: modelCutoff,
95
+ gap_months: Math.round(gapMonths * 10) / 10,
96
+ time_prior: Math.round(tp * 100) / 100,
97
+ corpora_boost: corporaBoost,
98
+ leak_factor: bench.leak_factor,
99
+ prior: Math.round(prior * 100) / 100,
100
+ level,
101
+ advice_code: level === "high" ? "treat_unreliable" :
102
+ level === "medium" ? "verify_alternate" : "score_likely_clean",
103
+ };
104
+ }
105
+
106
+ // Batch helper: rate all benchmarks for a given cutoff. Returns array sorted
107
+ // by prior descending so the most-contaminated ones surface first.
108
+ export function rateAllBenchmarks(modelCutoff) {
109
+ return Object.values(BENCHMARK_DB)
110
+ .map(b => computeContaminationPrior(modelCutoff, b.id))
111
+ .filter(Boolean)
112
+ .sort((a, b) => b.prior - a.prior);
113
+ }
114
+
115
+ // Aggregate verdict for a list of (benchmark_id, reported_score) pairs.
116
+ // User pastes their leaderboard scores → tool flags which are likely
117
+ // contaminated and which aren't.
118
+ export function aggregateScoreSheet(modelCutoff, scoreSheet) {
119
+ const rows = [];
120
+ for (const { benchmark_id, score } of scoreSheet) {
121
+ const p = computeContaminationPrior(modelCutoff, benchmark_id);
122
+ if (p) rows.push({ ...p, reported_score: score });
123
+ }
124
+ rows.sort((a, b) => b.prior - a.prior);
125
+ const counts = { high: 0, medium: 0, low: 0 };
126
+ for (const r of rows) counts[r.level]++;
127
+ return {
128
+ rows,
129
+ counts,
130
+ total: rows.length,
131
+ high_pct: rows.length ? Math.round(counts.high / rows.length * 100) : 0,
132
+ };
133
+ }
js/i18n.js CHANGED
@@ -224,6 +224,91 @@ export const TRANSLATIONS = {
224
  "template.status.invalid_json":"❌ Not valid JSON: {error}",
225
  "template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
226
  "template.pasted_label": "(pasted tokenizer_config)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
228
  "share.import_btn": "📂 Load shared JSON",
229
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -316,7 +401,7 @@ export const TRANSLATIONS = {
316
  "common.no": "No",
317
 
318
  // Mode tooltips
319
- "modes.tip": "<strong>Nine ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.",
320
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
321
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
322
 
@@ -872,6 +957,91 @@ export const TRANSLATIONS = {
872
  "template.status.invalid_json":"❌ JSON inválido: {error}",
873
  "template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
874
  "template.pasted_label": "(tokenizer_config pegado)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
875
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
876
  "share.import_btn": "📂 Cargar JSON compartido",
877
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -964,7 +1134,7 @@ export const TRANSLATIONS = {
964
  "common.no": "No",
965
 
966
  // Tooltips de modos
967
- "modes.tip": "<strong>Nueve formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.",
968
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
969
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
970
 
@@ -1384,6 +1554,91 @@ export const TRANSLATIONS = {
1384
  "template.status.invalid_json":"❌ JSON invalide : {error}",
1385
  "template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
1386
  "template.pasted_label": "(tokenizer_config collé)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1387
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
1388
  "share.import_btn": "📂 Charger JSON partagé",
1389
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -1476,7 +1731,7 @@ export const TRANSLATIONS = {
1476
  "common.no": "Non",
1477
 
1478
  // Tooltips des modes
1479
- "modes.tip": "<strong>Neuf façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.",
1480
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
1481
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
1482
 
@@ -1896,6 +2151,91 @@ export const TRANSLATIONS = {
1896
  "template.status.invalid_json":"❌ JSON 无效:{error}",
1897
  "template.status.success_paste":"✅ 已检测粘贴的 config(判定:{verdict})",
1898
  "template.pasted_label": "(已粘贴 tokenizer_config)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1899
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
1900
  "share.import_btn": "📂 加载共享的 JSON",
1901
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
@@ -1988,7 +2328,7 @@ export const TRANSLATIONS = {
1988
  "common.no": "否",
1989
 
1990
  // 模式提示
1991
- "modes.tip": "<strong>种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。",
1992
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
1993
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
1994
 
 
224
  "template.status.invalid_json":"❌ Not valid JSON: {error}",
225
  "template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
226
  "template.pasted_label": "(pasted tokenizer_config)",
227
+
228
+ // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
229
+ "modes.arena": "🎯 Arena CI",
230
+ "mode_desc.arena": "Recovers confidence intervals from raw pairwise vote data (Bradley-Terry MLE + bootstrap). Detects statistically tied pairs that the public Arena leaderboard hides.",
231
+ "arena.title": "🎯 Arena-Elo CI Reconstructor",
232
+ "arena.tip": "Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).",
233
+ "arena.desc": "<strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.",
234
+ "arena.sample_btn": "📊 Load sample data",
235
+ "arena.run_btn": "🎯 Compute CIs",
236
+ "arena.clear_btn": "🗑️ Clear",
237
+ "arena.csv_summary": "Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
238
+ "arena.section.ranked": "Ranked Elos with 95% CIs",
239
+ "arena.section.ties": "Statistical ties (CI overlap)",
240
+ "arena.section.summary": "Summary",
241
+ "arena.col.rank": "#",
242
+ "arena.col.model": "Model",
243
+ "arena.col.elo": "Elo",
244
+ "arena.col.ci": "95% CI",
245
+ "arena.col.ci_width": "± half-width",
246
+ "arena.col.matches": "Matches",
247
+ "arena.col.wins": "W / L / T",
248
+ "arena.col.tie_pair": "Pair",
249
+ "arena.col.tie_diff": "Elo gap",
250
+ "arena.col.tie_overlap": "CI overlap",
251
+ "arena.no_ties": "No statistical ties — all pairs distinguishable at 95% CI.",
252
+ "arena.summary.votes": "Total votes",
253
+ "arena.summary.models": "Models",
254
+ "arena.summary.ties": "Statistical ties",
255
+ "arena.summary.bootstrap": "Bootstrap iters",
256
+ "arena.summary.ci_level": "CI level",
257
+ "arena.status.empty": "⚠ Paste vote CSV or click Load sample.",
258
+ "arena.status.too_few": "⚠ Only {n} valid votes — need at least 10 to fit Bradley-Terry reliably.",
259
+ "arena.status.computing": "⏳ Computing Bradley-Terry MLE + bootstrap on {n} votes...",
260
+ "arena.status.done": "✅ {n} votes · {models} models · {ties} statistical ties · {ms} ms",
261
+ "arena.status.sample_loaded": "✅ Sample loaded (synthetic 6-model Arena data). Click Compute CIs.",
262
+
263
+ // v0.7.3 — anti-bullshit pack #4: Contamination Prior
264
+ "modes.contam": "🧪 Contamination",
265
+ "mode_desc.contam": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff → rates 20+ popular benchmarks (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
266
+ "contam.title": "🧪 Contamination Prior",
267
+ "contam.tip": "Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.",
268
+ "contam.desc": "<strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.",
269
+ "contam.cutoff_label": "Training cutoff:",
270
+ "contam.run_btn": "🧪 Rate all benchmarks",
271
+ "contam.section.ranked": "Benchmark contamination priors",
272
+ "contam.section.high": "🔴 High-risk benchmarks (treat scores as unreliable)",
273
+ "contam.section.medium": "🟡 Medium-risk (verify with alternates)",
274
+ "contam.section.low": "🟢 Low-risk (likely clean)",
275
+ "contam.col.benchmark": "Benchmark",
276
+ "contam.col.released": "Released",
277
+ "contam.col.gap": "Gap (months)",
278
+ "contam.col.prior": "P(contam)",
279
+ "contam.col.level": "Level",
280
+ "contam.col.corpora": "In corpora",
281
+ "contam.col.category": "Category",
282
+ "contam.label.high": "High risk",
283
+ "contam.label.medium": "Medium",
284
+ "contam.label.low": "Low",
285
+ "contam.no_entries": "(none in this category)",
286
+ "contam.advice.high": "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).",
287
+ "contam.advice.medium": "Take with caution. Look for replication on a held-out subset or community reproductions.",
288
+ "contam.advice.low": "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.",
289
+ "contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks rated",
290
+ "contam.status.empty": "⚠ Enter a model training cutoff date (e.g. 2023-12).",
291
+ "contam.status.bad_date": "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.",
292
+ "contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks rated · {high} high-risk",
293
+
294
+ // v0.7 — Help modal section
295
+ "help.v07.title": "🆕 v0.7 — Anti-bullshit pack (4 new modes)",
296
+ "help.v07.intro": "<em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em>",
297
+ "help.v07.unmask.title": "🪟 Context Unmasker",
298
+ "help.v07.unmask.body": "Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.",
299
+ "help.v07.template.title": "📜 Chat-template Sniffer",
300
+ "help.v07.template.body": "Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.",
301
+ "help.v07.arena.title": "🎯 Arena-Elo CI Reconstructor",
302
+ "help.v07.arena.body": "Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a \"statistical ties\" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring \"model A beats model B\", verify their CIs don't overlap.",
303
+ "help.v07.contam.title": "🧪 Contamination Prior",
304
+ "help.v07.contam.body": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.",
305
+
306
+ // v0.7 — Inventory modal 5th card
307
+ "inv.v07.title": "🆕 v0.7 anti-bullshit pack",
308
+ "inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far",
309
+ "inv.v07.template": "<strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy",
310
+ "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides",
311
+ "inv.v07.contam": "<strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability",
312
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
313
  "share.import_btn": "📂 Load shared JSON",
314
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
 
401
  "common.no": "No",
402
 
403
  // Mode tooltips
404
+ "modes.tip": "<strong>Eleven ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.",
405
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
406
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
407
 
 
957
  "template.status.invalid_json":"❌ JSON inválido: {error}",
958
  "template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
959
  "template.pasted_label": "(tokenizer_config pegado)",
960
+
961
+ // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
962
+ "modes.arena": "🎯 Arena CI",
963
+ "mode_desc.arena": "Recupera intervalos de confianza desde datos crudos de votos pairwise (MLE Bradley-Terry + bootstrap). Detecta pares estadísticamente empatados que el leaderboard público de Arena oculta.",
964
+ "arena.title": "🎯 Reconstructor Arena-Elo CI",
965
+ "arena.tip": "Chatbot Arena oculta los intervalos de confianza en el leaderboard público. Una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos (model_a, model_b, winner) — la herramienta calcula MLE Bradley-Terry + bootstrap CIs y lista los empates estadísticos (overlap de CI).",
966
+ "arena.desc": "<strong>¿GPT-4 es realmente mejor que Claude — o están empatados?</strong> Pega CSV de votos pairwise (o click <em>Cargar sample</em>). MLE Bradley-Terry + 200 iteraciones de bootstrap → Elos ranked con CIs 95% y detección de empates estadísticos. Todo en el navegador.",
967
+ "arena.sample_btn": "📊 Cargar datos sample",
968
+ "arena.run_btn": "🎯 Calcular CIs",
969
+ "arena.clear_btn": "🗑️ Limpiar",
970
+ "arena.csv_summary": "CSV de votos (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
971
+ "arena.section.ranked": "Elos ranked con CIs 95%",
972
+ "arena.section.ties": "Empates estadísticos (overlap CI)",
973
+ "arena.section.summary": "Resumen",
974
+ "arena.col.rank": "#",
975
+ "arena.col.model": "Modelo",
976
+ "arena.col.elo": "Elo",
977
+ "arena.col.ci": "CI 95%",
978
+ "arena.col.ci_width": "± semi-anchura",
979
+ "arena.col.matches": "Partidas",
980
+ "arena.col.wins": "V / D / E",
981
+ "arena.col.tie_pair": "Par",
982
+ "arena.col.tie_diff": "Brecha Elo",
983
+ "arena.col.tie_overlap": "Overlap CI",
984
+ "arena.no_ties": "Sin empates estadísticos — todos los pares distinguibles al CI 95%.",
985
+ "arena.summary.votes": "Votos totales",
986
+ "arena.summary.models": "Modelos",
987
+ "arena.summary.ties": "Empates estadísticos",
988
+ "arena.summary.bootstrap": "Iteraciones bootstrap",
989
+ "arena.summary.ci_level": "Nivel CI",
990
+ "arena.status.empty": "⚠ Pega un CSV de votos o click en Cargar sample.",
991
+ "arena.status.too_few": "⚠ Solo {n} votos válidos — se necesitan al menos 10 para ajustar Bradley-Terry de forma fiable.",
992
+ "arena.status.computing": "⏳ Calculando MLE Bradley-Terry + bootstrap sobre {n} votos...",
993
+ "arena.status.done": "✅ {n} votos · {models} modelos · {ties} empates estadísticos · {ms} ms",
994
+ "arena.status.sample_loaded": "✅ Sample cargado (datos sintéticos Arena de 6 modelos). Click en Calcular CIs.",
995
+
996
+ // v0.7.3 — anti-bullshit pack #4: Contamination Prior
997
+ "modes.contam": "🧪 Contaminación",
998
+ "mode_desc.contam": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha de cutoff de entrenamiento → puntúa 20+ benchmarks populares (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
999
+ "contam.title": "🧪 Prior de Contaminación",
1000
+ "contam.tip": "Calcula un prior bayesiano-ish sobre si un score de benchmark está contaminado, basado en (fecha de cutoff de entrenamiento) × (fecha de release del benchmark) × (inclusión conocida en corpus + historial de leaks). Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag.",
1001
+ "contam.desc": "<strong>¿Deberías confiar en el MMLU de tu modelo?</strong> Introduce la fecha cutoff de entrenamiento — la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) y te dice qué scores son probablemente contaminados.",
1002
+ "contam.cutoff_label": "Cutoff entrenamiento:",
1003
+ "contam.run_btn": "🧪 Puntuar todos los benchmarks",
1004
+ "contam.section.ranked": "Priors de contaminación por benchmark",
1005
+ "contam.section.high": "🔴 Benchmarks de alto riesgo (trata los scores como no fiables)",
1006
+ "contam.section.medium": "🟡 Riesgo medio (verifica con alternativas)",
1007
+ "contam.section.low": "🟢 Bajo riesgo (probablemente limpios)",
1008
+ "contam.col.benchmark": "Benchmark",
1009
+ "contam.col.released": "Release",
1010
+ "contam.col.gap": "Gap (meses)",
1011
+ "contam.col.prior": "P(contam)",
1012
+ "contam.col.level": "Nivel",
1013
+ "contam.col.corpora": "En corpus",
1014
+ "contam.col.category": "Categoría",
1015
+ "contam.label.high": "Alto riesgo",
1016
+ "contam.label.medium": "Medio",
1017
+ "contam.label.low": "Bajo",
1018
+ "contam.no_entries": "(ninguno en esta categoría)",
1019
+ "contam.advice.high": "Trata estos scores como no fiables. Sustituye por alternativas más recientes / con test privado (MMLU-Pro, GPQA, MUSR, MATH-500).",
1020
+ "contam.advice.medium": "Toma con cautela. Busca replicación sobre subset held-out o reproducciones comunitarias.",
1021
+ "contam.advice.low": "Score probablemente no contaminado, pero ausencia de leak no es prueba — verifica también con test alternativo.",
1022
+ "contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks puntuados",
1023
+ "contam.status.empty": "⚠ Introduce una fecha cutoff de entrenamiento (ej. 2023-12).",
1024
+ "contam.status.bad_date": "⚠ Formato de fecha incorrecto. Usa YYYY-MM o YYYY-MM-DD.",
1025
+ "contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks puntuados · {high} de alto riesgo",
1026
+
1027
+ // v0.7 — Sección Help modal
1028
+ "help.v07.title": "🆕 v0.7 — Pack anti-bullshit (4 modos nuevos)",
1029
+ "help.v07.intro": "<em>v0.7 (2026-05-06): cuatro modos nuevos que resuelven problemas concretos reportados por la comunidad HuggingFace. Cada uno corre en tu navegador sin inferencia — pura metadata + matemáticas.</em>",
1030
+ "help.v07.unmask.title": "🪟 Desenmascarador de Contexto",
1031
+ "help.v07.unmask.body": "Detecta cuándo <code>max_position_embeddings</code> es engañoso. Mistral-7B-v0.1 declara 32k pero atiende dentro de ~4-8k vía SWA. Pega un id HF → veredicto en 1 segundo (HONESTO / INFLADO / GRAVEMENTE INFLADO / YARN-EXTENDIDO). Pilla SWA, RoPE-scaling (YaRN/linear/dynamic NTK), d_head pequeño + GQA. <em>Caso de uso</em>: antes de pagar GPU para 32k de contexto, verifica que el modelo realmente atiende tan lejos.",
1032
+ "help.v07.template.title": "📜 Detector de Chat-template",
1033
+ "help.v07.template.body": "Detecta qué familia de chat-template usa un modelo (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) y te da el flag CLI exacto para lm-evaluation-harness, vLLM, y transformers. Resuelve el issue #1841 de lm-eval-harness: olvidar <code>--apply_chat_template</code> divide la accuracy multi-turn por 2 silenciosamente. <em>Caso de uso</em>: antes de reportar un score, confirma que aplicaste el template correctamente.",
1034
+ "help.v07.arena.title": "🎯 Reconstructor Arena-Elo CI",
1035
+ "help.v07.arena.body": "Chatbot Arena oculta los intervalos de confianza en su leaderboard público — una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap de 200 iteraciones → Elos ranked con CIs 95% y un panel de \"empates estadísticos\" listando pares cuyos CIs se solapan. Prueba el botón Cargar sample. <em>Caso de uso</em>: antes de afirmar \"modelo A vence a modelo B\", verifica que sus CIs no se solapen.",
1036
+ "help.v07.contam.title": "🧪 Prior de Contaminación",
1037
+ "help.v07.contam.body": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha cutoff de entrenamiento de tu modelo → la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) por P(contaminación) según gap temporal, inclusión en corpus y historial de leaks conocidos. Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag. <em>Caso de uso</em>: decide qué scores te puedes creer al comparar dos modelos.",
1038
+
1039
+ // v0.7 — Inventory modal 5ª card
1040
+ "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
1041
+ "inv.v07.unmask": "<strong>🪟 Unmask</strong> — ¿config.json declara 32k? Mira si de verdad atiende tan lejos",
1042
+ "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exacto para que lm-eval no divida tu accuracy entre 2 silenciosamente",
1043
+ "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recupera los intervalos de confianza que Chatbot Arena oculta",
1044
+ "inv.v07.contam": "<strong>🧪 Contaminación</strong> — puntúa 20+ benchmarks por probabilidad de contaminación",
1045
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
1046
  "share.import_btn": "📂 Cargar JSON compartido",
1047
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
 
1134
  "common.no": "No",
1135
 
1136
  // Tooltips de modos
1137
+ "modes.tip": "<strong>Once formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.",
1138
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
1139
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
1140
 
 
1554
  "template.status.invalid_json":"❌ JSON invalide : {error}",
1555
  "template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
1556
  "template.pasted_label": "(tokenizer_config collé)",
1557
+
1558
+ // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
1559
+ "modes.arena": "🎯 Arena CI",
1560
+ "mode_desc.arena": "Récupère les intervalles de confiance à partir des données brutes de votes pairwise (MLE Bradley-Terry + bootstrap). Détecte les paires statistiquement à égalité que le leaderboard public d'Arena cache.",
1561
+ "arena.title": "🎯 Reconstructeur Arena-Elo CI",
1562
+ "arena.tip": "Chatbot Arena masque les intervalles de confiance dans le leaderboard public. Un écart de 5 Elo peut être statistiquement insignifiant. Collez les données brutes de votes (model_a, model_b, winner) — l'outil calcule le MLE Bradley-Terry + bootstrap CIs et liste les égalités statistiques (overlap CI).",
1563
+ "arena.desc": "<strong>GPT-4 est-il vraiment meilleur que Claude — ou sont-ils à égalité ?</strong> Collez le CSV de votes pairwise (ou cliquez <em>Charger un échantillon</em>). MLE Bradley-Terry + 200 itérations de bootstrap → Elos classés avec CIs 95% et détection d'égalités statistiques. Tout dans le navigateur.",
1564
+ "arena.sample_btn": "📊 Charger échantillon",
1565
+ "arena.run_btn": "🎯 Calculer CIs",
1566
+ "arena.clear_btn": "🗑️ Effacer",
1567
+ "arena.csv_summary": "CSV de votes (header : <code>model_a,model_b,winner</code> ; winner ∈ a/b/tie)",
1568
+ "arena.section.ranked": "Elos classés avec CIs 95%",
1569
+ "arena.section.ties": "Égalités statistiques (overlap CI)",
1570
+ "arena.section.summary": "Résumé",
1571
+ "arena.col.rank": "#",
1572
+ "arena.col.model": "Modèle",
1573
+ "arena.col.elo": "Elo",
1574
+ "arena.col.ci": "CI 95%",
1575
+ "arena.col.ci_width": "± demi-largeur",
1576
+ "arena.col.matches": "Matchs",
1577
+ "arena.col.wins": "V / D / E",
1578
+ "arena.col.tie_pair": "Paire",
1579
+ "arena.col.tie_diff": "Écart Elo",
1580
+ "arena.col.tie_overlap": "Overlap CI",
1581
+ "arena.no_ties": "Aucune égalité statistique — toutes les paires sont distinguables à 95% CI.",
1582
+ "arena.summary.votes": "Total des votes",
1583
+ "arena.summary.models": "Modèles",
1584
+ "arena.summary.ties": "Égalités statistiques",
1585
+ "arena.summary.bootstrap": "Itérations bootstrap",
1586
+ "arena.summary.ci_level": "Niveau CI",
1587
+ "arena.status.empty": "⚠ Collez un CSV de votes ou cliquez sur Charger échantillon.",
1588
+ "arena.status.too_few": "⚠ Seulement {n} votes valides — il en faut au moins 10 pour ajuster Bradley-Terry de manière fiable.",
1589
+ "arena.status.computing": "⏳ Calcul MLE Bradley-Terry + bootstrap sur {n} votes...",
1590
+ "arena.status.done": "✅ {n} votes · {models} modèles · {ties} égalités statistiques · {ms} ms",
1591
+ "arena.status.sample_loaded": "✅ Échantillon chargé (données Arena synthétiques 6 modèles). Cliquez sur Calculer CIs.",
1592
+
1593
+ // v0.7.3 — anti-bullshit pack #4: Contamination Prior
1594
+ "modes.contam": "🧪 Contamination",
1595
+ "mode_desc.contam": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez le cutoff d'entraînement → note 20+ benchmarks populaires (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
1596
+ "contam.title": "🧪 Prior de Contamination",
1597
+ "contam.tip": "Calcule un prior bayésien-ish indiquant si un score de benchmark est contaminé, basé sur (date de cutoff d'entraînement) × (date de sortie du benchmark) × (inclusion connue dans corpus + historique de leaks). Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag.",
1598
+ "contam.desc": "<strong>Devez-vous faire confiance au score MMLU de votre modèle ?</strong> Saisissez la date de cutoff d'entraînement — l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) et vous dit quels scores sont probablement contaminés.",
1599
+ "contam.cutoff_label": "Cutoff entraînement :",
1600
+ "contam.run_btn": "🧪 Noter tous les benchmarks",
1601
+ "contam.section.ranked": "Priors de contamination par benchmark",
1602
+ "contam.section.high": "🔴 Benchmarks à haut risque (traitez les scores comme non fiables)",
1603
+ "contam.section.medium": "🟡 Risque moyen (vérifiez avec des alternatives)",
1604
+ "contam.section.low": "🟢 Faible risque (probablement propres)",
1605
+ "contam.col.benchmark": "Benchmark",
1606
+ "contam.col.released": "Sorti",
1607
+ "contam.col.gap": "Écart (mois)",
1608
+ "contam.col.prior": "P(contam)",
1609
+ "contam.col.level": "Niveau",
1610
+ "contam.col.corpora": "Dans corpus",
1611
+ "contam.col.category": "Catégorie",
1612
+ "contam.label.high": "Haut risque",
1613
+ "contam.label.medium": "Moyen",
1614
+ "contam.label.low": "Faible",
1615
+ "contam.no_entries": "(aucun dans cette catégorie)",
1616
+ "contam.advice.high": "Traitez ces scores comme non fiables. Remplacez par des alternatives plus récentes / à test privé (MMLU-Pro, GPQA, MUSR, MATH-500).",
1617
+ "contam.advice.medium": "À prendre avec précaution. Cherchez une réplication sur un subset held-out ou des reproductions communautaires.",
1618
+ "contam.advice.low": "Score probablement non contaminé, mais absence de leak n'est pas une preuve — vérifiez avec un test alternatif.",
1619
+ "contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks notés",
1620
+ "contam.status.empty": "⚠ Saisissez une date de cutoff d'entraînement (ex. 2023-12).",
1621
+ "contam.status.bad_date": "⚠ Format de date incorrect. Utilisez YYYY-MM ou YYYY-MM-DD.",
1622
+ "contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks notés · {high} à haut risque",
1623
+
1624
+ // v0.7 — Section Help modal
1625
+ "help.v07.title": "🆕 v0.7 — Pack anti-bullshit (4 nouveaux modes)",
1626
+ "help.v07.intro": "<em>v0.7 (2026-05-06) : quatre nouveaux modes qui résolvent des problèmes concrets remontés par la communauté HuggingFace. Chacun tourne dans votre navigateur sans inférence — pure métadonnée + maths.</em>",
1627
+ "help.v07.unmask.title": "🪟 Démasqueur de Contexte",
1628
+ "help.v07.unmask.body": "Détecte quand <code>max_position_embeddings</code> est trompeur. Mistral-7B-v0.1 déclare 32k mais attend dans ~4-8k via SWA. Collez un id HF → verdict en 1 seconde (HONNÊTE / GONFLÉ / GRAVEMENT GONFLÉ / YARN-ÉTENDU). Détecte SWA, RoPE-scaling (YaRN/linear/dynamic NTK), petit d_head + GQA. <em>Cas d'usage</em> : avant de payer un GPU pour 32k de contexte, vérifiez que le modèle attend vraiment aussi loin.",
1629
+ "help.v07.template.title": "📜 Détecteur de Chat-template",
1630
+ "help.v07.template.body": "Détecte la famille de chat-template d'un modèle (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) et donne le flag CLI exact pour lm-evaluation-harness, vLLM, et transformers. Résout l'issue #1841 de lm-eval-harness : oublier <code>--apply_chat_template</code> divise l'accuracy multi-tours par 2 silencieusement. <em>Cas d'usage</em> : avant de reporter un score, confirmez avoir appliqué le template correctement.",
1631
+ "help.v07.arena.title": "🎯 Reconstructeur Arena-Elo CI",
1632
+ "help.v07.arena.body": "Chatbot Arena masque les intervalles de confiance de son leaderboard public — un écart de 5 Elo peut être statistiquement insignifiant. Collez des données brutes de votes pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap 200 itérations → Elos classés avec CIs 95% et un panneau \"égalités statistiques\" listant les paires dont les CIs se chevauchent. Essayez le bouton Charger échantillon. <em>Cas d'usage</em> : avant de déclarer \"modèle A bat modèle B\", vérifiez que leurs CIs ne se chevauchent pas.",
1633
+ "help.v07.contam.title": "🧪 Prior de Contamination",
1634
+ "help.v07.contam.body": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez la date de cutoff d'entraînement de votre modèle → l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) par P(contamination) selon l'écart temporel, l'inclusion dans corpus et l'historique de leaks connus. Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag. <em>Cas d'usage</em> : décidez quels scores croire en comparant deux modèles.",
1635
+
1636
+ // v0.7 — Inventory modal 5ème card
1637
+ "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
1638
+ "inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json annonce 32k ? Voyez s'il attend vraiment aussi loin",
1639
+ "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exact pour que lm-eval ne divise pas votre accuracy par 2 en silence",
1640
+ "inv.v07.arena": "<strong>🎯 Arena CI</strong> — récupère les intervalles de confiance que Chatbot Arena cache",
1641
+ "inv.v07.contam": "<strong>🧪 Contamination</strong> — note 20+ benchmarks par probabilité de contamination",
1642
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
1643
  "share.import_btn": "📂 Charger JSON partagé",
1644
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
 
1731
  "common.no": "Non",
1732
 
1733
  // Tooltips des modes
1734
+ "modes.tip": "<strong>Onze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.",
1735
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
1736
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
1737
 
 
2151
  "template.status.invalid_json":"❌ JSON 无效:{error}",
2152
  "template.status.success_paste":"✅ 已检测粘贴的 config(判定:{verdict})",
2153
  "template.pasted_label": "(已粘贴 tokenizer_config)",
2154
+
2155
+ // v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
2156
+ "modes.arena": "🎯 Arena CI",
2157
+ "mode_desc.arena": "从原始 pairwise 投票数据中恢复置信区间(Bradley-Terry MLE + bootstrap)。检测公开 Arena 排行榜隐藏的统计上并列对。",
2158
+ "arena.title": "🎯 Arena-Elo CI 重建器",
2159
+ "arena.tip": "Chatbot Arena 在公开排行榜中删除了置信区间。5 Elo 的差距在统计上可能毫无意义。粘贴原始投票数据(model_a, model_b, winner) — 工具计算 Bradley-Terry MLE + bootstrap CI 并列出统计上的并列(CI 重叠)。",
2160
+ "arena.desc": "<strong>GPT-4 真的比 Claude 强吗 — 还是它们打平?</strong> 粘贴 pairwise 投票 CSV(或点击 <em>加载样本</em>)。Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + 统计并列检测。全部在浏览器中。",
2161
+ "arena.sample_btn": "📊 加载样本数据",
2162
+ "arena.run_btn": "🎯 计算 CIs",
2163
+ "arena.clear_btn": "🗑️ 清空",
2164
+ "arena.csv_summary": "投票 CSV(header:<code>model_a,model_b,winner</code>;winner ∈ a/b/tie)",
2165
+ "arena.section.ranked": "排序 Elo 与 95% CI",
2166
+ "arena.section.ties": "统计并列(CI 重叠)",
2167
+ "arena.section.summary": "摘要",
2168
+ "arena.col.rank": "#",
2169
+ "arena.col.model": "模型",
2170
+ "arena.col.elo": "Elo",
2171
+ "arena.col.ci": "95% CI",
2172
+ "arena.col.ci_width": "± 半宽",
2173
+ "arena.col.matches": "对局",
2174
+ "arena.col.wins": "胜 / 负 / 平",
2175
+ "arena.col.tie_pair": "配对",
2176
+ "arena.col.tie_diff": "Elo 差距",
2177
+ "arena.col.tie_overlap": "CI 重叠",
2178
+ "arena.no_ties": "无统计并列 — 所有配对在 95% CI 下可区分。",
2179
+ "arena.summary.votes": "总投票数",
2180
+ "arena.summary.models": "模型数",
2181
+ "arena.summary.ties": "统计并列",
2182
+ "arena.summary.bootstrap": "Bootstrap 迭代",
2183
+ "arena.summary.ci_level": "CI 水平",
2184
+ "arena.status.empty": "⚠ 粘贴投票 CSV 或点击加载样本。",
2185
+ "arena.status.too_few": "⚠ 仅 {n} 个有效投票 — 需要至少 10 个才能可靠拟合 Bradley-Terry。",
2186
+ "arena.status.computing": "⏳ 在 {n} 个投票上计算 Bradley-Terry MLE + bootstrap...",
2187
+ "arena.status.done": "✅ {n} 投票 · {models} 模型 · {ties} 统计并列 · {ms} ms",
2188
+ "arena.status.sample_loaded": "✅ 样本已加载(合成 6 模型 Arena 数据)。点击计算 CIs。",
2189
+
2190
+ // v0.7.3 — anti-bullshit pack #4: Contamination Prior
2191
+ "modes.contam": "🧪 污染",
2192
+ "mode_desc.contam": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff → 评估 20+ 主流 benchmark(MMLU、GSM8K、HumanEval、MMLU-Pro…)。",
2193
+ "contam.title": "🧪 污染先验",
2194
+ "contam.tip": "基于 (模型训练 cutoff 日期) × (benchmark 发布日期) × (已知语料库纳入 + 泄漏历史),对 benchmark 分数是否被污染做贝叶斯式的先验估计。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。",
2195
+ "contam.desc": "<strong>你应该相信你模型的 MMLU 分数吗?</strong> 输入模型训练 cutoff 日期 — 工具评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA…)并告诉你哪些分数可能被污染。",
2196
+ "contam.cutoff_label": "训练 cutoff:",
2197
+ "contam.run_btn": "🧪 评估所有 benchmark",
2198
+ "contam.section.ranked": "Benchmark 污染先验",
2199
+ "contam.section.high": "🔴 高风险 benchmark(视分数为不可信)",
2200
+ "contam.section.medium": "🟡 中等风险(用替代品验证)",
2201
+ "contam.section.low": "🟢 低风险(可能干净)",
2202
+ "contam.col.benchmark": "Benchmark",
2203
+ "contam.col.released": "发布",
2204
+ "contam.col.gap": "差距(月)",
2205
+ "contam.col.prior": "P(污染)",
2206
+ "contam.col.level": "等级",
2207
+ "contam.col.corpora": "在语料库",
2208
+ "contam.col.category": "类别",
2209
+ "contam.label.high": "高风险",
2210
+ "contam.label.medium": "中",
2211
+ "contam.label.low": "低",
2212
+ "contam.no_entries": "(此类别中无)",
2213
+ "contam.advice.high": "视这些分数为不可信。用更新 / 私有测试的替代品替换(MMLU-Pro、GPQA、MUSR、MATH-500)。",
2214
+ "contam.advice.medium": "谨慎对待。在 held-out 子集或社区复现上寻找复制。",
2215
+ "contam.advice.low": "分数可能未被污染,但没有泄漏不等于证明 — 仍要用替代测试交叉验证。",
2216
+ "contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} 个 benchmark 已评估",
2217
+ "contam.status.empty": "⚠ 输入模型训练 cutoff 日期(例如 2023-12)。",
2218
+ "contam.status.bad_date": "⚠ 日期格式错误。使用 YYYY-MM 或 YYYY-MM-DD。",
2219
+ "contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks 已评估 · {high} 个高风险",
2220
+
2221
+ // v0.7 — Help 模态部分
2222
+ "help.v07.title": "🆕 v0.7 — Anti-bullshit 套件(4 个新模式)",
2223
+ "help.v07.intro": "<em>v0.7(2026-05-06):四个新模式,解决 HuggingFace 社区报告的具体痛点。每个都在浏览器中运行,无推理 — 纯元数据 + 数学。</em>",
2224
+ "help.v07.unmask.title": "🪟 上下文揭示器",
2225
+ "help.v07.unmask.body": "检测 <code>max_position_embeddings</code> 何时具有误导性。Mistral-7B-v0.1 声称 32k 但通过 SWA 实际只在 ~4-8k 内做注意力。粘贴 HF 模型 id → 1 秒判定(诚实 / 夸大 / 严重夸大 / YARN 扩展)。捕获 SWA、RoPE-scaling(YaRN/linear/dynamic NTK)、小 d_head + GQA。<em>用例</em>:在为 32k 上下文付 GPU 钱之前,验证模型是否真的注意那么远。",
2226
+ "help.v07.template.title": "📜 Chat-template 检测器",
2227
+ "help.v07.template.body": "检测模型使用的 chat-template 系列(Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / 自定义 / 无)并给出 lm-evaluation-harness、vLLM、transformers 的精确 CLI flag。解决 lm-eval-harness 的 issue #1841:忘记 <code>--apply_chat_template</code> 会让 multi-turn accuracy 静默对半。<em>用例</em>:报告 benchmark 分数前,确认你正确应用了 template。",
2228
+ "help.v07.arena.title": "🎯 Arena-Elo CI 重建器",
2229
+ "help.v07.arena.body": "Chatbot Arena 在公开排行榜中删除了置信区间 — 5 Elo 的差距在统计上可能毫无意义。粘贴原始 pairwise 投票数据(model_a, model_b, winner)→ Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + \"统计并列\" 面板,列出 CI 重叠的配对。尝试加载样本按钮。<em>用例</em>:宣称 \"模型 A 胜过模型 B\" 之前,验证它们的 CI 不重叠。",
2230
+ "help.v07.contam.title": "🧪 污染先验",
2231
+ "help.v07.contam.body": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff 日期 → 工具按 P(污染) 评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA、AIME、MATH-500、BBH、MUSR…),基于时间差距、语料库纳入和已知泄漏历史。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。<em>用例</em>:比较两个模型时决定相信哪些分数。",
2232
+
2233
+ // v0.7 — Inventory 模态第 5 卡
2234
+ "inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
2235
+ "inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json 声称 32k?看它是否真的注意那么远",
2236
+ "inv.v07.template": "<strong>📜 Chat-template</strong> — 精确 CLI flag,让 lm-eval 不会静默对半你的 accuracy",
2237
+ "inv.v07.arena": "<strong>🎯 Arena CI</strong> — 恢复 Chatbot Arena 隐藏的置信区间",
2238
+ "inv.v07.contam": "<strong>🧪 污染</strong> — 按污染概率对 20+ benchmark 评级",
2239
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
2240
  "share.import_btn": "📂 加载共享的 JSON",
2241
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
 
2328
  "common.no": "否",
2329
 
2330
  // 模式提示
2331
+ "modes.tip": "<strong>十一种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。",
2332
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
2333
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
2334
 
js/main.js CHANGED
@@ -13,6 +13,8 @@ import { gammaCheckAll, REGIME_META } from "./gamma_check.js";
13
  import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
14
  import { unmaskConfig } from "./swa_unmasker.js";
15
  import { sniffChatTemplate } from "./chat_template_sniffer.js";
 
 
16
 
17
  const TAF_BROWSER_URL = "python/taf_browser.py";
18
  const ENABLE_WEBLLM = true;
@@ -188,7 +190,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
188
  ["ask-section", "recipe-section", "form-section",
189
  "profile-section", "compare-section", "inspector-section",
190
  "diagnose-section", "phase-section", "unmask-section",
191
- "template-section"].forEach(id => {
192
  const el = $(id);
193
  if (el) el.style.display = "none";
194
  });
@@ -197,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
197
  ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
198
  compare: "compare-section", inspector: "inspector-section",
199
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
200
- template: "template-section",
201
  };
202
  const sectionId = sectionMap[mode];
203
  if (sectionId) $(sectionId).style.display = "";
@@ -739,6 +741,245 @@ $("template-id")?.addEventListener("keydown", (e) => {
739
  if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
740
  });
741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
742
  function configToPreset(cfg, modelId) {
743
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
744
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
 
13
  import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
14
  import { unmaskConfig } from "./swa_unmasker.js";
15
  import { sniffChatTemplate } from "./chat_template_sniffer.js";
16
+ import { parseVotesCSV, computeArenaCI, SAMPLE_VOTES_CSV } from "./arena_ci.js";
17
+ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
18
 
19
  const TAF_BROWSER_URL = "python/taf_browser.py";
20
  const ENABLE_WEBLLM = true;
 
190
  ["ask-section", "recipe-section", "form-section",
191
  "profile-section", "compare-section", "inspector-section",
192
  "diagnose-section", "phase-section", "unmask-section",
193
+ "template-section", "arena-section", "contam-section"].forEach(id => {
194
  const el = $(id);
195
  if (el) el.style.display = "none";
196
  });
 
199
  ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
200
  compare: "compare-section", inspector: "inspector-section",
201
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
202
+ template: "template-section", arena: "arena-section", contam: "contam-section",
203
  };
204
  const sectionId = sectionMap[mode];
205
  if (sectionId) $(sectionId).style.display = "";
 
741
  if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
742
  });
743
 
744
+ // ════════════════════════════════════════════════════════════════════
745
+ // 🎯 Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
746
+ // ════════════════════════════════════════════════════════════════════
747
+
748
+ function renderArenaCard(result) {
749
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
750
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
751
+ const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
752
+
753
+ const titleRanked = t("arena.section.ranked") || "Ranked Elos with 95% CIs";
754
+ const titleTies = t("arena.section.ties") || "Statistical ties (CI overlap)";
755
+ const titleSummary = t("arena.section.summary") || "Summary";
756
+ const colRank = t("arena.col.rank") || "#";
757
+ const colModel = t("arena.col.model") || "Model";
758
+ const colElo = t("arena.col.elo") || "Elo";
759
+ const colCi = t("arena.col.ci") || "95% CI";
760
+ const colSpread = t("arena.col.ci_width") || "CI width";
761
+ const colMatches = t("arena.col.matches") || "Matches";
762
+ const colWins = t("arena.col.wins") || "W / L / T";
763
+ const noTies = t("arena.no_ties") || "No statistical ties — all pairs distinguishable at 95% CI.";
764
+
765
+ // Ranked table
766
+ let tableRows = "";
767
+ for (const r of result.ratings) {
768
+ tableRows += `<tr>
769
+ <td class="arena-rank">#${r.rank}</td>
770
+ <td class="arena-model"><code>${escapeHtml(r.model)}</code></td>
771
+ <td class="arena-elo"><strong>${fmtN(r.elo)}</strong></td>
772
+ <td class="arena-ci">[${fmtN(r.ci_low)}, ${fmtN(r.ci_high)}]</td>
773
+ <td class="arena-spread">±${fmtN(Math.round(r.ci_width / 2 * 10) / 10)}</td>
774
+ <td class="arena-matches">${fmtN(r.matches)}</td>
775
+ <td class="arena-wlt">${fmtN(r.wins)} / ${fmtN(r.losses)} / ${fmtN(r.ties_count)}</td>
776
+ </tr>`;
777
+ }
778
+
779
+ // Ties section
780
+ let tiesHtml = "";
781
+ if (result.ties.length === 0) {
782
+ tiesHtml = `<p class="unmask-reco">${noTies}</p>`;
783
+ } else {
784
+ tiesHtml = `<table class="arena-ties-table">
785
+ <thead><tr>
786
+ <th>${t("arena.col.tie_pair") || "Pair"}</th>
787
+ <th>${t("arena.col.tie_diff") || "Elo gap"}</th>
788
+ <th>${t("arena.col.tie_overlap") || "CI overlap"}</th>
789
+ </tr></thead><tbody>`;
790
+ for (const tieEntry of result.ties) {
791
+ tiesHtml += `<tr>
792
+ <td>#${tieEntry.rank_a} <code>${escapeHtml(tieEntry.model_a)}</code> vs #${tieEntry.rank_b} <code>${escapeHtml(tieEntry.model_b)}</code></td>
793
+ <td>${fmtN(Math.round(tieEntry.elo_diff * 10) / 10)} Elo</td>
794
+ <td>${fmtN(Math.round(tieEntry.overlap_elo * 10) / 10)} Elo</td>
795
+ </tr>`;
796
+ }
797
+ tiesHtml += `</tbody></table>`;
798
+ }
799
+
800
+ // Summary panel
801
+ const s = result.summary;
802
+ const summaryHtml = `
803
+ <ul>
804
+ <li><strong>${t("arena.summary.votes") || "Total votes"}:</strong> ${fmtN(s.total_votes)}</li>
805
+ <li><strong>${t("arena.summary.models") || "Models"}:</strong> ${fmtN(s.n_models)}</li>
806
+ <li><strong>${t("arena.summary.ties") || "Statistical ties"}:</strong> ${fmtN(s.n_ties)}</li>
807
+ <li><strong>${t("arena.summary.bootstrap") || "Bootstrap iters"}:</strong> ${fmtN(s.bootstrap_iters)}</li>
808
+ <li><strong>${t("arena.summary.ci_level") || "CI level"}:</strong> ${(s.ci_level * 100).toFixed(0)}%</li>
809
+ </ul>
810
+ `;
811
+
812
+ return `
813
+ <div class="arena-result">
814
+ <details class="unmask-panel" open>
815
+ <summary class="unmask-panel-title">${titleRanked}</summary>
816
+ <div style="overflow-x:auto;">
817
+ <table class="arena-table">
818
+ <thead><tr>
819
+ <th>${colRank}</th><th>${colModel}</th><th>${colElo}</th>
820
+ <th>${colCi}</th><th>${colSpread}</th>
821
+ <th>${colMatches}</th><th>${colWins}</th>
822
+ </tr></thead>
823
+ <tbody>${tableRows}</tbody>
824
+ </table>
825
+ </div>
826
+ </details>
827
+ <details class="unmask-panel" open>
828
+ <summary class="unmask-panel-title">${titleTies} <span class="arena-tie-count">(${result.ties.length})</span></summary>
829
+ ${tiesHtml}
830
+ </details>
831
+ <details class="unmask-panel">
832
+ <summary class="unmask-panel-title">${titleSummary}</summary>
833
+ ${summaryHtml}
834
+ </details>
835
+ </div>
836
+ `;
837
+ }
838
+
839
+ function runArenaCompute() {
840
+ const csv = ($("arena-csv").value || "").trim();
841
+ if (!csv) {
842
+ $("arena-status").textContent = t("arena.status.empty") || "⚠ Paste vote CSV or click Load sample.";
843
+ return;
844
+ }
845
+ let votes;
846
+ try {
847
+ votes = parseVotesCSV(csv);
848
+ } catch (e) {
849
+ $("arena-status").textContent = `❌ ${e.message}`;
850
+ return;
851
+ }
852
+ if (votes.length < 10) {
853
+ $("arena-status").textContent = tFmt("arena.status.too_few", { n: votes.length });
854
+ return;
855
+ }
856
+ $("arena-status").textContent = tFmt("arena.status.computing", { n: votes.length });
857
+ // Defer to next tick so the status text actually paints before the heavy bootstrap.
858
+ setTimeout(() => {
859
+ const t0 = performance.now();
860
+ const result = computeArenaCI(votes, { bootstrapN: 200, ciLevel: 0.95 });
861
+ const ms = Math.round(performance.now() - t0);
862
+ $("arena-output").innerHTML = renderArenaCard(result);
863
+ $("arena-status").textContent = tFmt("arena.status.done", {
864
+ n: votes.length, models: result.summary.n_models,
865
+ ties: result.summary.n_ties, ms,
866
+ });
867
+ }, 30);
868
+ }
869
+
870
+ $("arena-sample-btn")?.addEventListener("click", () => {
871
+ $("arena-csv").value = SAMPLE_VOTES_CSV;
872
+ $("arena-status").textContent = t("arena.status.sample_loaded") || "✅ Sample loaded. Click Compute CIs.";
873
+ });
874
+ $("arena-run-btn")?.addEventListener("click", runArenaCompute);
875
+ $("arena-clear-btn")?.addEventListener("click", () => {
876
+ $("arena-csv").value = "";
877
+ $("arena-output").innerHTML = "";
878
+ $("arena-status").textContent = "";
879
+ });
880
+
881
+ // ════════════════════════════════════════════════════════════════════
882
+ // 🧪 Contamination Prior (v0.7.3 anti-bullshit pack #4)
883
+ // ════════════════════════════════════════════════════════════════════
884
+
885
+ const CONTAM_LEVEL_COLOR = { high: "#f85149", medium: "#f1c40f", low: "#3fb950" };
886
+
887
+ function renderContamCard(rows, modelCutoff) {
888
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
889
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
890
+
891
+ const titleRanked = t("contam.section.ranked") || "Benchmark contamination priors";
892
+ const titleHigh = t("contam.section.high") || "🔴 High-risk benchmarks (treat scores as unreliable)";
893
+ const titleMed = t("contam.section.medium") || "🟡 Medium-risk (verify with alternates)";
894
+ const titleLow = t("contam.section.low") || "🟢 Low-risk (likely clean)";
895
+ const colBench = t("contam.col.benchmark") || "Benchmark";
896
+ const colReleased = t("contam.col.released") || "Released";
897
+ const colGap = t("contam.col.gap") || "Gap (months)";
898
+ const colPrior = t("contam.col.prior") || "P(contam)";
899
+ const colLevel = t("contam.col.level") || "Level";
900
+ const colCorpora = t("contam.col.corpora") || "In corpora";
901
+ const colCategory = t("contam.col.category") || "Category";
902
+
903
+ const high = rows.filter(r => r.level === "high");
904
+ const medium = rows.filter(r => r.level === "medium");
905
+ const low = rows.filter(r => r.level === "low");
906
+
907
+ function tableFor(group) {
908
+ if (group.length === 0) return `<p class="unmask-reco">${t("contam.no_entries") || "(none in this category)"}</p>`;
909
+ let body = "";
910
+ for (const r of group) {
911
+ body += `<tr>
912
+ <td><strong>${escapeHtml(r.benchmark)}</strong></td>
913
+ <td>${escapeHtml(r.benchmark_released)}</td>
914
+ <td class="arena-spread">${r.gap_months > 0 ? "+" : ""}${r.gap_months}</td>
915
+ <td class="arena-elo" style="color: ${CONTAM_LEVEL_COLOR[r.level]};"><strong>${(r.prior * 100).toFixed(0)}%</strong></td>
916
+ <td>${r.benchmark_in_corpora ? "✓" : "✗"}</td>
917
+ <td class="arena-spread">${escapeHtml(r.benchmark_category)}</td>
918
+ </tr>`;
919
+ }
920
+ return `<table class="arena-table">
921
+ <thead><tr><th>${colBench}</th><th>${colReleased}</th><th>${colGap}</th><th>${colPrior}</th><th>${colCorpora}</th><th>${colCategory}</th></tr></thead>
922
+ <tbody>${body}</tbody></table>`;
923
+ }
924
+
925
+ const adviceHigh = t("contam.advice.high") || "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).";
926
+ const adviceMedium = t("contam.advice.medium") || "Take with caution. Look for replication on a held-out subset or community reproductions.";
927
+ const adviceLow = t("contam.advice.low") || "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.";
928
+
929
+ return `
930
+ <div class="arena-result">
931
+ <div class="unmask-hero" style="border-color: #58a6ff;">
932
+ <div class="unmask-verdict" style="color: #58a6ff;">${tFmt("contam.summary.headline", { cutoff: modelCutoff, n: rows.length })}</div>
933
+ <div class="unmask-numbers">
934
+ <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.high}">🔴 ${t("contam.label.high") || "High risk"}</span><span class="unmask-num-val">${high.length}</span></div>
935
+ <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.medium}">🟡 ${t("contam.label.medium") || "Medium"}</span><span class="unmask-num-val">${medium.length}</span></div>
936
+ <div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.low}">🟢 ${t("contam.label.low") || "Low"}</span><span class="unmask-num-val">${low.length}</span></div>
937
+ </div>
938
+ </div>
939
+ <div class="unmask-details">
940
+ <details class="unmask-panel" open>
941
+ <summary class="unmask-panel-title">${titleHigh} <span class="arena-tie-count">(${high.length})</span></summary>
942
+ <p class="unmask-reco">${adviceHigh}</p>
943
+ ${tableFor(high)}
944
+ </details>
945
+ <details class="unmask-panel" open>
946
+ <summary class="unmask-panel-title">${titleMed} <span class="arena-tie-count">(${medium.length})</span></summary>
947
+ <p class="unmask-reco">${adviceMedium}</p>
948
+ ${tableFor(medium)}
949
+ </details>
950
+ <details class="unmask-panel">
951
+ <summary class="unmask-panel-title">${titleLow} <span class="arena-tie-count">(${low.length})</span></summary>
952
+ <p class="unmask-reco">${adviceLow}</p>
953
+ ${tableFor(low)}
954
+ </details>
955
+ </div>
956
+ </div>
957
+ `;
958
+ }
959
+
960
+ function runContamCompute() {
961
+ const cutoff = ($("contam-cutoff").value || "").trim();
962
+ if (!cutoff) {
963
+ $("contam-status").textContent = t("contam.status.empty") || "⚠ Enter a model training cutoff date (e.g. 2023-12).";
964
+ return;
965
+ }
966
+ if (!/^\d{4}(-\d{1,2})?(-\d{1,2})?$/.test(cutoff)) {
967
+ $("contam-status").textContent = t("contam.status.bad_date") || "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.";
968
+ return;
969
+ }
970
+ const rows = rateAllBenchmarks(cutoff);
971
+ $("contam-output").innerHTML = renderContamCard(rows, cutoff);
972
+ $("contam-status").textContent = tFmt("contam.status.done", {
973
+ cutoff, n: rows.length,
974
+ high: rows.filter(r => r.level === "high").length,
975
+ });
976
+ }
977
+
978
+ $("contam-run-btn")?.addEventListener("click", runContamCompute);
979
+ $("contam-cutoff")?.addEventListener("keydown", (e) => {
980
+ if (e.key === "Enter") { e.preventDefault(); runContamCompute(); }
981
+ });
982
+
983
  function configToPreset(cfg, modelId) {
984
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
985
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
style.css CHANGED
@@ -33,6 +33,34 @@
33
  flex: 1;
34
  }
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  /* v0.7.1 — Chat-template Sniffer mode */
37
  .template-cmd-block {
38
  display: flex;
 
33
  flex: 1;
34
  }
35
 
36
+ /* v0.7.2 — Arena-Elo CI reconstructor */
37
+ .arena-result { margin-top: 0.6em; }
38
+ .arena-table, .arena-ties-table {
39
+ width: 100%;
40
+ border-collapse: collapse;
41
+ font-size: 0.92em;
42
+ }
43
+ .arena-table th, .arena-ties-table th {
44
+ text-align: left;
45
+ font-weight: 600;
46
+ font-size: 0.78em;
47
+ text-transform: uppercase;
48
+ letter-spacing: 0.04em;
49
+ color: #58a6ff;
50
+ padding: 0.45em 0.6em;
51
+ border-bottom: 1px solid rgba(255, 255, 255, 0.12);
52
+ }
53
+ .arena-table td, .arena-ties-table td {
54
+ padding: 0.45em 0.6em;
55
+ border-bottom: 1px solid rgba(255, 255, 255, 0.04);
56
+ vertical-align: middle;
57
+ }
58
+ .arena-table tr:hover, .arena-ties-table tr:hover { background: rgba(88, 166, 255, 0.04); }
59
+ .arena-rank { color: #8b949e; font-family: monospace; }
60
+ .arena-elo { font-family: monospace; }
61
+ .arena-ci, .arena-spread, .arena-matches, .arena-wlt { font-family: monospace; font-size: 0.9em; opacity: 0.85; }
62
+ .arena-tie-count { font-size: 0.85em; opacity: 0.7; font-weight: normal; }
63
+
64
  /* v0.7.1 — Chat-template Sniffer mode */
65
  .template-cmd-block {
66
  display: flex;