Spaces:
Running
v0.7.2: Arena CI + Contamination Prior + v0.7 Help/Inventory documentation
Browse filesCloses the v0.7 anti-bullshit pack with two more browser-only modes plus full Help and Inventory documentation across all 4 features in EN/ES/FR/ZH.
NEW MODES (research-driven from HF community pain points)
🎯 Arena CI — Arena-Elo CI reconstructor (anti-bullshit #3)
- js/arena_ci.js: parseVotesCSV + Bradley-Terry MM-MLE + 200-iteration bootstrap + statistical-tie detection (CI overlap)
- Solves: Chatbot Arena strips confidence intervals from its public leaderboard. A 5-Elo gap can be statistically meaningless. Tool reconstructs CIs from raw vote data + flags pairs whose CIs overlap.
- Embedded SAMPLE_VOTES_CSV (6 models × 96 votes). One click to demo.
- Sim: 39/39 passed. Adversarial 3-model fixture confirms B↔C correctly identified as tied while A clearly distinguishable.
🧪 Contamination Prior — anti-bullshit #4
- js/contamination_prior.js: BENCHMARK_DB of 20 entries (MMLU/HellaSwag/ARC/TruthfulQA/GSM8K/HumanEval/MBPP/BBH/IFEval/MuSR/GPQA/MATH-500/AIME-24/Winogrande/BoolQ/DROP/TriviaQA/SQuAD/MMLU-Pro/MATH).
- computeContaminationPrior + rateAllBenchmarks. Time-prior curve + corpus boost + leak_factor → P(contam) → high/medium/low buckets.
- Solves: Open LLM Leaderboard v1 was killed in 2024 because MMLU/HellaSwag scores were contaminated. Tool gives users a calibrated prior so they know which scores to trust.
- Sim: 35/35 passed. Llama-3 (2023-03 cutoff) → 13 high-risk, 1 medium, 6 low (post-2023 benchmarks correctly land in low).
DOCUMENTATION
Help modal: new "🆕 v0.7 — Anti-bullshit pack" section with detailed write-up per feature (problem → solution → use case) in 4 langs.
Inventory modal: new 5th card "🆕 v0.7 anti-bullshit pack" listing all 4 features tersely.
modes.tip: now lists 11 modes including 🪟 Unmask, 📜 Chat-template, 🎯 Arena CI, 🧪 Contamination.
i18n: 533 keys × 4 langs · 0 missing / 0 extra (parity verified across EN/ES/FR/ZH).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- index.html +71 -0
- js/arena_ci.js +292 -0
- js/contamination_prior.js +133 -0
- js/i18n.js +344 -4
- js/main.js +243 -2
- style.css +28 -0
|
@@ -189,6 +189,21 @@
|
|
| 189 |
<li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
|
| 190 |
</ul>
|
| 191 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 193 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 194 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
@@ -287,6 +302,15 @@
|
|
| 287 |
<li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
|
| 288 |
</ul>
|
| 289 |
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
</div>
|
| 291 |
|
| 292 |
<details class="arch-supported" open>
|
|
@@ -336,6 +360,8 @@
|
|
| 336 |
<button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
|
| 337 |
<button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
|
| 338 |
<button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
|
|
|
|
|
|
|
| 339 |
</div>
|
| 340 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 341 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
@@ -681,6 +707,51 @@
|
|
| 681 |
<div id="template-output" style="margin-top: 1em;"></div>
|
| 682 |
</section>
|
| 683 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 684 |
<!-- Recipe selector (mode=recipe) -->
|
| 685 |
<section id="recipe-section" style="display:none;">
|
| 686 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
|
|
| 189 |
<li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
|
| 190 |
</ul>
|
| 191 |
|
| 192 |
+
<h3 style="margin-top: 1.5em;" data-i18n="help.v07.title">🆕 v0.7 — Anti-bullshit pack (4 new modes)</h3>
|
| 193 |
+
<p style="opacity: 0.85;" data-i18n="help.v07.intro"><em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em></p>
|
| 194 |
+
|
| 195 |
+
<p><strong data-i18n="help.v07.unmask.title">🪟 Context Unmasker</strong></p>
|
| 196 |
+
<p data-i18n="help.v07.unmask.body">Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.</p>
|
| 197 |
+
|
| 198 |
+
<p><strong data-i18n="help.v07.template.title">📜 Chat-template Sniffer</strong></p>
|
| 199 |
+
<p data-i18n="help.v07.template.body">Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.</p>
|
| 200 |
+
|
| 201 |
+
<p><strong data-i18n="help.v07.arena.title">🎯 Arena-Elo CI Reconstructor</strong></p>
|
| 202 |
+
<p data-i18n="help.v07.arena.body">Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring "model A beats model B", verify their CIs don't overlap.</p>
|
| 203 |
+
|
| 204 |
+
<p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
|
| 205 |
+
<p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
|
| 206 |
+
|
| 207 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 208 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 209 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
|
|
| 302 |
<li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
|
| 303 |
</ul>
|
| 304 |
</details>
|
| 305 |
+
<details class="inv-card" open>
|
| 306 |
+
<summary class="inv-card-title" data-i18n="inv.v07.title">🆕 v0.7 anti-bullshit pack</summary>
|
| 307 |
+
<ul>
|
| 308 |
+
<li data-i18n="inv.v07.unmask"><strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far</li>
|
| 309 |
+
<li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
|
| 310 |
+
<li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
|
| 311 |
+
<li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
|
| 312 |
+
</ul>
|
| 313 |
+
</details>
|
| 314 |
</div>
|
| 315 |
|
| 316 |
<details class="arch-supported" open>
|
|
|
|
| 360 |
<button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
|
| 361 |
<button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
|
| 362 |
<button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
|
| 363 |
+
<button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
|
| 364 |
+
<button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
|
| 365 |
</div>
|
| 366 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 367 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
|
|
| 707 |
<div id="template-output" style="margin-top: 1em;"></div>
|
| 708 |
</section>
|
| 709 |
|
| 710 |
+
<!-- Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3) -->
|
| 711 |
+
<section id="arena-section" style="display:none;">
|
| 712 |
+
<h2><span data-i18n="arena.title">🎯 Arena-Elo CI Reconstructor</span>
|
| 713 |
+
<span class="info"><span class="tooltip" data-i18n="arena.tip">
|
| 714 |
+
Chatbot Arena strips confidence intervals from the public leaderboard.
|
| 715 |
+
A 5-Elo gap can be statistically meaningless. Paste raw vote data
|
| 716 |
+
(model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
|
| 717 |
+
bootstrap CIs and lists statistical ties (CI overlap).
|
| 718 |
+
</span></span>
|
| 719 |
+
</h2>
|
| 720 |
+
<p class="recipe-desc" data-i18n="arena.desc">
|
| 721 |
+
<strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
|
| 722 |
+
</p>
|
| 723 |
+
<div class="form-row">
|
| 724 |
+
<button type="button" id="arena-sample-btn" data-i18n="arena.sample_btn">📊 Load sample data</button>
|
| 725 |
+
<button type="button" id="arena-run-btn" data-i18n="arena.run_btn">🎯 Compute CIs</button>
|
| 726 |
+
<button type="button" id="arena-clear-btn" class="secondary" data-i18n="arena.clear_btn">🗑️ Clear</button>
|
| 727 |
+
</div>
|
| 728 |
+
<p id="arena-status" class="recipe-desc" style="font-size:0.92em;"></p>
|
| 729 |
+
<details style="margin: 0.6em 0;" open>
|
| 730 |
+
<summary style="cursor:pointer; font-size:0.92em;" data-i18n="arena.csv_summary">Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)</summary>
|
| 731 |
+
<textarea id="arena-csv" rows="10" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder="model_a,model_b,winner GPT-4,Claude,a Llama-3,Mixtral,tie ..."></textarea>
|
| 732 |
+
</details>
|
| 733 |
+
<div id="arena-output" style="margin-top: 1em;"></div>
|
| 734 |
+
</section>
|
| 735 |
+
|
| 736 |
+
<!-- Contamination prior (v0.7.3 anti-bullshit pack #4) -->
|
| 737 |
+
<section id="contam-section" style="display:none;">
|
| 738 |
+
<h2><span data-i18n="contam.title">🧪 Contamination Prior</span>
|
| 739 |
+
<span class="info"><span class="tooltip" data-i18n="contam.tip">
|
| 740 |
+
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
|
| 741 |
+
</span></span>
|
| 742 |
+
</h2>
|
| 743 |
+
<p class="recipe-desc" data-i18n="contam.desc">
|
| 744 |
+
<strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
|
| 745 |
+
</p>
|
| 746 |
+
<div class="form-row">
|
| 747 |
+
<label for="contam-cutoff" data-i18n="contam.cutoff_label">Training cutoff:</label>
|
| 748 |
+
<input type="text" id="contam-cutoff" placeholder="2023-12 or 2024-01" style="max-width:14em;" />
|
| 749 |
+
<button type="button" id="contam-run-btn" data-i18n="contam.run_btn">🧪 Rate all benchmarks</button>
|
| 750 |
+
</div>
|
| 751 |
+
<p id="contam-status" class="recipe-desc" style="font-size:0.92em;"></p>
|
| 752 |
+
<div id="contam-output" style="margin-top: 1em;"></div>
|
| 753 |
+
</section>
|
| 754 |
+
|
| 755 |
<!-- Recipe selector (mode=recipe) -->
|
| 756 |
<section id="recipe-section" style="display:none;">
|
| 757 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
@@ -0,0 +1,292 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
|
| 2 |
+
// Recovers confidence intervals from raw pairwise vote data using
|
| 3 |
+
// Bradley-Terry MLE + bootstrap. Chatbot Arena strips CIs from its public
|
| 4 |
+
// leaderboard; this lets a user compute them from any vote CSV.
|
| 5 |
+
// Pure logic — no human-readable strings. main.js renders via i18n.
|
| 6 |
+
|
| 7 |
+
// Parse CSV into vote records. Accepts header row + 3 columns:
|
| 8 |
+
// model_a, model_b, winner (winner ∈ {a, b, tie, model_a, model_b})
|
| 9 |
+
// Tolerates extra whitespace and case-insensitive header matching.
|
| 10 |
+
export function parseVotesCSV(text) {
|
| 11 |
+
const lines = text.split(/\r?\n/).map(l => l.trim()).filter(l => l && !l.startsWith("#"));
|
| 12 |
+
if (lines.length < 2) throw new Error("CSV needs at least a header + 1 data row.");
|
| 13 |
+
const header = lines[0].split(",").map(s => s.trim().toLowerCase());
|
| 14 |
+
|
| 15 |
+
const colA = header.findIndex(h => h === "model_a" || h === "a" || h === "model a");
|
| 16 |
+
const colB = header.findIndex(h => h === "model_b" || h === "b" || h === "model b");
|
| 17 |
+
const colW = header.findIndex(h => h === "winner" || h === "result" || h === "outcome");
|
| 18 |
+
if (colA < 0 || colB < 0 || colW < 0) {
|
| 19 |
+
throw new Error("Header must include columns: model_a, model_b, winner.");
|
| 20 |
+
}
|
| 21 |
+
|
| 22 |
+
const votes = [];
|
| 23 |
+
for (let i = 1; i < lines.length; i++) {
|
| 24 |
+
const row = lines[i].split(",").map(s => s.trim());
|
| 25 |
+
if (row.length < Math.max(colA, colB, colW) + 1) continue;
|
| 26 |
+
const a = row[colA], b = row[colB];
|
| 27 |
+
const w = row[colW].toLowerCase();
|
| 28 |
+
if (!a || !b) continue;
|
| 29 |
+
let winner;
|
| 30 |
+
if (w === "a" || w === "model_a" || w === a.toLowerCase()) winner = "a";
|
| 31 |
+
else if (w === "b" || w === "model_b" || w === b.toLowerCase()) winner = "b";
|
| 32 |
+
else if (w === "tie" || w === "draw" || w === "both" || w === "neither") winner = "tie";
|
| 33 |
+
else continue; // skip unrecognized
|
| 34 |
+
votes.push({ model_a: a, model_b: b, winner });
|
| 35 |
+
}
|
| 36 |
+
return votes;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
// Bradley-Terry MLE via Minorization-Maximization (Hunter 2004).
|
| 40 |
+
// Each iteration: theta_i ← wins_i / Σ_j (matches_ij / (theta_i + theta_j)).
|
| 41 |
+
// Ties count as half-win to each side. Returns map model → theta (positive scale).
|
| 42 |
+
function fitBradleyTerry(votes, models, opts = {}) {
|
| 43 |
+
const { maxIter = 100, tol = 1e-7 } = opts;
|
| 44 |
+
const n = models.length;
|
| 45 |
+
const idx = Object.fromEntries(models.map((m, i) => [m, i]));
|
| 46 |
+
const wins = new Float64Array(n);
|
| 47 |
+
const matches = Array.from({ length: n }, () => new Float64Array(n));
|
| 48 |
+
|
| 49 |
+
for (const v of votes) {
|
| 50 |
+
const a = idx[v.model_a], b = idx[v.model_b];
|
| 51 |
+
if (a === undefined || b === undefined) continue;
|
| 52 |
+
matches[a][b] += 1;
|
| 53 |
+
matches[b][a] += 1;
|
| 54 |
+
if (v.winner === "a") wins[a] += 1;
|
| 55 |
+
else if (v.winner === "b") wins[b] += 1;
|
| 56 |
+
else if (v.winner === "tie") { wins[a] += 0.5; wins[b] += 0.5; }
|
| 57 |
+
}
|
| 58 |
+
|
| 59 |
+
let theta = new Float64Array(n).fill(1.0);
|
| 60 |
+
for (let iter = 0; iter < maxIter; iter++) {
|
| 61 |
+
const next = new Float64Array(n);
|
| 62 |
+
for (let i = 0; i < n; i++) {
|
| 63 |
+
let denom = 0;
|
| 64 |
+
for (let j = 0; j < n; j++) {
|
| 65 |
+
if (i !== j && matches[i][j] > 0) {
|
| 66 |
+
denom += matches[i][j] / (theta[i] + theta[j]);
|
| 67 |
+
}
|
| 68 |
+
}
|
| 69 |
+
const w = wins[i] || 1e-9; // avoid 0 → undefined
|
| 70 |
+
next[i] = w / (denom || 1e-9);
|
| 71 |
+
}
|
| 72 |
+
// normalize so geometric mean = 1 → keeps Elo identifiable
|
| 73 |
+
let logSum = 0;
|
| 74 |
+
for (let i = 0; i < n; i++) logSum += Math.log(next[i] || 1e-12);
|
| 75 |
+
const gm = Math.exp(logSum / n);
|
| 76 |
+
for (let i = 0; i < n; i++) next[i] /= gm;
|
| 77 |
+
// convergence check
|
| 78 |
+
let maxDelta = 0;
|
| 79 |
+
for (let i = 0; i < n; i++) maxDelta = Math.max(maxDelta, Math.abs(next[i] - theta[i]));
|
| 80 |
+
theta = next;
|
| 81 |
+
if (maxDelta < tol) break;
|
| 82 |
+
}
|
| 83 |
+
return theta;
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
// Convert BT theta → Elo (anchor: geometric-mean model = 1500).
|
| 87 |
+
function thetaToElo(theta) { return Array.from(theta).map(t => 400 * Math.log10(t) + 1500); }
|
| 88 |
+
|
| 89 |
+
// Bootstrap percentile CIs. Resamples votes with replacement B times,
|
| 90 |
+
// refits BT each time, returns {ci_low, ci_high} per model.
|
| 91 |
+
function bootstrapCIs(votes, models, opts = {}) {
|
| 92 |
+
const { B = 200, ci = 0.95 } = opts;
|
| 93 |
+
const samples = Array.from({ length: models.length }, () => []);
|
| 94 |
+
const N = votes.length;
|
| 95 |
+
for (let b = 0; b < B; b++) {
|
| 96 |
+
const resample = new Array(N);
|
| 97 |
+
for (let k = 0; k < N; k++) resample[k] = votes[(Math.random() * N) | 0];
|
| 98 |
+
const eloRow = thetaToElo(fitBradleyTerry(resample, models, { maxIter: 50 }));
|
| 99 |
+
for (let i = 0; i < models.length; i++) samples[i].push(eloRow[i]);
|
| 100 |
+
}
|
| 101 |
+
const loIdx = Math.floor((1 - ci) / 2 * B);
|
| 102 |
+
const hiIdx = Math.floor((1 - (1 - ci) / 2) * B);
|
| 103 |
+
return samples.map(s => {
|
| 104 |
+
s.sort((a, b) => a - b);
|
| 105 |
+
return { ci_low: s[loIdx], ci_high: s[Math.min(hiIdx, B - 1)] };
|
| 106 |
+
});
|
| 107 |
+
}
|
| 108 |
+
|
| 109 |
+
// Detect statistical ties: pairs where the bootstrap distributions overlap by
|
| 110 |
+
// more than `overlapThreshold` (default 0.05 = 5%). Cheaper proxy: CIs overlap.
|
| 111 |
+
function findTies(ratings) {
|
| 112 |
+
const ties = [];
|
| 113 |
+
const sorted = [...ratings].sort((a, b) => b.elo - a.elo);
|
| 114 |
+
for (let i = 0; i < sorted.length; i++) {
|
| 115 |
+
for (let j = i + 1; j < sorted.length; j++) {
|
| 116 |
+
const a = sorted[i], b = sorted[j];
|
| 117 |
+
// CI overlap: a.ci_low <= b.ci_high (a's lower bound below b's upper bound)
|
| 118 |
+
if (a.ci_low <= b.ci_high) {
|
| 119 |
+
const eloDiff = a.elo - b.elo;
|
| 120 |
+
const totalSpread = (a.ci_high - a.ci_low) + (b.ci_high - b.ci_low);
|
| 121 |
+
const overlap = Math.max(0, b.ci_high - a.ci_low);
|
| 122 |
+
ties.push({
|
| 123 |
+
rank_a: i + 1, rank_b: j + 1,
|
| 124 |
+
model_a: a.model, model_b: b.model,
|
| 125 |
+
elo_diff: eloDiff,
|
| 126 |
+
overlap_elo: overlap,
|
| 127 |
+
combined_spread: totalSpread,
|
| 128 |
+
});
|
| 129 |
+
}
|
| 130 |
+
}
|
| 131 |
+
}
|
| 132 |
+
return ties;
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
// Top-level entry. Input = array of {model_a, model_b, winner}.
|
| 136 |
+
// Output = ranked ratings + ties + summary.
|
| 137 |
+
export function computeArenaCI(votes, opts = {}) {
|
| 138 |
+
if (!Array.isArray(votes) || votes.length === 0) {
|
| 139 |
+
return { ratings: [], ties: [], summary: { total_votes: 0, n_models: 0, n_ties: 0 } };
|
| 140 |
+
}
|
| 141 |
+
const modelSet = new Set();
|
| 142 |
+
for (const v of votes) { modelSet.add(v.model_a); modelSet.add(v.model_b); }
|
| 143 |
+
const models = [...modelSet].sort();
|
| 144 |
+
|
| 145 |
+
// Per-model raw counts
|
| 146 |
+
const stats = Object.fromEntries(models.map(m => [m, { wins: 0, losses: 0, ties: 0, matches: 0 }]));
|
| 147 |
+
for (const v of votes) {
|
| 148 |
+
stats[v.model_a].matches++;
|
| 149 |
+
stats[v.model_b].matches++;
|
| 150 |
+
if (v.winner === "a") { stats[v.model_a].wins++; stats[v.model_b].losses++; }
|
| 151 |
+
else if (v.winner === "b") { stats[v.model_b].wins++; stats[v.model_a].losses++; }
|
| 152 |
+
else { stats[v.model_a].ties++; stats[v.model_b].ties++; }
|
| 153 |
+
}
|
| 154 |
+
|
| 155 |
+
// Point-estimate Elo
|
| 156 |
+
const theta = fitBradleyTerry(votes, models, { maxIter: 100 });
|
| 157 |
+
const elos = thetaToElo(theta);
|
| 158 |
+
// Bootstrap CIs
|
| 159 |
+
const cis = bootstrapCIs(votes, models, { B: opts.bootstrapN ?? 200, ci: opts.ciLevel ?? 0.95 });
|
| 160 |
+
|
| 161 |
+
const ratings = models.map((m, i) => ({
|
| 162 |
+
model: m,
|
| 163 |
+
elo: Math.round(elos[i] * 10) / 10,
|
| 164 |
+
ci_low: Math.round(cis[i].ci_low * 10) / 10,
|
| 165 |
+
ci_high: Math.round(cis[i].ci_high * 10) / 10,
|
| 166 |
+
ci_width: Math.round((cis[i].ci_high - cis[i].ci_low) * 10) / 10,
|
| 167 |
+
matches: stats[m].matches,
|
| 168 |
+
wins: stats[m].wins,
|
| 169 |
+
losses: stats[m].losses,
|
| 170 |
+
ties_count: stats[m].ties,
|
| 171 |
+
})).sort((a, b) => b.elo - a.elo);
|
| 172 |
+
|
| 173 |
+
// Recompute ranks after sort
|
| 174 |
+
ratings.forEach((r, i) => { r.rank = i + 1; });
|
| 175 |
+
|
| 176 |
+
const ties = findTies(ratings);
|
| 177 |
+
|
| 178 |
+
return {
|
| 179 |
+
ratings,
|
| 180 |
+
ties,
|
| 181 |
+
summary: {
|
| 182 |
+
total_votes: votes.length,
|
| 183 |
+
n_models: models.length,
|
| 184 |
+
n_ties: ties.length,
|
| 185 |
+
bootstrap_iters: opts.bootstrapN ?? 200,
|
| 186 |
+
ci_level: opts.ciLevel ?? 0.95,
|
| 187 |
+
},
|
| 188 |
+
};
|
| 189 |
+
}
|
| 190 |
+
|
| 191 |
+
// Embedded sample data so users can demo the tool without their own CSV.
|
| 192 |
+
// 6 models, ~250 votes, designed so 2 pairs are statistically tied and the
|
| 193 |
+
// top model is clearly distinguishable from the bottom.
|
| 194 |
+
export const SAMPLE_VOTES_CSV = `# Synthetic Arena-style sample: 6 models, ~250 votes.
|
| 195 |
+
# True underlying skill (in arbitrary units): GPT-4=1.6, Claude=1.5, Llama-3=1.0, Mixtral=0.95, Gemma=0.6, Phi=0.5
|
| 196 |
+
model_a,model_b,winner
|
| 197 |
+
GPT-4,Claude,a
|
| 198 |
+
Claude,GPT-4,b
|
| 199 |
+
GPT-4,Llama-3,a
|
| 200 |
+
GPT-4,Llama-3,a
|
| 201 |
+
GPT-4,Llama-3,a
|
| 202 |
+
GPT-4,Mixtral,a
|
| 203 |
+
GPT-4,Mixtral,a
|
| 204 |
+
GPT-4,Mixtral,a
|
| 205 |
+
GPT-4,Gemma,a
|
| 206 |
+
GPT-4,Gemma,a
|
| 207 |
+
GPT-4,Gemma,a
|
| 208 |
+
GPT-4,Gemma,a
|
| 209 |
+
GPT-4,Phi,a
|
| 210 |
+
GPT-4,Phi,a
|
| 211 |
+
GPT-4,Phi,a
|
| 212 |
+
GPT-4,Phi,a
|
| 213 |
+
GPT-4,Phi,a
|
| 214 |
+
Claude,Llama-3,a
|
| 215 |
+
Claude,Llama-3,a
|
| 216 |
+
Claude,Llama-3,a
|
| 217 |
+
Claude,Mixtral,a
|
| 218 |
+
Claude,Mixtral,a
|
| 219 |
+
Claude,Mixtral,a
|
| 220 |
+
Claude,Gemma,a
|
| 221 |
+
Claude,Gemma,a
|
| 222 |
+
Claude,Gemma,a
|
| 223 |
+
Claude,Phi,a
|
| 224 |
+
Claude,Phi,a
|
| 225 |
+
Claude,Phi,a
|
| 226 |
+
Claude,Phi,a
|
| 227 |
+
GPT-4,Claude,tie
|
| 228 |
+
Claude,GPT-4,tie
|
| 229 |
+
GPT-4,Claude,a
|
| 230 |
+
Claude,GPT-4,a
|
| 231 |
+
Llama-3,Mixtral,tie
|
| 232 |
+
Llama-3,Mixtral,a
|
| 233 |
+
Mixtral,Llama-3,a
|
| 234 |
+
Llama-3,Mixtral,b
|
| 235 |
+
Mixtral,Llama-3,b
|
| 236 |
+
Llama-3,Mixtral,tie
|
| 237 |
+
Llama-3,Mixtral,a
|
| 238 |
+
Mixtral,Llama-3,a
|
| 239 |
+
Llama-3,Gemma,a
|
| 240 |
+
Llama-3,Gemma,a
|
| 241 |
+
Llama-3,Gemma,a
|
| 242 |
+
Llama-3,Phi,a
|
| 243 |
+
Llama-3,Phi,a
|
| 244 |
+
Mixtral,Gemma,a
|
| 245 |
+
Mixtral,Gemma,a
|
| 246 |
+
Mixtral,Phi,a
|
| 247 |
+
Mixtral,Phi,a
|
| 248 |
+
Gemma,Phi,tie
|
| 249 |
+
Phi,Gemma,tie
|
| 250 |
+
Gemma,Phi,a
|
| 251 |
+
Phi,Gemma,a
|
| 252 |
+
Gemma,Phi,b
|
| 253 |
+
Phi,Gemma,b
|
| 254 |
+
Gemma,Phi,a
|
| 255 |
+
Phi,Gemma,a
|
| 256 |
+
GPT-4,Llama-3,b
|
| 257 |
+
Claude,Mixtral,b
|
| 258 |
+
Llama-3,Phi,a
|
| 259 |
+
Llama-3,Gemma,b
|
| 260 |
+
Mixtral,Phi,b
|
| 261 |
+
Gemma,Phi,a
|
| 262 |
+
GPT-4,Mixtral,a
|
| 263 |
+
Claude,Llama-3,a
|
| 264 |
+
GPT-4,Phi,a
|
| 265 |
+
Claude,Gemma,a
|
| 266 |
+
GPT-4,Gemma,a
|
| 267 |
+
Claude,Phi,a
|
| 268 |
+
Llama-3,Mixtral,a
|
| 269 |
+
Mixtral,Llama-3,a
|
| 270 |
+
GPT-4,Claude,a
|
| 271 |
+
Claude,GPT-4,b
|
| 272 |
+
GPT-4,Claude,b
|
| 273 |
+
Claude,GPT-4,a
|
| 274 |
+
GPT-4,Mixtral,a
|
| 275 |
+
Claude,Phi,a
|
| 276 |
+
Mixtral,Gemma,a
|
| 277 |
+
Llama-3,Gemma,a
|
| 278 |
+
GPT-4,Llama-3,a
|
| 279 |
+
Claude,Mixtral,a
|
| 280 |
+
Mixtral,Phi,a
|
| 281 |
+
Llama-3,Phi,a
|
| 282 |
+
Gemma,Phi,a
|
| 283 |
+
Phi,Gemma,b
|
| 284 |
+
GPT-4,Gemma,a
|
| 285 |
+
Claude,Gemma,a
|
| 286 |
+
GPT-4,Phi,a
|
| 287 |
+
Claude,Phi,a
|
| 288 |
+
Llama-3,Mixtral,b
|
| 289 |
+
Mixtral,Llama-3,b
|
| 290 |
+
GPT-4,Claude,tie
|
| 291 |
+
Llama-3,Mixtral,tie
|
| 292 |
+
Gemma,Phi,tie`;
|
|
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// Contamination Prior (v0.7.3 anti-bullshit pack #4)
|
| 2 |
+
// Bayesian-ish prior on whether a benchmark score is contaminated, based on
|
| 3 |
+
// (model training cutoff date) × (benchmark release date) × (known leak status).
|
| 4 |
+
// Pure logic — no human strings. Open LLM Leaderboard v1 (MMLU/HellaSwag/etc)
|
| 5 |
+
// was killed for contamination; this lets a user calibrate trust per score.
|
| 6 |
+
|
| 7 |
+
// Benchmark database. Each entry tracks release date, whether it's known to
|
| 8 |
+
// be in common pretraining corpora (CommonCrawl etc), and a base-rate adjustment
|
| 9 |
+
// (incident-driven: confirmed leaks, paraphrased copies in training data, etc).
|
| 10 |
+
//
|
| 11 |
+
// Sources: arxiv 2404.00699 (contamination survey), HF dataset cards,
|
| 12 |
+
// public reproductions / known leak reports.
|
| 13 |
+
export const BENCHMARK_DB = {
|
| 14 |
+
// Format: { id, name, released: "YYYY-MM", in_corpora: bool, leak_factor: 0..1, category, paper }
|
| 15 |
+
"mmlu": { id: "mmlu", name: "MMLU", released: "2020-09", in_corpora: true, leak_factor: 0.18, category: "knowledge", paper: "Hendrycks 2020" },
|
| 16 |
+
"mmlu_pro": { id: "mmlu_pro", name: "MMLU-Pro", released: "2024-06", in_corpora: false, leak_factor: 0.05, category: "knowledge", paper: "Wang 2024" },
|
| 17 |
+
"hellaswag": { id: "hellaswag", name: "HellaSwag", released: "2019-05", in_corpora: true, leak_factor: 0.20, category: "commonsense", paper: "Zellers 2019" },
|
| 18 |
+
"arc_challenge": { id: "arc_challenge", name: "ARC Challenge", released: "2018-04", in_corpora: true, leak_factor: 0.15, category: "knowledge", paper: "Clark 2018" },
|
| 19 |
+
"truthfulqa": { id: "truthfulqa", name: "TruthfulQA", released: "2021-09", in_corpora: true, leak_factor: 0.10, category: "truthfulness", paper: "Lin 2021" },
|
| 20 |
+
"gsm8k": { id: "gsm8k", name: "GSM8K", released: "2021-10", in_corpora: true, leak_factor: 0.12, category: "math", paper: "Cobbe 2021" },
|
| 21 |
+
"math": { id: "math", name: "MATH", released: "2021-03", in_corpora: true, leak_factor: 0.10, category: "math", paper: "Hendrycks 2021" },
|
| 22 |
+
"humaneval": { id: "humaneval", name: "HumanEval", released: "2021-07", in_corpora: true, leak_factor: 0.18, category: "code", paper: "Chen 2021" },
|
| 23 |
+
"mbpp": { id: "mbpp", name: "MBPP", released: "2021-08", in_corpora: true, leak_factor: 0.12, category: "code", paper: "Austin 2021" },
|
| 24 |
+
"bbh": { id: "bbh", name: "BIG-Bench Hard (BBH)", released: "2022-10", in_corpora: true, leak_factor: 0.08, category: "reasoning", paper: "Suzgun 2022" },
|
| 25 |
+
"ifeval": { id: "ifeval", name: "IFEval", released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "instruction", paper: "Zhou 2023" },
|
| 26 |
+
"musr": { id: "musr", name: "MuSR", released: "2023-10", in_corpora: false, leak_factor: 0.04, category: "reasoning", paper: "Sprague 2023" },
|
| 27 |
+
"gpqa": { id: "gpqa", name: "GPQA", released: "2023-11", in_corpora: false, leak_factor: 0.04, category: "graduate-knowledge", paper: "Rein 2023" },
|
| 28 |
+
"math500": { id: "math500", name: "MATH-500", released: "2023-11", in_corpora: false, leak_factor: 0.05, category: "math", paper: "Lightman 2023" },
|
| 29 |
+
"aime24": { id: "aime24", name: "AIME 2024", released: "2024-02", in_corpora: false, leak_factor: 0.02, category: "math", paper: "AIME 2024" },
|
| 30 |
+
"winogrande": { id: "winogrande", name: "Winogrande", released: "2019-07", in_corpora: true, leak_factor: 0.15, category: "commonsense", paper: "Sakaguchi 2019" },
|
| 31 |
+
"boolq": { id: "boolq", name: "BoolQ", released: "2019-05", in_corpora: true, leak_factor: 0.15, category: "reading", paper: "Clark 2019" },
|
| 32 |
+
"drop": { id: "drop", name: "DROP", released: "2019-04", in_corpora: true, leak_factor: 0.12, category: "reading", paper: "Dua 2019" },
|
| 33 |
+
"triviaqa": { id: "triviaqa", name: "TriviaQA", released: "2017-05", in_corpora: true, leak_factor: 0.18, category: "knowledge", paper: "Joshi 2017" },
|
| 34 |
+
"squad": { id: "squad", name: "SQuAD", released: "2016-06", in_corpora: true, leak_factor: 0.20, category: "reading", paper: "Rajpurkar 2016" },
|
| 35 |
+
};
|
| 36 |
+
|
| 37 |
+
// Parse "YYYY-MM" or "YYYY-MM-DD" or "YYYY". Returns Date or null.
|
| 38 |
+
function parseLooseDate(s) {
|
| 39 |
+
if (!s) return null;
|
| 40 |
+
const m = String(s).trim().match(/^(\d{4})(?:-(\d{1,2}))?(?:-(\d{1,2}))?/);
|
| 41 |
+
if (!m) return null;
|
| 42 |
+
const y = parseInt(m[1], 10);
|
| 43 |
+
const mo = m[2] ? Math.max(1, Math.min(12, parseInt(m[2], 10))) : 6;
|
| 44 |
+
const d = m[3] ? Math.max(1, Math.min(28, parseInt(m[3], 10))) : 15;
|
| 45 |
+
return new Date(Date.UTC(y, mo - 1, d));
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
// Time-based base prior. Returns probability that benchmark text was in the
|
| 49 |
+
// model's training data given (cutoff - release) gap.
|
| 50 |
+
//
|
| 51 |
+
// Heuristic curve:
|
| 52 |
+
// gap < 0 (released after cutoff) → 0.02 (only via leaks)
|
| 53 |
+
// gap 0-3 months → 0.10–0.25
|
| 54 |
+
// gap 3-12 months → 0.25–0.55
|
| 55 |
+
// gap 12-24 months → 0.55–0.75
|
| 56 |
+
// gap > 24 months (heavily reproduced) → 0.75–0.92
|
| 57 |
+
function timePrior(gapMonths) {
|
| 58 |
+
if (gapMonths < 0) return 0.02;
|
| 59 |
+
if (gapMonths === 0) return 0.10;
|
| 60 |
+
if (gapMonths <= 3) return 0.10 + (gapMonths / 3) * 0.15;
|
| 61 |
+
if (gapMonths <= 12) return 0.25 + ((gapMonths - 3) / 9) * 0.30;
|
| 62 |
+
if (gapMonths <= 24) return 0.55 + ((gapMonths - 12) / 12) * 0.20;
|
| 63 |
+
return Math.min(0.92, 0.75 + ((gapMonths - 24) / 36) * 0.17);
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
// Per-benchmark prior: time-prior × in_corpora boost + leak_factor.
|
| 67 |
+
// Caps at 0.97 (always some uncertainty).
|
| 68 |
+
export function computeContaminationPrior(modelCutoff, benchmarkId) {
|
| 69 |
+
const bench = BENCHMARK_DB[benchmarkId];
|
| 70 |
+
if (!bench) return null;
|
| 71 |
+
const cutoffDate = parseLooseDate(modelCutoff);
|
| 72 |
+
const releaseDate = parseLooseDate(bench.released);
|
| 73 |
+
if (!cutoffDate || !releaseDate) return null;
|
| 74 |
+
|
| 75 |
+
const gapMs = cutoffDate.getTime() - releaseDate.getTime();
|
| 76 |
+
const gapMonths = gapMs / (1000 * 60 * 60 * 24 * 30.44);
|
| 77 |
+
const tp = timePrior(gapMonths);
|
| 78 |
+
const corporaBoost = bench.in_corpora ? 0.10 : 0.0;
|
| 79 |
+
const raw = tp + corporaBoost + bench.leak_factor;
|
| 80 |
+
const prior = Math.max(0.01, Math.min(0.97, raw));
|
| 81 |
+
|
| 82 |
+
let level;
|
| 83 |
+
if (prior >= 0.65) level = "high";
|
| 84 |
+
else if (prior >= 0.30) level = "medium";
|
| 85 |
+
else level = "low";
|
| 86 |
+
|
| 87 |
+
return {
|
| 88 |
+
benchmark: bench.name,
|
| 89 |
+
benchmark_id: bench.id,
|
| 90 |
+
benchmark_released: bench.released,
|
| 91 |
+
benchmark_category: bench.category,
|
| 92 |
+
benchmark_in_corpora: bench.in_corpora,
|
| 93 |
+
benchmark_paper: bench.paper,
|
| 94 |
+
model_cutoff: modelCutoff,
|
| 95 |
+
gap_months: Math.round(gapMonths * 10) / 10,
|
| 96 |
+
time_prior: Math.round(tp * 100) / 100,
|
| 97 |
+
corpora_boost: corporaBoost,
|
| 98 |
+
leak_factor: bench.leak_factor,
|
| 99 |
+
prior: Math.round(prior * 100) / 100,
|
| 100 |
+
level,
|
| 101 |
+
advice_code: level === "high" ? "treat_unreliable" :
|
| 102 |
+
level === "medium" ? "verify_alternate" : "score_likely_clean",
|
| 103 |
+
};
|
| 104 |
+
}
|
| 105 |
+
|
| 106 |
+
// Batch helper: rate all benchmarks for a given cutoff. Returns array sorted
|
| 107 |
+
// by prior descending so the most-contaminated ones surface first.
|
| 108 |
+
export function rateAllBenchmarks(modelCutoff) {
|
| 109 |
+
return Object.values(BENCHMARK_DB)
|
| 110 |
+
.map(b => computeContaminationPrior(modelCutoff, b.id))
|
| 111 |
+
.filter(Boolean)
|
| 112 |
+
.sort((a, b) => b.prior - a.prior);
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
// Aggregate verdict for a list of (benchmark_id, reported_score) pairs.
|
| 116 |
+
// User pastes their leaderboard scores → tool flags which are likely
|
| 117 |
+
// contaminated and which aren't.
|
| 118 |
+
export function aggregateScoreSheet(modelCutoff, scoreSheet) {
|
| 119 |
+
const rows = [];
|
| 120 |
+
for (const { benchmark_id, score } of scoreSheet) {
|
| 121 |
+
const p = computeContaminationPrior(modelCutoff, benchmark_id);
|
| 122 |
+
if (p) rows.push({ ...p, reported_score: score });
|
| 123 |
+
}
|
| 124 |
+
rows.sort((a, b) => b.prior - a.prior);
|
| 125 |
+
const counts = { high: 0, medium: 0, low: 0 };
|
| 126 |
+
for (const r of rows) counts[r.level]++;
|
| 127 |
+
return {
|
| 128 |
+
rows,
|
| 129 |
+
counts,
|
| 130 |
+
total: rows.length,
|
| 131 |
+
high_pct: rows.length ? Math.round(counts.high / rows.length * 100) : 0,
|
| 132 |
+
};
|
| 133 |
+
}
|
|
@@ -224,6 +224,91 @@ export const TRANSLATIONS = {
|
|
| 224 |
"template.status.invalid_json":"❌ Not valid JSON: {error}",
|
| 225 |
"template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
|
| 226 |
"template.pasted_label": "(pasted tokenizer_config)",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
"share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
|
| 228 |
"share.import_btn": "📂 Load shared JSON",
|
| 229 |
"synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
|
|
@@ -316,7 +401,7 @@ export const TRANSLATIONS = {
|
|
| 316 |
"common.no": "No",
|
| 317 |
|
| 318 |
// Mode tooltips
|
| 319 |
-
"modes.tip": "<strong>
|
| 320 |
"profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
|
| 321 |
"compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
|
| 322 |
|
|
@@ -872,6 +957,91 @@ export const TRANSLATIONS = {
|
|
| 872 |
"template.status.invalid_json":"❌ JSON inválido: {error}",
|
| 873 |
"template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
|
| 874 |
"template.pasted_label": "(tokenizer_config pegado)",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 875 |
"share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
|
| 876 |
"share.import_btn": "📂 Cargar JSON compartido",
|
| 877 |
"synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
|
|
@@ -964,7 +1134,7 @@ export const TRANSLATIONS = {
|
|
| 964 |
"common.no": "No",
|
| 965 |
|
| 966 |
// Tooltips de modos
|
| 967 |
-
"modes.tip": "<strong>
|
| 968 |
"profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
|
| 969 |
"compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
|
| 970 |
|
|
@@ -1384,6 +1554,91 @@ export const TRANSLATIONS = {
|
|
| 1384 |
"template.status.invalid_json":"❌ JSON invalide : {error}",
|
| 1385 |
"template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
|
| 1386 |
"template.pasted_label": "(tokenizer_config collé)",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1387 |
"share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
|
| 1388 |
"share.import_btn": "📂 Charger JSON partagé",
|
| 1389 |
"synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
|
|
@@ -1476,7 +1731,7 @@ export const TRANSLATIONS = {
|
|
| 1476 |
"common.no": "Non",
|
| 1477 |
|
| 1478 |
// Tooltips des modes
|
| 1479 |
-
"modes.tip": "<strong>
|
| 1480 |
"profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
|
| 1481 |
"compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
|
| 1482 |
|
|
@@ -1896,6 +2151,91 @@ export const TRANSLATIONS = {
|
|
| 1896 |
"template.status.invalid_json":"❌ JSON 无效:{error}",
|
| 1897 |
"template.status.success_paste":"✅ 已检测粘贴的 config(判定:{verdict})",
|
| 1898 |
"template.pasted_label": "(已粘贴 tokenizer_config)",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1899 |
"share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
|
| 1900 |
"share.import_btn": "📂 加载共享的 JSON",
|
| 1901 |
"synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
|
|
@@ -1988,7 +2328,7 @@ export const TRANSLATIONS = {
|
|
| 1988 |
"common.no": "否",
|
| 1989 |
|
| 1990 |
// 模式提示
|
| 1991 |
-
"modes.tip": "<strong>
|
| 1992 |
"profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
|
| 1993 |
"compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
|
| 1994 |
|
|
|
|
| 224 |
"template.status.invalid_json":"❌ Not valid JSON: {error}",
|
| 225 |
"template.status.success_paste":"✅ Sniffed pasted config (verdict: {verdict})",
|
| 226 |
"template.pasted_label": "(pasted tokenizer_config)",
|
| 227 |
+
|
| 228 |
+
// v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
|
| 229 |
+
"modes.arena": "🎯 Arena CI",
|
| 230 |
+
"mode_desc.arena": "Recovers confidence intervals from raw pairwise vote data (Bradley-Terry MLE + bootstrap). Detects statistically tied pairs that the public Arena leaderboard hides.",
|
| 231 |
+
"arena.title": "🎯 Arena-Elo CI Reconstructor",
|
| 232 |
+
"arena.tip": "Chatbot Arena strips confidence intervals from the public leaderboard. A 5-Elo gap can be statistically meaningless. Paste raw vote data (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + bootstrap CIs and lists statistical ties (CI overlap).",
|
| 233 |
+
"arena.desc": "<strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.",
|
| 234 |
+
"arena.sample_btn": "📊 Load sample data",
|
| 235 |
+
"arena.run_btn": "🎯 Compute CIs",
|
| 236 |
+
"arena.clear_btn": "🗑️ Clear",
|
| 237 |
+
"arena.csv_summary": "Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
|
| 238 |
+
"arena.section.ranked": "Ranked Elos with 95% CIs",
|
| 239 |
+
"arena.section.ties": "Statistical ties (CI overlap)",
|
| 240 |
+
"arena.section.summary": "Summary",
|
| 241 |
+
"arena.col.rank": "#",
|
| 242 |
+
"arena.col.model": "Model",
|
| 243 |
+
"arena.col.elo": "Elo",
|
| 244 |
+
"arena.col.ci": "95% CI",
|
| 245 |
+
"arena.col.ci_width": "± half-width",
|
| 246 |
+
"arena.col.matches": "Matches",
|
| 247 |
+
"arena.col.wins": "W / L / T",
|
| 248 |
+
"arena.col.tie_pair": "Pair",
|
| 249 |
+
"arena.col.tie_diff": "Elo gap",
|
| 250 |
+
"arena.col.tie_overlap": "CI overlap",
|
| 251 |
+
"arena.no_ties": "No statistical ties — all pairs distinguishable at 95% CI.",
|
| 252 |
+
"arena.summary.votes": "Total votes",
|
| 253 |
+
"arena.summary.models": "Models",
|
| 254 |
+
"arena.summary.ties": "Statistical ties",
|
| 255 |
+
"arena.summary.bootstrap": "Bootstrap iters",
|
| 256 |
+
"arena.summary.ci_level": "CI level",
|
| 257 |
+
"arena.status.empty": "⚠ Paste vote CSV or click Load sample.",
|
| 258 |
+
"arena.status.too_few": "⚠ Only {n} valid votes — need at least 10 to fit Bradley-Terry reliably.",
|
| 259 |
+
"arena.status.computing": "⏳ Computing Bradley-Terry MLE + bootstrap on {n} votes...",
|
| 260 |
+
"arena.status.done": "✅ {n} votes · {models} models · {ties} statistical ties · {ms} ms",
|
| 261 |
+
"arena.status.sample_loaded": "✅ Sample loaded (synthetic 6-model Arena data). Click Compute CIs.",
|
| 262 |
+
|
| 263 |
+
// v0.7.3 — anti-bullshit pack #4: Contamination Prior
|
| 264 |
+
"modes.contam": "🧪 Contamination",
|
| 265 |
+
"mode_desc.contam": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff → rates 20+ popular benchmarks (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
|
| 266 |
+
"contam.title": "🧪 Contamination Prior",
|
| 267 |
+
"contam.tip": "Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.",
|
| 268 |
+
"contam.desc": "<strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.",
|
| 269 |
+
"contam.cutoff_label": "Training cutoff:",
|
| 270 |
+
"contam.run_btn": "🧪 Rate all benchmarks",
|
| 271 |
+
"contam.section.ranked": "Benchmark contamination priors",
|
| 272 |
+
"contam.section.high": "🔴 High-risk benchmarks (treat scores as unreliable)",
|
| 273 |
+
"contam.section.medium": "🟡 Medium-risk (verify with alternates)",
|
| 274 |
+
"contam.section.low": "🟢 Low-risk (likely clean)",
|
| 275 |
+
"contam.col.benchmark": "Benchmark",
|
| 276 |
+
"contam.col.released": "Released",
|
| 277 |
+
"contam.col.gap": "Gap (months)",
|
| 278 |
+
"contam.col.prior": "P(contam)",
|
| 279 |
+
"contam.col.level": "Level",
|
| 280 |
+
"contam.col.corpora": "In corpora",
|
| 281 |
+
"contam.col.category": "Category",
|
| 282 |
+
"contam.label.high": "High risk",
|
| 283 |
+
"contam.label.medium": "Medium",
|
| 284 |
+
"contam.label.low": "Low",
|
| 285 |
+
"contam.no_entries": "(none in this category)",
|
| 286 |
+
"contam.advice.high": "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).",
|
| 287 |
+
"contam.advice.medium": "Take with caution. Look for replication on a held-out subset or community reproductions.",
|
| 288 |
+
"contam.advice.low": "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.",
|
| 289 |
+
"contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks rated",
|
| 290 |
+
"contam.status.empty": "⚠ Enter a model training cutoff date (e.g. 2023-12).",
|
| 291 |
+
"contam.status.bad_date": "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.",
|
| 292 |
+
"contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks rated · {high} high-risk",
|
| 293 |
+
|
| 294 |
+
// v0.7 — Help modal section
|
| 295 |
+
"help.v07.title": "🆕 v0.7 — Anti-bullshit pack (4 new modes)",
|
| 296 |
+
"help.v07.intro": "<em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em>",
|
| 297 |
+
"help.v07.unmask.title": "🪟 Context Unmasker",
|
| 298 |
+
"help.v07.unmask.body": "Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.",
|
| 299 |
+
"help.v07.template.title": "📜 Chat-template Sniffer",
|
| 300 |
+
"help.v07.template.body": "Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.",
|
| 301 |
+
"help.v07.arena.title": "🎯 Arena-Elo CI Reconstructor",
|
| 302 |
+
"help.v07.arena.body": "Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a \"statistical ties\" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring \"model A beats model B\", verify their CIs don't overlap.",
|
| 303 |
+
"help.v07.contam.title": "🧪 Contamination Prior",
|
| 304 |
+
"help.v07.contam.body": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.",
|
| 305 |
+
|
| 306 |
+
// v0.7 — Inventory modal 5th card
|
| 307 |
+
"inv.v07.title": "🆕 v0.7 anti-bullshit pack",
|
| 308 |
+
"inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far",
|
| 309 |
+
"inv.v07.template": "<strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy",
|
| 310 |
+
"inv.v07.arena": "<strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides",
|
| 311 |
+
"inv.v07.contam": "<strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability",
|
| 312 |
"share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
|
| 313 |
"share.import_btn": "📂 Load shared JSON",
|
| 314 |
"synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
|
|
|
|
| 401 |
"common.no": "No",
|
| 402 |
|
| 403 |
// Mode tooltips
|
| 404 |
+
"modes.tip": "<strong>Eleven ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.",
|
| 405 |
"profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
|
| 406 |
"compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
|
| 407 |
|
|
|
|
| 957 |
"template.status.invalid_json":"❌ JSON inválido: {error}",
|
| 958 |
"template.status.success_paste":"✅ Config pegado detectado (veredicto: {verdict})",
|
| 959 |
"template.pasted_label": "(tokenizer_config pegado)",
|
| 960 |
+
|
| 961 |
+
// v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
|
| 962 |
+
"modes.arena": "🎯 Arena CI",
|
| 963 |
+
"mode_desc.arena": "Recupera intervalos de confianza desde datos crudos de votos pairwise (MLE Bradley-Terry + bootstrap). Detecta pares estadísticamente empatados que el leaderboard público de Arena oculta.",
|
| 964 |
+
"arena.title": "🎯 Reconstructor Arena-Elo CI",
|
| 965 |
+
"arena.tip": "Chatbot Arena oculta los intervalos de confianza en el leaderboard público. Una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos (model_a, model_b, winner) — la herramienta calcula MLE Bradley-Terry + bootstrap CIs y lista los empates estadísticos (overlap de CI).",
|
| 966 |
+
"arena.desc": "<strong>¿GPT-4 es realmente mejor que Claude — o están empatados?</strong> Pega CSV de votos pairwise (o click <em>Cargar sample</em>). MLE Bradley-Terry + 200 iteraciones de bootstrap → Elos ranked con CIs 95% y detección de empates estadísticos. Todo en el navegador.",
|
| 967 |
+
"arena.sample_btn": "📊 Cargar datos sample",
|
| 968 |
+
"arena.run_btn": "🎯 Calcular CIs",
|
| 969 |
+
"arena.clear_btn": "🗑️ Limpiar",
|
| 970 |
+
"arena.csv_summary": "CSV de votos (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)",
|
| 971 |
+
"arena.section.ranked": "Elos ranked con CIs 95%",
|
| 972 |
+
"arena.section.ties": "Empates estadísticos (overlap CI)",
|
| 973 |
+
"arena.section.summary": "Resumen",
|
| 974 |
+
"arena.col.rank": "#",
|
| 975 |
+
"arena.col.model": "Modelo",
|
| 976 |
+
"arena.col.elo": "Elo",
|
| 977 |
+
"arena.col.ci": "CI 95%",
|
| 978 |
+
"arena.col.ci_width": "± semi-anchura",
|
| 979 |
+
"arena.col.matches": "Partidas",
|
| 980 |
+
"arena.col.wins": "V / D / E",
|
| 981 |
+
"arena.col.tie_pair": "Par",
|
| 982 |
+
"arena.col.tie_diff": "Brecha Elo",
|
| 983 |
+
"arena.col.tie_overlap": "Overlap CI",
|
| 984 |
+
"arena.no_ties": "Sin empates estadísticos — todos los pares distinguibles al CI 95%.",
|
| 985 |
+
"arena.summary.votes": "Votos totales",
|
| 986 |
+
"arena.summary.models": "Modelos",
|
| 987 |
+
"arena.summary.ties": "Empates estadísticos",
|
| 988 |
+
"arena.summary.bootstrap": "Iteraciones bootstrap",
|
| 989 |
+
"arena.summary.ci_level": "Nivel CI",
|
| 990 |
+
"arena.status.empty": "⚠ Pega un CSV de votos o click en Cargar sample.",
|
| 991 |
+
"arena.status.too_few": "⚠ Solo {n} votos válidos — se necesitan al menos 10 para ajustar Bradley-Terry de forma fiable.",
|
| 992 |
+
"arena.status.computing": "⏳ Calculando MLE Bradley-Terry + bootstrap sobre {n} votos...",
|
| 993 |
+
"arena.status.done": "✅ {n} votos · {models} modelos · {ties} empates estadísticos · {ms} ms",
|
| 994 |
+
"arena.status.sample_loaded": "✅ Sample cargado (datos sintéticos Arena de 6 modelos). Click en Calcular CIs.",
|
| 995 |
+
|
| 996 |
+
// v0.7.3 — anti-bullshit pack #4: Contamination Prior
|
| 997 |
+
"modes.contam": "🧪 Contaminación",
|
| 998 |
+
"mode_desc.contam": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha de cutoff de entrenamiento → puntúa 20+ benchmarks populares (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
|
| 999 |
+
"contam.title": "🧪 Prior de Contaminación",
|
| 1000 |
+
"contam.tip": "Calcula un prior bayesiano-ish sobre si un score de benchmark está contaminado, basado en (fecha de cutoff de entrenamiento) × (fecha de release del benchmark) × (inclusión conocida en corpus + historial de leaks). Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag.",
|
| 1001 |
+
"contam.desc": "<strong>¿Deberías confiar en el MMLU de tu modelo?</strong> Introduce la fecha cutoff de entrenamiento — la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) y te dice qué scores son probablemente contaminados.",
|
| 1002 |
+
"contam.cutoff_label": "Cutoff entrenamiento:",
|
| 1003 |
+
"contam.run_btn": "🧪 Puntuar todos los benchmarks",
|
| 1004 |
+
"contam.section.ranked": "Priors de contaminación por benchmark",
|
| 1005 |
+
"contam.section.high": "🔴 Benchmarks de alto riesgo (trata los scores como no fiables)",
|
| 1006 |
+
"contam.section.medium": "🟡 Riesgo medio (verifica con alternativas)",
|
| 1007 |
+
"contam.section.low": "🟢 Bajo riesgo (probablemente limpios)",
|
| 1008 |
+
"contam.col.benchmark": "Benchmark",
|
| 1009 |
+
"contam.col.released": "Release",
|
| 1010 |
+
"contam.col.gap": "Gap (meses)",
|
| 1011 |
+
"contam.col.prior": "P(contam)",
|
| 1012 |
+
"contam.col.level": "Nivel",
|
| 1013 |
+
"contam.col.corpora": "En corpus",
|
| 1014 |
+
"contam.col.category": "Categoría",
|
| 1015 |
+
"contam.label.high": "Alto riesgo",
|
| 1016 |
+
"contam.label.medium": "Medio",
|
| 1017 |
+
"contam.label.low": "Bajo",
|
| 1018 |
+
"contam.no_entries": "(ninguno en esta categoría)",
|
| 1019 |
+
"contam.advice.high": "Trata estos scores como no fiables. Sustituye por alternativas más recientes / con test privado (MMLU-Pro, GPQA, MUSR, MATH-500).",
|
| 1020 |
+
"contam.advice.medium": "Toma con cautela. Busca replicación sobre subset held-out o reproducciones comunitarias.",
|
| 1021 |
+
"contam.advice.low": "Score probablemente no contaminado, pero ausencia de leak no es prueba — verifica también con test alternativo.",
|
| 1022 |
+
"contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks puntuados",
|
| 1023 |
+
"contam.status.empty": "⚠ Introduce una fecha cutoff de entrenamiento (ej. 2023-12).",
|
| 1024 |
+
"contam.status.bad_date": "⚠ Formato de fecha incorrecto. Usa YYYY-MM o YYYY-MM-DD.",
|
| 1025 |
+
"contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks puntuados · {high} de alto riesgo",
|
| 1026 |
+
|
| 1027 |
+
// v0.7 — Sección Help modal
|
| 1028 |
+
"help.v07.title": "🆕 v0.7 — Pack anti-bullshit (4 modos nuevos)",
|
| 1029 |
+
"help.v07.intro": "<em>v0.7 (2026-05-06): cuatro modos nuevos que resuelven problemas concretos reportados por la comunidad HuggingFace. Cada uno corre en tu navegador sin inferencia — pura metadata + matemáticas.</em>",
|
| 1030 |
+
"help.v07.unmask.title": "🪟 Desenmascarador de Contexto",
|
| 1031 |
+
"help.v07.unmask.body": "Detecta cuándo <code>max_position_embeddings</code> es engañoso. Mistral-7B-v0.1 declara 32k pero atiende dentro de ~4-8k vía SWA. Pega un id HF → veredicto en 1 segundo (HONESTO / INFLADO / GRAVEMENTE INFLADO / YARN-EXTENDIDO). Pilla SWA, RoPE-scaling (YaRN/linear/dynamic NTK), d_head pequeño + GQA. <em>Caso de uso</em>: antes de pagar GPU para 32k de contexto, verifica que el modelo realmente atiende tan lejos.",
|
| 1032 |
+
"help.v07.template.title": "📜 Detector de Chat-template",
|
| 1033 |
+
"help.v07.template.body": "Detecta qué familia de chat-template usa un modelo (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) y te da el flag CLI exacto para lm-evaluation-harness, vLLM, y transformers. Resuelve el issue #1841 de lm-eval-harness: olvidar <code>--apply_chat_template</code> divide la accuracy multi-turn por 2 silenciosamente. <em>Caso de uso</em>: antes de reportar un score, confirma que aplicaste el template correctamente.",
|
| 1034 |
+
"help.v07.arena.title": "🎯 Reconstructor Arena-Elo CI",
|
| 1035 |
+
"help.v07.arena.body": "Chatbot Arena oculta los intervalos de confianza en su leaderboard público — una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap de 200 iteraciones → Elos ranked con CIs 95% y un panel de \"empates estadísticos\" listando pares cuyos CIs se solapan. Prueba el botón Cargar sample. <em>Caso de uso</em>: antes de afirmar \"modelo A vence a modelo B\", verifica que sus CIs no se solapen.",
|
| 1036 |
+
"help.v07.contam.title": "🧪 Prior de Contaminación",
|
| 1037 |
+
"help.v07.contam.body": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha cutoff de entrenamiento de tu modelo → la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) por P(contaminación) según gap temporal, inclusión en corpus y historial de leaks conocidos. Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag. <em>Caso de uso</em>: decide qué scores te puedes creer al comparar dos modelos.",
|
| 1038 |
+
|
| 1039 |
+
// v0.7 — Inventory modal 5ª card
|
| 1040 |
+
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
| 1041 |
+
"inv.v07.unmask": "<strong>🪟 Unmask</strong> — ¿config.json declara 32k? Mira si de verdad atiende tan lejos",
|
| 1042 |
+
"inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exacto para que lm-eval no divida tu accuracy entre 2 silenciosamente",
|
| 1043 |
+
"inv.v07.arena": "<strong>🎯 Arena CI</strong> — recupera los intervalos de confianza que Chatbot Arena oculta",
|
| 1044 |
+
"inv.v07.contam": "<strong>🧪 Contaminación</strong> — puntúa 20+ benchmarks por probabilidad de contaminación",
|
| 1045 |
"share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
|
| 1046 |
"share.import_btn": "📂 Cargar JSON compartido",
|
| 1047 |
"synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
|
|
|
|
| 1134 |
"common.no": "No",
|
| 1135 |
|
| 1136 |
// Tooltips de modos
|
| 1137 |
+
"modes.tip": "<strong>Once formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.",
|
| 1138 |
"profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
|
| 1139 |
"compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
|
| 1140 |
|
|
|
|
| 1554 |
"template.status.invalid_json":"❌ JSON invalide : {error}",
|
| 1555 |
"template.status.success_paste":"✅ Config collé détecté (verdict : {verdict})",
|
| 1556 |
"template.pasted_label": "(tokenizer_config collé)",
|
| 1557 |
+
|
| 1558 |
+
// v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
|
| 1559 |
+
"modes.arena": "🎯 Arena CI",
|
| 1560 |
+
"mode_desc.arena": "Récupère les intervalles de confiance à partir des données brutes de votes pairwise (MLE Bradley-Terry + bootstrap). Détecte les paires statistiquement à égalité que le leaderboard public d'Arena cache.",
|
| 1561 |
+
"arena.title": "🎯 Reconstructeur Arena-Elo CI",
|
| 1562 |
+
"arena.tip": "Chatbot Arena masque les intervalles de confiance dans le leaderboard public. Un écart de 5 Elo peut être statistiquement insignifiant. Collez les données brutes de votes (model_a, model_b, winner) — l'outil calcule le MLE Bradley-Terry + bootstrap CIs et liste les égalités statistiques (overlap CI).",
|
| 1563 |
+
"arena.desc": "<strong>GPT-4 est-il vraiment meilleur que Claude — ou sont-ils à égalité ?</strong> Collez le CSV de votes pairwise (ou cliquez <em>Charger un échantillon</em>). MLE Bradley-Terry + 200 itérations de bootstrap → Elos classés avec CIs 95% et détection d'égalités statistiques. Tout dans le navigateur.",
|
| 1564 |
+
"arena.sample_btn": "📊 Charger échantillon",
|
| 1565 |
+
"arena.run_btn": "🎯 Calculer CIs",
|
| 1566 |
+
"arena.clear_btn": "🗑️ Effacer",
|
| 1567 |
+
"arena.csv_summary": "CSV de votes (header : <code>model_a,model_b,winner</code> ; winner ∈ a/b/tie)",
|
| 1568 |
+
"arena.section.ranked": "Elos classés avec CIs 95%",
|
| 1569 |
+
"arena.section.ties": "Égalités statistiques (overlap CI)",
|
| 1570 |
+
"arena.section.summary": "Résumé",
|
| 1571 |
+
"arena.col.rank": "#",
|
| 1572 |
+
"arena.col.model": "Modèle",
|
| 1573 |
+
"arena.col.elo": "Elo",
|
| 1574 |
+
"arena.col.ci": "CI 95%",
|
| 1575 |
+
"arena.col.ci_width": "± demi-largeur",
|
| 1576 |
+
"arena.col.matches": "Matchs",
|
| 1577 |
+
"arena.col.wins": "V / D / E",
|
| 1578 |
+
"arena.col.tie_pair": "Paire",
|
| 1579 |
+
"arena.col.tie_diff": "Écart Elo",
|
| 1580 |
+
"arena.col.tie_overlap": "Overlap CI",
|
| 1581 |
+
"arena.no_ties": "Aucune égalité statistique — toutes les paires sont distinguables à 95% CI.",
|
| 1582 |
+
"arena.summary.votes": "Total des votes",
|
| 1583 |
+
"arena.summary.models": "Modèles",
|
| 1584 |
+
"arena.summary.ties": "Égalités statistiques",
|
| 1585 |
+
"arena.summary.bootstrap": "Itérations bootstrap",
|
| 1586 |
+
"arena.summary.ci_level": "Niveau CI",
|
| 1587 |
+
"arena.status.empty": "⚠ Collez un CSV de votes ou cliquez sur Charger échantillon.",
|
| 1588 |
+
"arena.status.too_few": "⚠ Seulement {n} votes valides — il en faut au moins 10 pour ajuster Bradley-Terry de manière fiable.",
|
| 1589 |
+
"arena.status.computing": "⏳ Calcul MLE Bradley-Terry + bootstrap sur {n} votes...",
|
| 1590 |
+
"arena.status.done": "✅ {n} votes · {models} modèles · {ties} égalités statistiques · {ms} ms",
|
| 1591 |
+
"arena.status.sample_loaded": "✅ Échantillon chargé (données Arena synthétiques 6 modèles). Cliquez sur Calculer CIs.",
|
| 1592 |
+
|
| 1593 |
+
// v0.7.3 — anti-bullshit pack #4: Contamination Prior
|
| 1594 |
+
"modes.contam": "🧪 Contamination",
|
| 1595 |
+
"mode_desc.contam": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez le cutoff d'entraînement → note 20+ benchmarks populaires (MMLU, GSM8K, HumanEval, MMLU-Pro…).",
|
| 1596 |
+
"contam.title": "🧪 Prior de Contamination",
|
| 1597 |
+
"contam.tip": "Calcule un prior bayésien-ish indiquant si un score de benchmark est contaminé, basé sur (date de cutoff d'entraînement) × (date de sortie du benchmark) × (inclusion connue dans corpus + historique de leaks). Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag.",
|
| 1598 |
+
"contam.desc": "<strong>Devez-vous faire confiance au score MMLU de votre modèle ?</strong> Saisissez la date de cutoff d'entraînement — l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) et vous dit quels scores sont probablement contaminés.",
|
| 1599 |
+
"contam.cutoff_label": "Cutoff entraînement :",
|
| 1600 |
+
"contam.run_btn": "🧪 Noter tous les benchmarks",
|
| 1601 |
+
"contam.section.ranked": "Priors de contamination par benchmark",
|
| 1602 |
+
"contam.section.high": "🔴 Benchmarks à haut risque (traitez les scores comme non fiables)",
|
| 1603 |
+
"contam.section.medium": "🟡 Risque moyen (vérifiez avec des alternatives)",
|
| 1604 |
+
"contam.section.low": "🟢 Faible risque (probablement propres)",
|
| 1605 |
+
"contam.col.benchmark": "Benchmark",
|
| 1606 |
+
"contam.col.released": "Sorti",
|
| 1607 |
+
"contam.col.gap": "Écart (mois)",
|
| 1608 |
+
"contam.col.prior": "P(contam)",
|
| 1609 |
+
"contam.col.level": "Niveau",
|
| 1610 |
+
"contam.col.corpora": "Dans corpus",
|
| 1611 |
+
"contam.col.category": "Catégorie",
|
| 1612 |
+
"contam.label.high": "Haut risque",
|
| 1613 |
+
"contam.label.medium": "Moyen",
|
| 1614 |
+
"contam.label.low": "Faible",
|
| 1615 |
+
"contam.no_entries": "(aucun dans cette catégorie)",
|
| 1616 |
+
"contam.advice.high": "Traitez ces scores comme non fiables. Remplacez par des alternatives plus récentes / à test privé (MMLU-Pro, GPQA, MUSR, MATH-500).",
|
| 1617 |
+
"contam.advice.medium": "À prendre avec précaution. Cherchez une réplication sur un subset held-out ou des reproductions communautaires.",
|
| 1618 |
+
"contam.advice.low": "Score probablement non contaminé, mais absence de leak n'est pas une preuve — vérifiez avec un test alternatif.",
|
| 1619 |
+
"contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} benchmarks notés",
|
| 1620 |
+
"contam.status.empty": "⚠ Saisissez une date de cutoff d'entraînement (ex. 2023-12).",
|
| 1621 |
+
"contam.status.bad_date": "⚠ Format de date incorrect. Utilisez YYYY-MM ou YYYY-MM-DD.",
|
| 1622 |
+
"contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks notés · {high} à haut risque",
|
| 1623 |
+
|
| 1624 |
+
// v0.7 — Section Help modal
|
| 1625 |
+
"help.v07.title": "🆕 v0.7 — Pack anti-bullshit (4 nouveaux modes)",
|
| 1626 |
+
"help.v07.intro": "<em>v0.7 (2026-05-06) : quatre nouveaux modes qui résolvent des problèmes concrets remontés par la communauté HuggingFace. Chacun tourne dans votre navigateur sans inférence — pure métadonnée + maths.</em>",
|
| 1627 |
+
"help.v07.unmask.title": "🪟 Démasqueur de Contexte",
|
| 1628 |
+
"help.v07.unmask.body": "Détecte quand <code>max_position_embeddings</code> est trompeur. Mistral-7B-v0.1 déclare 32k mais attend dans ~4-8k via SWA. Collez un id HF → verdict en 1 seconde (HONNÊTE / GONFLÉ / GRAVEMENT GONFLÉ / YARN-ÉTENDU). Détecte SWA, RoPE-scaling (YaRN/linear/dynamic NTK), petit d_head + GQA. <em>Cas d'usage</em> : avant de payer un GPU pour 32k de contexte, vérifiez que le modèle attend vraiment aussi loin.",
|
| 1629 |
+
"help.v07.template.title": "📜 Détecteur de Chat-template",
|
| 1630 |
+
"help.v07.template.body": "Détecte la famille de chat-template d'un modèle (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) et donne le flag CLI exact pour lm-evaluation-harness, vLLM, et transformers. Résout l'issue #1841 de lm-eval-harness : oublier <code>--apply_chat_template</code> divise l'accuracy multi-tours par 2 silencieusement. <em>Cas d'usage</em> : avant de reporter un score, confirmez avoir appliqué le template correctement.",
|
| 1631 |
+
"help.v07.arena.title": "🎯 Reconstructeur Arena-Elo CI",
|
| 1632 |
+
"help.v07.arena.body": "Chatbot Arena masque les intervalles de confiance de son leaderboard public — un écart de 5 Elo peut être statistiquement insignifiant. Collez des données brutes de votes pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap 200 itérations → Elos classés avec CIs 95% et un panneau \"égalités statistiques\" listant les paires dont les CIs se chevauchent. Essayez le bouton Charger échantillon. <em>Cas d'usage</em> : avant de déclarer \"modèle A bat modèle B\", vérifiez que leurs CIs ne se chevauchent pas.",
|
| 1633 |
+
"help.v07.contam.title": "🧪 Prior de Contamination",
|
| 1634 |
+
"help.v07.contam.body": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez la date de cutoff d'entraînement de votre modèle → l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) par P(contamination) selon l'écart temporel, l'inclusion dans corpus et l'historique de leaks connus. Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag. <em>Cas d'usage</em> : décidez quels scores croire en comparant deux modèles.",
|
| 1635 |
+
|
| 1636 |
+
// v0.7 — Inventory modal 5ème card
|
| 1637 |
+
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
| 1638 |
+
"inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json annonce 32k ? Voyez s'il attend vraiment aussi loin",
|
| 1639 |
+
"inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exact pour que lm-eval ne divise pas votre accuracy par 2 en silence",
|
| 1640 |
+
"inv.v07.arena": "<strong>🎯 Arena CI</strong> — récupère les intervalles de confiance que Chatbot Arena cache",
|
| 1641 |
+
"inv.v07.contam": "<strong>🧪 Contamination</strong> — note 20+ benchmarks par probabilité de contamination",
|
| 1642 |
"share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
|
| 1643 |
"share.import_btn": "📂 Charger JSON partagé",
|
| 1644 |
"synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
|
|
|
|
| 1731 |
"common.no": "Non",
|
| 1732 |
|
| 1733 |
// Tooltips des modes
|
| 1734 |
+
"modes.tip": "<strong>Onze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.",
|
| 1735 |
"profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
|
| 1736 |
"compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
|
| 1737 |
|
|
|
|
| 2151 |
"template.status.invalid_json":"❌ JSON 无效:{error}",
|
| 2152 |
"template.status.success_paste":"✅ 已检测粘贴的 config(判定:{verdict})",
|
| 2153 |
"template.pasted_label": "(已粘贴 tokenizer_config)",
|
| 2154 |
+
|
| 2155 |
+
// v0.7.2 — anti-bullshit pack #3: Arena-Elo CI reconstructor
|
| 2156 |
+
"modes.arena": "🎯 Arena CI",
|
| 2157 |
+
"mode_desc.arena": "从原始 pairwise 投票数据中恢复置信区间(Bradley-Terry MLE + bootstrap)。检测公开 Arena 排行榜隐藏的统计上并列对。",
|
| 2158 |
+
"arena.title": "🎯 Arena-Elo CI 重建器",
|
| 2159 |
+
"arena.tip": "Chatbot Arena 在公开排行榜中删除了置信区间。5 Elo 的差距在统计上可能毫无意义。粘贴原始投票数据(model_a, model_b, winner) — 工具计算 Bradley-Terry MLE + bootstrap CI 并列出统计上的并列(CI 重叠)。",
|
| 2160 |
+
"arena.desc": "<strong>GPT-4 真的比 Claude 强吗 — 还是它们打平?</strong> 粘贴 pairwise 投票 CSV(或点击 <em>加载样本</em>)。Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + 统计并列检测。全部在浏览器中。",
|
| 2161 |
+
"arena.sample_btn": "📊 加载样本数据",
|
| 2162 |
+
"arena.run_btn": "🎯 计算 CIs",
|
| 2163 |
+
"arena.clear_btn": "🗑️ 清空",
|
| 2164 |
+
"arena.csv_summary": "投票 CSV(header:<code>model_a,model_b,winner</code>;winner ∈ a/b/tie)",
|
| 2165 |
+
"arena.section.ranked": "排序 Elo 与 95% CI",
|
| 2166 |
+
"arena.section.ties": "统计并列(CI 重叠)",
|
| 2167 |
+
"arena.section.summary": "摘要",
|
| 2168 |
+
"arena.col.rank": "#",
|
| 2169 |
+
"arena.col.model": "模型",
|
| 2170 |
+
"arena.col.elo": "Elo",
|
| 2171 |
+
"arena.col.ci": "95% CI",
|
| 2172 |
+
"arena.col.ci_width": "± 半宽",
|
| 2173 |
+
"arena.col.matches": "对局",
|
| 2174 |
+
"arena.col.wins": "胜 / 负 / 平",
|
| 2175 |
+
"arena.col.tie_pair": "配对",
|
| 2176 |
+
"arena.col.tie_diff": "Elo 差距",
|
| 2177 |
+
"arena.col.tie_overlap": "CI 重叠",
|
| 2178 |
+
"arena.no_ties": "无统计并列 — 所有配对在 95% CI 下可区分。",
|
| 2179 |
+
"arena.summary.votes": "总投票数",
|
| 2180 |
+
"arena.summary.models": "模型数",
|
| 2181 |
+
"arena.summary.ties": "统计并列",
|
| 2182 |
+
"arena.summary.bootstrap": "Bootstrap 迭代",
|
| 2183 |
+
"arena.summary.ci_level": "CI 水平",
|
| 2184 |
+
"arena.status.empty": "⚠ 粘贴投票 CSV 或点击加载样本。",
|
| 2185 |
+
"arena.status.too_few": "⚠ 仅 {n} 个有效投票 — 需要至少 10 个才能可靠拟合 Bradley-Terry。",
|
| 2186 |
+
"arena.status.computing": "⏳ 在 {n} 个投票上计算 Bradley-Terry MLE + bootstrap...",
|
| 2187 |
+
"arena.status.done": "✅ {n} 投票 · {models} 模型 · {ties} 统计并列 · {ms} ms",
|
| 2188 |
+
"arena.status.sample_loaded": "✅ 样本已加载(合成 6 模型 Arena 数据)。点击计算 CIs。",
|
| 2189 |
+
|
| 2190 |
+
// v0.7.3 — anti-bullshit pack #4: Contamination Prior
|
| 2191 |
+
"modes.contam": "🧪 污染",
|
| 2192 |
+
"mode_desc.contam": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff → 评估 20+ 主流 benchmark(MMLU、GSM8K、HumanEval、MMLU-Pro…)。",
|
| 2193 |
+
"contam.title": "🧪 污染先验",
|
| 2194 |
+
"contam.tip": "基于 (模型训练 cutoff 日期) × (benchmark 发布日期) × (已知语料库纳入 + 泄漏历史),对 benchmark 分数是否被污染做贝叶斯式的先验估计。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。",
|
| 2195 |
+
"contam.desc": "<strong>你应该相信你模型的 MMLU 分数吗?</strong> 输入模型训练 cutoff 日期 — 工具评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA…)并告诉你哪些分数可能被污染。",
|
| 2196 |
+
"contam.cutoff_label": "训练 cutoff:",
|
| 2197 |
+
"contam.run_btn": "🧪 评估所有 benchmark",
|
| 2198 |
+
"contam.section.ranked": "Benchmark 污染先验",
|
| 2199 |
+
"contam.section.high": "🔴 高风险 benchmark(视分数为不可信)",
|
| 2200 |
+
"contam.section.medium": "🟡 中等风险(用替代品验证)",
|
| 2201 |
+
"contam.section.low": "🟢 低风险(可能干净)",
|
| 2202 |
+
"contam.col.benchmark": "Benchmark",
|
| 2203 |
+
"contam.col.released": "发布",
|
| 2204 |
+
"contam.col.gap": "差距(月)",
|
| 2205 |
+
"contam.col.prior": "P(污染)",
|
| 2206 |
+
"contam.col.level": "等级",
|
| 2207 |
+
"contam.col.corpora": "在语料库",
|
| 2208 |
+
"contam.col.category": "类别",
|
| 2209 |
+
"contam.label.high": "高风险",
|
| 2210 |
+
"contam.label.medium": "中",
|
| 2211 |
+
"contam.label.low": "低",
|
| 2212 |
+
"contam.no_entries": "(此类别中无)",
|
| 2213 |
+
"contam.advice.high": "视这些分数为不可信。用更新 / 私有测试的替代品替换(MMLU-Pro、GPQA、MUSR、MATH-500)。",
|
| 2214 |
+
"contam.advice.medium": "谨慎对待。在 held-out 子集或社区复现上寻找复制。",
|
| 2215 |
+
"contam.advice.low": "分数可能未被污染,但没有泄漏不等于证明 — 仍要用替代测试交叉验证。",
|
| 2216 |
+
"contam.summary.headline": "Cutoff <code>{cutoff}</code> · {n} 个 benchmark 已评估",
|
| 2217 |
+
"contam.status.empty": "⚠ 输入模型训练 cutoff 日期(例如 2023-12)。",
|
| 2218 |
+
"contam.status.bad_date": "⚠ 日期格式错误。使用 YYYY-MM 或 YYYY-MM-DD。",
|
| 2219 |
+
"contam.status.done": "✅ Cutoff {cutoff} · {n} benchmarks 已评估 · {high} 个高风险",
|
| 2220 |
+
|
| 2221 |
+
// v0.7 — Help 模态部分
|
| 2222 |
+
"help.v07.title": "🆕 v0.7 — Anti-bullshit 套件(4 个新模式)",
|
| 2223 |
+
"help.v07.intro": "<em>v0.7(2026-05-06):四个新模式,解决 HuggingFace 社区报告的具体痛点。每个都在浏览器中运行,无推理 — 纯元数据 + 数学。</em>",
|
| 2224 |
+
"help.v07.unmask.title": "🪟 上下文揭示器",
|
| 2225 |
+
"help.v07.unmask.body": "检测 <code>max_position_embeddings</code> 何时具有误导性。Mistral-7B-v0.1 声称 32k 但通过 SWA 实际只在 ~4-8k 内做注意力。粘贴 HF 模型 id → 1 秒判定(诚实 / 夸大 / 严重夸大 / YARN 扩展)。捕获 SWA、RoPE-scaling(YaRN/linear/dynamic NTK)、小 d_head + GQA。<em>用例</em>:在为 32k 上下文付 GPU 钱之前,验证模型是否真的注意那么远。",
|
| 2226 |
+
"help.v07.template.title": "📜 Chat-template 检测器",
|
| 2227 |
+
"help.v07.template.body": "检测模型使用的 chat-template 系列(Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / 自定义 / 无)并给出 lm-evaluation-harness、vLLM、transformers 的精确 CLI flag。解决 lm-eval-harness 的 issue #1841:忘记 <code>--apply_chat_template</code> 会让 multi-turn accuracy 静默对半。<em>用例</em>:报告 benchmark 分数前,确认你正确应用了 template。",
|
| 2228 |
+
"help.v07.arena.title": "🎯 Arena-Elo CI 重建器",
|
| 2229 |
+
"help.v07.arena.body": "Chatbot Arena 在公开排行榜中删除了置信区间 — 5 Elo 的差距在统计上可能毫无意义。粘贴原始 pairwise 投票数据(model_a, model_b, winner)→ Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + \"统计并列\" 面板,列出 CI 重叠的配对。尝试加载样本按钮。<em>用例</em>:宣称 \"模型 A 胜过模型 B\" 之前,验证它们的 CI 不重叠。",
|
| 2230 |
+
"help.v07.contam.title": "🧪 污染先验",
|
| 2231 |
+
"help.v07.contam.body": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff 日期 → 工具按 P(污染) 评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA、AIME、MATH-500、BBH、MUSR…),基于时间差距、语料库纳入和已知泄漏历史。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。<em>用例</em>:比较两个模型时决定相信哪些分数。",
|
| 2232 |
+
|
| 2233 |
+
// v0.7 — Inventory 模态第 5 卡
|
| 2234 |
+
"inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
|
| 2235 |
+
"inv.v07.unmask": "<strong>🪟 Unmask</strong> — config.json 声称 32k?看它是否真的注意那么远",
|
| 2236 |
+
"inv.v07.template": "<strong>📜 Chat-template</strong> — 精确 CLI flag,让 lm-eval 不会静默对半你的 accuracy",
|
| 2237 |
+
"inv.v07.arena": "<strong>🎯 Arena CI</strong> — 恢复 Chatbot Arena 隐藏的置信区间",
|
| 2238 |
+
"inv.v07.contam": "<strong>🧪 污染</strong> — 按污染概率对 20+ benchmark 评级",
|
| 2239 |
"share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
|
| 2240 |
"share.import_btn": "📂 加载共享的 JSON",
|
| 2241 |
"synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
|
|
|
|
| 2328 |
"common.no": "否",
|
| 2329 |
|
| 2330 |
// 模式提示
|
| 2331 |
+
"modes.tip": "<strong>十一种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。",
|
| 2332 |
"profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
|
| 2333 |
"compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
|
| 2334 |
|
|
@@ -13,6 +13,8 @@ import { gammaCheckAll, REGIME_META } from "./gamma_check.js";
|
|
| 13 |
import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
|
| 14 |
import { unmaskConfig } from "./swa_unmasker.js";
|
| 15 |
import { sniffChatTemplate } from "./chat_template_sniffer.js";
|
|
|
|
|
|
|
| 16 |
|
| 17 |
const TAF_BROWSER_URL = "python/taf_browser.py";
|
| 18 |
const ENABLE_WEBLLM = true;
|
|
@@ -188,7 +190,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 188 |
["ask-section", "recipe-section", "form-section",
|
| 189 |
"profile-section", "compare-section", "inspector-section",
|
| 190 |
"diagnose-section", "phase-section", "unmask-section",
|
| 191 |
-
"template-section"].forEach(id => {
|
| 192 |
const el = $(id);
|
| 193 |
if (el) el.style.display = "none";
|
| 194 |
});
|
|
@@ -197,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 197 |
ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
|
| 198 |
compare: "compare-section", inspector: "inspector-section",
|
| 199 |
diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
|
| 200 |
-
template: "template-section",
|
| 201 |
};
|
| 202 |
const sectionId = sectionMap[mode];
|
| 203 |
if (sectionId) $(sectionId).style.display = "";
|
|
@@ -739,6 +741,245 @@ $("template-id")?.addEventListener("keydown", (e) => {
|
|
| 739 |
if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
|
| 740 |
});
|
| 741 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 742 |
function configToPreset(cfg, modelId) {
|
| 743 |
const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
|
| 744 |
const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
|
|
|
|
| 13 |
import { loadLeanManifest, badgeHtml, badgesForUiBinding, renderTheoremTable, getManifest } from "./lean_badges.js";
|
| 14 |
import { unmaskConfig } from "./swa_unmasker.js";
|
| 15 |
import { sniffChatTemplate } from "./chat_template_sniffer.js";
|
| 16 |
+
import { parseVotesCSV, computeArenaCI, SAMPLE_VOTES_CSV } from "./arena_ci.js";
|
| 17 |
+
import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
|
| 18 |
|
| 19 |
const TAF_BROWSER_URL = "python/taf_browser.py";
|
| 20 |
const ENABLE_WEBLLM = true;
|
|
|
|
| 190 |
["ask-section", "recipe-section", "form-section",
|
| 191 |
"profile-section", "compare-section", "inspector-section",
|
| 192 |
"diagnose-section", "phase-section", "unmask-section",
|
| 193 |
+
"template-section", "arena-section", "contam-section"].forEach(id => {
|
| 194 |
const el = $(id);
|
| 195 |
if (el) el.style.display = "none";
|
| 196 |
});
|
|
|
|
| 199 |
ask: "ask-section", recipe: "recipe-section", profile: "profile-section",
|
| 200 |
compare: "compare-section", inspector: "inspector-section",
|
| 201 |
diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
|
| 202 |
+
template: "template-section", arena: "arena-section", contam: "contam-section",
|
| 203 |
};
|
| 204 |
const sectionId = sectionMap[mode];
|
| 205 |
if (sectionId) $(sectionId).style.display = "";
|
|
|
|
| 741 |
if (e.key === "Enter") { e.preventDefault(); runTemplateFromId(); }
|
| 742 |
});
|
| 743 |
|
| 744 |
+
// ════════════════════════════════════════════════════════════════════
|
| 745 |
+
// 🎯 Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3)
|
| 746 |
+
// ════════════════════════════════════════════════════════════════════
|
| 747 |
+
|
| 748 |
+
function renderArenaCard(result) {
|
| 749 |
+
const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
|
| 750 |
+
({"&":"&","<":"<",">":">",'"':""","'":"'"}[c]));
|
| 751 |
+
const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
|
| 752 |
+
|
| 753 |
+
const titleRanked = t("arena.section.ranked") || "Ranked Elos with 95% CIs";
|
| 754 |
+
const titleTies = t("arena.section.ties") || "Statistical ties (CI overlap)";
|
| 755 |
+
const titleSummary = t("arena.section.summary") || "Summary";
|
| 756 |
+
const colRank = t("arena.col.rank") || "#";
|
| 757 |
+
const colModel = t("arena.col.model") || "Model";
|
| 758 |
+
const colElo = t("arena.col.elo") || "Elo";
|
| 759 |
+
const colCi = t("arena.col.ci") || "95% CI";
|
| 760 |
+
const colSpread = t("arena.col.ci_width") || "CI width";
|
| 761 |
+
const colMatches = t("arena.col.matches") || "Matches";
|
| 762 |
+
const colWins = t("arena.col.wins") || "W / L / T";
|
| 763 |
+
const noTies = t("arena.no_ties") || "No statistical ties — all pairs distinguishable at 95% CI.";
|
| 764 |
+
|
| 765 |
+
// Ranked table
|
| 766 |
+
let tableRows = "";
|
| 767 |
+
for (const r of result.ratings) {
|
| 768 |
+
tableRows += `<tr>
|
| 769 |
+
<td class="arena-rank">#${r.rank}</td>
|
| 770 |
+
<td class="arena-model"><code>${escapeHtml(r.model)}</code></td>
|
| 771 |
+
<td class="arena-elo"><strong>${fmtN(r.elo)}</strong></td>
|
| 772 |
+
<td class="arena-ci">[${fmtN(r.ci_low)}, ${fmtN(r.ci_high)}]</td>
|
| 773 |
+
<td class="arena-spread">±${fmtN(Math.round(r.ci_width / 2 * 10) / 10)}</td>
|
| 774 |
+
<td class="arena-matches">${fmtN(r.matches)}</td>
|
| 775 |
+
<td class="arena-wlt">${fmtN(r.wins)} / ${fmtN(r.losses)} / ${fmtN(r.ties_count)}</td>
|
| 776 |
+
</tr>`;
|
| 777 |
+
}
|
| 778 |
+
|
| 779 |
+
// Ties section
|
| 780 |
+
let tiesHtml = "";
|
| 781 |
+
if (result.ties.length === 0) {
|
| 782 |
+
tiesHtml = `<p class="unmask-reco">${noTies}</p>`;
|
| 783 |
+
} else {
|
| 784 |
+
tiesHtml = `<table class="arena-ties-table">
|
| 785 |
+
<thead><tr>
|
| 786 |
+
<th>${t("arena.col.tie_pair") || "Pair"}</th>
|
| 787 |
+
<th>${t("arena.col.tie_diff") || "Elo gap"}</th>
|
| 788 |
+
<th>${t("arena.col.tie_overlap") || "CI overlap"}</th>
|
| 789 |
+
</tr></thead><tbody>`;
|
| 790 |
+
for (const tieEntry of result.ties) {
|
| 791 |
+
tiesHtml += `<tr>
|
| 792 |
+
<td>#${tieEntry.rank_a} <code>${escapeHtml(tieEntry.model_a)}</code> vs #${tieEntry.rank_b} <code>${escapeHtml(tieEntry.model_b)}</code></td>
|
| 793 |
+
<td>${fmtN(Math.round(tieEntry.elo_diff * 10) / 10)} Elo</td>
|
| 794 |
+
<td>${fmtN(Math.round(tieEntry.overlap_elo * 10) / 10)} Elo</td>
|
| 795 |
+
</tr>`;
|
| 796 |
+
}
|
| 797 |
+
tiesHtml += `</tbody></table>`;
|
| 798 |
+
}
|
| 799 |
+
|
| 800 |
+
// Summary panel
|
| 801 |
+
const s = result.summary;
|
| 802 |
+
const summaryHtml = `
|
| 803 |
+
<ul>
|
| 804 |
+
<li><strong>${t("arena.summary.votes") || "Total votes"}:</strong> ${fmtN(s.total_votes)}</li>
|
| 805 |
+
<li><strong>${t("arena.summary.models") || "Models"}:</strong> ${fmtN(s.n_models)}</li>
|
| 806 |
+
<li><strong>${t("arena.summary.ties") || "Statistical ties"}:</strong> ${fmtN(s.n_ties)}</li>
|
| 807 |
+
<li><strong>${t("arena.summary.bootstrap") || "Bootstrap iters"}:</strong> ${fmtN(s.bootstrap_iters)}</li>
|
| 808 |
+
<li><strong>${t("arena.summary.ci_level") || "CI level"}:</strong> ${(s.ci_level * 100).toFixed(0)}%</li>
|
| 809 |
+
</ul>
|
| 810 |
+
`;
|
| 811 |
+
|
| 812 |
+
return `
|
| 813 |
+
<div class="arena-result">
|
| 814 |
+
<details class="unmask-panel" open>
|
| 815 |
+
<summary class="unmask-panel-title">${titleRanked}</summary>
|
| 816 |
+
<div style="overflow-x:auto;">
|
| 817 |
+
<table class="arena-table">
|
| 818 |
+
<thead><tr>
|
| 819 |
+
<th>${colRank}</th><th>${colModel}</th><th>${colElo}</th>
|
| 820 |
+
<th>${colCi}</th><th>${colSpread}</th>
|
| 821 |
+
<th>${colMatches}</th><th>${colWins}</th>
|
| 822 |
+
</tr></thead>
|
| 823 |
+
<tbody>${tableRows}</tbody>
|
| 824 |
+
</table>
|
| 825 |
+
</div>
|
| 826 |
+
</details>
|
| 827 |
+
<details class="unmask-panel" open>
|
| 828 |
+
<summary class="unmask-panel-title">${titleTies} <span class="arena-tie-count">(${result.ties.length})</span></summary>
|
| 829 |
+
${tiesHtml}
|
| 830 |
+
</details>
|
| 831 |
+
<details class="unmask-panel">
|
| 832 |
+
<summary class="unmask-panel-title">${titleSummary}</summary>
|
| 833 |
+
${summaryHtml}
|
| 834 |
+
</details>
|
| 835 |
+
</div>
|
| 836 |
+
`;
|
| 837 |
+
}
|
| 838 |
+
|
| 839 |
+
function runArenaCompute() {
|
| 840 |
+
const csv = ($("arena-csv").value || "").trim();
|
| 841 |
+
if (!csv) {
|
| 842 |
+
$("arena-status").textContent = t("arena.status.empty") || "⚠ Paste vote CSV or click Load sample.";
|
| 843 |
+
return;
|
| 844 |
+
}
|
| 845 |
+
let votes;
|
| 846 |
+
try {
|
| 847 |
+
votes = parseVotesCSV(csv);
|
| 848 |
+
} catch (e) {
|
| 849 |
+
$("arena-status").textContent = `❌ ${e.message}`;
|
| 850 |
+
return;
|
| 851 |
+
}
|
| 852 |
+
if (votes.length < 10) {
|
| 853 |
+
$("arena-status").textContent = tFmt("arena.status.too_few", { n: votes.length });
|
| 854 |
+
return;
|
| 855 |
+
}
|
| 856 |
+
$("arena-status").textContent = tFmt("arena.status.computing", { n: votes.length });
|
| 857 |
+
// Defer to next tick so the status text actually paints before the heavy bootstrap.
|
| 858 |
+
setTimeout(() => {
|
| 859 |
+
const t0 = performance.now();
|
| 860 |
+
const result = computeArenaCI(votes, { bootstrapN: 200, ciLevel: 0.95 });
|
| 861 |
+
const ms = Math.round(performance.now() - t0);
|
| 862 |
+
$("arena-output").innerHTML = renderArenaCard(result);
|
| 863 |
+
$("arena-status").textContent = tFmt("arena.status.done", {
|
| 864 |
+
n: votes.length, models: result.summary.n_models,
|
| 865 |
+
ties: result.summary.n_ties, ms,
|
| 866 |
+
});
|
| 867 |
+
}, 30);
|
| 868 |
+
}
|
| 869 |
+
|
| 870 |
+
$("arena-sample-btn")?.addEventListener("click", () => {
|
| 871 |
+
$("arena-csv").value = SAMPLE_VOTES_CSV;
|
| 872 |
+
$("arena-status").textContent = t("arena.status.sample_loaded") || "✅ Sample loaded. Click Compute CIs.";
|
| 873 |
+
});
|
| 874 |
+
$("arena-run-btn")?.addEventListener("click", runArenaCompute);
|
| 875 |
+
$("arena-clear-btn")?.addEventListener("click", () => {
|
| 876 |
+
$("arena-csv").value = "";
|
| 877 |
+
$("arena-output").innerHTML = "";
|
| 878 |
+
$("arena-status").textContent = "";
|
| 879 |
+
});
|
| 880 |
+
|
| 881 |
+
// ════════════════════════════════════════════════════════════════════
|
| 882 |
+
// 🧪 Contamination Prior (v0.7.3 anti-bullshit pack #4)
|
| 883 |
+
// ════════════════════════════════════════════════════════════════════
|
| 884 |
+
|
| 885 |
+
const CONTAM_LEVEL_COLOR = { high: "#f85149", medium: "#f1c40f", low: "#3fb950" };
|
| 886 |
+
|
| 887 |
+
function renderContamCard(rows, modelCutoff) {
|
| 888 |
+
const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
|
| 889 |
+
({"&":"&","<":"<",">":">",'"':""","'":"'"}[c]));
|
| 890 |
+
|
| 891 |
+
const titleRanked = t("contam.section.ranked") || "Benchmark contamination priors";
|
| 892 |
+
const titleHigh = t("contam.section.high") || "🔴 High-risk benchmarks (treat scores as unreliable)";
|
| 893 |
+
const titleMed = t("contam.section.medium") || "🟡 Medium-risk (verify with alternates)";
|
| 894 |
+
const titleLow = t("contam.section.low") || "🟢 Low-risk (likely clean)";
|
| 895 |
+
const colBench = t("contam.col.benchmark") || "Benchmark";
|
| 896 |
+
const colReleased = t("contam.col.released") || "Released";
|
| 897 |
+
const colGap = t("contam.col.gap") || "Gap (months)";
|
| 898 |
+
const colPrior = t("contam.col.prior") || "P(contam)";
|
| 899 |
+
const colLevel = t("contam.col.level") || "Level";
|
| 900 |
+
const colCorpora = t("contam.col.corpora") || "In corpora";
|
| 901 |
+
const colCategory = t("contam.col.category") || "Category";
|
| 902 |
+
|
| 903 |
+
const high = rows.filter(r => r.level === "high");
|
| 904 |
+
const medium = rows.filter(r => r.level === "medium");
|
| 905 |
+
const low = rows.filter(r => r.level === "low");
|
| 906 |
+
|
| 907 |
+
function tableFor(group) {
|
| 908 |
+
if (group.length === 0) return `<p class="unmask-reco">${t("contam.no_entries") || "(none in this category)"}</p>`;
|
| 909 |
+
let body = "";
|
| 910 |
+
for (const r of group) {
|
| 911 |
+
body += `<tr>
|
| 912 |
+
<td><strong>${escapeHtml(r.benchmark)}</strong></td>
|
| 913 |
+
<td>${escapeHtml(r.benchmark_released)}</td>
|
| 914 |
+
<td class="arena-spread">${r.gap_months > 0 ? "+" : ""}${r.gap_months}</td>
|
| 915 |
+
<td class="arena-elo" style="color: ${CONTAM_LEVEL_COLOR[r.level]};"><strong>${(r.prior * 100).toFixed(0)}%</strong></td>
|
| 916 |
+
<td>${r.benchmark_in_corpora ? "✓" : "✗"}</td>
|
| 917 |
+
<td class="arena-spread">${escapeHtml(r.benchmark_category)}</td>
|
| 918 |
+
</tr>`;
|
| 919 |
+
}
|
| 920 |
+
return `<table class="arena-table">
|
| 921 |
+
<thead><tr><th>${colBench}</th><th>${colReleased}</th><th>${colGap}</th><th>${colPrior}</th><th>${colCorpora}</th><th>${colCategory}</th></tr></thead>
|
| 922 |
+
<tbody>${body}</tbody></table>`;
|
| 923 |
+
}
|
| 924 |
+
|
| 925 |
+
const adviceHigh = t("contam.advice.high") || "Treat these scores as unreliable. Replace with newer / private-test alternates (MMLU-Pro, GPQA, MUSR, MATH-500).";
|
| 926 |
+
const adviceMedium = t("contam.advice.medium") || "Take with caution. Look for replication on a held-out subset or community reproductions.";
|
| 927 |
+
const adviceLow = t("contam.advice.low") || "Score likely uncontaminated, but absence of leak is not proof — still cross-check with alternate test.";
|
| 928 |
+
|
| 929 |
+
return `
|
| 930 |
+
<div class="arena-result">
|
| 931 |
+
<div class="unmask-hero" style="border-color: #58a6ff;">
|
| 932 |
+
<div class="unmask-verdict" style="color: #58a6ff;">${tFmt("contam.summary.headline", { cutoff: modelCutoff, n: rows.length })}</div>
|
| 933 |
+
<div class="unmask-numbers">
|
| 934 |
+
<div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.high}">🔴 ${t("contam.label.high") || "High risk"}</span><span class="unmask-num-val">${high.length}</span></div>
|
| 935 |
+
<div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.medium}">🟡 ${t("contam.label.medium") || "Medium"}</span><span class="unmask-num-val">${medium.length}</span></div>
|
| 936 |
+
<div><span class="unmask-num-label" style="color:${CONTAM_LEVEL_COLOR.low}">🟢 ${t("contam.label.low") || "Low"}</span><span class="unmask-num-val">${low.length}</span></div>
|
| 937 |
+
</div>
|
| 938 |
+
</div>
|
| 939 |
+
<div class="unmask-details">
|
| 940 |
+
<details class="unmask-panel" open>
|
| 941 |
+
<summary class="unmask-panel-title">${titleHigh} <span class="arena-tie-count">(${high.length})</span></summary>
|
| 942 |
+
<p class="unmask-reco">${adviceHigh}</p>
|
| 943 |
+
${tableFor(high)}
|
| 944 |
+
</details>
|
| 945 |
+
<details class="unmask-panel" open>
|
| 946 |
+
<summary class="unmask-panel-title">${titleMed} <span class="arena-tie-count">(${medium.length})</span></summary>
|
| 947 |
+
<p class="unmask-reco">${adviceMedium}</p>
|
| 948 |
+
${tableFor(medium)}
|
| 949 |
+
</details>
|
| 950 |
+
<details class="unmask-panel">
|
| 951 |
+
<summary class="unmask-panel-title">${titleLow} <span class="arena-tie-count">(${low.length})</span></summary>
|
| 952 |
+
<p class="unmask-reco">${adviceLow}</p>
|
| 953 |
+
${tableFor(low)}
|
| 954 |
+
</details>
|
| 955 |
+
</div>
|
| 956 |
+
</div>
|
| 957 |
+
`;
|
| 958 |
+
}
|
| 959 |
+
|
| 960 |
+
function runContamCompute() {
|
| 961 |
+
const cutoff = ($("contam-cutoff").value || "").trim();
|
| 962 |
+
if (!cutoff) {
|
| 963 |
+
$("contam-status").textContent = t("contam.status.empty") || "⚠ Enter a model training cutoff date (e.g. 2023-12).";
|
| 964 |
+
return;
|
| 965 |
+
}
|
| 966 |
+
if (!/^\d{4}(-\d{1,2})?(-\d{1,2})?$/.test(cutoff)) {
|
| 967 |
+
$("contam-status").textContent = t("contam.status.bad_date") || "⚠ Bad date format. Use YYYY-MM or YYYY-MM-DD.";
|
| 968 |
+
return;
|
| 969 |
+
}
|
| 970 |
+
const rows = rateAllBenchmarks(cutoff);
|
| 971 |
+
$("contam-output").innerHTML = renderContamCard(rows, cutoff);
|
| 972 |
+
$("contam-status").textContent = tFmt("contam.status.done", {
|
| 973 |
+
cutoff, n: rows.length,
|
| 974 |
+
high: rows.filter(r => r.level === "high").length,
|
| 975 |
+
});
|
| 976 |
+
}
|
| 977 |
+
|
| 978 |
+
$("contam-run-btn")?.addEventListener("click", runContamCompute);
|
| 979 |
+
$("contam-cutoff")?.addEventListener("keydown", (e) => {
|
| 980 |
+
if (e.key === "Enter") { e.preventDefault(); runContamCompute(); }
|
| 981 |
+
});
|
| 982 |
+
|
| 983 |
function configToPreset(cfg, modelId) {
|
| 984 |
const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
|
| 985 |
const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
|
|
@@ -33,6 +33,34 @@
|
|
| 33 |
flex: 1;
|
| 34 |
}
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
/* v0.7.1 — Chat-template Sniffer mode */
|
| 37 |
.template-cmd-block {
|
| 38 |
display: flex;
|
|
|
|
| 33 |
flex: 1;
|
| 34 |
}
|
| 35 |
|
| 36 |
+
/* v0.7.2 — Arena-Elo CI reconstructor */
|
| 37 |
+
.arena-result { margin-top: 0.6em; }
|
| 38 |
+
.arena-table, .arena-ties-table {
|
| 39 |
+
width: 100%;
|
| 40 |
+
border-collapse: collapse;
|
| 41 |
+
font-size: 0.92em;
|
| 42 |
+
}
|
| 43 |
+
.arena-table th, .arena-ties-table th {
|
| 44 |
+
text-align: left;
|
| 45 |
+
font-weight: 600;
|
| 46 |
+
font-size: 0.78em;
|
| 47 |
+
text-transform: uppercase;
|
| 48 |
+
letter-spacing: 0.04em;
|
| 49 |
+
color: #58a6ff;
|
| 50 |
+
padding: 0.45em 0.6em;
|
| 51 |
+
border-bottom: 1px solid rgba(255, 255, 255, 0.12);
|
| 52 |
+
}
|
| 53 |
+
.arena-table td, .arena-ties-table td {
|
| 54 |
+
padding: 0.45em 0.6em;
|
| 55 |
+
border-bottom: 1px solid rgba(255, 255, 255, 0.04);
|
| 56 |
+
vertical-align: middle;
|
| 57 |
+
}
|
| 58 |
+
.arena-table tr:hover, .arena-ties-table tr:hover { background: rgba(88, 166, 255, 0.04); }
|
| 59 |
+
.arena-rank { color: #8b949e; font-family: monospace; }
|
| 60 |
+
.arena-elo { font-family: monospace; }
|
| 61 |
+
.arena-ci, .arena-spread, .arena-matches, .arena-wlt { font-family: monospace; font-size: 0.9em; opacity: 0.85; }
|
| 62 |
+
.arena-tie-count { font-size: 0.85em; opacity: 0.7; font-weight: normal; }
|
| 63 |
+
|
| 64 |
/* v0.7.1 — Chat-template Sniffer mode */
|
| 65 |
.template-cmd-block {
|
| 66 |
display: flex;
|