karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
7bc5d7c
·
1 Parent(s): 3dbfebb

v0.7.3: Quant-regime classifier (anti-bullshit pack #5)

Browse files

NEW MODE: ⚖️ Quant — predicts γ-shift and ΔPPL for any (model × quant scheme) combination, architecture-aware.

Solves: HF community widely reports unpredictable quantization cliffs. NF4 might lose 2 PPL on Phi-3 but be fine on Llama-3-8B. Generic claims like "AWQ ~95% retention" are too vague — TAF gives architecture-specific verdict using d_head, GQA ratio, SWA flag, and model size.

NEW
- js/quant_regime.js: pure logic. QUANT_SCHEMES table for 10 schemes (FP8 / int8 / Q8_0 / Q5_K_M / AWQ / GPTQ / Q4_K_M / NF4 / Q3_K_M / Q2_K). predictQuantShift() returns γ_shift × arch multiplier + ΔPPL band + regime band (safe/mild/significant/cliff) + concrete recommendation code.
- predictAllSchemes() ranks all 10 schemes for a given architecture so user sees the full trade-off table.
- HF Hub auto-fetch + paste-config fallback. Two output modes: single (one scheme + breakdown) and compare-all (sorted table).

VIRTUAL SIMULATION
- Llama-3-8B + AWQ → mild (γ=0.022); + NF4 → significant (γ=0.076); + Q8_0 → safe (γ=0.009).
- Phi-3-mini (small d_head) + NF4 → cliff (γ=0.085) with reco "switch to AWQ".
- Pythia-160m + Q3_K_M → cliff (γ=0.185) with reco "switch to Q4_K_M".
- Mistral-7B trade-off table: FP8/Q8_0/int8 → safe; Q5_K_M/AWQ/GPTQ → mild; Q4_K_M/NF4 → significant; Q3_K_M/Q2_K → cliff.

DOCUMENTATION
- Help modal: new v0.7 quant section (4 langs) with problem/solution/use case.
- Inventory modal v0.7 card: new "⚖️ Quant" entry (4 langs).
- modes.tip: now lists 12 modes in 4 langs.
- 583 i18n keys × 4 langs · 0 missing / 0 extra (50 new quant.* keys per lang).

38/38 smoke tests passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show
  1. index.html +36 -0
  2. js/i18n.js +212 -4
  3. js/main.js +176 -1
  4. js/quant_regime.js +147 -0
index.html CHANGED
@@ -204,6 +204,9 @@
204
  <p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
205
  <p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
206
 
 
 
 
207
  <h3 data-i18n="help.audit.title">The audit chain</h3>
208
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
209
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -309,6 +312,7 @@
309
  <li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
310
  <li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
311
  <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
 
312
  </ul>
313
  </details>
314
  </div>
@@ -362,6 +366,7 @@
362
  <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
363
  <button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
364
  <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
 
365
  </div>
366
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
367
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -752,6 +757,37 @@
752
  <div id="contam-output" style="margin-top: 1em;"></div>
753
  </section>
754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
755
  <!-- Recipe selector (mode=recipe) -->
756
  <section id="recipe-section" style="display:none;">
757
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
 
204
  <p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
205
  <p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
206
 
207
+ <p><strong data-i18n="help.v07.quant.title">⚖️ Quant-regime Classifier</strong></p>
208
+ <p data-i18n="help.v07.quant.body">Predicts γ-shift and ΔPPL for any (model × quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, …). Architecture-aware: small d_head + aggressive GQA → more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. <em>Use case</em>: before quantizing, predict whether your specific architecture × scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.</p>
209
+
210
  <h3 data-i18n="help.audit.title">The audit chain</h3>
211
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
212
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
 
312
  <li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
313
  <li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
314
  <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
315
+ <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
316
  </ul>
317
  </details>
318
  </div>
 
366
  <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
367
  <button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
368
  <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
369
+ <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
370
  </div>
371
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
372
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
 
757
  <div id="contam-output" style="margin-top: 1em;"></div>
758
  </section>
759
 
760
+ <!-- Quant-regime classifier (v0.7.3 anti-bullshit pack #5) -->
761
+ <section id="quant-section" style="display:none;">
762
+ <h2><span data-i18n="quant.title">⚖️ Quant-regime Classifier</span>
763
+ <span class="info"><span class="tooltip" data-i18n="quant.tip">
764
+ Predicts γ-shift (and downstream ΔPPL) for a given (model × quant scheme).
765
+ Generic claims like "AWQ ~95% retention" are too vague — TAF uses
766
+ d_head, GQA ratio, SWA flag, and model size to give an architecture-specific
767
+ verdict. Solves: HF community widely reports unpredictable quant cliffs
768
+ (NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).
769
+ </span></span>
770
+ </h2>
771
+ <p class="recipe-desc" data-i18n="quant.desc">
772
+ <strong>Will quantizing your model break it?</strong> Paste an HF model id, pick a quant scheme — get predicted γ-shift, expected ΔPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.
773
+ </p>
774
+ <div class="form-row">
775
+ <label for="quant-id" data-i18n="quant.id_label">HF model id:</label>
776
+ <input type="text" id="quant-id" placeholder="e.g. meta-llama/Llama-3.2-1B" />
777
+ <button type="button" id="quant-fetch-btn" data-i18n="quant.fetch_btn">📥 Fetch config</button>
778
+ </div>
779
+ <div class="form-row">
780
+ <label for="quant-scheme" data-i18n="quant.scheme_label">Quant scheme:</label>
781
+ <select id="quant-scheme">
782
+ <option value="">— select scheme —</option>
783
+ </select>
784
+ <button type="button" id="quant-run-btn" data-i18n="quant.run_btn">⚖️ Predict</button>
785
+ <button type="button" id="quant-all-btn" class="secondary" data-i18n="quant.all_btn">📊 Compare all schemes</button>
786
+ </div>
787
+ <p id="quant-status" class="recipe-desc" style="font-size:0.92em;"></p>
788
+ <div id="quant-output" style="margin-top: 1em;"></div>
789
+ </section>
790
+
791
  <!-- Recipe selector (mode=recipe) -->
792
  <section id="recipe-section" style="display:none;">
793
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
js/i18n.js CHANGED
@@ -302,6 +302,8 @@ export const TRANSLATIONS = {
302
  "help.v07.arena.body": "Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a \"statistical ties\" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring \"model A beats model B\", verify their CIs don't overlap.",
303
  "help.v07.contam.title": "🧪 Contamination Prior",
304
  "help.v07.contam.body": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.",
 
 
305
 
306
  // v0.7 — Inventory modal 5th card
307
  "inv.v07.title": "🆕 v0.7 anti-bullshit pack",
@@ -309,6 +311,56 @@ export const TRANSLATIONS = {
309
  "inv.v07.template": "<strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy",
310
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides",
311
  "inv.v07.contam": "<strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
312
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
313
  "share.import_btn": "📂 Load shared JSON",
314
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -401,7 +453,7 @@ export const TRANSLATIONS = {
401
  "common.no": "No",
402
 
403
  // Mode tooltips
404
- "modes.tip": "<strong>Eleven ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.",
405
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
406
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
407
 
@@ -1035,6 +1087,8 @@ export const TRANSLATIONS = {
1035
  "help.v07.arena.body": "Chatbot Arena oculta los intervalos de confianza en su leaderboard público — una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap de 200 iteraciones → Elos ranked con CIs 95% y un panel de \"empates estadísticos\" listando pares cuyos CIs se solapan. Prueba el botón Cargar sample. <em>Caso de uso</em>: antes de afirmar \"modelo A vence a modelo B\", verifica que sus CIs no se solapen.",
1036
  "help.v07.contam.title": "🧪 Prior de Contaminación",
1037
  "help.v07.contam.body": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha cutoff de entrenamiento de tu modelo → la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) por P(contaminación) según gap temporal, inclusión en corpus y historial de leaks conocidos. Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag. <em>Caso de uso</em>: decide qué scores te puedes creer al comparar dos modelos.",
 
 
1038
 
1039
  // v0.7 — Inventory modal 5ª card
1040
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
@@ -1042,6 +1096,56 @@ export const TRANSLATIONS = {
1042
  "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exacto para que lm-eval no divida tu accuracy entre 2 silenciosamente",
1043
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recupera los intervalos de confianza que Chatbot Arena oculta",
1044
  "inv.v07.contam": "<strong>🧪 Contaminación</strong> — puntúa 20+ benchmarks por probabilidad de contaminación",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1045
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
1046
  "share.import_btn": "📂 Cargar JSON compartido",
1047
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -1134,7 +1238,7 @@ export const TRANSLATIONS = {
1134
  "common.no": "No",
1135
 
1136
  // Tooltips de modos
1137
- "modes.tip": "<strong>Once formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.",
1138
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
1139
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
1140
 
@@ -1632,6 +1736,8 @@ export const TRANSLATIONS = {
1632
  "help.v07.arena.body": "Chatbot Arena masque les intervalles de confiance de son leaderboard public — un écart de 5 Elo peut être statistiquement insignifiant. Collez des données brutes de votes pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap 200 itérations → Elos classés avec CIs 95% et un panneau \"égalités statistiques\" listant les paires dont les CIs se chevauchent. Essayez le bouton Charger échantillon. <em>Cas d'usage</em> : avant de déclarer \"modèle A bat modèle B\", vérifiez que leurs CIs ne se chevauchent pas.",
1633
  "help.v07.contam.title": "🧪 Prior de Contamination",
1634
  "help.v07.contam.body": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez la date de cutoff d'entraînement de votre modèle → l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) par P(contamination) selon l'écart temporel, l'inclusion dans corpus et l'historique de leaks connus. Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag. <em>Cas d'usage</em> : décidez quels scores croire en comparant deux modèles.",
 
 
1635
 
1636
  // v0.7 — Inventory modal 5ème card
1637
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
@@ -1639,6 +1745,56 @@ export const TRANSLATIONS = {
1639
  "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exact pour que lm-eval ne divise pas votre accuracy par 2 en silence",
1640
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — récupère les intervalles de confiance que Chatbot Arena cache",
1641
  "inv.v07.contam": "<strong>🧪 Contamination</strong> — note 20+ benchmarks par probabilité de contamination",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1642
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
1643
  "share.import_btn": "📂 Charger JSON partagé",
1644
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -1731,7 +1887,7 @@ export const TRANSLATIONS = {
1731
  "common.no": "Non",
1732
 
1733
  // Tooltips des modes
1734
- "modes.tip": "<strong>Onze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.",
1735
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
1736
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
1737
 
@@ -2229,6 +2385,8 @@ export const TRANSLATIONS = {
2229
  "help.v07.arena.body": "Chatbot Arena 在公开排行榜中删除了置信区间 — 5 Elo 的差距在统计上可能毫无意义。粘贴原始 pairwise 投票数据(model_a, model_b, winner)→ Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + \"统计并列\" 面板,列出 CI 重叠的配对。尝试加载样本按钮。<em>用例</em>:宣称 \"模型 A 胜过模型 B\" 之前,验证它们的 CI 不重叠。",
2230
  "help.v07.contam.title": "🧪 污染先验",
2231
  "help.v07.contam.body": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff 日期 → 工具按 P(污染) 评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA、AIME、MATH-500、BBH、MUSR…),基于时间差距、语料库纳入和已知泄漏历史。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。<em>用例</em>:比较两个模型时决定相信哪些分数。",
 
 
2232
 
2233
  // v0.7 — Inventory 模态第 5 卡
2234
  "inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
@@ -2236,6 +2394,56 @@ export const TRANSLATIONS = {
2236
  "inv.v07.template": "<strong>📜 Chat-template</strong> — 精确 CLI flag,让 lm-eval 不会静默对半你的 accuracy",
2237
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — 恢复 Chatbot Arena 隐藏的置信区间",
2238
  "inv.v07.contam": "<strong>🧪 污染</strong> — 按污染概率对 20+ benchmark 评级",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2239
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
2240
  "share.import_btn": "📂 加载共享的 JSON",
2241
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
@@ -2328,7 +2536,7 @@ export const TRANSLATIONS = {
2328
  "common.no": "否",
2329
 
2330
  // 模式提示
2331
- "modes.tip": "<strong>十种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。",
2332
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
2333
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
2334
 
 
302
  "help.v07.arena.body": "Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a \"statistical ties\" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring \"model A beats model B\", verify their CIs don't overlap.",
303
  "help.v07.contam.title": "🧪 Contamination Prior",
304
  "help.v07.contam.body": "Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.",
305
+ "help.v07.quant.title": "⚖️ Quant-regime Classifier",
306
+ "help.v07.quant.body": "Predicts γ-shift and ΔPPL for any (model × quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, …). Architecture-aware: small d_head + aggressive GQA → more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. <em>Use case</em>: before quantizing, predict whether your specific architecture × scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.",
307
 
308
  // v0.7 — Inventory modal 5th card
309
  "inv.v07.title": "🆕 v0.7 anti-bullshit pack",
 
311
  "inv.v07.template": "<strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy",
312
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides",
313
  "inv.v07.contam": "<strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability",
314
+ "inv.v07.quant": "<strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo",
315
+
316
+ // v0.7.3 — anti-bullshit pack #5: Quant-regime classifier
317
+ "modes.quant": "⚖️ Quant",
318
+ "mode_desc.quant": "Predicts γ-shift and ΔPPL for any (model × quant scheme). Architecture-aware: small d_head + GQA → more sensitive. Recommends safer alternatives if a cliff is detected.",
319
+ "quant.title": "⚖️ Quant-regime Classifier",
320
+ "quant.tip": "Predicts γ-shift (and downstream ΔPPL) for a given (model × quant scheme). Generic claims like 'AWQ ~95% retention' are too vague — TAF uses d_head, GQA ratio, SWA flag, and model size to give an architecture-specific verdict. Solves: HF community widely reports unpredictable quant cliffs (NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).",
321
+ "quant.desc": "<strong>Will quantizing your model break it?</strong> Paste an HF model id, pick a quant scheme — get predicted γ-shift, expected ΔPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.",
322
+ "quant.id_label": "HF model id:",
323
+ "quant.fetch_btn": "📥 Fetch config",
324
+ "quant.scheme_label": "Quant scheme:",
325
+ "quant.run_btn": "⚖️ Predict",
326
+ "quant.all_btn": "📊 Compare all schemes",
327
+ "quant.regime.safe": "✅ SAFE",
328
+ "quant.regime.mild": "✅ MILD COMPRESSION",
329
+ "quant.regime.significant": "⚠ SIGNIFICANT DEGRADATION",
330
+ "quant.regime.cliff": "❌ HEAVY CLIFF",
331
+ "quant.label.gamma_shift": "γ shift",
332
+ "quant.label.delta_ppl": "ΔPPL (est.)",
333
+ "quant.label.arch_mult": "Arch multiplier",
334
+ "quant.section.breakdown": "Breakdown",
335
+ "quant.section.reco": "Recommendation",
336
+ "quant.section.compare": "All schemes (sorted by safety)",
337
+ "quant.field.scheme": "Scheme",
338
+ "quant.field.calibrated": "calibrated",
339
+ "quant.field.uncalibrated": "uncalibrated",
340
+ "quant.field.base_penalty": "Base penalty",
341
+ "quant.field.arch_mult_full": "Architecture multiplier",
342
+ "quant.field.gamma_shift": "Predicted γ shift",
343
+ "quant.field.ppl_band": "ΔPPL band (est.)",
344
+ "quant.field.params": "Parameters",
345
+ "quant.col.scheme": "Scheme",
346
+ "quant.col.bits": "Bits",
347
+ "quant.col.gamma_shift": "γ shift",
348
+ "quant.col.ppl_band": "ΔPPL band",
349
+ "quant.col.regime": "Regime",
350
+ "quant.reco.switch_to_awq": "<strong>Switch to {scheme}</strong> — calibrated 4-bit handles small d_head + GQA much better than NF4. Expected ΔPPL drops ~2-3×.",
351
+ "quant.reco.switch_to_q5_km": "<strong>Switch to {scheme}</strong> — Q5 keeps more head dimensions intact at low cost (only ~25% bigger file).",
352
+ "quant.reco.switch_to_q4_km": "<strong>Switch to {scheme}</strong> — Q3/Q2 are too aggressive for this architecture.",
353
+ "quant.reco.consider_awq": "<strong>Consider {scheme}</strong> — calibration meaningfully reduces γ-shift on this architecture.",
354
+ "quant.reco.use_higher_bits": "<strong>Use higher-bit alternative</strong> — this architecture cannot absorb 4-bit cleanly. Try 5- or 8-bit.",
355
+ "quant.reco.verify_with_eval": "<strong>Verify with a real eval</strong> — predicted shift is borderline. Run NIAH at your target context before deploying.",
356
+ "quant.reco.no_action": "No action needed — quantization is safe for this architecture.",
357
+ "quant.summary.headline_all": "All schemes for <code>{modelId}</code>",
358
+ "quant.status.empty_id": "⚠ Enter a model id (e.g. meta-llama/Llama-3.2-1B).",
359
+ "quant.status.fetching": "⏳ Fetching config.json for {modelId}...",
360
+ "quant.status.fetched": "✅ Config fetched for {modelId}. Pick a scheme and click Predict (or Compare all schemes).",
361
+ "quant.status.no_scheme": "⚠ Pick a quant scheme from the dropdown.",
362
+ "quant.status.done": "✅ Predicted regime: {regime}",
363
+ "quant.status.done_all": "✅ Compared {n} schemes — sorted by safety.",
364
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
365
  "share.import_btn": "📂 Load shared JSON",
366
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
 
453
  "common.no": "No",
454
 
455
  // Mode tooltips
456
+ "modes.tip": "<strong>Twelve ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.",
457
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
458
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
459
 
 
1087
  "help.v07.arena.body": "Chatbot Arena oculta los intervalos de confianza en su leaderboard público — una diferencia de 5 Elo puede ser estadísticamente irrelevante. Pega datos crudos de votos pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap de 200 iteraciones → Elos ranked con CIs 95% y un panel de \"empates estadísticos\" listando pares cuyos CIs se solapan. Prueba el botón Cargar sample. <em>Caso de uso</em>: antes de afirmar \"modelo A vence a modelo B\", verifica que sus CIs no se solapen.",
1088
  "help.v07.contam.title": "🧪 Prior de Contaminación",
1089
  "help.v07.contam.body": "Prior bayesiano-ish sobre si un score de benchmark está contaminado. Introduce la fecha cutoff de entrenamiento de tu modelo → la herramienta puntúa 20+ benchmarks populares (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) por P(contaminación) según gap temporal, inclusión en corpus y historial de leaks conocidos. Open LLM Leaderboard v1 fue cancelado en 2024 tras la contaminación de MMLU/HellaSwag. <em>Caso de uso</em>: decide qué scores te puedes creer al comparar dos modelos.",
1090
+ "help.v07.quant.title": "⚖️ Clasificador de régimen de cuantización",
1091
+ "help.v07.quant.body": "Predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8…). Arch-aware: d_head pequeño + GQA agresivo → más sensible; los esquemas calibrados (AWQ) absorben mejor el shift que los no calibrados (NF4). Recomienda alternativas más seguras si detecta cliff. <em>Caso de uso</em>: antes de cuantizar, predice si tu combo arquitectura × esquema mantendrá la PPL aceptable, con sugerencia concreta de switch si no.",
1092
 
1093
  // v0.7 — Inventory modal 5ª card
1094
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
 
1096
  "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exacto para que lm-eval no divida tu accuracy entre 2 silenciosamente",
1097
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — recupera los intervalos de confianza que Chatbot Arena oculta",
1098
  "inv.v07.contam": "<strong>🧪 Contaminación</strong> — puntúa 20+ benchmarks por probabilidad de contaminación",
1099
+ "inv.v07.quant": "<strong>⚖️ Quant</strong> — predice γ-shift + ΔPPL para cualquier combo (modelo × esquema de cuantización)",
1100
+
1101
+ // v0.7.3 — anti-bullshit pack #5: Quant-regime classifier
1102
+ "modes.quant": "⚖️ Quant",
1103
+ "mode_desc.quant": "Predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización). Arch-aware: d_head pequeño + GQA → más sensible. Recomienda alternativas más seguras si detecta cliff.",
1104
+ "quant.title": "⚖️ Clasificador de régimen de cuantización",
1105
+ "quant.tip": "Predice γ-shift (y la ΔPPL resultante) para un par (modelo × esquema). Claims genéricos como 'AWQ ~95% retención' son demasiado vagos — TAF usa d_head, ratio GQA, flag SWA y tamaño del modelo para dar veredicto arquitectura-específico. Resuelve: la comunidad HF reporta cliffs de cuantización impredecibles (NF4 -2 PPL en Phi-3 pero bien en Llama-3-8B).",
1106
+ "quant.desc": "<strong>¿Cuantizar romperá tu modelo?</strong> Pega un id HF, elige esquema de cuantización — obtén γ-shift predicho, banda ΔPPL esperada y alternativa recomendada si es un cliff. Solo navegador, sin GPU, sin set de calibración.",
1107
+ "quant.id_label": "ID modelo HF:",
1108
+ "quant.fetch_btn": "📥 Fetch config",
1109
+ "quant.scheme_label": "Esquema cuant:",
1110
+ "quant.run_btn": "⚖️ Predecir",
1111
+ "quant.all_btn": "📊 Comparar todos los esquemas",
1112
+ "quant.regime.safe": "✅ SEGURO",
1113
+ "quant.regime.mild": "✅ COMPRESIÓN LEVE",
1114
+ "quant.regime.significant": "⚠ DEGRADACIÓN SIGNIFICATIVA",
1115
+ "quant.regime.cliff": "❌ CLIFF FUERTE",
1116
+ "quant.label.gamma_shift": "γ shift",
1117
+ "quant.label.delta_ppl": "ΔPPL (est.)",
1118
+ "quant.label.arch_mult": "Multiplicador arch",
1119
+ "quant.section.breakdown": "Desglose",
1120
+ "quant.section.reco": "Recomendación",
1121
+ "quant.section.compare": "Todos los esquemas (ordenados por seguridad)",
1122
+ "quant.field.scheme": "Esquema",
1123
+ "quant.field.calibrated": "calibrado",
1124
+ "quant.field.uncalibrated": "no calibrado",
1125
+ "quant.field.base_penalty": "Penalización base",
1126
+ "quant.field.arch_mult_full": "Multiplicador arquitectónico",
1127
+ "quant.field.gamma_shift": "γ shift predicho",
1128
+ "quant.field.ppl_band": "Banda ΔPPL (est.)",
1129
+ "quant.field.params": "Parámetros",
1130
+ "quant.col.scheme": "Esquema",
1131
+ "quant.col.bits": "Bits",
1132
+ "quant.col.gamma_shift": "γ shift",
1133
+ "quant.col.ppl_band": "Banda ΔPPL",
1134
+ "quant.col.regime": "Régimen",
1135
+ "quant.reco.switch_to_awq": "<strong>Cambia a {scheme}</strong> — el 4-bit calibrado maneja d_head pequeño + GQA mucho mejor que NF4. ΔPPL esperada cae ~2-3×.",
1136
+ "quant.reco.switch_to_q5_km": "<strong>Cambia a {scheme}</strong> — Q5 mantiene más dimensiones de head intactas a bajo coste (solo ~25% más grande).",
1137
+ "quant.reco.switch_to_q4_km": "<strong>Cambia a {scheme}</strong> — Q3/Q2 son demasiado agresivos para esta arquitectura.",
1138
+ "quant.reco.consider_awq": "<strong>Considera {scheme}</strong> — la calibración reduce γ-shift significativamente en esta arquitectura.",
1139
+ "quant.reco.use_higher_bits": "<strong>Usa alternativa de mayor bit</strong> — esta arquitectura no absorbe 4-bit limpiamente. Prueba 5 u 8-bit.",
1140
+ "quant.reco.verify_with_eval": "<strong>Verifica con eval real</strong> — el shift predicho está en el límite. Corre NIAH a tu contexto objetivo antes de desplegar.",
1141
+ "quant.reco.no_action": "No requiere acción — la cuantización es segura para esta arquitectura.",
1142
+ "quant.summary.headline_all": "Todos los esquemas para <code>{modelId}</code>",
1143
+ "quant.status.empty_id": "⚠ Introduce un model id (ej. meta-llama/Llama-3.2-1B).",
1144
+ "quant.status.fetching": "⏳ Obteniendo config.json para {modelId}...",
1145
+ "quant.status.fetched": "✅ Config obtenido para {modelId}. Elige un esquema y click Predecir (o Comparar todos).",
1146
+ "quant.status.no_scheme": "⚠ Elige un esquema de cuantización del dropdown.",
1147
+ "quant.status.done": "✅ Régimen predicho: {regime}",
1148
+ "quant.status.done_all": "✅ Comparados {n} esquemas — ordenados por seguridad.",
1149
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
1150
  "share.import_btn": "📂 Cargar JSON compartido",
1151
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
 
1238
  "common.no": "No",
1239
 
1240
  // Tooltips de modos
1241
+ "modes.tip": "<strong>Doce formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.",
1242
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
1243
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
1244
 
 
1736
  "help.v07.arena.body": "Chatbot Arena masque les intervalles de confiance de son leaderboard public — un écart de 5 Elo peut être statistiquement insignifiant. Collez des données brutes de votes pairwise (model_a, model_b, winner) → MLE Bradley-Terry + bootstrap 200 itérations → Elos classés avec CIs 95% et un panneau \"égalités statistiques\" listant les paires dont les CIs se chevauchent. Essayez le bouton Charger échantillon. <em>Cas d'usage</em> : avant de déclarer \"modèle A bat modèle B\", vérifiez que leurs CIs ne se chevauchent pas.",
1737
  "help.v07.contam.title": "🧪 Prior de Contamination",
1738
  "help.v07.contam.body": "Prior bayésien-ish sur la contamination d'un score de benchmark. Saisissez la date de cutoff d'entraînement de votre modèle → l'outil note 20+ benchmarks populaires (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) par P(contamination) selon l'écart temporel, l'inclusion dans corpus et l'historique de leaks connus. Open LLM Leaderboard v1 a été tué en 2024 après la contamination de MMLU/HellaSwag. <em>Cas d'usage</em> : décidez quels scores croire en comparant deux modèles.",
1739
+ "help.v07.quant.title": "⚖️ Classificateur de régime de quantification",
1740
+ "help.v07.quant.body": "Prédit le γ-shift et ΔPPL pour tout (modèle × schéma de quantification : NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8…). Arch-aware : petit d_head + GQA agressif → plus sensible ; les schémas calibrés (AWQ) absorbent mieux le shift que les non calibrés (NF4). Recommande des alternatives plus sûres si un cliff est détecté. <em>Cas d'usage</em> : avant de quantifier, prédisez si votre combo architecture × schéma maintiendra la PPL acceptable, avec une suggestion concrète de switch sinon.",
1741
 
1742
  // v0.7 — Inventory modal 5ème card
1743
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
 
1745
  "inv.v07.template": "<strong>📜 Chat-template</strong> — flag CLI exact pour que lm-eval ne divise pas votre accuracy par 2 en silence",
1746
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — récupère les intervalles de confiance que Chatbot Arena cache",
1747
  "inv.v07.contam": "<strong>🧪 Contamination</strong> — note 20+ benchmarks par probabilité de contamination",
1748
+ "inv.v07.quant": "<strong>⚖️ Quant</strong> — prédit le γ-shift + ΔPPL pour tout combo (modèle × schéma de quantification)",
1749
+
1750
+ // v0.7.3 — anti-bullshit pack #5: Quant-regime classifier
1751
+ "modes.quant": "⚖️ Quant",
1752
+ "mode_desc.quant": "Prédit le γ-shift et ΔPPL pour tout (modèle × schéma de quantification). Arch-aware : petit d_head + GQA → plus sensible. Recommande des alternatives plus sûres si un cliff est détecté.",
1753
+ "quant.title": "⚖️ Classificateur de régime de quantification",
1754
+ "quant.tip": "Prédit le γ-shift (et la ΔPPL résultante) pour une paire (modèle × schéma). Les claims génériques comme 'AWQ ~95% retention' sont trop vagues — TAF utilise d_head, ratio GQA, flag SWA et taille du modèle pour donner un verdict arch-spécifique. Résout : la communauté HF rapporte des cliffs de quantification imprédictibles (NF4 -2 PPL sur Phi-3 mais OK sur Llama-3-8B).",
1755
+ "quant.desc": "<strong>La quantification cassera-t-elle votre modèle ?</strong> Collez un id HF, choisissez un schéma — obtenez le γ-shift prédit, la bande ΔPPL attendue et une alternative recommandée si c'est un cliff. Navigateur uniquement, sans GPU, sans set de calibration.",
1756
+ "quant.id_label": "ID modèle HF :",
1757
+ "quant.fetch_btn": "📥 Récupérer config",
1758
+ "quant.scheme_label": "Schéma quant :",
1759
+ "quant.run_btn": "⚖️ Prédire",
1760
+ "quant.all_btn": "📊 Comparer tous les schémas",
1761
+ "quant.regime.safe": "✅ SÛR",
1762
+ "quant.regime.mild": "✅ COMPRESSION LÉGÈRE",
1763
+ "quant.regime.significant": "⚠ DÉGRADATION SIGNIFICATIVE",
1764
+ "quant.regime.cliff": "❌ CLIFF SÉVÈRE",
1765
+ "quant.label.gamma_shift": "γ shift",
1766
+ "quant.label.delta_ppl": "ΔPPL (est.)",
1767
+ "quant.label.arch_mult": "Multiplicateur arch",
1768
+ "quant.section.breakdown": "Détail",
1769
+ "quant.section.reco": "Recommandation",
1770
+ "quant.section.compare": "Tous les schémas (triés par sécurité)",
1771
+ "quant.field.scheme": "Schéma",
1772
+ "quant.field.calibrated": "calibré",
1773
+ "quant.field.uncalibrated": "non calibré",
1774
+ "quant.field.base_penalty": "Pénalité de base",
1775
+ "quant.field.arch_mult_full": "Multiplicateur architectural",
1776
+ "quant.field.gamma_shift": "γ shift prédit",
1777
+ "quant.field.ppl_band": "Bande ΔPPL (est.)",
1778
+ "quant.field.params": "Paramètres",
1779
+ "quant.col.scheme": "Schéma",
1780
+ "quant.col.bits": "Bits",
1781
+ "quant.col.gamma_shift": "γ shift",
1782
+ "quant.col.ppl_band": "Bande ΔPPL",
1783
+ "quant.col.regime": "Régime",
1784
+ "quant.reco.switch_to_awq": "<strong>Passez à {scheme}</strong> — le 4-bit calibré gère bien mieux les petits d_head + GQA que NF4. ΔPPL attendue chute ~2-3×.",
1785
+ "quant.reco.switch_to_q5_km": "<strong>Passez à {scheme}</strong> — Q5 garde plus de dimensions de head intactes à faible coût (~25% plus grand seulement).",
1786
+ "quant.reco.switch_to_q4_km": "<strong>Passez à {scheme}</strong> — Q3/Q2 sont trop agressifs pour cette architecture.",
1787
+ "quant.reco.consider_awq": "<strong>Considérez {scheme}</strong> — la calibration réduit significativement le γ-shift sur cette architecture.",
1788
+ "quant.reco.use_higher_bits": "<strong>Utilisez une alternative à plus de bits</strong> — cette architecture n'absorbe pas le 4-bit proprement. Essayez 5 ou 8-bit.",
1789
+ "quant.reco.verify_with_eval": "<strong>Vérifiez avec une vraie éval</strong> — le shift prédit est borderline. Lancez NIAH à votre contexte cible avant de déployer.",
1790
+ "quant.reco.no_action": "Pas d'action requise — la quantification est sûre pour cette architecture.",
1791
+ "quant.summary.headline_all": "Tous les schémas pour <code>{modelId}</code>",
1792
+ "quant.status.empty_id": "⚠ Saisissez un model id (ex. meta-llama/Llama-3.2-1B).",
1793
+ "quant.status.fetching": "⏳ Récupération config.json pour {modelId}...",
1794
+ "quant.status.fetched": "✅ Config récupéré pour {modelId}. Choisissez un schéma et cliquez Prédire (ou Comparer tous).",
1795
+ "quant.status.no_scheme": "⚠ Choisissez un schéma de quantification dans le dropdown.",
1796
+ "quant.status.done": "✅ Régime prédit : {regime}",
1797
+ "quant.status.done_all": "✅ Comparé {n} schémas — triés par sécurité.",
1798
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
1799
  "share.import_btn": "📂 Charger JSON partagé",
1800
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
 
1887
  "common.no": "Non",
1888
 
1889
  // Tooltips des modes
1890
+ "modes.tip": "<strong>Douze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.",
1891
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
1892
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
1893
 
 
2385
  "help.v07.arena.body": "Chatbot Arena 在公开排行榜中删除了置信区间 — 5 Elo 的差距在统计上可能毫无意义。粘贴原始 pairwise 投票数据(model_a, model_b, winner)→ Bradley-Terry MLE + 200 次 bootstrap → 排序 Elo + 95% CI + \"统计并列\" 面板,列出 CI 重叠的配对。尝试加载样本按钮。<em>用例</em>:宣称 \"模型 A 胜过模型 B\" 之前,验证它们的 CI 不重叠。",
2386
  "help.v07.contam.title": "🧪 污染先验",
2387
  "help.v07.contam.body": "对 benchmark 分数是否被污染做贝叶斯式的先验估计。输入模型训练 cutoff 日期 → 工具按 P(污染) 评估 20+ 主流 benchmark(MMLU、HellaSwag、GSM8K、HumanEval、IFEval、MMLU-Pro、GPQA、AIME、MATH-500、BBH、MUSR…),基于时间差距、语料库纳入和已知泄漏历史。Open LLM Leaderboard v1 在 2024 年因 MMLU/HellaSwag 分数被污染而停用。<em>用例</em>:比较两个模型时决定相信哪些分数。",
2388
+ "help.v07.quant.title": "⚖️ 量化机制分类器",
2389
+ "help.v07.quant.body": "预测任意(模型 × 量化方案:NF4、AWQ、GPTQ、GGUF Q4_K_M / Q5_K_M / Q8_0、int8、FP8…)的 γ-shift 与 ΔPPL。架构感知:小 d_head + 激进 GQA → 更敏感;校准方案(AWQ)比未校准方案(NF4)更好地吸收偏移。检测到 cliff 时推荐更安全的替代方案。<em>用例</em>:量化之前,预测你的特定架构 × 方案组合是否能保持 PPL 可接受,否则给出具体的切换建议。",
2390
 
2391
  // v0.7 — Inventory 模态第 5 卡
2392
  "inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
 
2394
  "inv.v07.template": "<strong>📜 Chat-template</strong> — 精确 CLI flag,让 lm-eval 不会静默对半你的 accuracy",
2395
  "inv.v07.arena": "<strong>🎯 Arena CI</strong> — 恢复 Chatbot Arena 隐藏的置信区间",
2396
  "inv.v07.contam": "<strong>🧪 污染</strong> — 按污染概率对 20+ benchmark 评级",
2397
+ "inv.v07.quant": "<strong>⚖️ Quant</strong> — 预测任意(模型 × 量化方案)组合的 γ-shift + ΔPPL",
2398
+
2399
+ // v0.7.3 — anti-bullshit pack #5: Quant-regime classifier
2400
+ "modes.quant": "⚖️ Quant",
2401
+ "mode_desc.quant": "预测任意(模型 × 量化方案)的 γ-shift 与 ΔPPL。架构感知:小 d_head + GQA → 更敏感。检测到 cliff 时推荐更安全的替代方案。",
2402
+ "quant.title": "⚖️ 量化机制分类器",
2403
+ "quant.tip": "预测给定(模型 × ���化方案)的 γ-shift(及由此产生的 ΔPPL)。\"AWQ 保留 ~95%\" 这类通用说法太模糊 — TAF 利用 d_head、GQA 比、SWA 标志和模型大小给出特定于架构的判定。解决:HF 社区普遍报告不可预测的量化 cliff(NF4 在 Phi-3 上 -2 PPL,但在 Llama-3-8B 上没问题)。",
2404
+ "quant.desc": "<strong>量化会破坏你的模型吗?</strong>粘贴 HF 模型 id,选择量化方案 — 获取预测的 γ-shift、预期 ΔPPL 区间,以及在 cliff 情况下的推荐替代方案。仅浏览器,无 GPU,无需校准集。",
2405
+ "quant.id_label": "HF 模型 id:",
2406
+ "quant.fetch_btn": "📥 获取 config",
2407
+ "quant.scheme_label": "量化方案:",
2408
+ "quant.run_btn": "⚖️ 预测",
2409
+ "quant.all_btn": "📊 比较所有方案",
2410
+ "quant.regime.safe": "✅ 安全",
2411
+ "quant.regime.mild": "✅ 轻度压缩",
2412
+ "quant.regime.significant": "⚠ 显著退化",
2413
+ "quant.regime.cliff": "❌ 重大 CLIFF",
2414
+ "quant.label.gamma_shift": "γ 偏移",
2415
+ "quant.label.delta_ppl": "ΔPPL(估)",
2416
+ "quant.label.arch_mult": "架构乘数",
2417
+ "quant.section.breakdown": "细节分解",
2418
+ "quant.section.reco": "建议",
2419
+ "quant.section.compare": "所有方案(按安全性排序)",
2420
+ "quant.field.scheme": "方案",
2421
+ "quant.field.calibrated": "已校准",
2422
+ "quant.field.uncalibrated": "未校准",
2423
+ "quant.field.base_penalty": "基础惩罚",
2424
+ "quant.field.arch_mult_full": "架构乘数",
2425
+ "quant.field.gamma_shift": "预测 γ 偏移",
2426
+ "quant.field.ppl_band": "ΔPPL 区间(估)",
2427
+ "quant.field.params": "参数量",
2428
+ "quant.col.scheme": "方案",
2429
+ "quant.col.bits": "比特",
2430
+ "quant.col.gamma_shift": "γ 偏移",
2431
+ "quant.col.ppl_band": "ΔPPL 区间",
2432
+ "quant.col.regime": "机制",
2433
+ "quant.reco.switch_to_awq": "<strong>切换到 {scheme}</strong> — 校准的 4-bit 处理小 d_head + GQA 比 NF4 好得多。预期 ΔPPL 下降 ~2-3 倍。",
2434
+ "quant.reco.switch_to_q5_km": "<strong>切换到 {scheme}</strong> — Q5 以低成本保留更多 head 维度(仅大约 25% 文件更大)。",
2435
+ "quant.reco.switch_to_q4_km": "<strong>切换到 {scheme}</strong> — Q3/Q2 对此架构过于激进。",
2436
+ "quant.reco.consider_awq": "<strong>考虑 {scheme}</strong> — 在此架构上校准能显著降低 γ-shift。",
2437
+ "quant.reco.use_higher_bits": "<strong>使用更高比特的替代</strong> — 此架构无法干净吸收 4-bit。尝试 5 或 8-bit。",
2438
+ "quant.reco.verify_with_eval": "<strong>用真实 eval 验证</strong> — 预测偏移在边缘。部署前在目标上下文运行 NIAH。",
2439
+ "quant.reco.no_action": "无需操作 — 此架构下量化是安全的。",
2440
+ "quant.summary.headline_all": "<code>{modelId}</code> 的所有方案",
2441
+ "quant.status.empty_id": "⚠ 输入 model id(例如 meta-llama/Llama-3.2-1B)。",
2442
+ "quant.status.fetching": "⏳ 正在获取 {modelId} 的 config.json...",
2443
+ "quant.status.fetched": "✅ 已获取 {modelId} 的 config。选择方案并点击预测(或比较所有)。",
2444
+ "quant.status.no_scheme": "⚠ 从下拉中选择一个量化方案。",
2445
+ "quant.status.done": "✅ 预测机制:{regime}",
2446
+ "quant.status.done_all": "✅ 已比较 {n} 个方案 — 按安全性排序。",
2447
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
2448
  "share.import_btn": "📂 加载共享的 JSON",
2449
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
 
2536
  "common.no": "否",
2537
 
2538
  // 模式提示
2539
+ "modes.tip": "<strong>十种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意(模型 × 量化方案)的 γ-shift 与 ΔPPL;cliff 时推荐更安全替代方案。",
2540
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
2541
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
2542
 
js/main.js CHANGED
@@ -15,6 +15,7 @@ import { unmaskConfig } from "./swa_unmasker.js";
15
  import { sniffChatTemplate } from "./chat_template_sniffer.js";
16
  import { parseVotesCSV, computeArenaCI, SAMPLE_VOTES_CSV } from "./arena_ci.js";
17
  import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
 
18
 
19
  const TAF_BROWSER_URL = "python/taf_browser.py";
20
  const ENABLE_WEBLLM = true;
@@ -190,7 +191,8 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
190
  ["ask-section", "recipe-section", "form-section",
191
  "profile-section", "compare-section", "inspector-section",
192
  "diagnose-section", "phase-section", "unmask-section",
193
- "template-section", "arena-section", "contam-section"].forEach(id => {
 
194
  const el = $(id);
195
  if (el) el.style.display = "none";
196
  });
@@ -200,6 +202,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
200
  compare: "compare-section", inspector: "inspector-section",
201
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
202
  template: "template-section", arena: "arena-section", contam: "contam-section",
 
203
  };
204
  const sectionId = sectionMap[mode];
205
  if (sectionId) $(sectionId).style.display = "";
@@ -980,6 +983,178 @@ $("contam-cutoff")?.addEventListener("keydown", (e) => {
980
  if (e.key === "Enter") { e.preventDefault(); runContamCompute(); }
981
  });
982
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
983
  function configToPreset(cfg, modelId) {
984
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
985
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
 
15
  import { sniffChatTemplate } from "./chat_template_sniffer.js";
16
  import { parseVotesCSV, computeArenaCI, SAMPLE_VOTES_CSV } from "./arena_ci.js";
17
  import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
18
+ import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
19
 
20
  const TAF_BROWSER_URL = "python/taf_browser.py";
21
  const ENABLE_WEBLLM = true;
 
191
  ["ask-section", "recipe-section", "form-section",
192
  "profile-section", "compare-section", "inspector-section",
193
  "diagnose-section", "phase-section", "unmask-section",
194
+ "template-section", "arena-section", "contam-section",
195
+ "quant-section"].forEach(id => {
196
  const el = $(id);
197
  if (el) el.style.display = "none";
198
  });
 
202
  compare: "compare-section", inspector: "inspector-section",
203
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
204
  template: "template-section", arena: "arena-section", contam: "contam-section",
205
+ quant: "quant-section",
206
  };
207
  const sectionId = sectionMap[mode];
208
  if (sectionId) $(sectionId).style.display = "";
 
983
  if (e.key === "Enter") { e.preventDefault(); runContamCompute(); }
984
  });
985
 
986
+ // ════════════════════════════════════════════════════════════════════
987
+ // ⚖️ Quant-regime classifier (v0.7.3 anti-bullshit pack #5)
988
+ // ════════════════════════════════════════════════════════════════════
989
+
990
+ const QUANT_REGIME_COLOR = {
991
+ safe: "#3fb950",
992
+ mild: "#3fb950",
993
+ significant: "#f1c40f",
994
+ cliff: "#f85149",
995
+ };
996
+
997
+ // Populate scheme dropdown from QUANT_SCHEMES on first render. Idempotent.
998
+ function populateQuantSchemes() {
999
+ const sel = $("quant-scheme");
1000
+ if (!sel || sel.options.length > 1) return;
1001
+ for (const s of QUANT_SCHEMES) {
1002
+ const opt = document.createElement("option");
1003
+ opt.value = s.id;
1004
+ opt.textContent = s.label;
1005
+ sel.appendChild(opt);
1006
+ }
1007
+ }
1008
+
1009
+ // Cache config across "Fetch" + "Predict" / "Compare" actions on the same id.
1010
+ let __quantLastConfig = null;
1011
+ let __quantLastModelId = null;
1012
+
1013
+ async function quantFetchConfig() {
1014
+ const modelId = ($("quant-id").value || "").trim();
1015
+ if (!modelId) {
1016
+ $("quant-status").textContent = t("quant.status.empty_id") || "⚠ Enter a model id.";
1017
+ return null;
1018
+ }
1019
+ $("quant-status").textContent = tFmt("quant.status.fetching", { modelId });
1020
+ $("quant-fetch-btn").disabled = true;
1021
+ try {
1022
+ const cfg = await fetchHfConfig(modelId);
1023
+ __quantLastConfig = cfg;
1024
+ __quantLastModelId = modelId;
1025
+ $("quant-status").textContent = tFmt("quant.status.fetched", { modelId });
1026
+ return cfg;
1027
+ } catch (err) {
1028
+ $("quant-status").textContent = `❌ ${err.message}`;
1029
+ return null;
1030
+ } finally {
1031
+ $("quant-fetch-btn").disabled = false;
1032
+ }
1033
+ }
1034
+
1035
+ function renderQuantSingle(result, modelId) {
1036
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1037
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1038
+ const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
1039
+ const color = QUANT_REGIME_COLOR[result.regime] || "#8b949e";
1040
+ const regimeLabel = t(`quant.regime.${result.regime}`) || result.regime;
1041
+
1042
+ let recoHtml = "";
1043
+ if (result.recommend_code) {
1044
+ const recoText = result.recommend_scheme
1045
+ ? tFmt("quant.reco." + result.recommend_code, {
1046
+ scheme: QUANT_SCHEMES.find(s => s.id === result.recommend_scheme)?.label || result.recommend_scheme,
1047
+ })
1048
+ : (t("quant.reco." + result.recommend_code) || result.recommend_code);
1049
+ recoHtml = `<p class="unmask-reco">${recoText}</p>`;
1050
+ } else {
1051
+ recoHtml = `<p class="unmask-reco">${t("quant.reco.no_action") || "No action needed — quantization is safe for this architecture."}</p>`;
1052
+ }
1053
+
1054
+ return `
1055
+ <div class="unmask-result">
1056
+ <div class="unmask-hero" style="border-color: ${color};">
1057
+ <div class="unmask-verdict" style="color: ${color};">${regimeLabel}</div>
1058
+ <div class="unmask-model"><code>${escapeHtml(modelId)}</code> + <code>${escapeHtml(result.scheme_label)}</code></div>
1059
+ <div class="unmask-numbers">
1060
+ <div><span class="unmask-num-label">${t("quant.label.gamma_shift") || "γ shift"}</span><span class="unmask-num-val">+${result.gamma_shift.toFixed(3)}</span></div>
1061
+ <div><span class="unmask-num-label">${t("quant.label.delta_ppl") || "ΔPPL (est.)"}</span><span class="unmask-num-val">+${result.delta_ppl.mid.toFixed(2)}</span></div>
1062
+ <div><span class="unmask-num-label">${t("quant.label.arch_mult") || "Arch multiplier"}</span><span class="unmask-num-val">×${result.arch_multiplier}</span></div>
1063
+ </div>
1064
+ </div>
1065
+ <div class="unmask-details">
1066
+ <details class="unmask-panel" open>
1067
+ <summary class="unmask-panel-title">${t("quant.section.breakdown") || "Breakdown"}</summary>
1068
+ <ul>
1069
+ <li><strong>${t("quant.field.scheme") || "Scheme"}:</strong> ${escapeHtml(result.scheme_label)} (${result.scheme_bits}-bit, ${result.scheme_calibrated ? (t("quant.field.calibrated") || "calibrated") : (t("quant.field.uncalibrated") || "uncalibrated")})</li>
1070
+ <li><strong>${t("quant.field.base_penalty") || "Base penalty"}:</strong> ${result.base_penalty.toFixed(3)}</li>
1071
+ <li><strong>${t("quant.field.arch_mult_full") || "Architecture multiplier"}:</strong> ×${result.arch_multiplier} (d_head, GQA, SWA, params)</li>
1072
+ <li><strong>${t("quant.field.gamma_shift") || "Predicted γ shift"}:</strong> +${result.gamma_shift.toFixed(3)}</li>
1073
+ <li><strong>${t("quant.field.ppl_band") || "ΔPPL band (est.)"}:</strong> ${result.delta_ppl.low.toFixed(2)} – ${result.delta_ppl.high.toFixed(2)}</li>
1074
+ <li><strong>${t("quant.field.params") || "Parameters"}:</strong> ${fmtN(result.n_params)}</li>
1075
+ </ul>
1076
+ </details>
1077
+ <details class="unmask-panel" open>
1078
+ <summary class="unmask-panel-title">${t("quant.section.reco") || "Recommendation"}</summary>
1079
+ ${recoHtml}
1080
+ </details>
1081
+ </div>
1082
+ </div>
1083
+ `;
1084
+ }
1085
+
1086
+ function renderQuantAll(rows, modelId) {
1087
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1088
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1089
+ let body = "";
1090
+ for (const r of rows) {
1091
+ const color = QUANT_REGIME_COLOR[r.regime] || "#8b949e";
1092
+ const regimeLabel = t(`quant.regime.${r.regime}`) || r.regime;
1093
+ body += `<tr>
1094
+ <td><strong>${escapeHtml(r.scheme_label)}</strong></td>
1095
+ <td class="arena-spread">${r.scheme_bits}-bit ${r.scheme_calibrated ? "✓" : ""}</td>
1096
+ <td class="arena-elo">+${r.gamma_shift.toFixed(3)}</td>
1097
+ <td class="arena-spread">${r.delta_ppl.low.toFixed(2)}–${r.delta_ppl.high.toFixed(2)}</td>
1098
+ <td style="color: ${color};"><strong>${regimeLabel}</strong></td>
1099
+ </tr>`;
1100
+ }
1101
+ return `
1102
+ <div class="arena-result">
1103
+ <div class="unmask-hero" style="border-color: #58a6ff;">
1104
+ <div class="unmask-verdict" style="color: #58a6ff;">${tFmt("quant.summary.headline_all", { modelId })}</div>
1105
+ </div>
1106
+ <div class="unmask-details">
1107
+ <details class="unmask-panel" open>
1108
+ <summary class="unmask-panel-title">${t("quant.section.compare") || "All schemes (sorted by safety)"}</summary>
1109
+ <table class="arena-table">
1110
+ <thead><tr>
1111
+ <th>${t("quant.col.scheme") || "Scheme"}</th>
1112
+ <th>${t("quant.col.bits") || "Bits"}</th>
1113
+ <th>${t("quant.col.gamma_shift") || "γ shift"}</th>
1114
+ <th>${t("quant.col.ppl_band") || "ΔPPL band"}</th>
1115
+ <th>${t("quant.col.regime") || "Regime"}</th>
1116
+ </tr></thead>
1117
+ <tbody>${body}</tbody>
1118
+ </table>
1119
+ </details>
1120
+ </div>
1121
+ </div>
1122
+ `;
1123
+ }
1124
+
1125
+ async function runQuantPredict() {
1126
+ const cfg = __quantLastConfig || await quantFetchConfig();
1127
+ if (!cfg) return;
1128
+ const schemeId = $("quant-scheme").value;
1129
+ if (!schemeId) {
1130
+ $("quant-status").textContent = t("quant.status.no_scheme") || "⚠ Pick a quant scheme.";
1131
+ return;
1132
+ }
1133
+ const result = predictQuantShift(cfg, schemeId);
1134
+ if (!result) {
1135
+ $("quant-status").textContent = "❌ Unknown scheme.";
1136
+ return;
1137
+ }
1138
+ $("quant-output").innerHTML = renderQuantSingle(result, __quantLastModelId);
1139
+ $("quant-status").textContent = tFmt("quant.status.done", { regime: t(`quant.regime.${result.regime}`) || result.regime });
1140
+ }
1141
+
1142
+ async function runQuantAll() {
1143
+ const cfg = __quantLastConfig || await quantFetchConfig();
1144
+ if (!cfg) return;
1145
+ const rows = predictAllSchemes(cfg);
1146
+ $("quant-output").innerHTML = renderQuantAll(rows, __quantLastModelId);
1147
+ $("quant-status").textContent = tFmt("quant.status.done_all", { n: rows.length });
1148
+ }
1149
+
1150
+ populateQuantSchemes();
1151
+ $("quant-fetch-btn")?.addEventListener("click", quantFetchConfig);
1152
+ $("quant-run-btn")?.addEventListener("click", runQuantPredict);
1153
+ $("quant-all-btn")?.addEventListener("click", runQuantAll);
1154
+ $("quant-id")?.addEventListener("keydown", (e) => {
1155
+ if (e.key === "Enter") { e.preventDefault(); quantFetchConfig(); }
1156
+ });
1157
+
1158
  function configToPreset(cfg, modelId) {
1159
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
1160
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
js/quant_regime.js ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Quant-regime classifier (v0.7.3 anti-bullshit pack #5)
2
+ // Predicts γ shift under quantization given (architecture × quant scheme).
3
+ // Pure logic — no human strings. Solves: HF community widely complains that
4
+ // quantization "cliffs" are unpredictable per model. Generic "AWQ ~95% retention"
5
+ // claims are too vague — TAF gives architecture-specific verdict.
6
+ //
7
+ // Calibration sources: Maarten Grootendorst's quant comparison newsletter,
8
+ // llama.cpp PPL benchmarks, GPTQ/AWQ papers.
9
+
10
+ export const QUANT_SCHEMES = [
11
+ { id: "fp8", label: "FP8 (Hopper)", bits: 8, base_penalty: 0.007, calibrated: false, hardware: "h100+" },
12
+ { id: "int8", label: "int8 (LLM.int8())", bits: 8, base_penalty: 0.010, calibrated: false, hardware: "any" },
13
+ { id: "gguf_q8_0", label: "GGUF Q8_0", bits: 8, base_penalty: 0.008, calibrated: false, hardware: "cpu/any" },
14
+ { id: "gguf_q5_km", label: "GGUF Q5_K_M", bits: 5, base_penalty: 0.020, calibrated: false, hardware: "cpu/any" },
15
+ { id: "awq", label: "AWQ (4-bit, calibrated)", bits: 4, base_penalty: 0.020, calibrated: true, hardware: "any" },
16
+ { id: "gptq", label: "GPTQ (4-bit, calibrated)", bits: 4, base_penalty: 0.035, calibrated: true, hardware: "any" },
17
+ { id: "gguf_q4_km", label: "GGUF Q4_K_M", bits: 4, base_penalty: 0.050, calibrated: false, hardware: "cpu/any" },
18
+ { id: "nf4", label: "NF4 (bitsandbytes, uncalibrated)", bits: 4, base_penalty: 0.070, calibrated: false, hardware: "any" },
19
+ { id: "gguf_q3_km", label: "GGUF Q3_K_M (aggressive)", bits: 3, base_penalty: 0.110, calibrated: false, hardware: "cpu/any" },
20
+ { id: "gguf_q2_k", label: "GGUF Q2_K (extreme)", bits: 2, base_penalty: 0.180, calibrated: false, hardware: "cpu/any" },
21
+ ];
22
+
23
+ const REGIME_BANDS = [
24
+ { id: "safe", max_gamma_shift: 0.015, label_code: "safe" },
25
+ { id: "mild", max_gamma_shift: 0.04, label_code: "mild" },
26
+ { id: "significant", max_gamma_shift: 0.08, label_code: "significant" },
27
+ { id: "cliff", max_gamma_shift: 1.0, label_code: "cliff" },
28
+ ];
29
+
30
+ function bandFor(gammaShift) {
31
+ for (const b of REGIME_BANDS) if (gammaShift <= b.max_gamma_shift) return b.id;
32
+ return "cliff";
33
+ }
34
+
35
+ // Architecture-specific multiplier on the base quant penalty.
36
+ // More sensitive: small d_head, aggressive GQA ratio, very small models (pre-IH).
37
+ // Less sensitive: large d_head, post-IH, MHA (no GQA pressure).
38
+ function archMultiplier(config) {
39
+ let mult = 1.0;
40
+ const n_attn = config.num_attention_heads ?? null;
41
+ const n_kv = config.num_key_value_heads ?? n_attn;
42
+ const hidden = config.hidden_size ?? null;
43
+ const d_head = config.head_dim ?? (n_attn && hidden ? hidden / n_attn : null);
44
+ const n_params = inferNParams(config);
45
+ const hasSWA = typeof config.sliding_window === "number" && config.sliding_window > 0;
46
+ const hasGQA = n_attn && n_kv && n_kv < n_attn;
47
+ const gqaRatio = hasGQA ? n_attn / n_kv : 1;
48
+
49
+ // d_head sensitivity (small head = more compression damage)
50
+ if (d_head !== null) {
51
+ if (d_head < 64) mult *= 1.5;
52
+ else if (d_head < 96) mult *= 1.2;
53
+ else if (d_head < 128) mult *= 1.05;
54
+ // d_head >= 128: no penalty
55
+ }
56
+ // GQA pressure (heavily-shared kv heads = more interference under quant)
57
+ if (gqaRatio >= 8) mult *= 1.3;
58
+ else if (gqaRatio >= 4) mult *= 1.15;
59
+ // SWA: localized attention is somewhat more robust to head-level noise
60
+ if (hasSWA) mult *= 0.92;
61
+ // Post-IH (large) models more robust; pre-IH (small) less robust
62
+ if (n_params !== null) {
63
+ if (n_params < 1.5e9) mult *= 1.4; // <1.5B = pre-IH
64
+ else if (n_params < 4e9) mult *= 1.15; // borderline
65
+ else if (n_params >= 30e9) mult *= 0.85; // very large = robust
66
+ else if (n_params >= 7e9) mult *= 0.95;
67
+ }
68
+ return mult;
69
+ }
70
+
71
+ function inferNParams(config) {
72
+ if (typeof config.num_parameters === "number") return config.num_parameters;
73
+ if (typeof config.n_params === "number") return config.n_params;
74
+ // Estimate from h × layers × ~12h (transformer rule-of-thumb)
75
+ const h = config.hidden_size ?? null;
76
+ const L = config.num_hidden_layers ?? null;
77
+ const v = config.vocab_size ?? null;
78
+ if (h && L) {
79
+ const transformer = 12 * L * h * h;
80
+ const embed = v ? v * h : 0;
81
+ return transformer + 2 * embed;
82
+ }
83
+ return null;
84
+ }
85
+
86
+ // Predict ΔPPL band from γ shift, scaled by model size.
87
+ // Empirical fit (rough): ΔPPL ≈ 8 × γ_shift² × (1 + log10(N)/4).
88
+ // Returns {low, mid, high} estimate as a band (50% uncertainty).
89
+ function predictDeltaPPL(gammaShift, nParams) {
90
+ if (gammaShift <= 0) return { low: 0, mid: 0, high: 0 };
91
+ const sizeBoost = nParams ? 1 + Math.log10(nParams / 1e9) / 4 : 1;
92
+ const mid = 8 * gammaShift * gammaShift * sizeBoost;
93
+ return {
94
+ low: Math.max(0, Math.round((mid * 0.6) * 100) / 100),
95
+ mid: Math.round(mid * 100) / 100,
96
+ high: Math.round((mid * 1.5) * 100) / 100,
97
+ };
98
+ }
99
+
100
+ export function predictQuantShift(config, schemeId) {
101
+ const scheme = QUANT_SCHEMES.find(s => s.id === schemeId);
102
+ if (!scheme) return null;
103
+
104
+ const mult = archMultiplier(config);
105
+ const gammaShift = scheme.base_penalty * mult;
106
+ const regime = bandFor(gammaShift);
107
+ const nParams = inferNParams(config);
108
+ const deltaPPL = predictDeltaPPL(gammaShift, nParams);
109
+
110
+ // Recommendation logic (which scheme to switch to if regime is bad).
111
+ let recommendCode = null;
112
+ let recommendScheme = null;
113
+ if (regime === "cliff") {
114
+ // Suggest stepping up to next-better: q4_km → q5_km, nf4 → awq, q3 → q4, q2 → q4
115
+ if (scheme.id === "nf4") { recommendCode = "switch_to_awq"; recommendScheme = "awq"; }
116
+ else if (scheme.id === "gguf_q4_km") { recommendCode = "switch_to_q5_km"; recommendScheme = "gguf_q5_km"; }
117
+ else if (scheme.id === "gguf_q3_km") { recommendCode = "switch_to_q4_km"; recommendScheme = "gguf_q4_km"; }
118
+ else if (scheme.id === "gguf_q2_k") { recommendCode = "switch_to_q4_km"; recommendScheme = "gguf_q4_km"; }
119
+ else if (scheme.id === "gptq") { recommendCode = "switch_to_awq"; recommendScheme = "awq"; }
120
+ else recommendCode = "use_higher_bits";
121
+ } else if (regime === "significant") {
122
+ if (scheme.id === "nf4") { recommendCode = "consider_awq"; recommendScheme = "awq"; }
123
+ else recommendCode = "verify_with_eval";
124
+ }
125
+
126
+ return {
127
+ scheme: scheme.id,
128
+ scheme_label: scheme.label,
129
+ scheme_bits: scheme.bits,
130
+ scheme_calibrated: scheme.calibrated,
131
+ arch_multiplier: Math.round(mult * 100) / 100,
132
+ base_penalty: scheme.base_penalty,
133
+ gamma_shift: Math.round(gammaShift * 1000) / 1000,
134
+ regime,
135
+ delta_ppl: deltaPPL,
136
+ n_params: nParams,
137
+ recommend_code: recommendCode,
138
+ recommend_scheme: recommendScheme,
139
+ };
140
+ }
141
+
142
+ // Batch: predict all schemes for one config. Useful for "show me the trade-offs".
143
+ export function predictAllSchemes(config) {
144
+ return QUANT_SCHEMES.map(s => predictQuantShift(config, s.id))
145
+ .filter(Boolean)
146
+ .sort((a, b) => a.gamma_shift - b.gamma_shift);
147
+ }