karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
ab15c91
·
1 Parent(s): edb4038

v0.7.6: NIAH→reasoning predictor (anti-bullshit #7) + Drift CSS fix

Browse files

NEW MODE 🔍 NIAH→Reason — pass rate predictor for retrieval vs reasoning at any context
- js/niah_reasoning.js: pure logic. Reuses gammaPade/d_horizon machinery.
- Inputs: HF model id (auto-fetch config) + target T_eval.
- Outputs: NIAH pass rate, multi-hop reasoning rate, gap, "safe context" where reasoning ≥ 65%, verdict (robust / marginal / degraded / retrieval_only / broken).
- Logistic on log(T_eval / d_horizon) for NIAH; arch-pressure-modulated penalty for reasoning. Extrapolation penalty when T_eval > T_train (no positional embeddings learned past training).
- sweepContextLengths(): scans 1k / 4k / 16k / 64k / T_train so user sees the curve.
- Solves: RULER paper (NVIDIA 2024) finding that long-context models pass NIAH but fail reasoning. P6 in HF community pain analysis.

VIRTUAL SIMULATION
- Llama-3.1-8B → robust at 8k / 64k / 128k (matches RULER findings).
- Mistral-7B (SWA 4096) → robust at 4k, broken at 16k+ (SWA limit kicks in).
- Pythia-160m at 4× T_train → broken (extrapolation penalty caps gain past trained positions).

DRIFT CSS FIX
- #drift-section max-width 1100px (default 980px caused panel overflow).
- grid-template-columns: 1fr 1fr explicit (replaces auto-fit which overlapped).
- min-width: 0 + box-sizing on .drift-setup → fixes CSS Grid implicit auto min-width that caused overflow.
- Mobile <800px → single column.
- form-row inside drift-setup: flex-wrap, label width 110px, inputs flex 1 1 120px so values don't squish.

DOCUMENTATION
- Help modal v0.7 niah section in 4 langs.
- Inventory modal v0.7 card: 7th entry "🔍 NIAH→Reason".
- modes.tip → 14 modes in 4 langs.
- 681 i18n keys × 4 langs · 0 missing / 0 extra (49 new niah.* keys per lang).

33/33 sim passes — including extrapolation penalty calibration on Pythia.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (5) hide show
  1. index.html +30 -0
  2. js/i18n.js +208 -4
  3. js/main.js +160 -2
  4. js/niah_reasoning.js +138 -0
  5. style.css +27 -1
index.html CHANGED
@@ -210,6 +210,9 @@
210
  <p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
211
  <p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
212
 
 
 
 
213
  <h3 data-i18n="help.audit.title">The audit chain</h3>
214
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
215
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -317,6 +320,7 @@
317
  <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
318
  <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
319
  <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
 
320
  </ul>
321
  </details>
322
  </div>
@@ -372,6 +376,7 @@
372
  <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
373
  <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
374
  <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
 
375
  </div>
376
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
377
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -871,6 +876,31 @@
871
  <div id="drift-output" style="margin-top: 1em;"></div>
872
  </section>
873
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
874
  <!-- Recipe selector (mode=recipe) -->
875
  <section id="recipe-section" style="display:none;">
876
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
 
210
  <p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
211
  <p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
212
 
213
+ <p><strong data-i18n="help.v07.niah.title">🔍 NIAH → Reasoning Gap</strong></p>
214
+ <p data-i18n="help.v07.niah.body">RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.</p>
215
+
216
  <h3 data-i18n="help.audit.title">The audit chain</h3>
217
  <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
218
  output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
 
320
  <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
321
  <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
322
  <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
323
+ <li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
324
  </ul>
325
  </details>
326
  </div>
 
376
  <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
377
  <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
378
  <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
379
+ <button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
380
  </div>
381
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
382
  <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
 
876
  <div id="drift-output" style="margin-top: 1em;"></div>
877
  </section>
878
 
879
+ <!-- NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7) -->
880
+ <section id="niah-section" style="display:none;">
881
+ <h2><span data-i18n="niah.title">🔍 NIAH → Reasoning Gap</span>
882
+ <span class="info"><span class="tooltip" data-i18n="niah.tip">
883
+ NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
884
+ </span></span>
885
+ </h2>
886
+ <p class="recipe-desc" data-i18n="niah.desc">
887
+ <strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%.
888
+ </p>
889
+ <div class="form-row">
890
+ <label for="niah-id" data-i18n="niah.id_label">HF model id:</label>
891
+ <input type="text" id="niah-id" placeholder="e.g. meta-llama/Llama-3.1-8B-Instruct" />
892
+ <button type="button" id="niah-fetch-btn" data-i18n="niah.fetch_btn">📥 Fetch config</button>
893
+ </div>
894
+ <div class="form-row">
895
+ <label for="niah-teval" data-i18n="niah.teval_label">Target context (T_eval):</label>
896
+ <input type="number" id="niah-teval" min="512" step="1024" value="32768" />
897
+ <button type="button" id="niah-run-btn" data-i18n="niah.run_btn">🔍 Predict</button>
898
+ <button type="button" id="niah-sweep-btn" class="secondary" data-i18n="niah.sweep_btn">📊 Sweep contexts</button>
899
+ </div>
900
+ <p id="niah-status" class="recipe-desc" style="font-size:0.92em;"></p>
901
+ <div id="niah-output" style="margin-top: 1em;"></div>
902
+ </section>
903
+
904
  <!-- Recipe selector (mode=recipe) -->
905
  <section id="recipe-section" style="display:none;">
906
  <h2 data-i18n="recipe.title">📋 Recipe</h2>
js/i18n.js CHANGED
@@ -307,6 +307,9 @@ export const TRANSLATIONS = {
307
  "help.v07.drift.title": "🔀 Cross-framework Drift Bound",
308
  "help.v07.drift.body": "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
309
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
 
 
 
310
 
311
  // v0.7 — Inventory modal 5th card
312
  "inv.v07.title": "🆕 v0.7 anti-bullshit pack",
@@ -414,6 +417,54 @@ export const TRANSLATIONS = {
414
  "drift.status.empty_scores": "⚠ Enter both scores.",
415
  "drift.status.done": "✅ Verdict: {verdict}",
416
  "drift.status.sample_loaded": "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
417
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
418
  "share.import_btn": "📂 Load shared JSON",
419
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -506,7 +557,7 @@ export const TRANSLATIONS = {
506
  "common.no": "No",
507
 
508
  // Mode tooltips
509
- "modes.tip": "<strong>Thirteen ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.<br><strong>🔀 Drift</strong>: same model, different scores on two setups — bug or noise? Predict numerical-noise band and flag real bugs.",
510
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
511
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
512
 
@@ -1145,6 +1196,9 @@ export const TRANSLATIONS = {
1145
  "help.v07.drift.title": "🔀 Cota de drift entre frameworks",
1146
  "help.v07.drift.body": "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón &quot;Cargar sample&quot; para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
1147
  "inv.v07.drift": "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
 
 
 
1148
 
1149
  // v0.7 — Inventory modal 5ª card
1150
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
@@ -1252,6 +1306,54 @@ export const TRANSLATIONS = {
1252
  "drift.status.empty_scores": "⚠ Introduce ambos scores.",
1253
  "drift.status.done": "✅ Veredicto: {verdict}",
1254
  "drift.status.sample_loaded": "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1255
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
1256
  "share.import_btn": "📂 Cargar JSON compartido",
1257
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -1344,7 +1446,7 @@ export const TRANSLATIONS = {
1344
  "common.no": "No",
1345
 
1346
  // Tooltips de modos
1347
- "modes.tip": "<strong>Trece formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.<br><strong>🔀 Drift</strong>: mismo modelo, scores distintos en dos setups — ¿bug o ruido? Predice banda de ruido numérico y flagea bugs reales.",
1348
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
1349
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
1350
 
@@ -1847,6 +1949,9 @@ export const TRANSLATIONS = {
1847
  "help.v07.drift.title": "🔀 Borne de drift inter-frameworks",
1848
  "help.v07.drift.body": "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton &quot;Charger échantillon&quot; pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
1849
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
 
 
 
1850
 
1851
  // v0.7 — Inventory modal 5ème card
1852
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
@@ -1954,6 +2059,54 @@ export const TRANSLATIONS = {
1954
  "drift.status.empty_scores": "⚠ Saisissez les deux scores.",
1955
  "drift.status.done": "✅ Verdict : {verdict}",
1956
  "drift.status.sample_loaded": "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1957
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
1958
  "share.import_btn": "📂 Charger JSON partagé",
1959
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -2046,7 +2199,7 @@ export const TRANSLATIONS = {
2046
  "common.no": "Non",
2047
 
2048
  // Tooltips des modes
2049
- "modes.tip": "<strong>Treize façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.<br><strong>🔀 Drift</strong>: même modèle, scores différents sur deux setups — bug ou bruit ? Prédit la bande de bruit numérique et signale les vrais bugs.",
2050
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
2051
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
2052
 
@@ -2549,6 +2702,9 @@ export const TRANSLATIONS = {
2549
  "help.v07.drift.title": "🔀 跨框架 Drift 界",
2550
  "help.v07.drift.body": "同一模型,不同 setup 下分数不同。工具预测仅由数值噪声(dtype、framework、batch)允许的最大 drift。若观测差距超过它 → 真实 bug,通常是 chat-template mismatch(lm-eval-harness issue #1841)或 KV-cache 布局。试试 &quot;加载样本&quot; 按钮看典型的 chat-template bug。<em>用例</em>:在报告回归或声称可复现性之前,验证两个评估之间的差距是否大于数值噪声能解释的范围。",
2551
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug 还是噪声?预测两个评估间的最大可允许差距",
 
 
 
2552
 
2553
  // v0.7 — Inventory 模态第 5 卡
2554
  "inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
@@ -2656,6 +2812,54 @@ export const TRANSLATIONS = {
2656
  "drift.status.empty_scores": "⚠ 输入两个分数。",
2657
  "drift.status.done": "✅ 判定:{verdict}",
2658
  "drift.status.sample_loaded": "✅ 样本已加载(典型 chat-template bug)。点击计算 drift 界。",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2659
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
2660
  "share.import_btn": "📂 加载共享的 JSON",
2661
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
@@ -2748,7 +2952,7 @@ export const TRANSLATIONS = {
2748
  "common.no": "否",
2749
 
2750
  // 模式提示
2751
- "modes.tip": "<strong>十种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动��择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意(模型 × 量化方案)的 γ-shift 与 ΔPPL;cliff 时推荐更安全替代方案。<br><strong>🔀 Drift</strong>: 同一模型,两 setup 下分数不同 — bug 还是噪声?预测数值噪声区间并标记真实 bug。",
2752
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
2753
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
2754
 
 
307
  "help.v07.drift.title": "🔀 Cross-framework Drift Bound",
308
  "help.v07.drift.body": "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
309
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
310
+ "help.v07.niah.title": "🔍 NIAH → Reasoning Gap",
311
+ "help.v07.niah.body": "RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's \"safe reasoning context\" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.",
312
+ "inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — does your \"128k context\" actually reason there, or just retrieve?",
313
 
314
  // v0.7 — Inventory modal 5th card
315
  "inv.v07.title": "🆕 v0.7 anti-bullshit pack",
 
417
  "drift.status.empty_scores": "⚠ Enter both scores.",
418
  "drift.status.done": "✅ Verdict: {verdict}",
419
  "drift.status.sample_loaded": "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
420
+
421
+ // v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap predictor
422
+ "modes.niah": "🔍 NIAH→Reason",
423
+ "mode_desc.niah": "Predicts NIAH (retrieval) and multi-hop reasoning pass rates at any context. Solves: long-context models often pass NIAH but fail reasoning at the same context (RULER paper).",
424
+ "niah.title": "🔍 NIAH → Reasoning Gap",
425
+ "niah.tip": "NIAH (Needle in a Haystack) tests retrieval: 'find this fact in long text'. Multi-hop reasoning tests inference: 'combine facts X+Y at the start with fact Z at the end'. RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.",
426
+ "niah.desc": "<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a 'safe context' where reasoning stays ≥65%.",
427
+ "niah.id_label": "HF model id:",
428
+ "niah.fetch_btn": "📥 Fetch config",
429
+ "niah.teval_label": "Target context (T_eval):",
430
+ "niah.run_btn": "🔍 Predict",
431
+ "niah.sweep_btn": "📊 Sweep contexts",
432
+ "niah.label.niah": "NIAH pass rate",
433
+ "niah.label.reasoning": "Reasoning pass rate",
434
+ "niah.label.gap": "Gap",
435
+ "niah.label.safe_ctx": "Safe reasoning context",
436
+ "niah.section.breakdown": "Architecture breakdown",
437
+ "niah.section.reco": "Recommendation",
438
+ "niah.section.sweep": "Pass rate sweep across context lengths",
439
+ "niah.field.dhorizon": "d_horizon (effective)",
440
+ "niah.field.ratio": "T_eval / d_horizon",
441
+ "niah.field.arch_pressure": "Arch pressure (small d_head + GQA + SWA)",
442
+ "niah.field.theta": "RoPE θ",
443
+ "niah.field.t_train": "T_train (claimed)",
444
+ "niah.col.context": "T_eval",
445
+ "niah.col.niah": "NIAH",
446
+ "niah.col.reasoning": "Reasoning",
447
+ "niah.col.gap": "Gap",
448
+ "niah.col.verdict": "Verdict",
449
+ "niah.verdict.robust": "✅ ROBUST",
450
+ "niah.verdict.marginal": "⚠ MARGINAL",
451
+ "niah.verdict.degraded": "⚠ DEGRADED",
452
+ "niah.verdict.retrieval_only": "❌ RETRIEVAL-ONLY",
453
+ "niah.verdict.broken": "❌ BROKEN",
454
+ "niah.reco.robust": "Both retrieval and reasoning hold up at this context. Safe to deploy for both lookup and inference tasks.",
455
+ "niah.reco.marginal": "Borderline. Retrieval works but reasoning is shaky. Use for fact-lookup, not multi-step inference.",
456
+ "niah.reco.degraded": "Significant reasoning drop. The model can find facts but struggles to combine them. Avoid multi-hop tasks at this length.",
457
+ "niah.reco.retrieval_only": "Canonical RULER finding: model passes NIAH but fails reasoning. Useful for retrieval-augmented setups (where the LLM only locates facts) but NOT for chained inference. Cut your context to the 'safe' value below.",
458
+ "niah.reco.broken": "Model fails even basic retrieval at this context. Treat as out-of-distribution — re-test at a shorter context.",
459
+ "niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
460
+ "niah.safe_context_none": "No safe context found below your target — model fails reasoning even at small contexts.",
461
+ "niah.summary.sweep": "<code>{modelId}</code> — pass rates by context",
462
+ "niah.status.empty_id": "⚠ Enter a model id (e.g. meta-llama/Llama-3.1-8B-Instruct).",
463
+ "niah.status.bad_teval": "⚠ Enter a target context (≥ 512 tokens).",
464
+ "niah.status.fetching": "⏳ Fetching config.json for {modelId}...",
465
+ "niah.status.fetched": "✅ Config fetched for {modelId}. Set T_eval and click Predict (or Sweep contexts).",
466
+ "niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
467
+ "niah.status.sweep_done": "✅ Swept {n} context lengths.",
468
  "share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
469
  "share.import_btn": "📂 Load shared JSON",
470
  "synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
 
557
  "common.no": "No",
558
 
559
  // Mode tooltips
560
+ "modes.tip": "<strong>Fourteen ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.<br><strong>🔀 Drift</strong>: same model, different scores on two setups — bug or noise? Predict numerical-noise band and flag real bugs.<br><strong>🔍 NIAH→Reason</strong>: predict NIAH and multi-hop reasoning pass rates from architecture; find your model's safe reasoning context.",
561
  "profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
562
  "compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
563
 
 
1196
  "help.v07.drift.title": "🔀 Cota de drift entre frameworks",
1197
  "help.v07.drift.body": "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón &quot;Cargar sample&quot; para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
1198
  "inv.v07.drift": "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
1199
+ "help.v07.niah.title": "🔍 Gap NIAH → Reasoning",
1200
+ "help.v07.niah.body": "El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH (retrieval de needle) pero fallan reasoning multi-hop al mismo contexto. La herramienta predice ambas tasas de pass desde la arquitectura (γ_Padé + d_horizon + presión arq: d_head pequeño, GQA, SWA), reporta el gap, y encuentra el \"contexto seguro de reasoning\" donde reasoning se mantiene ≥65%. Modo barrido muestra la curva a 1k/4k/16k/64k/T_train. <em>Caso de uso</em>: antes de desplegar al contexto declarado, descubre si el modelo realmente razonará ahí o solo encontrará.",
1201
+ "inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — ¿tu \"128k\" realmente razona ahí, o solo encuentra?",
1202
 
1203
  // v0.7 — Inventory modal 5ª card
1204
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
 
1306
  "drift.status.empty_scores": "⚠ Introduce ambos scores.",
1307
  "drift.status.done": "✅ Veredicto: {verdict}",
1308
  "drift.status.sample_loaded": "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
1309
+
1310
+ // v0.7.6 — anti-bullshit pack #7: NIAH → predictor de gap de reasoning
1311
+ "modes.niah": "🔍 NIAH→Reason",
1312
+ "mode_desc.niah": "Predice tasas de pass de NIAH (retrieval) y reasoning multi-hop a cualquier contexto. Resuelve: modelos long-context pasan NIAH pero fallan reasoning al mismo contexto (paper RULER).",
1313
+ "niah.title": "🔍 Gap NIAH → Reasoning",
1314
+ "niah.tip": "NIAH (Needle in a Haystack) testea retrieval: 'encuentra este hecho en texto largo'. Reasoning multi-hop testea inferencia: 'combina hechos X+Y del principio con hecho Z del final'. El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH pero fallan reasoning al mismo contexto. Esta herramienta predice ambas tasas desde la arquitectura sola.",
1315
+ "niah.desc": "<strong>Tu modelo dice 128k de contexto. ¿Razonará realmente a 64k, o solo encontrará?</strong> Pega un model id HF y un contexto objetivo — la herramienta predice tasas de pass NIAH y reasoning multi-hop, el gap, y un 'contexto seguro' donde reasoning se mantiene ≥65%.",
1316
+ "niah.id_label": "ID modelo HF:",
1317
+ "niah.fetch_btn": "📥 Fetch config",
1318
+ "niah.teval_label": "Contexto objetivo (T_eval):",
1319
+ "niah.run_btn": "🔍 Predecir",
1320
+ "niah.sweep_btn": "📊 Barrer contextos",
1321
+ "niah.label.niah": "Tasa pass NIAH",
1322
+ "niah.label.reasoning": "Tasa pass Reasoning",
1323
+ "niah.label.gap": "Gap",
1324
+ "niah.label.safe_ctx": "Contexto seguro de reasoning",
1325
+ "niah.section.breakdown": "Desglose arquitectónico",
1326
+ "niah.section.reco": "Recomendación",
1327
+ "niah.section.sweep": "Barrido de tasas pass por longitud de contexto",
1328
+ "niah.field.dhorizon": "d_horizon (efectivo)",
1329
+ "niah.field.ratio": "T_eval / d_horizon",
1330
+ "niah.field.arch_pressure": "Presión arq (d_head pequeño + GQA + SWA)",
1331
+ "niah.field.theta": "RoPE θ",
1332
+ "niah.field.t_train": "T_train (declarado)",
1333
+ "niah.col.context": "T_eval",
1334
+ "niah.col.niah": "NIAH",
1335
+ "niah.col.reasoning": "Reasoning",
1336
+ "niah.col.gap": "Gap",
1337
+ "niah.col.verdict": "Veredicto",
1338
+ "niah.verdict.robust": "✅ ROBUSTO",
1339
+ "niah.verdict.marginal": "⚠ MARGINAL",
1340
+ "niah.verdict.degraded": "⚠ DEGRADADO",
1341
+ "niah.verdict.retrieval_only": "❌ SOLO RETRIEVAL",
1342
+ "niah.verdict.broken": "❌ ROTO",
1343
+ "niah.reco.robust": "Tanto retrieval como reasoning aguantan a este contexto. Seguro para desplegar tareas de lookup e inferencia.",
1344
+ "niah.reco.marginal": "Borderline. Retrieval funciona pero reasoning está flojo. Úsalo para lookup, no para inferencia multi-paso.",
1345
+ "niah.reco.degraded": "Caída significativa de reasoning. El modelo encuentra hechos pero le cuesta combinarlos. Evita tareas multi-hop a esta longitud.",
1346
+ "niah.reco.retrieval_only": "Hallazgo canónico de RULER: el modelo pasa NIAH pero falla reasoning. Útil para setups RAG (donde el LLM solo localiza hechos) pero NO para inferencia encadenada. Reduce tu contexto al valor 'seguro' de abajo.",
1347
+ "niah.reco.broken": "El modelo falla incluso retrieval básico a este contexto. Trátalo como out-of-distribution — re-testea a contexto más corto.",
1348
+ "niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
1349
+ "niah.safe_context_none": "No se encontró contexto seguro bajo tu objetivo — el modelo falla reasoning incluso a contextos pequeños.",
1350
+ "niah.summary.sweep": "<code>{modelId}</code> — tasas pass por contexto",
1351
+ "niah.status.empty_id": "⚠ Introduce un model id (ej. meta-llama/Llama-3.1-8B-Instruct).",
1352
+ "niah.status.bad_teval": "⚠ Introduce un contexto objetivo (≥ 512 tokens).",
1353
+ "niah.status.fetching": "⏳ Obteniendo config.json para {modelId}...",
1354
+ "niah.status.fetched": "✅ Config obtenido para {modelId}. Pon T_eval y click Predecir (o Barrer contextos).",
1355
+ "niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
1356
+ "niah.status.sweep_done": "✅ Barridos {n} largos de contexto.",
1357
  "share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
1358
  "share.import_btn": "📂 Cargar JSON compartido",
1359
  "synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
 
1446
  "common.no": "No",
1447
 
1448
  // Tooltips de modos
1449
+ "modes.tip": "<strong>Catorce formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.<br><strong>🔀 Drift</strong>: mismo modelo, scores distintos en dos setups — ¿bug o ruido? Predice banda de ruido numérico y flagea bugs reales.<br><strong>🔍 NIAH→Reason</strong>: predice tasas pass NIAH y reasoning multi-hop desde arquitectura; encuentra el contexto seguro de reasoning.",
1450
  "profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
1451
  "compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
1452
 
 
1949
  "help.v07.drift.title": "🔀 Borne de drift inter-frameworks",
1950
  "help.v07.drift.body": "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton &quot;Charger échantillon&quot; pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
1951
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
1952
+ "help.v07.niah.title": "🔍 Gap NIAH → Reasoning",
1953
+ "help.v07.niah.body": "Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH (retrieval de needle) mais échouent au reasoning multi-hop au même contexte. L'outil prédit les deux taux de réussite à partir de l'architecture (γ_Padé + d_horizon + pression arch : petit d_head, GQA, SWA), reporte le gap, et trouve le \"contexte sûr pour reasoning\" où le reasoning reste ≥65%. Mode balayage montre la courbe à 1k/4k/16k/64k/T_train. <em>Cas d'usage</em> : avant de déployer au contexte revendiqué, découvrez si le modèle va vraiment raisonner là ou seulement retrouver.",
1954
+ "inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — votre \"128k\" raisonne-t-il vraiment là, ou seulement retrouve ?",
1955
 
1956
  // v0.7 — Inventory modal 5ème card
1957
  "inv.v07.title": "🆕 Pack anti-bullshit v0.7",
 
2059
  "drift.status.empty_scores": "⚠ Saisissez les deux scores.",
2060
  "drift.status.done": "✅ Verdict : {verdict}",
2061
  "drift.status.sample_loaded": "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
2062
+
2063
+ // v0.7.6 — anti-bullshit pack #7: prédicteur de gap NIAH → reasoning
2064
+ "modes.niah": "🔍 NIAH→Reason",
2065
+ "mode_desc.niah": "Prédit les taux de réussite NIAH (retrieval) et reasoning multi-hop à n'importe quel contexte. Résout : les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte (paper RULER).",
2066
+ "niah.title": "🔍 Gap NIAH → Reasoning",
2067
+ "niah.tip": "NIAH (Needle in a Haystack) teste le retrieval : 'trouve ce fait dans un long texte'. Le reasoning multi-hop teste l'inférence : 'combine les faits X+Y au début avec le fait Z à la fin'. Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte. Cet outil prédit les deux taux à partir de la seule architecture.",
2068
+ "niah.desc": "<strong>Votre modèle revendique 128k de contexte. Va-t-il vraiment raisonner à 64k, ou seulement retrouver ?</strong> Collez un model id HF et un contexte cible — l'outil prédit les taux de réussite NIAH et reasoning multi-hop, le gap, et un 'contexte sûr' où le reasoning reste ≥65%.",
2069
+ "niah.id_label": "ID modèle HF :",
2070
+ "niah.fetch_btn": "📥 Récupérer config",
2071
+ "niah.teval_label": "Contexte cible (T_eval) :",
2072
+ "niah.run_btn": "🔍 Prédire",
2073
+ "niah.sweep_btn": "📊 Balayer les contextes",
2074
+ "niah.label.niah": "Taux NIAH",
2075
+ "niah.label.reasoning": "Taux Reasoning",
2076
+ "niah.label.gap": "Gap",
2077
+ "niah.label.safe_ctx": "Contexte sûr pour reasoning",
2078
+ "niah.section.breakdown": "Détail architectural",
2079
+ "niah.section.reco": "Recommandation",
2080
+ "niah.section.sweep": "Balayage des taux par longueur de contexte",
2081
+ "niah.field.dhorizon": "d_horizon (effectif)",
2082
+ "niah.field.ratio": "T_eval / d_horizon",
2083
+ "niah.field.arch_pressure": "Pression arch (petit d_head + GQA + SWA)",
2084
+ "niah.field.theta": "RoPE θ",
2085
+ "niah.field.t_train": "T_train (revendiqué)",
2086
+ "niah.col.context": "T_eval",
2087
+ "niah.col.niah": "NIAH",
2088
+ "niah.col.reasoning": "Reasoning",
2089
+ "niah.col.gap": "Gap",
2090
+ "niah.col.verdict": "Verdict",
2091
+ "niah.verdict.robust": "✅ ROBUSTE",
2092
+ "niah.verdict.marginal": "⚠ MARGINAL",
2093
+ "niah.verdict.degraded": "⚠ DÉGRADÉ",
2094
+ "niah.verdict.retrieval_only": "❌ RETRIEVAL UNIQUEMENT",
2095
+ "niah.verdict.broken": "❌ CASSÉ",
2096
+ "niah.reco.robust": "Retrieval et reasoning tiennent tous deux à ce contexte. Sûr de déployer pour les tâches de lookup et d'inférence.",
2097
+ "niah.reco.marginal": "Borderline. Le retrieval fonctionne mais le reasoning est fragile. À utiliser pour le lookup, pas pour l'inférence multi-étapes.",
2098
+ "niah.reco.degraded": "Chute significative du reasoning. Le modèle trouve des faits mais peine à les combiner. Évitez les tâches multi-hop à cette longueur.",
2099
+ "niah.reco.retrieval_only": "Constat canonique de RULER : le modèle passe NIAH mais échoue au reasoning. Utile pour les setups RAG (où le LLM ne fait que localiser les faits) mais PAS pour l'inférence chaînée. Réduisez votre contexte à la valeur 'sûre' ci-dessous.",
2100
+ "niah.reco.broken": "Le modèle échoue même au retrieval basique à ce contexte. Traitez-le comme hors-distribution — re-testez à un contexte plus court.",
2101
+ "niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
2102
+ "niah.safe_context_none": "Aucun contexte sûr trouvé sous votre cible — le modèle échoue au reasoning même à de petits contextes.",
2103
+ "niah.summary.sweep": "<code>{modelId}</code> — taux par contexte",
2104
+ "niah.status.empty_id": "⚠ Saisissez un model id (ex. meta-llama/Llama-3.1-8B-Instruct).",
2105
+ "niah.status.bad_teval": "⚠ Saisissez un contexte cible (≥ 512 tokens).",
2106
+ "niah.status.fetching": "⏳ Récupération config.json pour {modelId}...",
2107
+ "niah.status.fetched": "✅ Config récupéré pour {modelId}. Réglez T_eval et cliquez Prédire (ou Balayer les contextes).",
2108
+ "niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
2109
+ "niah.status.sweep_done": "✅ Balayé {n} longueurs de contexte.",
2110
  "share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
2111
  "share.import_btn": "📂 Charger JSON partagé",
2112
  "synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
 
2199
  "common.no": "Non",
2200
 
2201
  // Tooltips des modes
2202
+ "modes.tip": "<strong>Quatorze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.<br><strong>🔀 Drift</strong>: même modèle, scores différents sur deux setups — bug ou bruit ? Prédit la bande de bruit numérique et signale les vrais bugs.<br><strong>🔍 NIAH→Reason</strong>: prédit les taux NIAH et reasoning multi-hop depuis l'architecture ; trouve le contexte sûr pour reasoning.",
2203
  "profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
2204
  "compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
2205
 
 
2702
  "help.v07.drift.title": "🔀 跨框架 Drift 界",
2703
  "help.v07.drift.body": "同一模型,不同 setup 下分数不同。工具预测仅由数值噪声(dtype、framework、batch)允许的最大 drift。若观测差距超过它 → 真实 bug,通常是 chat-template mismatch(lm-eval-harness issue #1841)或 KV-cache 布局。试试 &quot;加载样本&quot; 按钮看典型的 chat-template bug。<em>用例</em>:在报告回归或声称可复现性之前,验证两个评估之间的差距是否大于数值噪声能解释的范围。",
2704
  "inv.v07.drift": "<strong>🔀 Drift</strong> — bug 还是噪声?预测两个评估间的最大可允许差距",
2705
+ "help.v07.niah.title": "🔍 NIAH → Reasoning Gap",
2706
+ "help.v07.niah.body": "RULER 论文(NVIDIA 2024)显示长上下文模型经常通过 NIAH(needle 检索)但在相同上下文上多跳 reasoning 失败。工具仅根据架构(γ_Padé + d_horizon + 架构压力:小 d_head、GQA、SWA)预测两种通过率,报告 gap,并找到模型 reasoning 保持 ≥65% 的\"安全 reasoning 上下文\"。扫描模式显示在 1k/4k/16k/64k/T_train 的曲线。<em>用例</em>:在声称的上下文部署之前,搞清楚模型是真的能在那里 reasoning 还是只能检索。",
2707
+ "inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — 你的\"128k 上下文\"真的能在那里 reasoning,还是只能检索?",
2708
 
2709
  // v0.7 — Inventory 模态第 5 卡
2710
  "inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
 
2812
  "drift.status.empty_scores": "⚠ 输入两个分数。",
2813
  "drift.status.done": "✅ 判定:{verdict}",
2814
  "drift.status.sample_loaded": "✅ 样本已加载(典型 chat-template bug)。点击计算 drift 界。",
2815
+
2816
+ // v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap 预测器
2817
+ "modes.niah": "🔍 NIAH→Reason",
2818
+ "mode_desc.niah": "在任意上下文下预测 NIAH(检索)与多跳 reasoning 通过率。解决:长上下文模型常常通过 NIAH 但在同一上下文上 reasoning 失败(RULER 论文)。",
2819
+ "niah.title": "🔍 NIAH → Reasoning Gap",
2820
+ "niah.tip": "NIAH(Needle in a Haystack)测试检索:\"在长文本中找到这个事实\"。多跳 reasoning 测试推理:\"把开头的事实 X+Y 与结尾的事实 Z 结合\"。RULER 论文(NVIDIA 2024)显示长上下文模型经常通过 NIAH 但在相同上下文上 reasoning 失败。本工具仅根据架构预测两种通过率。",
2821
+ "niah.desc": "<strong>你的模型声称 128k 上下文。它在 64k 是真的能 reasoning,还是只能检索?</strong>粘贴 HF 模型 id 和目标 eval 上下文 — 工具预测 NIAH 与多跳 reasoning 通过率、gap,以及 reasoning 保持 ≥65% 的 \"安全上下文\"。",
2822
+ "niah.id_label": "HF 模型 id:",
2823
+ "niah.fetch_btn": "📥 获取 config",
2824
+ "niah.teval_label": "目标上下文 (T_eval):",
2825
+ "niah.run_btn": "🔍 预测",
2826
+ "niah.sweep_btn": "📊 扫描上下文",
2827
+ "niah.label.niah": "NIAH 通过率",
2828
+ "niah.label.reasoning": "Reasoning 通过率",
2829
+ "niah.label.gap": "Gap",
2830
+ "niah.label.safe_ctx": "Reasoning 安全上下文",
2831
+ "niah.section.breakdown": "架构细节",
2832
+ "niah.section.reco": "建议",
2833
+ "niah.section.sweep": "按上下文长度扫描通过率",
2834
+ "niah.field.dhorizon": "d_horizon(有效)",
2835
+ "niah.field.ratio": "T_eval / d_horizon",
2836
+ "niah.field.arch_pressure": "架构压力(小 d_head + GQA + SWA)",
2837
+ "niah.field.theta": "RoPE θ",
2838
+ "niah.field.t_train": "T_train(声称)",
2839
+ "niah.col.context": "T_eval",
2840
+ "niah.col.niah": "NIAH",
2841
+ "niah.col.reasoning": "Reasoning",
2842
+ "niah.col.gap": "Gap",
2843
+ "niah.col.verdict": "判定",
2844
+ "niah.verdict.robust": "✅ 稳健",
2845
+ "niah.verdict.marginal": "⚠ 边缘",
2846
+ "niah.verdict.degraded": "⚠ 退化",
2847
+ "niah.verdict.retrieval_only": "❌ 仅检索",
2848
+ "niah.verdict.broken": "❌ 失效",
2849
+ "niah.reco.robust": "在此上下文下检索与 reasoning 都稳定。可安全部署用于查询和推理任务。",
2850
+ "niah.reco.marginal": "边缘。检索可用但 reasoning 不稳。用于事实查询,不要用于多步推理。",
2851
+ "niah.reco.degraded": "Reasoning 显著下降。模型能找到事实但难以组合它们。在此长度下避免多跳任务。",
2852
+ "niah.reco.retrieval_only": "RULER 的典型发现:模型通过 NIAH 但 reasoning 失败。适用于 RAG 设置(LLM 仅定位事实),不适用于链式推理。把上下文降到下方的 \"安全\" 值。",
2853
+ "niah.reco.broken": "在此上下文下模型连基本检索都失败。视为 out-of-distribution — 在更短上下文重测。",
2854
+ "niah.safe_context": "≤ {ctx} tokens(reasoning ≥ 65%)",
2855
+ "niah.safe_context_none": "在你的目标以下没找到安全上下文 — 模型即使在小上下文也 reasoning 失败。",
2856
+ "niah.summary.sweep": "<code>{modelId}</code> — 按上下文的通过率",
2857
+ "niah.status.empty_id": "⚠ 输入 model id(例如 meta-llama/Llama-3.1-8B-Instruct)。",
2858
+ "niah.status.bad_teval": "⚠ 输入目标上下文(≥ 512 tokens)。",
2859
+ "niah.status.fetching": "⏳ 正在获取 {modelId} 的 config.json...",
2860
+ "niah.status.fetched": "✅ 已获取 {modelId} 的 config。设置 T_eval 并点击预测(或扫描上下文)。",
2861
+ "niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
2862
+ "niah.status.sweep_done": "✅ 已扫描 {n} 个上下文长度。",
2863
  "share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
2864
  "share.import_btn": "📂 加载共享的 JSON",
2865
  "synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
 
2952
  "common.no": "否",
2953
 
2954
  // 模式提示
2955
+ "modes.tip": "<strong>十种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意(模型 × 量化方案)的 γ-shift 与 ΔPPL;cliff 时推荐更安全替代方案。<br><strong>🔀 Drift</strong>: 同一模型,两 setup 下分数不同 — bug 还是噪声?预测数值噪声区间并标记真实 bug。<br><strong>🔍 NIAH→Reason</strong>: 从架构预测 NIAH 与多跳 reasoning 通过率;找到模型的安全 reasoning 上下文。",
2956
  "profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
2957
  "compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
2958
 
js/main.js CHANGED
@@ -18,6 +18,7 @@ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
18
  import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
19
  import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
20
  import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
 
21
 
22
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
23
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -198,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
198
  "profile-section", "compare-section", "inspector-section",
199
  "diagnose-section", "phase-section", "unmask-section",
200
  "template-section", "arena-section", "contam-section",
201
- "quant-section", "drift-section"].forEach(id => {
202
  const el = $(id);
203
  if (el) el.style.display = "none";
204
  });
@@ -208,7 +209,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
208
  compare: "compare-section", inspector: "inspector-section",
209
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
210
  template: "template-section", arena: "arena-section", contam: "contam-section",
211
- quant: "quant-section", drift: "drift-section",
212
  };
213
  const sectionId = sectionMap[mode];
214
  if (sectionId) $(sectionId).style.display = "";
@@ -1315,6 +1316,163 @@ populateDriftDropdowns();
1315
  $("drift-run-btn")?.addEventListener("click", runDriftCompute);
1316
  $("drift-sample-btn")?.addEventListener("click", loadDriftSample);
1317
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1318
  function configToPreset(cfg, modelId) {
1319
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
1320
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
 
18
  import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
19
  import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
20
  import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
21
+ import { predictNIAHReasoning, sweepContextLengths } from "./niah_reasoning.js";
22
 
23
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
24
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
 
199
  "profile-section", "compare-section", "inspector-section",
200
  "diagnose-section", "phase-section", "unmask-section",
201
  "template-section", "arena-section", "contam-section",
202
+ "quant-section", "drift-section", "niah-section"].forEach(id => {
203
  const el = $(id);
204
  if (el) el.style.display = "none";
205
  });
 
209
  compare: "compare-section", inspector: "inspector-section",
210
  diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
211
  template: "template-section", arena: "arena-section", contam: "contam-section",
212
+ quant: "quant-section", drift: "drift-section", niah: "niah-section",
213
  };
214
  const sectionId = sectionMap[mode];
215
  if (sectionId) $(sectionId).style.display = "";
 
1316
  $("drift-run-btn")?.addEventListener("click", runDriftCompute);
1317
  $("drift-sample-btn")?.addEventListener("click", loadDriftSample);
1318
 
1319
+ // ════════════════════════════════════════════════════════════════════
1320
+ // 🔍 NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
1321
+ // ════════════════════════════════════════════════════════════════════
1322
+
1323
+ const NIAH_VERDICT_COLOR = {
1324
+ robust: "#3fb950",
1325
+ marginal: "#f1c40f",
1326
+ degraded: "#f1c40f",
1327
+ retrieval_only: "#f85149",
1328
+ broken: "#f85149",
1329
+ };
1330
+
1331
+ let __niahLastConfig = null;
1332
+ let __niahLastModelId = null;
1333
+
1334
+ async function niahFetchConfig() {
1335
+ const modelId = ($("niah-id").value || "").trim();
1336
+ if (!modelId) {
1337
+ $("niah-status").textContent = t("niah.status.empty_id") || "⚠ Enter a model id.";
1338
+ return null;
1339
+ }
1340
+ $("niah-status").textContent = tFmt("niah.status.fetching", { modelId });
1341
+ $("niah-fetch-btn").disabled = true;
1342
+ try {
1343
+ const cfg = await fetchHfConfig(modelId);
1344
+ __niahLastConfig = cfg;
1345
+ __niahLastModelId = modelId;
1346
+ $("niah-status").textContent = tFmt("niah.status.fetched", { modelId });
1347
+ return cfg;
1348
+ } catch (err) {
1349
+ if (err.code === "gated") {
1350
+ $("niah-status").innerHTML = `🔒 <strong>${err.modelId}</strong> ${t("hf_auto.gated_msg") || "is gated. Accept the license here:"} <a href="https://huggingface.co/${err.modelId}" target="_blank" rel="noopener">huggingface.co/${err.modelId}</a>`;
1351
+ } else {
1352
+ $("niah-status").textContent = `❌ ${err.message}`;
1353
+ }
1354
+ return null;
1355
+ } finally {
1356
+ $("niah-fetch-btn").disabled = false;
1357
+ }
1358
+ }
1359
+
1360
+ function renderNIAHCard(result, modelId) {
1361
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1362
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1363
+ const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
1364
+ const color = NIAH_VERDICT_COLOR[result.verdict] || "#8b949e";
1365
+ const verdictLabel = t(`niah.verdict.${result.verdict}`) || result.verdict;
1366
+ const reco = t(`niah.reco.${result.verdict}`) || "";
1367
+ const safeText = result.safe_context
1368
+ ? tFmt("niah.safe_context", { ctx: result.safe_context })
1369
+ : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
1370
+
1371
+ return `
1372
+ <div class="unmask-result">
1373
+ <div class="unmask-hero" style="border-color: ${color};">
1374
+ <div class="unmask-verdict" style="color: ${color};">${verdictLabel}</div>
1375
+ <div class="unmask-model"><code>${escapeHtml(modelId)}</code> @ <code>${fmtN(result.T_eval)}</code> tokens</div>
1376
+ <div class="unmask-numbers">
1377
+ <div><span class="unmask-num-label">${t("niah.label.niah") || "NIAH pass rate"}</span><span class="unmask-num-val">${(result.niah_rate * 100).toFixed(0)}%</span></div>
1378
+ <div><span class="unmask-num-label">${t("niah.label.reasoning") || "Reasoning pass rate"}</span><span class="unmask-num-val">${(result.reasoning_rate * 100).toFixed(0)}%</span></div>
1379
+ <div><span class="unmask-num-label">${t("niah.label.gap") || "Gap"}</span><span class="unmask-num-val">${(result.gap * 100).toFixed(0)} pts</span></div>
1380
+ </div>
1381
+ </div>
1382
+ <div class="unmask-details">
1383
+ <details class="unmask-panel" open>
1384
+ <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
1385
+ <ul>
1386
+ <li><strong>γ_Padé @ T_eval:</strong> ${result.gamma_pade}</li>
1387
+ <li><strong>${t("niah.field.dhorizon") || "d_horizon (effective)"}:</strong> ${fmtN(result.d_horizon)} tokens</li>
1388
+ <li><strong>${t("niah.field.ratio") || "T_eval / d_horizon"}:</strong> ${result.horizon_ratio}×</li>
1389
+ <li><strong>${t("niah.field.arch_pressure") || "Arch pressure (small d_head + GQA + SWA)"}:</strong> ×${result.arch_pressure}</li>
1390
+ <li><strong>${t("niah.field.theta") || "RoPE θ"}:</strong> ${fmtN(result.theta)}</li>
1391
+ <li><strong>${t("niah.field.t_train") || "T_train (claimed)"}:</strong> ${fmtN(result.T_train)}</li>
1392
+ </ul>
1393
+ </details>
1394
+ <details class="unmask-panel" open>
1395
+ <summary class="unmask-panel-title">${t("niah.section.reco") || "Recommendation"}</summary>
1396
+ <p class="unmask-reco">${reco}</p>
1397
+ <p class="unmask-reco"><strong>${t("niah.label.safe_ctx") || "Safe reasoning context"}:</strong> ${safeText}</p>
1398
+ </details>
1399
+ </div>
1400
+ </div>
1401
+ `;
1402
+ }
1403
+
1404
+ function renderNIAHSweep(rows, modelId) {
1405
+ const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1406
+ ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1407
+ const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
1408
+ let body = "";
1409
+ for (const r of rows) {
1410
+ const color = NIAH_VERDICT_COLOR[r.verdict] || "#8b949e";
1411
+ const label = t(`niah.verdict.${r.verdict}`) || r.verdict;
1412
+ body += `<tr>
1413
+ <td><strong>${fmtN(r.T_eval)}</strong></td>
1414
+ <td class="arena-elo">${(r.niah_rate * 100).toFixed(0)}%</td>
1415
+ <td class="arena-elo">${(r.reasoning_rate * 100).toFixed(0)}%</td>
1416
+ <td class="arena-spread">${(r.gap * 100).toFixed(0)} pts</td>
1417
+ <td style="color: ${color};"><strong>${label}</strong></td>
1418
+ </tr>`;
1419
+ }
1420
+ return `
1421
+ <div class="arena-result">
1422
+ <div class="unmask-hero" style="border-color: #58a6ff;">
1423
+ <div class="unmask-verdict" style="color: #58a6ff;">${tFmt("niah.summary.sweep", { modelId })}</div>
1424
+ </div>
1425
+ <div class="unmask-details">
1426
+ <details class="unmask-panel" open>
1427
+ <summary class="unmask-panel-title">${t("niah.section.sweep") || "Pass rate sweep across context lengths"}</summary>
1428
+ <table class="arena-table">
1429
+ <thead><tr>
1430
+ <th>${t("niah.col.context") || "T_eval"}</th>
1431
+ <th>${t("niah.col.niah") || "NIAH"}</th>
1432
+ <th>${t("niah.col.reasoning") || "Reasoning"}</th>
1433
+ <th>${t("niah.col.gap") || "Gap"}</th>
1434
+ <th>${t("niah.col.verdict") || "Verdict"}</th>
1435
+ </tr></thead>
1436
+ <tbody>${body}</tbody>
1437
+ </table>
1438
+ </details>
1439
+ </div>
1440
+ </div>
1441
+ `;
1442
+ }
1443
+
1444
+ async function runNIAHPredict() {
1445
+ const cfg = __niahLastConfig || await niahFetchConfig();
1446
+ if (!cfg) return;
1447
+ const T_eval = parseInt($("niah-teval").value, 10);
1448
+ if (Number.isNaN(T_eval) || T_eval < 512) {
1449
+ $("niah-status").textContent = t("niah.status.bad_teval") || "⚠ Enter a target context (≥512).";
1450
+ return;
1451
+ }
1452
+ const result = predictNIAHReasoning(cfg, T_eval);
1453
+ $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId);
1454
+ $("niah-status").textContent = tFmt("niah.status.done", {
1455
+ verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
1456
+ niah: (result.niah_rate * 100).toFixed(0),
1457
+ reasoning: (result.reasoning_rate * 100).toFixed(0),
1458
+ });
1459
+ }
1460
+
1461
+ async function runNIAHSweep() {
1462
+ const cfg = __niahLastConfig || await niahFetchConfig();
1463
+ if (!cfg) return;
1464
+ const rows = sweepContextLengths(cfg);
1465
+ $("niah-output").innerHTML = renderNIAHSweep(rows, __niahLastModelId);
1466
+ $("niah-status").textContent = tFmt("niah.status.sweep_done", { n: rows.length });
1467
+ }
1468
+
1469
+ $("niah-fetch-btn")?.addEventListener("click", niahFetchConfig);
1470
+ $("niah-run-btn")?.addEventListener("click", runNIAHPredict);
1471
+ $("niah-sweep-btn")?.addEventListener("click", runNIAHSweep);
1472
+ $("niah-id")?.addEventListener("keydown", (e) => {
1473
+ if (e.key === "Enter") { e.preventDefault(); niahFetchConfig(); }
1474
+ });
1475
+
1476
  function configToPreset(cfg, modelId) {
1477
  const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
1478
  const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
js/niah_reasoning.js ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
2
+ // Predicts pass rate at a given evaluation context for two tasks:
3
+ // - NIAH (Needle in a Haystack): single-fact retrieval, lenient
4
+ // - Multi-hop reasoning: chained inference, strict
5
+ // And the GAP — the dominant failure mode for "long context" claims.
6
+ //
7
+ // Calibration: rough empirical fit to RULER paper bands (NVIDIA 2024) +
8
+ // observed degradation curves on Llama-3.1, Mistral, Qwen2.5 at 8k/16k/32k/64k.
9
+ // Uses TAF's existing γ_Padé / d_horizon machinery for the architectural input.
10
+ //
11
+ // Pure logic — no human strings. Render via i18n in main.js.
12
+
13
+ import { gammaPade, thetaEffPade } from "./gamma_check.js";
14
+
15
+ // d_horizon ≈ effective attention horizon. Reproduces formula from
16
+ // taf_browser.py / paper §sec:gamma_decomposition. For browser-only v1 use.
17
+ function dHorizon(theta, gammaPredicted) {
18
+ if (gammaPredicted >= 1) return Infinity;
19
+ if (gammaPredicted <= 0) return theta;
20
+ // d_horizon ≈ θ × (1 + γ_predicted) / (1 - γ_predicted)
21
+ // Padé-canonical form (paper §sec:gamma_decomposition).
22
+ return theta * (1 + gammaPredicted) / (1 - gammaPredicted);
23
+ }
24
+
25
+ // Sigmoid-like passrate vs. ratio = T_eval / d_horizon.
26
+ // Calibrated such that:
27
+ // ratio = 0.25 → ≈ 0.95 (well within horizon)
28
+ // ratio = 0.50 → ≈ 0.88
29
+ // ratio = 1.00 → ≈ 0.65
30
+ // ratio = 2.00 → ≈ 0.35
31
+ // ratio = 4.00 → ≈ 0.15
32
+ function niahRate(ratio) {
33
+ // Logistic on log-ratio: P = 1/(1+exp(k*(log(ratio)-log(0.7))))
34
+ const k = 1.4;
35
+ const center = Math.log(0.7);
36
+ const x = Math.log(Math.max(0.01, ratio));
37
+ return 1 / (1 + Math.exp(k * (x - center)));
38
+ }
39
+
40
+ // Multi-hop reasoning is strictly harder than NIAH. RULER paper shows ~30-50%
41
+ // drop from NIAH-Single to multi-hop at long context. The gap grows with
42
+ // architecture pressure (small d_head, aggressive GQA, SWA boundary).
43
+ function reasoningPenalty(ratio, archPressure) {
44
+ // Base penalty grows with context ratio (more multi-hop steps required).
45
+ // archPressure ∈ [1.0, 1.6] from architecture (small d_head + GQA → higher).
46
+ const base = ratio < 0.5 ? 0.05 :
47
+ ratio < 1.0 ? 0.15 :
48
+ ratio < 2.0 ? 0.30 :
49
+ ratio < 4.0 ? 0.45 : 0.55;
50
+ return Math.min(0.7, base * archPressure);
51
+ }
52
+
53
+ function archPressureFromConfig(config) {
54
+ let p = 1.0;
55
+ const n_attn = config.num_attention_heads ?? null;
56
+ const n_kv = config.num_key_value_heads ?? n_attn;
57
+ const hidden = config.hidden_size ?? null;
58
+ const d_head = config.head_dim ?? (n_attn && hidden ? hidden / n_attn : null);
59
+ if (d_head !== null) {
60
+ if (d_head < 64) p *= 1.25;
61
+ else if (d_head < 96) p *= 1.10;
62
+ else if (d_head < 128) p *= 1.03;
63
+ }
64
+ if (n_attn && n_kv && n_kv < n_attn) {
65
+ const ratio = n_attn / n_kv;
66
+ if (ratio >= 8) p *= 1.15;
67
+ else if (ratio >= 4) p *= 1.08;
68
+ }
69
+ if (typeof config.sliding_window === "number" && config.sliding_window > 0) {
70
+ p *= 1.10; // SWA: cross-window reasoning costs extra
71
+ }
72
+ return Math.min(1.6, p);
73
+ }
74
+
75
+ export function predictNIAHReasoning(config, T_eval) {
76
+ const theta = config.rope_theta ?? 10000;
77
+ const T_train = config.max_position_embeddings ?? T_eval;
78
+ const gPade = gammaPade(theta, T_eval);
79
+ const dh = dHorizon(theta, gPade);
80
+ const ratio = dh === Infinity ? 0 : T_eval / dh;
81
+
82
+ const archPressure = archPressureFromConfig(config);
83
+ // Extrapolation penalty: models tested far beyond their training context
84
+ // degrade regardless of architecture (no positional embeddings learned for
85
+ // unseen positions). Capped at 0.7 so we never zero out completely.
86
+ const extrapolation_ratio = T_train > 0 ? T_eval / T_train : 1;
87
+ const extrapolation_penalty = extrapolation_ratio > 1
88
+ ? Math.min(0.7, (extrapolation_ratio - 1) * 0.3)
89
+ : 0;
90
+ const niah = Math.max(0.02, niahRate(ratio) * (1 - extrapolation_penalty));
91
+ const penalty = reasoningPenalty(ratio, archPressure);
92
+ const reasoning = Math.max(0.02, niah * (1 - penalty));
93
+ const gap = niah - reasoning;
94
+
95
+ // Verdict bands
96
+ let verdict;
97
+ if (niah < 0.35) verdict = "broken"; // model can't even retrieve
98
+ else if (gap >= 0.30) verdict = "retrieval_only"; // canonical RULER finding
99
+ else if (gap >= 0.15) verdict = "degraded";
100
+ else if (niah >= 0.70 && reasoning >= 0.55) verdict = "robust";
101
+ else verdict = "marginal";
102
+
103
+ // Find a "safe" context where reasoning >= 0.65 (binary search-like sweep)
104
+ let safeT = null;
105
+ for (let t = 1024; t <= T_eval; t *= 2) {
106
+ const gP = gammaPade(theta, t);
107
+ const dh2 = dHorizon(theta, gP);
108
+ const r = dh2 === Infinity ? 0 : t / dh2;
109
+ const niah2 = niahRate(r);
110
+ const reas2 = niah2 * (1 - reasoningPenalty(r, archPressure));
111
+ if (reas2 >= 0.65) safeT = t;
112
+ else break;
113
+ }
114
+
115
+ return {
116
+ T_eval,
117
+ T_train,
118
+ theta,
119
+ arch_pressure: Math.round(archPressure * 100) / 100,
120
+ gamma_pade: Math.round(gPade * 1000) / 1000,
121
+ d_horizon: dh === Infinity ? null : Math.round(dh),
122
+ horizon_ratio: Math.round(ratio * 100) / 100,
123
+ niah_rate: Math.round(niah * 100) / 100,
124
+ reasoning_rate: Math.round(reasoning * 100) / 100,
125
+ gap: Math.round(gap * 100) / 100,
126
+ verdict,
127
+ safe_context: safeT,
128
+ };
129
+ }
130
+
131
+ // Sweep across context lengths (1k, 4k, 16k, 64k, 128k) so user sees the curve.
132
+ export function sweepContextLengths(config, lengths = null) {
133
+ const T_max = config.max_position_embeddings ?? 131072;
134
+ const defaults = lengths || [1024, 4096, 16384, 65536, T_max].filter((v, i, arr) =>
135
+ v <= T_max && arr.indexOf(v) === i
136
+ );
137
+ return defaults.map(T => predictNIAHReasoning(config, T));
138
+ }
style.css CHANGED
@@ -34,22 +34,48 @@
34
  }
35
 
36
  /* v0.7.5 — Cross-framework drift bound */
 
 
 
 
 
37
  .drift-grid {
38
  display: grid;
39
- grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
40
  gap: 1em;
41
  margin: 0.6em 0 0.8em;
 
 
 
 
42
  }
43
  .drift-setup {
44
  border: 1px solid rgba(88, 166, 255, 0.3);
45
  border-radius: 8px;
46
  padding: 0.6em 1em 0.4em;
 
 
 
47
  }
48
  .drift-setup legend {
49
  padding: 0 0.5em;
50
  font-weight: 600;
51
  color: #58a6ff;
52
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  /* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
55
  .hf-autocomplete-dropdown {
 
34
  }
35
 
36
  /* v0.7.5 — Cross-framework drift bound */
37
+ /* Drift section overrides the default 980px main width so the two-column form
38
+ has room without overlapping. Mobile (<800px) collapses to single column. */
39
+ #drift-section {
40
+ max-width: 1100px;
41
+ }
42
  .drift-grid {
43
  display: grid;
44
+ grid-template-columns: 1fr 1fr;
45
  gap: 1em;
46
  margin: 0.6em 0 0.8em;
47
+ align-items: start;
48
+ }
49
+ @media (max-width: 800px) {
50
+ .drift-grid { grid-template-columns: 1fr; }
51
  }
52
  .drift-setup {
53
  border: 1px solid rgba(88, 166, 255, 0.3);
54
  border-radius: 8px;
55
  padding: 0.6em 1em 0.4em;
56
+ min-width: 0; /* fix: lets grid item shrink instead of overflowing */
57
+ box-sizing: border-box;
58
+ overflow: hidden; /* clip any overlong content cleanly */
59
  }
60
  .drift-setup legend {
61
  padding: 0 0.5em;
62
  font-weight: 600;
63
  color: #58a6ff;
64
  }
65
+ .drift-setup .form-row {
66
+ flex-wrap: wrap;
67
+ gap: 0.4em 0.6em;
68
+ margin: 0.35em 0;
69
+ }
70
+ .drift-setup .form-row label {
71
+ flex: 0 0 110px;
72
+ font-size: 0.92em;
73
+ }
74
+ .drift-setup .form-row input,
75
+ .drift-setup .form-row select {
76
+ flex: 1 1 120px;
77
+ min-width: 0;
78
+ }
79
 
80
  /* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
81
  .hf-autocomplete-dropdown {