Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 18 days ago

Commit

ab15c91

1 Parent(s): edb4038

v0.7.6: NIAH→reasoning predictor (anti-bullshit #7) + Drift CSS fix

NEW MODE 🔍 NIAH→Reason — pass rate predictor for retrieval vs reasoning at any context
- js/niah_reasoning.js: pure logic. Reuses gammaPade/d_horizon machinery.
- Inputs: HF model id (auto-fetch config) + target T_eval.
- Outputs: NIAH pass rate, multi-hop reasoning rate, gap, "safe context" where reasoning ≥ 65%, verdict (robust / marginal / degraded / retrieval_only / broken).
- Logistic on log(T_eval / d_horizon) for NIAH; arch-pressure-modulated penalty for reasoning. Extrapolation penalty when T_eval > T_train (no positional embeddings learned past training).
- sweepContextLengths(): scans 1k / 4k / 16k / 64k / T_train so user sees the curve.
- Solves: RULER paper (NVIDIA 2024) finding that long-context models pass NIAH but fail reasoning. P6 in HF community pain analysis.

VIRTUAL SIMULATION
- Llama-3.1-8B → robust at 8k / 64k / 128k (matches RULER findings).
- Mistral-7B (SWA 4096) → robust at 4k, broken at 16k+ (SWA limit kicks in).
- Pythia-160m at 4× T_train → broken (extrapolation penalty caps gain past trained positions).

DRIFT CSS FIX
- #drift-section max-width 1100px (default 980px caused panel overflow).
- grid-template-columns: 1fr 1fr explicit (replaces auto-fit which overlapped).
- min-width: 0 + box-sizing on .drift-setup → fixes CSS Grid implicit auto min-width that caused overflow.
- Mobile <800px → single column.
- form-row inside drift-setup: flex-wrap, label width 110px, inputs flex 1 1 120px so values don't squish.

DOCUMENTATION
- Help modal v0.7 niah section in 4 langs.
- Inventory modal v0.7 card: 7th entry "🔍 NIAH→Reason".
- modes.tip → 14 modes in 4 langs.
- 681 i18n keys × 4 langs · 0 missing / 0 extra (49 new niah.* keys per lang).

33/33 sim passes — including extrapolation penalty calibration on Pythia.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (5) hide show

index.html +30 -0
js/i18n.js +208 -4
js/main.js +160 -2
js/niah_reasoning.js +138 -0
style.css +27 -1

index.html CHANGED Viewed

@@ -210,6 +210,9 @@
       <p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
       <p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
@@ -317,6 +320,7 @@
             <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
             <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
             <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
           </ul>
         </details>
       </div>
@@ -372,6 +376,7 @@
         <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
         <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
         <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
@@ -871,6 +876,31 @@
       <div id="drift-output" style="margin-top: 1em;"></div>
     </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

       <p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
       <p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
+      <p><strong data-i18n="help.v07.niah.title">🔍 NIAH → Reasoning Gap</strong></p>
+      <p data-i18n="help.v07.niah.body">RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.</p>
       <h3 data-i18n="help.audit.title">The audit chain</h3>
       <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
       output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
             <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
             <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
             <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
+            <li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
           </ul>
         </details>
       </div>
         <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
         <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
         <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
+        <button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
         <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
       <div id="drift-output" style="margin-top: 1em;"></div>
     </section>
+    <!-- NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7) -->
+    <section id="niah-section" style="display:none;">
+      <h2><span data-i18n="niah.title">🔍 NIAH → Reasoning Gap</span>
+        <span class="info"><span class="tooltip" data-i18n="niah.tip">
+          NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="niah.desc">
+        <strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%.
+      </p>
+      <div class="form-row">
+        <label for="niah-id" data-i18n="niah.id_label">HF model id:</label>
+        <input type="text" id="niah-id" placeholder="e.g. meta-llama/Llama-3.1-8B-Instruct" />
+        <button type="button" id="niah-fetch-btn" data-i18n="niah.fetch_btn">📥 Fetch config</button>
+      </div>
+      <div class="form-row">
+        <label for="niah-teval" data-i18n="niah.teval_label">Target context (T_eval):</label>
+        <input type="number" id="niah-teval" min="512" step="1024" value="32768" />
+        <button type="button" id="niah-run-btn" data-i18n="niah.run_btn">🔍 Predict</button>
+        <button type="button" id="niah-sweep-btn" class="secondary" data-i18n="niah.sweep_btn">📊 Sweep contexts</button>
+      </div>
+      <p id="niah-status" class="recipe-desc" style="font-size:0.92em;"></p>
+      <div id="niah-output" style="margin-top: 1em;"></div>
+    </section>
     <!-- Recipe selector (mode=recipe) -->
     <section id="recipe-section" style="display:none;">
       <h2 data-i18n="recipe.title">📋 Recipe</h2>

js/i18n.js CHANGED Viewed

@@ -307,6 +307,9 @@ export const TRANSLATIONS = {
     "help.v07.drift.title":        "🔀 Cross-framework Drift Bound",
     "help.v07.drift.body":         "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
     // v0.7 — Inventory modal 5th card
     "inv.v07.title":               "🆕 v0.7 anti-bullshit pack",
@@ -414,6 +417,54 @@ export const TRANSLATIONS = {
     "drift.status.empty_scores":   "⚠ Enter both scores.",
     "drift.status.done":           "✅ Verdict: {verdict}",
     "drift.status.sample_loaded":  "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -506,7 +557,7 @@ export const TRANSLATIONS = {
     "common.no":           "No",
     // Mode tooltips
-    "modes.tip":           "<strong>Thirteen ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.<br><strong>🔀 Drift</strong>: same model, different scores on two setups — bug or noise? Predict numerical-noise band and flag real bugs.",
     "profile.tip":         "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
     "compare.tip":         "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
@@ -1145,6 +1196,9 @@ export const TRANSLATIONS = {
     "help.v07.drift.title":        "🔀 Cota de drift entre frameworks",
     "help.v07.drift.body":         "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón &quot;Cargar sample&quot; para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
     // v0.7 — Inventory modal 5ª card
     "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
@@ -1252,6 +1306,54 @@ export const TRANSLATIONS = {
     "drift.status.empty_scores":   "⚠ Introduce ambos scores.",
     "drift.status.done":           "✅ Veredicto: {verdict}",
     "drift.status.sample_loaded":  "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -1344,7 +1446,7 @@ export const TRANSLATIONS = {
     "common.no":           "No",
     // Tooltips de modos
-    "modes.tip":           "<strong>Trece formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.<br><strong>🔀 Drift</strong>: mismo modelo, scores distintos en dos setups — ¿bug o ruido? Predice banda de ruido numérico y flagea bugs reales.",
     "profile.tip":         "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
     "compare.tip":         "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
@@ -1847,6 +1949,9 @@ export const TRANSLATIONS = {
     "help.v07.drift.title":        "🔀 Borne de drift inter-frameworks",
     "help.v07.drift.body":         "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton &quot;Charger échantillon&quot; pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
     // v0.7 — Inventory modal 5ème card
     "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
@@ -1954,6 +2059,54 @@ export const TRANSLATIONS = {
     "drift.status.empty_scores":   "⚠ Saisissez les deux scores.",
     "drift.status.done":           "✅ Verdict : {verdict}",
     "drift.status.sample_loaded":  "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -2046,7 +2199,7 @@ export const TRANSLATIONS = {
     "common.no":           "Non",
     // Tooltips des modes
-    "modes.tip":           "<strong>Treize façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.<br><strong>🔀 Drift</strong>: même modèle, scores différents sur deux setups — bug ou bruit ? Prédit la bande de bruit numérique et signale les vrais bugs.",
     "profile.tip":         "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
     "compare.tip":         "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
@@ -2549,6 +2702,9 @@ export const TRANSLATIONS = {
     "help.v07.drift.title":        "🔀 跨框架 Drift 界",
     "help.v07.drift.body":         "同一模型，不同 setup 下分数不同。工具预测仅由数值噪声（dtype、framework、batch）允许的最大 drift。若观测差距超过它 → 真实 bug，通常是 chat-template mismatch（lm-eval-harness issue #1841）或 KV-cache 布局。试试 &quot;加载样本&quot; 按钮看典型的 chat-template bug。<em>用例</em>：在报告回归或声称可复现性之前，验证两个评估之间的差距是否大于数值噪声能解释的范围。",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug 还是噪声？预测两个评估间的最大可允许差距",
     // v0.7 — Inventory 模态第 5 卡
     "inv.v07.title":               "🆕 v0.7 anti-bullshit 套件",
@@ -2656,6 +2812,54 @@ export const TRANSLATIONS = {
     "drift.status.empty_scores":   "⚠ 输入两个分数。",
     "drift.status.done":           "✅ 判定：{verdict}",
     "drift.status.sample_loaded":  "✅ 样本已加载（典型 chat-template bug）。点击计算 drift 界。",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
@@ -2748,7 +2952,7 @@ export const TRANSLATIONS = {
     "common.no":           "否",
     // 模式提示
-    "modes.tip":           "<strong>十三种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动��择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings（SWA / YaRN / RoPE 缩放）。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间；检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期，对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意（模型 × 量化方案）的 γ-shift 与 ΔPPL；cliff 时推荐更安全替代方案。<br><strong>🔀 Drift</strong>: 同一模型，两 setup 下分数不同 — bug 还是噪声？预测数值噪声区间并标记真实 bug。",
     "profile.tip":         "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
     "compare.tip":         "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",

     "help.v07.drift.title":        "🔀 Cross-framework Drift Bound",
     "help.v07.drift.body":         "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
+    "help.v07.niah.title":         "🔍 NIAH → Reasoning Gap",
+    "help.v07.niah.body":          "RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's \"safe reasoning context\" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.",
+    "inv.v07.niah":                "<strong>🔍 NIAH→Reason</strong> — does your \"128k context\" actually reason there, or just retrieve?",
     // v0.7 — Inventory modal 5th card
     "inv.v07.title":               "🆕 v0.7 anti-bullshit pack",
     "drift.status.empty_scores":   "⚠ Enter both scores.",
     "drift.status.done":           "✅ Verdict: {verdict}",
     "drift.status.sample_loaded":  "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
+    // v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap predictor
+    "modes.niah":                  "🔍 NIAH→Reason",
+    "mode_desc.niah":              "Predicts NIAH (retrieval) and multi-hop reasoning pass rates at any context. Solves: long-context models often pass NIAH but fail reasoning at the same context (RULER paper).",
+    "niah.title":                  "🔍 NIAH → Reasoning Gap",
+    "niah.tip":                    "NIAH (Needle in a Haystack) tests retrieval: 'find this fact in long text'. Multi-hop reasoning tests inference: 'combine facts X+Y at the start with fact Z at the end'. RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.",
+    "niah.desc":                   "<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a 'safe context' where reasoning stays ≥65%.",
+    "niah.id_label":               "HF model id:",
+    "niah.fetch_btn":              "📥 Fetch config",
+    "niah.teval_label":            "Target context (T_eval):",
+    "niah.run_btn":                "🔍 Predict",
+    "niah.sweep_btn":              "📊 Sweep contexts",
+    "niah.label.niah":             "NIAH pass rate",
+    "niah.label.reasoning":        "Reasoning pass rate",
+    "niah.label.gap":              "Gap",
+    "niah.label.safe_ctx":         "Safe reasoning context",
+    "niah.section.breakdown":      "Architecture breakdown",
+    "niah.section.reco":           "Recommendation",
+    "niah.section.sweep":          "Pass rate sweep across context lengths",
+    "niah.field.dhorizon":         "d_horizon (effective)",
+    "niah.field.ratio":            "T_eval / d_horizon",
+    "niah.field.arch_pressure":    "Arch pressure (small d_head + GQA + SWA)",
+    "niah.field.theta":            "RoPE θ",
+    "niah.field.t_train":          "T_train (claimed)",
+    "niah.col.context":            "T_eval",
+    "niah.col.niah":               "NIAH",
+    "niah.col.reasoning":          "Reasoning",
+    "niah.col.gap":                "Gap",
+    "niah.col.verdict":            "Verdict",
+    "niah.verdict.robust":         "✅ ROBUST",
+    "niah.verdict.marginal":       "⚠ MARGINAL",
+    "niah.verdict.degraded":       "⚠ DEGRADED",
+    "niah.verdict.retrieval_only": "❌ RETRIEVAL-ONLY",
+    "niah.verdict.broken":         "❌ BROKEN",
+    "niah.reco.robust":            "Both retrieval and reasoning hold up at this context. Safe to deploy for both lookup and inference tasks.",
+    "niah.reco.marginal":          "Borderline. Retrieval works but reasoning is shaky. Use for fact-lookup, not multi-step inference.",
+    "niah.reco.degraded":          "Significant reasoning drop. The model can find facts but struggles to combine them. Avoid multi-hop tasks at this length.",
+    "niah.reco.retrieval_only":    "Canonical RULER finding: model passes NIAH but fails reasoning. Useful for retrieval-augmented setups (where the LLM only locates facts) but NOT for chained inference. Cut your context to the 'safe' value below.",
+    "niah.reco.broken":            "Model fails even basic retrieval at this context. Treat as out-of-distribution — re-test at a shorter context.",
+    "niah.safe_context":           "≤ {ctx} tokens (reasoning ≥ 65%)",
+    "niah.safe_context_none":      "No safe context found below your target — model fails reasoning even at small contexts.",
+    "niah.summary.sweep":          "<code>{modelId}</code> — pass rates by context",
+    "niah.status.empty_id":        "⚠ Enter a model id (e.g. meta-llama/Llama-3.1-8B-Instruct).",
+    "niah.status.bad_teval":       "⚠ Enter a target context (≥ 512 tokens).",
+    "niah.status.fetching":        "⏳ Fetching config.json for {modelId}...",
+    "niah.status.fetched":        "✅ Config fetched for {modelId}. Set T_eval and click Predict (or Sweep contexts).",
+    "niah.status.done":            "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
+    "niah.status.sweep_done":      "✅ Swept {n} context lengths.",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
     "common.no":           "No",
     // Mode tooltips
+    "modes.tip":           "<strong>Fourteen ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.<br><strong>🔀 Drift</strong>: same model, different scores on two setups — bug or noise? Predict numerical-noise band and flag real bugs.<br><strong>🔍 NIAH→Reason</strong>: predict NIAH and multi-hop reasoning pass rates from architecture; find your model's safe reasoning context.",
     "profile.tip":         "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
     "compare.tip":         "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
     "help.v07.drift.title":        "🔀 Cota de drift entre frameworks",
     "help.v07.drift.body":         "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón &quot;Cargar sample&quot; para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
+    "help.v07.niah.title":         "🔍 Gap NIAH → Reasoning",
+    "help.v07.niah.body":          "El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH (retrieval de needle) pero fallan reasoning multi-hop al mismo contexto. La herramienta predice ambas tasas de pass desde la arquitectura (γ_Padé + d_horizon + presión arq: d_head pequeño, GQA, SWA), reporta el gap, y encuentra el \"contexto seguro de reasoning\" donde reasoning se mantiene ≥65%. Modo barrido muestra la curva a 1k/4k/16k/64k/T_train. <em>Caso de uso</em>: antes de desplegar al contexto declarado, descubre si el modelo realmente razonará ahí o solo encontrará.",
+    "inv.v07.niah":                "<strong>🔍 NIAH→Reason</strong> — ¿tu \"128k\" realmente razona ahí, o solo encuentra?",
     // v0.7 — Inventory modal 5ª card
     "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
     "drift.status.empty_scores":   "⚠ Introduce ambos scores.",
     "drift.status.done":           "✅ Veredicto: {verdict}",
     "drift.status.sample_loaded":  "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
+    // v0.7.6 — anti-bullshit pack #7: NIAH → predictor de gap de reasoning
+    "modes.niah":                  "🔍 NIAH→Reason",
+    "mode_desc.niah":              "Predice tasas de pass de NIAH (retrieval) y reasoning multi-hop a cualquier contexto. Resuelve: modelos long-context pasan NIAH pero fallan reasoning al mismo contexto (paper RULER).",
+    "niah.title":                  "🔍 Gap NIAH → Reasoning",
+    "niah.tip":                    "NIAH (Needle in a Haystack) testea retrieval: 'encuentra este hecho en texto largo'. Reasoning multi-hop testea inferencia: 'combina hechos X+Y del principio con hecho Z del final'. El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH pero fallan reasoning al mismo contexto. Esta herramienta predice ambas tasas desde la arquitectura sola.",
+    "niah.desc":                   "<strong>Tu modelo dice 128k de contexto. ¿Razonará realmente a 64k, o solo encontrará?</strong> Pega un model id HF y un contexto objetivo — la herramienta predice tasas de pass NIAH y reasoning multi-hop, el gap, y un 'contexto seguro' donde reasoning se mantiene ≥65%.",
+    "niah.id_label":               "ID modelo HF:",
+    "niah.fetch_btn":              "📥 Fetch config",
+    "niah.teval_label":            "Contexto objetivo (T_eval):",
+    "niah.run_btn":                "🔍 Predecir",
+    "niah.sweep_btn":              "📊 Barrer contextos",
+    "niah.label.niah":             "Tasa pass NIAH",
+    "niah.label.reasoning":        "Tasa pass Reasoning",
+    "niah.label.gap":              "Gap",
+    "niah.label.safe_ctx":         "Contexto seguro de reasoning",
+    "niah.section.breakdown":      "Desglose arquitectónico",
+    "niah.section.reco":           "Recomendación",
+    "niah.section.sweep":          "Barrido de tasas pass por longitud de contexto",
+    "niah.field.dhorizon":         "d_horizon (efectivo)",
+    "niah.field.ratio":            "T_eval / d_horizon",
+    "niah.field.arch_pressure":    "Presión arq (d_head pequeño + GQA + SWA)",
+    "niah.field.theta":            "RoPE θ",
+    "niah.field.t_train":          "T_train (declarado)",
+    "niah.col.context":            "T_eval",
+    "niah.col.niah":               "NIAH",
+    "niah.col.reasoning":          "Reasoning",
+    "niah.col.gap":                "Gap",
+    "niah.col.verdict":            "Veredicto",
+    "niah.verdict.robust":         "✅ ROBUSTO",
+    "niah.verdict.marginal":       "⚠ MARGINAL",
+    "niah.verdict.degraded":       "⚠ DEGRADADO",
+    "niah.verdict.retrieval_only": "❌ SOLO RETRIEVAL",
+    "niah.verdict.broken":         "❌ ROTO",
+    "niah.reco.robust":            "Tanto retrieval como reasoning aguantan a este contexto. Seguro para desplegar tareas de lookup e inferencia.",
+    "niah.reco.marginal":          "Borderline. Retrieval funciona pero reasoning está flojo. Úsalo para lookup, no para inferencia multi-paso.",
+    "niah.reco.degraded":          "Caída significativa de reasoning. El modelo encuentra hechos pero le cuesta combinarlos. Evita tareas multi-hop a esta longitud.",
+    "niah.reco.retrieval_only":    "Hallazgo canónico de RULER: el modelo pasa NIAH pero falla reasoning. Útil para setups RAG (donde el LLM solo localiza hechos) pero NO para inferencia encadenada. Reduce tu contexto al valor 'seguro' de abajo.",
+    "niah.reco.broken":            "El modelo falla incluso retrieval básico a este contexto. Trátalo como out-of-distribution — re-testea a contexto más corto.",
+    "niah.safe_context":           "≤ {ctx} tokens (reasoning ≥ 65%)",
+    "niah.safe_context_none":      "No se encontró contexto seguro bajo tu objetivo — el modelo falla reasoning incluso a contextos pequeños.",
+    "niah.summary.sweep":          "<code>{modelId}</code> — tasas pass por contexto",
+    "niah.status.empty_id":        "⚠ Introduce un model id (ej. meta-llama/Llama-3.1-8B-Instruct).",
+    "niah.status.bad_teval":       "⚠ Introduce un contexto objetivo (≥ 512 tokens).",
+    "niah.status.fetching":        "⏳ Obteniendo config.json para {modelId}...",
+    "niah.status.fetched":        "✅ Config obtenido para {modelId}. Pon T_eval y click Predecir (o Barrer contextos).",
+    "niah.status.done":            "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
+    "niah.status.sweep_done":      "✅ Barridos {n} largos de contexto.",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
     "common.no":           "No",
     // Tooltips de modos
+    "modes.tip":           "<strong>Catorce formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.<br><strong>🔀 Drift</strong>: mismo modelo, scores distintos en dos setups — ¿bug o ruido? Predice banda de ruido numérico y flagea bugs reales.<br><strong>🔍 NIAH→Reason</strong>: predice tasas pass NIAH y reasoning multi-hop desde arquitectura; encuentra el contexto seguro de reasoning.",
     "profile.tip":         "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
     "compare.tip":         "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
     "help.v07.drift.title":        "🔀 Borne de drift inter-frameworks",
     "help.v07.drift.body":         "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton &quot;Charger échantillon&quot; pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
+    "help.v07.niah.title":         "🔍 Gap NIAH → Reasoning",
+    "help.v07.niah.body":          "Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH (retrieval de needle) mais échouent au reasoning multi-hop au même contexte. L'outil prédit les deux taux de réussite à partir de l'architecture (γ_Padé + d_horizon + pression arch : petit d_head, GQA, SWA), reporte le gap, et trouve le \"contexte sûr pour reasoning\" où le reasoning reste ≥65%. Mode balayage montre la courbe à 1k/4k/16k/64k/T_train. <em>Cas d'usage</em> : avant de déployer au contexte revendiqué, découvrez si le modèle va vraiment raisonner là ou seulement retrouver.",
+    "inv.v07.niah":                "<strong>🔍 NIAH→Reason</strong> — votre \"128k\" raisonne-t-il vraiment là, ou seulement retrouve ?",
     // v0.7 — Inventory modal 5ème card
     "inv.v07.title":               "🆕 Pack anti-bullshit v0.7",
     "drift.status.empty_scores":   "⚠ Saisissez les deux scores.",
     "drift.status.done":           "✅ Verdict : {verdict}",
     "drift.status.sample_loaded":  "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
+    // v0.7.6 — anti-bullshit pack #7: prédicteur de gap NIAH → reasoning
+    "modes.niah":                  "🔍 NIAH→Reason",
+    "mode_desc.niah":              "Prédit les taux de réussite NIAH (retrieval) et reasoning multi-hop à n'importe quel contexte. Résout : les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte (paper RULER).",
+    "niah.title":                  "🔍 Gap NIAH → Reasoning",
+    "niah.tip":                    "NIAH (Needle in a Haystack) teste le retrieval : 'trouve ce fait dans un long texte'. Le reasoning multi-hop teste l'inférence : 'combine les faits X+Y au début avec le fait Z à la fin'. Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte. Cet outil prédit les deux taux à partir de la seule architecture.",
+    "niah.desc":                   "<strong>Votre modèle revendique 128k de contexte. Va-t-il vraiment raisonner à 64k, ou seulement retrouver ?</strong> Collez un model id HF et un contexte cible — l'outil prédit les taux de réussite NIAH et reasoning multi-hop, le gap, et un 'contexte sûr' où le reasoning reste ≥65%.",
+    "niah.id_label":               "ID modèle HF :",
+    "niah.fetch_btn":              "📥 Récupérer config",
+    "niah.teval_label":            "Contexte cible (T_eval) :",
+    "niah.run_btn":                "🔍 Prédire",
+    "niah.sweep_btn":              "📊 Balayer les contextes",
+    "niah.label.niah":             "Taux NIAH",
+    "niah.label.reasoning":        "Taux Reasoning",
+    "niah.label.gap":              "Gap",
+    "niah.label.safe_ctx":         "Contexte sûr pour reasoning",
+    "niah.section.breakdown":      "Détail architectural",
+    "niah.section.reco":           "Recommandation",
+    "niah.section.sweep":          "Balayage des taux par longueur de contexte",
+    "niah.field.dhorizon":         "d_horizon (effectif)",
+    "niah.field.ratio":            "T_eval / d_horizon",
+    "niah.field.arch_pressure":    "Pression arch (petit d_head + GQA + SWA)",
+    "niah.field.theta":            "RoPE θ",
+    "niah.field.t_train":          "T_train (revendiqué)",
+    "niah.col.context":            "T_eval",
+    "niah.col.niah":               "NIAH",
+    "niah.col.reasoning":          "Reasoning",
+    "niah.col.gap":                "Gap",
+    "niah.col.verdict":            "Verdict",
+    "niah.verdict.robust":         "✅ ROBUSTE",
+    "niah.verdict.marginal":       "⚠ MARGINAL",
+    "niah.verdict.degraded":       "⚠ DÉGRADÉ",
+    "niah.verdict.retrieval_only": "❌ RETRIEVAL UNIQUEMENT",
+    "niah.verdict.broken":         "❌ CASSÉ",
+    "niah.reco.robust":            "Retrieval et reasoning tiennent tous deux à ce contexte. Sûr de déployer pour les tâches de lookup et d'inférence.",
+    "niah.reco.marginal":          "Borderline. Le retrieval fonctionne mais le reasoning est fragile. À utiliser pour le lookup, pas pour l'inférence multi-étapes.",
+    "niah.reco.degraded":          "Chute significative du reasoning. Le modèle trouve des faits mais peine à les combiner. Évitez les tâches multi-hop à cette longueur.",
+    "niah.reco.retrieval_only":    "Constat canonique de RULER : le modèle passe NIAH mais échoue au reasoning. Utile pour les setups RAG (où le LLM ne fait que localiser les faits) mais PAS pour l'inférence chaînée. Réduisez votre contexte à la valeur 'sûre' ci-dessous.",
+    "niah.reco.broken":            "Le modèle échoue même au retrieval basique à ce contexte. Traitez-le comme hors-distribution — re-testez à un contexte plus court.",
+    "niah.safe_context":           "≤ {ctx} tokens (reasoning ≥ 65%)",
+    "niah.safe_context_none":      "Aucun contexte sûr trouvé sous votre cible — le modèle échoue au reasoning même à de petits contextes.",
+    "niah.summary.sweep":          "<code>{modelId}</code> — taux par contexte",
+    "niah.status.empty_id":        "⚠ Saisissez un model id (ex. meta-llama/Llama-3.1-8B-Instruct).",
+    "niah.status.bad_teval":       "⚠ Saisissez un contexte cible (≥ 512 tokens).",
+    "niah.status.fetching":        "⏳ Récupération config.json pour {modelId}...",
+    "niah.status.fetched":        "✅ Config récupéré pour {modelId}. Réglez T_eval et cliquez Prédire (ou Balayer les contextes).",
+    "niah.status.done":            "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
+    "niah.status.sweep_done":      "✅ Balayé {n} longueurs de contexte.",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
     "common.no":           "Non",
     // Tooltips des modes
+    "modes.tip":           "<strong>Quatorze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.<br><strong>🔀 Drift</strong>: même modèle, scores différents sur deux setups — bug ou bruit ? Prédit la bande de bruit numérique et signale les vrais bugs.<br><strong>🔍 NIAH→Reason</strong>: prédit les taux NIAH et reasoning multi-hop depuis l'architecture ; trouve le contexte sûr pour reasoning.",
     "profile.tip":         "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
     "compare.tip":         "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
     "help.v07.drift.title":        "🔀 跨框架 Drift 界",
     "help.v07.drift.body":         "同一模型，不同 setup 下分数不同。工具预测仅由数值噪声（dtype、framework、batch）允许的最大 drift。若观测差距超过它 → 真实 bug，通常是 chat-template mismatch（lm-eval-harness issue #1841）或 KV-cache 布局。试试 &quot;加载样本&quot; 按钮看典型的 chat-template bug。<em>用例</em>：在报告回归或声称可复现性之前，验证两个评估之间的差距是否大于数值噪声能解释的范围。",
     "inv.v07.drift":               "<strong>🔀 Drift</strong> — bug 还是噪声？预测两个评估间的最大可允许差距",
+    "help.v07.niah.title":         "🔍 NIAH → Reasoning Gap",
+    "help.v07.niah.body":          "RULER 论文（NVIDIA 2024）显示长上下文模型经常通过 NIAH（needle 检索）但在相同上下文上多跳 reasoning 失败。工具仅根据架构（γ_Padé + d_horizon + 架构压力：小 d_head、GQA、SWA）预测两种通过率，报告 gap，并找到模型 reasoning 保持 ≥65% 的\"安全 reasoning 上下文\"。扫描模式显示在 1k/4k/16k/64k/T_train 的曲线。<em>用例</em>：在声称的上下文部署之前，搞清楚模型是真的能在那里 reasoning 还是只能检索。",
+    "inv.v07.niah":                "<strong>🔍 NIAH→Reason</strong> — 你的\"128k 上下文\"真的能在那里 reasoning，还是只能检索？",
     // v0.7 — Inventory 模态第 5 卡
     "inv.v07.title":               "🆕 v0.7 anti-bullshit 套件",
     "drift.status.empty_scores":   "⚠ 输入两个分数。",
     "drift.status.done":           "✅ 判定：{verdict}",
     "drift.status.sample_loaded":  "✅ 样本已加载（典型 chat-template bug）。点击计算 drift 界。",
+    // v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap 预测器
+    "modes.niah":                  "🔍 NIAH→Reason",
+    "mode_desc.niah":              "在任意上下文下预测 NIAH（检索）与多跳 reasoning 通过率。解决：长上下文模型常常通过 NIAH 但在同一上下文上 reasoning 失败（RULER 论文）。",
+    "niah.title":                  "🔍 NIAH → Reasoning Gap",
+    "niah.tip":                    "NIAH（Needle in a Haystack）测试检索：\"在长文本中找到这个事实\"。多跳 reasoning 测试推理：\"把开头的事实 X+Y 与结尾的事实 Z 结合\"。RULER 论文（NVIDIA 2024）显示长上下文模型经常通过 NIAH 但在相同上下文上 reasoning 失败。本工具仅根据架构预测两种通过率。",
+    "niah.desc":                   "<strong>你的模型声称 128k 上下文。它在 64k 是真的能 reasoning，还是只能检索？</strong>粘贴 HF 模型 id 和目标 eval 上下文 — 工具预测 NIAH 与多跳 reasoning 通过率、gap，以及 reasoning 保持 ≥65% 的 \"安全上下文\"。",
+    "niah.id_label":               "HF 模型 id：",
+    "niah.fetch_btn":              "📥 获取 config",
+    "niah.teval_label":            "目标上下文 (T_eval)：",
+    "niah.run_btn":                "🔍 预测",
+    "niah.sweep_btn":              "📊 扫描上下文",
+    "niah.label.niah":             "NIAH 通过率",
+    "niah.label.reasoning":        "Reasoning 通过率",
+    "niah.label.gap":              "Gap",
+    "niah.label.safe_ctx":         "Reasoning 安全上下文",
+    "niah.section.breakdown":      "架构细节",
+    "niah.section.reco":           "建议",
+    "niah.section.sweep":          "按上下文长度扫描通过率",
+    "niah.field.dhorizon":         "d_horizon（有效）",
+    "niah.field.ratio":            "T_eval / d_horizon",
+    "niah.field.arch_pressure":    "架构压力（小 d_head + GQA + SWA）",
+    "niah.field.theta":            "RoPE θ",
+    "niah.field.t_train":          "T_train（声称）",
+    "niah.col.context":            "T_eval",
+    "niah.col.niah":               "NIAH",
+    "niah.col.reasoning":          "Reasoning",
+    "niah.col.gap":                "Gap",
+    "niah.col.verdict":            "判定",
+    "niah.verdict.robust":         "✅ 稳健",
+    "niah.verdict.marginal":       "⚠ 边缘",
+    "niah.verdict.degraded":       "⚠ 退化",
+    "niah.verdict.retrieval_only": "❌ 仅检索",
+    "niah.verdict.broken":         "❌ 失效",
+    "niah.reco.robust":            "在此上下文下检索与 reasoning 都稳定。可安全部署用于查询和推理任务。",
+    "niah.reco.marginal":          "边缘。检索可用但 reasoning 不稳。用于事实查询，不要用于多步推理。",
+    "niah.reco.degraded":          "Reasoning 显著下降。模型能找到事实但难以组合它们。在此长度下避免多跳任务。",
+    "niah.reco.retrieval_only":    "RULER 的典型发现：模型通过 NIAH 但 reasoning 失败。适用于 RAG 设置（LLM 仅定位事实），不适用于链式推理。把上下文降到下方的 \"安全\" 值。",
+    "niah.reco.broken":            "在此上下文下模型连基本检索都失败。视为 out-of-distribution — 在更短上下文重测。",
+    "niah.safe_context":           "≤ {ctx} tokens（reasoning ≥ 65%）",
+    "niah.safe_context_none":      "在你的目标以下没找到安全上下文 — 模型即使在小上下文也 reasoning 失败。",
+    "niah.summary.sweep":          "<code>{modelId}</code> — 按上下文的通过率",
+    "niah.status.empty_id":        "⚠ 输入 model id（例如 meta-llama/Llama-3.1-8B-Instruct）。",
+    "niah.status.bad_teval":       "⚠ 输入目标上下文（≥ 512 tokens）。",
+    "niah.status.fetching":        "⏳ 正在获取 {modelId} 的 config.json...",
+    "niah.status.fetched":        "✅ 已获取 {modelId} 的 config。设置 T_eval 并点击预测（或扫描上下文）。",
+    "niah.status.done":            "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
+    "niah.status.sweep_done":      "✅ 已扫描 {n} 个上下文长度。",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
     "common.no":           "否",
     // 模式提示
+    "modes.tip":           "<strong>十四种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings（SWA / YaRN / RoPE 缩放）。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间；检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期，对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意（模型 × 量化方案）的 γ-shift 与 ΔPPL；cliff 时推荐更安全替代方案。<br><strong>🔀 Drift</strong>: 同一模型，两 setup 下分数不同 — bug 还是噪声？预测数值噪声区间并标记真实 bug。<br><strong>🔍 NIAH→Reason</strong>: 从架构预测 NIAH 与多跳 reasoning 通过率；找到模型的安全 reasoning 上下文。",
     "profile.tip":         "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
     "compare.tip":         "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",

js/main.js CHANGED Viewed

@@ -18,6 +18,7 @@ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
 import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
 import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
 import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -198,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
      "profile-section", "compare-section", "inspector-section",
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
-     "quant-section", "drift-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
@@ -208,7 +209,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
       compare: "compare-section", inspector: "inspector-section",
       diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
       template: "template-section", arena: "arena-section", contam: "contam-section",
-      quant: "quant-section", drift: "drift-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
@@ -1315,6 +1316,163 @@ populateDriftDropdowns();
 $("drift-run-btn")?.addEventListener("click", runDriftCompute);
 $("drift-sample-btn")?.addEventListener("click", loadDriftSample);
 function configToPreset(cfg, modelId) {
   const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
   const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;

 import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
 import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
 import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
+import { predictNIAHReasoning, sweepContextLengths } from "./niah_reasoning.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
      "profile-section", "compare-section", "inspector-section",
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
+     "quant-section", "drift-section", "niah-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
       compare: "compare-section", inspector: "inspector-section",
       diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
       template: "template-section", arena: "arena-section", contam: "contam-section",
+      quant: "quant-section", drift: "drift-section", niah: "niah-section",
     };
     const sectionId = sectionMap[mode];
     if (sectionId) $(sectionId).style.display = "";
 $("drift-run-btn")?.addEventListener("click", runDriftCompute);
 $("drift-sample-btn")?.addEventListener("click", loadDriftSample);
+// ════════════════════════════════════════════════════════════════════
+// 🔍 NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
+// ════════════════════════════════════════════════════════════════════
+const NIAH_VERDICT_COLOR = {
+  robust:         "#3fb950",
+  marginal:       "#f1c40f",
+  degraded:       "#f1c40f",
+  retrieval_only: "#f85149",
+  broken:         "#f85149",
+};
+let __niahLastConfig = null;
+let __niahLastModelId = null;
+async function niahFetchConfig() {
+  const modelId = ($("niah-id").value || "").trim();
+  if (!modelId) {
+    $("niah-status").textContent = t("niah.status.empty_id") || "⚠ Enter a model id.";
+    return null;
+  }
+  $("niah-status").textContent = tFmt("niah.status.fetching", { modelId });
+  $("niah-fetch-btn").disabled = true;
+  try {
+    const cfg = await fetchHfConfig(modelId);
+    __niahLastConfig = cfg;
+    __niahLastModelId = modelId;
+    $("niah-status").textContent = tFmt("niah.status.fetched", { modelId });
+    return cfg;
+  } catch (err) {
+    if (err.code === "gated") {
+      $("niah-status").innerHTML = `🔒 <strong>${err.modelId}</strong> ${t("hf_auto.gated_msg") || "is gated. Accept the license here:"} <a href="https://huggingface.co/${err.modelId}" target="_blank" rel="noopener">huggingface.co/${err.modelId}</a>`;
+    } else {
+      $("niah-status").textContent = `❌ ${err.message}`;
+    }
+    return null;
+  } finally {
+    $("niah-fetch-btn").disabled = false;
+  }
+}
+function renderNIAHCard(result, modelId) {
+  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
+    ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
+  const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
+  const color = NIAH_VERDICT_COLOR[result.verdict] || "#8b949e";
+  const verdictLabel = t(`niah.verdict.${result.verdict}`) || result.verdict;
+  const reco = t(`niah.reco.${result.verdict}`) || "";
+  const safeText = result.safe_context
+    ? tFmt("niah.safe_context", { ctx: result.safe_context })
+    : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
+  return `
+    <div class="unmask-result">
+      <div class="unmask-hero" style="border-color: ${color};">
+        <div class="unmask-verdict" style="color: ${color};">${verdictLabel}</div>
+        <div class="unmask-model"><code>${escapeHtml(modelId)}</code> @ <code>${fmtN(result.T_eval)}</code> tokens</div>
+        <div class="unmask-numbers">
+          <div><span class="unmask-num-label">${t("niah.label.niah") || "NIAH pass rate"}</span><span class="unmask-num-val">${(result.niah_rate * 100).toFixed(0)}%</span></div>
+          <div><span class="unmask-num-label">${t("niah.label.reasoning") || "Reasoning pass rate"}</span><span class="unmask-num-val">${(result.reasoning_rate * 100).toFixed(0)}%</span></div>
+          <div><span class="unmask-num-label">${t("niah.label.gap") || "Gap"}</span><span class="unmask-num-val">${(result.gap * 100).toFixed(0)} pts</span></div>
+        </div>
+      </div>
+      <div class="unmask-details">
+        <details class="unmask-panel" open>
+          <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
+          <ul>
+            <li><strong>γ_Padé @ T_eval:</strong> ${result.gamma_pade}</li>
+            <li><strong>${t("niah.field.dhorizon") || "d_horizon (effective)"}:</strong> ${fmtN(result.d_horizon)} tokens</li>
+            <li><strong>${t("niah.field.ratio") || "T_eval / d_horizon"}:</strong> ${result.horizon_ratio}×</li>
+            <li><strong>${t("niah.field.arch_pressure") || "Arch pressure (small d_head + GQA + SWA)"}:</strong> ×${result.arch_pressure}</li>
+            <li><strong>${t("niah.field.theta") || "RoPE θ"}:</strong> ${fmtN(result.theta)}</li>
+            <li><strong>${t("niah.field.t_train") || "T_train (claimed)"}:</strong> ${fmtN(result.T_train)}</li>
+          </ul>
+        </details>
+        <details class="unmask-panel" open>
+          <summary class="unmask-panel-title">${t("niah.section.reco") || "Recommendation"}</summary>
+          <p class="unmask-reco">${reco}</p>
+          <p class="unmask-reco"><strong>${t("niah.label.safe_ctx") || "Safe reasoning context"}:</strong> ${safeText}</p>
+        </details>
+      </div>
+    </div>
+  `;
+}
+function renderNIAHSweep(rows, modelId) {
+  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
+    ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
+  const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
+  let body = "";
+  for (const r of rows) {
+    const color = NIAH_VERDICT_COLOR[r.verdict] || "#8b949e";
+    const label = t(`niah.verdict.${r.verdict}`) || r.verdict;
+    body += `<tr>
+      <td><strong>${fmtN(r.T_eval)}</strong></td>
+      <td class="arena-elo">${(r.niah_rate * 100).toFixed(0)}%</td>
+      <td class="arena-elo">${(r.reasoning_rate * 100).toFixed(0)}%</td>
+      <td class="arena-spread">${(r.gap * 100).toFixed(0)} pts</td>
+      <td style="color: ${color};"><strong>${label}</strong></td>
+    </tr>`;
+  }
+  return `
+    <div class="arena-result">
+      <div class="unmask-hero" style="border-color: #58a6ff;">
+        <div class="unmask-verdict" style="color: #58a6ff;">${tFmt("niah.summary.sweep", { modelId })}</div>
+      </div>
+      <div class="unmask-details">
+        <details class="unmask-panel" open>
+          <summary class="unmask-panel-title">${t("niah.section.sweep") || "Pass rate sweep across context lengths"}</summary>
+          <table class="arena-table">
+            <thead><tr>
+              <th>${t("niah.col.context") || "T_eval"}</th>
+              <th>${t("niah.col.niah") || "NIAH"}</th>
+              <th>${t("niah.col.reasoning") || "Reasoning"}</th>
+              <th>${t("niah.col.gap") || "Gap"}</th>
+              <th>${t("niah.col.verdict") || "Verdict"}</th>
+            </tr></thead>
+            <tbody>${body}</tbody>
+          </table>
+        </details>
+      </div>
+    </div>
+  `;
+}
+async function runNIAHPredict() {
+  const cfg = __niahLastConfig || await niahFetchConfig();
+  if (!cfg) return;
+  const T_eval = parseInt($("niah-teval").value, 10);
+  if (Number.isNaN(T_eval) || T_eval < 512) {
+    $("niah-status").textContent = t("niah.status.bad_teval") || "⚠ Enter a target context (≥512).";
+    return;
+  }
+  const result = predictNIAHReasoning(cfg, T_eval);
+  $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId);
+  $("niah-status").textContent = tFmt("niah.status.done", {
+    verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
+    niah: (result.niah_rate * 100).toFixed(0),
+    reasoning: (result.reasoning_rate * 100).toFixed(0),
+  });
+}
+async function runNIAHSweep() {
+  const cfg = __niahLastConfig || await niahFetchConfig();
+  if (!cfg) return;
+  const rows = sweepContextLengths(cfg);
+  $("niah-output").innerHTML = renderNIAHSweep(rows, __niahLastModelId);
+  $("niah-status").textContent = tFmt("niah.status.sweep_done", { n: rows.length });
+}
+$("niah-fetch-btn")?.addEventListener("click", niahFetchConfig);
+$("niah-run-btn")?.addEventListener("click", runNIAHPredict);
+$("niah-sweep-btn")?.addEventListener("click", runNIAHSweep);
+$("niah-id")?.addEventListener("keydown", (e) => {
+  if (e.key === "Enter") { e.preventDefault(); niahFetchConfig(); }
+});
 function configToPreset(cfg, modelId) {
   const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
   const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;

js/niah_reasoning.js ADDED Viewed

	@@ -0,0 +1,138 @@

+// NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
+// Predicts pass rate at a given evaluation context for two tasks:
+//   - NIAH (Needle in a Haystack): single-fact retrieval, lenient
+//   - Multi-hop reasoning: chained inference, strict
+// And the GAP — the dominant failure mode for "long context" claims.
+//
+// Calibration: rough empirical fit to RULER paper bands (NVIDIA 2024) +
+// observed degradation curves on Llama-3.1, Mistral, Qwen2.5 at 8k/16k/32k/64k.
+// Uses TAF's existing γ_Padé / d_horizon machinery for the architectural input.
+//
+// Pure logic — no human strings. Render via i18n in main.js.
+import { gammaPade, thetaEffPade } from "./gamma_check.js";
+// d_horizon ≈ effective attention horizon. Reproduces formula from
+// taf_browser.py / paper §sec:gamma_decomposition. For browser-only v1 use.
+function dHorizon(theta, gammaPredicted) {
+  if (gammaPredicted >= 1) return Infinity;
+  if (gammaPredicted <= 0) return theta;
+  // d_horizon ≈ θ × (1 + γ_predicted) / (1 - γ_predicted)
+  // Padé-canonical form (paper §sec:gamma_decomposition).
+  return theta * (1 + gammaPredicted) / (1 - gammaPredicted);
+}
+// Sigmoid-like passrate vs. ratio = T_eval / d_horizon.
+// Calibrated such that:
+//   ratio = 0.25 → ≈ 0.95 (well within horizon)
+//   ratio = 0.50 → ≈ 0.88
+//   ratio = 1.00 → ≈ 0.65
+//   ratio = 2.00 → ≈ 0.35
+//   ratio = 4.00 → ≈ 0.15
+function niahRate(ratio) {
+  // Logistic on log-ratio: P = 1/(1+exp(k*(log(ratio)-log(0.7))))
+  const k = 1.4;
+  const center = Math.log(0.7);
+  const x = Math.log(Math.max(0.01, ratio));
+  return 1 / (1 + Math.exp(k * (x - center)));
+}
+// Multi-hop reasoning is strictly harder than NIAH. RULER paper shows ~30-50%
+// drop from NIAH-Single to multi-hop at long context. The gap grows with
+// architecture pressure (small d_head, aggressive GQA, SWA boundary).
+function reasoningPenalty(ratio, archPressure) {
+  // Base penalty grows with context ratio (more multi-hop steps required).
+  // archPressure ∈ [1.0, 1.6] from architecture (small d_head + GQA → higher).
+  const base = ratio < 0.5 ? 0.05 :
+               ratio < 1.0 ? 0.15 :
+               ratio < 2.0 ? 0.30 :
+               ratio < 4.0 ? 0.45 : 0.55;
+  return Math.min(0.7, base * archPressure);
+}
+function archPressureFromConfig(config) {
+  let p = 1.0;
+  const n_attn = config.num_attention_heads ?? null;
+  const n_kv   = config.num_key_value_heads ?? n_attn;
+  const hidden = config.hidden_size ?? null;
+  const d_head = config.head_dim ?? (n_attn && hidden ? hidden / n_attn : null);
+  if (d_head !== null) {
+    if (d_head < 64)  p *= 1.25;
+    else if (d_head < 96)  p *= 1.10;
+    else if (d_head < 128) p *= 1.03;
+  }
+  if (n_attn && n_kv && n_kv < n_attn) {
+    const ratio = n_attn / n_kv;
+    if (ratio >= 8)      p *= 1.15;
+    else if (ratio >= 4) p *= 1.08;
+  }
+  if (typeof config.sliding_window === "number" && config.sliding_window > 0) {
+    p *= 1.10; // SWA: cross-window reasoning costs extra
+  }
+  return Math.min(1.6, p);
+}
+export function predictNIAHReasoning(config, T_eval) {
+  const theta = config.rope_theta ?? 10000;
+  const T_train = config.max_position_embeddings ?? T_eval;
+  const gPade = gammaPade(theta, T_eval);
+  const dh = dHorizon(theta, gPade);
+  const ratio = dh === Infinity ? 0 : T_eval / dh;
+  const archPressure = archPressureFromConfig(config);
+  // Extrapolation penalty: models tested far beyond their training context
+  // degrade regardless of architecture (no positional embeddings learned for
+  // unseen positions). Capped at 0.7 so we never zero out completely.
+  const extrapolation_ratio = T_train > 0 ? T_eval / T_train : 1;
+  const extrapolation_penalty = extrapolation_ratio > 1
+    ? Math.min(0.7, (extrapolation_ratio - 1) * 0.3)
+    : 0;
+  const niah = Math.max(0.02, niahRate(ratio) * (1 - extrapolation_penalty));
+  const penalty = reasoningPenalty(ratio, archPressure);
+  const reasoning = Math.max(0.02, niah * (1 - penalty));
+  const gap = niah - reasoning;
+  // Verdict bands
+  let verdict;
+  if (niah < 0.35)                           verdict = "broken";        // model can't even retrieve
+  else if (gap >= 0.30)                       verdict = "retrieval_only"; // canonical RULER finding
+  else if (gap >= 0.15)                       verdict = "degraded";
+  else if (niah >= 0.70 && reasoning >= 0.55) verdict = "robust";
+  else                                        verdict = "marginal";
+  // Find a "safe" context where reasoning >= 0.65 (binary search-like sweep)
+  let safeT = null;
+  for (let t = 1024; t <= T_eval; t *= 2) {
+    const gP = gammaPade(theta, t);
+    const dh2 = dHorizon(theta, gP);
+    const r = dh2 === Infinity ? 0 : t / dh2;
+    const niah2 = niahRate(r);
+    const reas2 = niah2 * (1 - reasoningPenalty(r, archPressure));
+    if (reas2 >= 0.65) safeT = t;
+    else break;
+  }
+  return {
+    T_eval,
+    T_train,
+    theta,
+    arch_pressure: Math.round(archPressure * 100) / 100,
+    gamma_pade: Math.round(gPade * 1000) / 1000,
+    d_horizon: dh === Infinity ? null : Math.round(dh),
+    horizon_ratio: Math.round(ratio * 100) / 100,
+    niah_rate: Math.round(niah * 100) / 100,
+    reasoning_rate: Math.round(reasoning * 100) / 100,
+    gap: Math.round(gap * 100) / 100,
+    verdict,
+    safe_context: safeT,
+  };
+}
+// Sweep across context lengths (1k, 4k, 16k, 64k, 128k) so user sees the curve.
+export function sweepContextLengths(config, lengths = null) {
+  const T_max = config.max_position_embeddings ?? 131072;
+  const defaults = lengths || [1024, 4096, 16384, 65536, T_max].filter((v, i, arr) =>
+    v <= T_max && arr.indexOf(v) === i
+  );
+  return defaults.map(T => predictNIAHReasoning(config, T));
+}

style.css CHANGED Viewed

@@ -34,22 +34,48 @@
 }
 /* v0.7.5 — Cross-framework drift bound */
 .drift-grid {
   display: grid;
-  grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
   gap: 1em;
   margin: 0.6em 0 0.8em;
 }
 .drift-setup {
   border: 1px solid rgba(88, 166, 255, 0.3);
   border-radius: 8px;
   padding: 0.6em 1em 0.4em;
 }
 .drift-setup legend {
   padding: 0 0.5em;
   font-weight: 600;
   color: #58a6ff;
 }
 /* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
 .hf-autocomplete-dropdown {

 }
 /* v0.7.5 — Cross-framework drift bound */
+/* Drift section overrides the default 980px main width so the two-column form
+   has room without overlapping. Mobile (<800px) collapses to single column. */
+#drift-section {
+  max-width: 1100px;
+}
 .drift-grid {
   display: grid;
+  grid-template-columns: 1fr 1fr;
   gap: 1em;
   margin: 0.6em 0 0.8em;
+  align-items: start;
+}
+@media (max-width: 800px) {
+  .drift-grid { grid-template-columns: 1fr; }
 }
 .drift-setup {
   border: 1px solid rgba(88, 166, 255, 0.3);
   border-radius: 8px;
   padding: 0.6em 1em 0.4em;
+  min-width: 0;          /* fix: lets grid item shrink instead of overflowing */
+  box-sizing: border-box;
+  overflow: hidden;      /* clip any overlong content cleanly */
 }
 .drift-setup legend {
   padding: 0 0.5em;
   font-weight: 600;
   color: #58a6ff;
 }
+.drift-setup .form-row {
+  flex-wrap: wrap;
+  gap: 0.4em 0.6em;
+  margin: 0.35em 0;
+}
+.drift-setup .form-row label {
+  flex: 0 0 110px;
+  font-size: 0.92em;
+}
+.drift-setup .form-row input,
+.drift-setup .form-row select {
+  flex: 1 1 120px;
+  min-width: 0;
+}
 /* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
 .hf-autocomplete-dropdown {