Spaces:
Running
v0.7.6: NIAH→reasoning predictor (anti-bullshit #7) + Drift CSS fix
Browse filesNEW MODE 🔍 NIAH→Reason — pass rate predictor for retrieval vs reasoning at any context
- js/niah_reasoning.js: pure logic. Reuses gammaPade/d_horizon machinery.
- Inputs: HF model id (auto-fetch config) + target T_eval.
- Outputs: NIAH pass rate, multi-hop reasoning rate, gap, "safe context" where reasoning ≥ 65%, verdict (robust / marginal / degraded / retrieval_only / broken).
- Logistic on log(T_eval / d_horizon) for NIAH; arch-pressure-modulated penalty for reasoning. Extrapolation penalty when T_eval > T_train (no positional embeddings learned past training).
- sweepContextLengths(): scans 1k / 4k / 16k / 64k / T_train so user sees the curve.
- Solves: RULER paper (NVIDIA 2024) finding that long-context models pass NIAH but fail reasoning. P6 in HF community pain analysis.
VIRTUAL SIMULATION
- Llama-3.1-8B → robust at 8k / 64k / 128k (matches RULER findings).
- Mistral-7B (SWA 4096) → robust at 4k, broken at 16k+ (SWA limit kicks in).
- Pythia-160m at 4× T_train → broken (extrapolation penalty caps gain past trained positions).
DRIFT CSS FIX
- #drift-section max-width 1100px (default 980px caused panel overflow).
- grid-template-columns: 1fr 1fr explicit (replaces auto-fit which overlapped).
- min-width: 0 + box-sizing on .drift-setup → fixes CSS Grid implicit auto min-width that caused overflow.
- Mobile <800px → single column.
- form-row inside drift-setup: flex-wrap, label width 110px, inputs flex 1 1 120px so values don't squish.
DOCUMENTATION
- Help modal v0.7 niah section in 4 langs.
- Inventory modal v0.7 card: 7th entry "🔍 NIAH→Reason".
- modes.tip → 14 modes in 4 langs.
- 681 i18n keys × 4 langs · 0 missing / 0 extra (49 new niah.* keys per lang).
33/33 sim passes — including extrapolation penalty calibration on Pythia.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- index.html +30 -0
- js/i18n.js +208 -4
- js/main.js +160 -2
- js/niah_reasoning.js +138 -0
- style.css +27 -1
|
@@ -210,6 +210,9 @@
|
|
| 210 |
<p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
|
| 211 |
<p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
|
| 212 |
|
|
|
|
|
|
|
|
|
|
| 213 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 214 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 215 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
@@ -317,6 +320,7 @@
|
|
| 317 |
<li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
|
| 318 |
<li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
|
| 319 |
<li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
|
|
|
|
| 320 |
</ul>
|
| 321 |
</details>
|
| 322 |
</div>
|
|
@@ -372,6 +376,7 @@
|
|
| 372 |
<button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
|
| 373 |
<button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
|
| 374 |
<button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
|
|
|
|
| 375 |
</div>
|
| 376 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 377 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
@@ -871,6 +876,31 @@
|
|
| 871 |
<div id="drift-output" style="margin-top: 1em;"></div>
|
| 872 |
</section>
|
| 873 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 874 |
<!-- Recipe selector (mode=recipe) -->
|
| 875 |
<section id="recipe-section" style="display:none;">
|
| 876 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
|
|
| 210 |
<p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
|
| 211 |
<p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
|
| 212 |
|
| 213 |
+
<p><strong data-i18n="help.v07.niah.title">🔍 NIAH → Reasoning Gap</strong></p>
|
| 214 |
+
<p data-i18n="help.v07.niah.body">RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.</p>
|
| 215 |
+
|
| 216 |
<h3 data-i18n="help.audit.title">The audit chain</h3>
|
| 217 |
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
|
| 218 |
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
|
|
|
|
| 320 |
<li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
|
| 321 |
<li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
|
| 322 |
<li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
|
| 323 |
+
<li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
|
| 324 |
</ul>
|
| 325 |
</details>
|
| 326 |
</div>
|
|
|
|
| 376 |
<button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
|
| 377 |
<button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
|
| 378 |
<button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
|
| 379 |
+
<button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
|
| 380 |
</div>
|
| 381 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
| 382 |
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
|
|
|
|
| 876 |
<div id="drift-output" style="margin-top: 1em;"></div>
|
| 877 |
</section>
|
| 878 |
|
| 879 |
+
<!-- NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7) -->
|
| 880 |
+
<section id="niah-section" style="display:none;">
|
| 881 |
+
<h2><span data-i18n="niah.title">🔍 NIAH → Reasoning Gap</span>
|
| 882 |
+
<span class="info"><span class="tooltip" data-i18n="niah.tip">
|
| 883 |
+
NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
|
| 884 |
+
</span></span>
|
| 885 |
+
</h2>
|
| 886 |
+
<p class="recipe-desc" data-i18n="niah.desc">
|
| 887 |
+
<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%.
|
| 888 |
+
</p>
|
| 889 |
+
<div class="form-row">
|
| 890 |
+
<label for="niah-id" data-i18n="niah.id_label">HF model id:</label>
|
| 891 |
+
<input type="text" id="niah-id" placeholder="e.g. meta-llama/Llama-3.1-8B-Instruct" />
|
| 892 |
+
<button type="button" id="niah-fetch-btn" data-i18n="niah.fetch_btn">📥 Fetch config</button>
|
| 893 |
+
</div>
|
| 894 |
+
<div class="form-row">
|
| 895 |
+
<label for="niah-teval" data-i18n="niah.teval_label">Target context (T_eval):</label>
|
| 896 |
+
<input type="number" id="niah-teval" min="512" step="1024" value="32768" />
|
| 897 |
+
<button type="button" id="niah-run-btn" data-i18n="niah.run_btn">🔍 Predict</button>
|
| 898 |
+
<button type="button" id="niah-sweep-btn" class="secondary" data-i18n="niah.sweep_btn">📊 Sweep contexts</button>
|
| 899 |
+
</div>
|
| 900 |
+
<p id="niah-status" class="recipe-desc" style="font-size:0.92em;"></p>
|
| 901 |
+
<div id="niah-output" style="margin-top: 1em;"></div>
|
| 902 |
+
</section>
|
| 903 |
+
|
| 904 |
<!-- Recipe selector (mode=recipe) -->
|
| 905 |
<section id="recipe-section" style="display:none;">
|
| 906 |
<h2 data-i18n="recipe.title">📋 Recipe</h2>
|
|
@@ -307,6 +307,9 @@ export const TRANSLATIONS = {
|
|
| 307 |
"help.v07.drift.title": "🔀 Cross-framework Drift Bound",
|
| 308 |
"help.v07.drift.body": "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
|
| 309 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
|
|
|
|
|
|
|
|
|
|
| 310 |
|
| 311 |
// v0.7 — Inventory modal 5th card
|
| 312 |
"inv.v07.title": "🆕 v0.7 anti-bullshit pack",
|
|
@@ -414,6 +417,54 @@ export const TRANSLATIONS = {
|
|
| 414 |
"drift.status.empty_scores": "⚠ Enter both scores.",
|
| 415 |
"drift.status.done": "✅ Verdict: {verdict}",
|
| 416 |
"drift.status.sample_loaded": "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 417 |
"share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
|
| 418 |
"share.import_btn": "📂 Load shared JSON",
|
| 419 |
"synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
|
|
@@ -506,7 +557,7 @@ export const TRANSLATIONS = {
|
|
| 506 |
"common.no": "No",
|
| 507 |
|
| 508 |
// Mode tooltips
|
| 509 |
-
"modes.tip": "<strong>
|
| 510 |
"profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
|
| 511 |
"compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
|
| 512 |
|
|
@@ -1145,6 +1196,9 @@ export const TRANSLATIONS = {
|
|
| 1145 |
"help.v07.drift.title": "🔀 Cota de drift entre frameworks",
|
| 1146 |
"help.v07.drift.body": "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón "Cargar sample" para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
|
| 1147 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
|
|
|
|
|
|
|
|
|
|
| 1148 |
|
| 1149 |
// v0.7 — Inventory modal 5ª card
|
| 1150 |
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
|
@@ -1252,6 +1306,54 @@ export const TRANSLATIONS = {
|
|
| 1252 |
"drift.status.empty_scores": "⚠ Introduce ambos scores.",
|
| 1253 |
"drift.status.done": "✅ Veredicto: {verdict}",
|
| 1254 |
"drift.status.sample_loaded": "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1255 |
"share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
|
| 1256 |
"share.import_btn": "📂 Cargar JSON compartido",
|
| 1257 |
"synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
|
|
@@ -1344,7 +1446,7 @@ export const TRANSLATIONS = {
|
|
| 1344 |
"common.no": "No",
|
| 1345 |
|
| 1346 |
// Tooltips de modos
|
| 1347 |
-
"modes.tip": "<strong>
|
| 1348 |
"profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
|
| 1349 |
"compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
|
| 1350 |
|
|
@@ -1847,6 +1949,9 @@ export const TRANSLATIONS = {
|
|
| 1847 |
"help.v07.drift.title": "🔀 Borne de drift inter-frameworks",
|
| 1848 |
"help.v07.drift.body": "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton "Charger échantillon" pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
|
| 1849 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
|
|
|
|
|
|
|
|
|
|
| 1850 |
|
| 1851 |
// v0.7 — Inventory modal 5ème card
|
| 1852 |
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
|
@@ -1954,6 +2059,54 @@ export const TRANSLATIONS = {
|
|
| 1954 |
"drift.status.empty_scores": "⚠ Saisissez les deux scores.",
|
| 1955 |
"drift.status.done": "✅ Verdict : {verdict}",
|
| 1956 |
"drift.status.sample_loaded": "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1957 |
"share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
|
| 1958 |
"share.import_btn": "📂 Charger JSON partagé",
|
| 1959 |
"synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
|
|
@@ -2046,7 +2199,7 @@ export const TRANSLATIONS = {
|
|
| 2046 |
"common.no": "Non",
|
| 2047 |
|
| 2048 |
// Tooltips des modes
|
| 2049 |
-
"modes.tip": "<strong>
|
| 2050 |
"profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
|
| 2051 |
"compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
|
| 2052 |
|
|
@@ -2549,6 +2702,9 @@ export const TRANSLATIONS = {
|
|
| 2549 |
"help.v07.drift.title": "🔀 跨框架 Drift 界",
|
| 2550 |
"help.v07.drift.body": "同一模型,不同 setup 下分数不同。工具预测仅由数值噪声(dtype、framework、batch)允许的最大 drift。若观测差距超过它 → 真实 bug,通常是 chat-template mismatch(lm-eval-harness issue #1841)或 KV-cache 布局。试试 "加载样本" 按钮看典型的 chat-template bug。<em>用例</em>:在报告回归或声称可复现性之前,验证两个评估之间的差距是否大于数值噪声能解释的范围。",
|
| 2551 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug 还是噪声?预测两个评估间的最大可允许差距",
|
|
|
|
|
|
|
|
|
|
| 2552 |
|
| 2553 |
// v0.7 — Inventory 模态第 5 卡
|
| 2554 |
"inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
|
|
@@ -2656,6 +2812,54 @@ export const TRANSLATIONS = {
|
|
| 2656 |
"drift.status.empty_scores": "⚠ 输入两个分数。",
|
| 2657 |
"drift.status.done": "✅ 判定:{verdict}",
|
| 2658 |
"drift.status.sample_loaded": "✅ 样本已加载(典型 chat-template bug)。点击计算 drift 界。",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2659 |
"share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
|
| 2660 |
"share.import_btn": "📂 加载共享的 JSON",
|
| 2661 |
"synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
|
|
@@ -2748,7 +2952,7 @@ export const TRANSLATIONS = {
|
|
| 2748 |
"common.no": "否",
|
| 2749 |
|
| 2750 |
// 模式提示
|
| 2751 |
-
"modes.tip": "<strong>十
|
| 2752 |
"profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
|
| 2753 |
"compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
|
| 2754 |
|
|
|
|
| 307 |
"help.v07.drift.title": "🔀 Cross-framework Drift Bound",
|
| 308 |
"help.v07.drift.body": "Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.",
|
| 309 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals",
|
| 310 |
+
"help.v07.niah.title": "🔍 NIAH → Reasoning Gap",
|
| 311 |
+
"help.v07.niah.body": "RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's \"safe reasoning context\" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.",
|
| 312 |
+
"inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — does your \"128k context\" actually reason there, or just retrieve?",
|
| 313 |
|
| 314 |
// v0.7 — Inventory modal 5th card
|
| 315 |
"inv.v07.title": "🆕 v0.7 anti-bullshit pack",
|
|
|
|
| 417 |
"drift.status.empty_scores": "⚠ Enter both scores.",
|
| 418 |
"drift.status.done": "✅ Verdict: {verdict}",
|
| 419 |
"drift.status.sample_loaded": "✅ Sample loaded (canonical chat-template bug). Click Compute drift bound.",
|
| 420 |
+
|
| 421 |
+
// v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap predictor
|
| 422 |
+
"modes.niah": "🔍 NIAH→Reason",
|
| 423 |
+
"mode_desc.niah": "Predicts NIAH (retrieval) and multi-hop reasoning pass rates at any context. Solves: long-context models often pass NIAH but fail reasoning at the same context (RULER paper).",
|
| 424 |
+
"niah.title": "🔍 NIAH → Reasoning Gap",
|
| 425 |
+
"niah.tip": "NIAH (Needle in a Haystack) tests retrieval: 'find this fact in long text'. Multi-hop reasoning tests inference: 'combine facts X+Y at the start with fact Z at the end'. RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.",
|
| 426 |
+
"niah.desc": "<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a 'safe context' where reasoning stays ≥65%.",
|
| 427 |
+
"niah.id_label": "HF model id:",
|
| 428 |
+
"niah.fetch_btn": "📥 Fetch config",
|
| 429 |
+
"niah.teval_label": "Target context (T_eval):",
|
| 430 |
+
"niah.run_btn": "🔍 Predict",
|
| 431 |
+
"niah.sweep_btn": "📊 Sweep contexts",
|
| 432 |
+
"niah.label.niah": "NIAH pass rate",
|
| 433 |
+
"niah.label.reasoning": "Reasoning pass rate",
|
| 434 |
+
"niah.label.gap": "Gap",
|
| 435 |
+
"niah.label.safe_ctx": "Safe reasoning context",
|
| 436 |
+
"niah.section.breakdown": "Architecture breakdown",
|
| 437 |
+
"niah.section.reco": "Recommendation",
|
| 438 |
+
"niah.section.sweep": "Pass rate sweep across context lengths",
|
| 439 |
+
"niah.field.dhorizon": "d_horizon (effective)",
|
| 440 |
+
"niah.field.ratio": "T_eval / d_horizon",
|
| 441 |
+
"niah.field.arch_pressure": "Arch pressure (small d_head + GQA + SWA)",
|
| 442 |
+
"niah.field.theta": "RoPE θ",
|
| 443 |
+
"niah.field.t_train": "T_train (claimed)",
|
| 444 |
+
"niah.col.context": "T_eval",
|
| 445 |
+
"niah.col.niah": "NIAH",
|
| 446 |
+
"niah.col.reasoning": "Reasoning",
|
| 447 |
+
"niah.col.gap": "Gap",
|
| 448 |
+
"niah.col.verdict": "Verdict",
|
| 449 |
+
"niah.verdict.robust": "✅ ROBUST",
|
| 450 |
+
"niah.verdict.marginal": "⚠ MARGINAL",
|
| 451 |
+
"niah.verdict.degraded": "⚠ DEGRADED",
|
| 452 |
+
"niah.verdict.retrieval_only": "❌ RETRIEVAL-ONLY",
|
| 453 |
+
"niah.verdict.broken": "❌ BROKEN",
|
| 454 |
+
"niah.reco.robust": "Both retrieval and reasoning hold up at this context. Safe to deploy for both lookup and inference tasks.",
|
| 455 |
+
"niah.reco.marginal": "Borderline. Retrieval works but reasoning is shaky. Use for fact-lookup, not multi-step inference.",
|
| 456 |
+
"niah.reco.degraded": "Significant reasoning drop. The model can find facts but struggles to combine them. Avoid multi-hop tasks at this length.",
|
| 457 |
+
"niah.reco.retrieval_only": "Canonical RULER finding: model passes NIAH but fails reasoning. Useful for retrieval-augmented setups (where the LLM only locates facts) but NOT for chained inference. Cut your context to the 'safe' value below.",
|
| 458 |
+
"niah.reco.broken": "Model fails even basic retrieval at this context. Treat as out-of-distribution — re-test at a shorter context.",
|
| 459 |
+
"niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
|
| 460 |
+
"niah.safe_context_none": "No safe context found below your target — model fails reasoning even at small contexts.",
|
| 461 |
+
"niah.summary.sweep": "<code>{modelId}</code> — pass rates by context",
|
| 462 |
+
"niah.status.empty_id": "⚠ Enter a model id (e.g. meta-llama/Llama-3.1-8B-Instruct).",
|
| 463 |
+
"niah.status.bad_teval": "⚠ Enter a target context (≥ 512 tokens).",
|
| 464 |
+
"niah.status.fetching": "⏳ Fetching config.json for {modelId}...",
|
| 465 |
+
"niah.status.fetched": "✅ Config fetched for {modelId}. Set T_eval and click Predict (or Sweep contexts).",
|
| 466 |
+
"niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
|
| 467 |
+
"niah.status.sweep_done": "✅ Swept {n} context lengths.",
|
| 468 |
"share.import_desc": "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
|
| 469 |
"share.import_btn": "📂 Load shared JSON",
|
| 470 |
"synthesis.system": "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
|
|
|
|
| 557 |
"common.no": "No",
|
| 558 |
|
| 559 |
// Mode tooltips
|
| 560 |
+
"modes.tip": "<strong>Fourteen ways to use the tool</strong>.<br><strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br><strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br><strong>🔍 Inspect config</strong>: paste raw config.json → full Profile.<br><strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br><strong>📋 Recipe</strong>: manual selection with full form control.<br><strong>🩺 Diagnose CLI</strong>: generate Python command for local γ measurement.<br><strong>📊 Phase diagram</strong>: 23-model panel on (log θ, γ) plane.<br><strong>🪟 Unmask</strong>: detect misleading max_position_embeddings (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detect family + give exact CLI flag for lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruct confidence intervals from raw pairwise vote data; detect statistical ties Arena hides.<br><strong>🧪 Contamination</strong>: rate 20+ benchmarks for contamination probability based on training cutoff vs release date.<br><strong>⚖️ Quant</strong>: predict γ-shift and ΔPPL for any (model × quant scheme); recommend safer alternative on cliff.<br><strong>🔀 Drift</strong>: same model, different scores on two setups — bug or noise? Predict numerical-noise band and flag real bugs.<br><strong>🔍 NIAH→Reason</strong>: predict NIAH and multi-hop reasoning pass rates from architecture; find your model's safe reasoning context.",
|
| 561 |
"profile.tip": "<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single <strong>TAF Card</strong> with verdict per dimension + key numbers + architecture classification.<br><br><strong>Use case</strong>: \"I'm evaluating Qwen2.5-32B for production — what's its full viability profile?\" → paste id → Profile → done.",
|
| 562 |
"compare.tip": "<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and one recipe. See verdicts in a single comparison table.<br><br><strong>Use case</strong>: \"I need long-context retrieval at 16K — which is best: Llama-3-8B, Mistral-7B, or Qwen-7B?\" → pick 3 + X-2 + 16K → see winner.",
|
| 563 |
|
|
|
|
| 1196 |
"help.v07.drift.title": "🔀 Cota de drift entre frameworks",
|
| 1197 |
"help.v07.drift.body": "Mismo modelo, scores distintos en setups distintos. La herramienta predice el drift máximo admisible solo por ruido numérico (dtype, framework, batch). Si el gap observado lo excede → bug real, normalmente chat-template mismatch (issue #1841 de lm-eval-harness) o layout de KV-cache. Prueba el botón "Cargar sample" para el bug canónico de chat-template. <em>Caso de uso</em>: antes de reportar una regresión o reclamar reproducibilidad, verifica si el gap entre dos evals es mayor de lo que el ruido numérico puede explicar.",
|
| 1198 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — ¿bug o ruido? Predice el gap máximo admisible entre dos evals",
|
| 1199 |
+
"help.v07.niah.title": "🔍 Gap NIAH → Reasoning",
|
| 1200 |
+
"help.v07.niah.body": "El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH (retrieval de needle) pero fallan reasoning multi-hop al mismo contexto. La herramienta predice ambas tasas de pass desde la arquitectura (γ_Padé + d_horizon + presión arq: d_head pequeño, GQA, SWA), reporta el gap, y encuentra el \"contexto seguro de reasoning\" donde reasoning se mantiene ≥65%. Modo barrido muestra la curva a 1k/4k/16k/64k/T_train. <em>Caso de uso</em>: antes de desplegar al contexto declarado, descubre si el modelo realmente razonará ahí o solo encontrará.",
|
| 1201 |
+
"inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — ¿tu \"128k\" realmente razona ahí, o solo encuentra?",
|
| 1202 |
|
| 1203 |
// v0.7 — Inventory modal 5ª card
|
| 1204 |
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
|
|
|
| 1306 |
"drift.status.empty_scores": "⚠ Introduce ambos scores.",
|
| 1307 |
"drift.status.done": "✅ Veredicto: {verdict}",
|
| 1308 |
"drift.status.sample_loaded": "✅ Sample cargado (bug canónico de chat-template). Click en Calcular cota de drift.",
|
| 1309 |
+
|
| 1310 |
+
// v0.7.6 — anti-bullshit pack #7: NIAH → predictor de gap de reasoning
|
| 1311 |
+
"modes.niah": "🔍 NIAH→Reason",
|
| 1312 |
+
"mode_desc.niah": "Predice tasas de pass de NIAH (retrieval) y reasoning multi-hop a cualquier contexto. Resuelve: modelos long-context pasan NIAH pero fallan reasoning al mismo contexto (paper RULER).",
|
| 1313 |
+
"niah.title": "🔍 Gap NIAH → Reasoning",
|
| 1314 |
+
"niah.tip": "NIAH (Needle in a Haystack) testea retrieval: 'encuentra este hecho en texto largo'. Reasoning multi-hop testea inferencia: 'combina hechos X+Y del principio con hecho Z del final'. El paper RULER (NVIDIA 2024) muestra que modelos long-context a menudo pasan NIAH pero fallan reasoning al mismo contexto. Esta herramienta predice ambas tasas desde la arquitectura sola.",
|
| 1315 |
+
"niah.desc": "<strong>Tu modelo dice 128k de contexto. ¿Razonará realmente a 64k, o solo encontrará?</strong> Pega un model id HF y un contexto objetivo — la herramienta predice tasas de pass NIAH y reasoning multi-hop, el gap, y un 'contexto seguro' donde reasoning se mantiene ≥65%.",
|
| 1316 |
+
"niah.id_label": "ID modelo HF:",
|
| 1317 |
+
"niah.fetch_btn": "📥 Fetch config",
|
| 1318 |
+
"niah.teval_label": "Contexto objetivo (T_eval):",
|
| 1319 |
+
"niah.run_btn": "🔍 Predecir",
|
| 1320 |
+
"niah.sweep_btn": "📊 Barrer contextos",
|
| 1321 |
+
"niah.label.niah": "Tasa pass NIAH",
|
| 1322 |
+
"niah.label.reasoning": "Tasa pass Reasoning",
|
| 1323 |
+
"niah.label.gap": "Gap",
|
| 1324 |
+
"niah.label.safe_ctx": "Contexto seguro de reasoning",
|
| 1325 |
+
"niah.section.breakdown": "Desglose arquitectónico",
|
| 1326 |
+
"niah.section.reco": "Recomendación",
|
| 1327 |
+
"niah.section.sweep": "Barrido de tasas pass por longitud de contexto",
|
| 1328 |
+
"niah.field.dhorizon": "d_horizon (efectivo)",
|
| 1329 |
+
"niah.field.ratio": "T_eval / d_horizon",
|
| 1330 |
+
"niah.field.arch_pressure": "Presión arq (d_head pequeño + GQA + SWA)",
|
| 1331 |
+
"niah.field.theta": "RoPE θ",
|
| 1332 |
+
"niah.field.t_train": "T_train (declarado)",
|
| 1333 |
+
"niah.col.context": "T_eval",
|
| 1334 |
+
"niah.col.niah": "NIAH",
|
| 1335 |
+
"niah.col.reasoning": "Reasoning",
|
| 1336 |
+
"niah.col.gap": "Gap",
|
| 1337 |
+
"niah.col.verdict": "Veredicto",
|
| 1338 |
+
"niah.verdict.robust": "✅ ROBUSTO",
|
| 1339 |
+
"niah.verdict.marginal": "⚠ MARGINAL",
|
| 1340 |
+
"niah.verdict.degraded": "⚠ DEGRADADO",
|
| 1341 |
+
"niah.verdict.retrieval_only": "❌ SOLO RETRIEVAL",
|
| 1342 |
+
"niah.verdict.broken": "❌ ROTO",
|
| 1343 |
+
"niah.reco.robust": "Tanto retrieval como reasoning aguantan a este contexto. Seguro para desplegar tareas de lookup e inferencia.",
|
| 1344 |
+
"niah.reco.marginal": "Borderline. Retrieval funciona pero reasoning está flojo. Úsalo para lookup, no para inferencia multi-paso.",
|
| 1345 |
+
"niah.reco.degraded": "Caída significativa de reasoning. El modelo encuentra hechos pero le cuesta combinarlos. Evita tareas multi-hop a esta longitud.",
|
| 1346 |
+
"niah.reco.retrieval_only": "Hallazgo canónico de RULER: el modelo pasa NIAH pero falla reasoning. Útil para setups RAG (donde el LLM solo localiza hechos) pero NO para inferencia encadenada. Reduce tu contexto al valor 'seguro' de abajo.",
|
| 1347 |
+
"niah.reco.broken": "El modelo falla incluso retrieval básico a este contexto. Trátalo como out-of-distribution — re-testea a contexto más corto.",
|
| 1348 |
+
"niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
|
| 1349 |
+
"niah.safe_context_none": "No se encontró contexto seguro bajo tu objetivo — el modelo falla reasoning incluso a contextos pequeños.",
|
| 1350 |
+
"niah.summary.sweep": "<code>{modelId}</code> — tasas pass por contexto",
|
| 1351 |
+
"niah.status.empty_id": "⚠ Introduce un model id (ej. meta-llama/Llama-3.1-8B-Instruct).",
|
| 1352 |
+
"niah.status.bad_teval": "⚠ Introduce un contexto objetivo (≥ 512 tokens).",
|
| 1353 |
+
"niah.status.fetching": "⏳ Obteniendo config.json para {modelId}...",
|
| 1354 |
+
"niah.status.fetched": "✅ Config obtenido para {modelId}. Pon T_eval y click Predecir (o Barrer contextos).",
|
| 1355 |
+
"niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
|
| 1356 |
+
"niah.status.sweep_done": "✅ Barridos {n} largos de contexto.",
|
| 1357 |
"share.import_desc": "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
|
| 1358 |
"share.import_btn": "📂 Cargar JSON compartido",
|
| 1359 |
"synthesis.system": "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
|
|
|
|
| 1446 |
"common.no": "No",
|
| 1447 |
|
| 1448 |
// Tooltips de modos
|
| 1449 |
+
"modes.tip": "<strong>Catorce formas de usar la herramienta</strong>.<br><strong>📇 Perfil</strong>: pega un id → TAF Card de 5 recetas.<br><strong>🆚 Comparar</strong>: 2-3 modelos lado a lado en una receta.<br><strong>🔍 Inspeccionar config</strong>: pega config.json crudo → Perfil completo.<br><strong>💬 Pregunta</strong>: pregunta libre, el LLM del navegador elige la receta.<br><strong>📋 Receta</strong>: selección manual con control total del formulario.<br><strong>🩺 Diagnóstico CLI</strong>: genera comando Python para medir γ localmente.<br><strong>📊 Diagrama de fase</strong>: panel de 23 modelos en plano (log θ, γ).<br><strong>🪟 Desenmascarar</strong>: detecta max_position_embeddings engañoso (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: detecta familia + da el flag CLI exacto para lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruye intervalos de confianza desde votos pairwise crudos; detecta empates estadísticos que Arena oculta.<br><strong>🧪 Contaminación</strong>: puntúa 20+ benchmarks por probabilidad de contaminación según cutoff de entrenamiento vs fecha de release.<br><strong>⚖️ Quant</strong>: predice γ-shift y ΔPPL para cualquier (modelo × esquema de cuantización); recomienda alternativa segura si hay cliff.<br><strong>🔀 Drift</strong>: mismo modelo, scores distintos en dos setups — ¿bug o ruido? Predice banda de ruido numérico y flagea bugs reales.<br><strong>🔍 NIAH→Reason</strong>: predice tasas pass NIAH y reasoning multi-hop desde arquitectura; encuentra el contexto seguro de reasoning.",
|
| 1450 |
"profile.tip": "<strong>Diagnóstico completo en un click</strong>. Pega cualquier id de modelo HF (o elige preset). La herramienta ejecuta las 5 recetas (contexto largo, compresión KV, custom vs API, presupuesto, hardware) y produce una única <strong>TAF Card</strong> con veredicto por dimensión + números clave + clasificación arquitectónica.<br><br><strong>Caso de uso</strong>: \"Estoy evaluando Qwen2.5-32B para producción — ¿cuál es su perfil completo de viabilidad?\" → pega id → Perfilar → listo.",
|
| 1451 |
"compare.tip": "<strong>Misma receta, múltiples modelos</strong>. Elige 2-3 modelos candidatos y una receta. Ve los veredictos en una única tabla comparativa.<br><br><strong>Caso de uso</strong>: \"Necesito recuperación de contexto largo a 16K — ¿cuál es mejor: Llama-3-8B, Mistral-7B o Qwen-7B?\" → elige 3 + X-2 + 16K → ve el ganador.",
|
| 1452 |
|
|
|
|
| 1949 |
"help.v07.drift.title": "🔀 Borne de drift inter-frameworks",
|
| 1950 |
"help.v07.drift.body": "Même modèle, scores différents sur setups différents. L'outil prédit le drift max admissible dû au seul bruit numérique (dtype, framework, batch). Si l'écart observé le dépasse → vrai bug, généralement chat-template mismatch (issue #1841 lm-eval-harness) ou layout KV-cache. Essayez le bouton "Charger échantillon" pour le bug chat-template canonique. <em>Cas d'usage</em> : avant de reporter une régression ou de revendiquer la reproductibilité, vérifiez si l'écart entre deux évals est plus grand que ce que le bruit numérique peut expliquer.",
|
| 1951 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug ou bruit ? Prédit l'écart max admissible entre deux évals",
|
| 1952 |
+
"help.v07.niah.title": "🔍 Gap NIAH → Reasoning",
|
| 1953 |
+
"help.v07.niah.body": "Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH (retrieval de needle) mais échouent au reasoning multi-hop au même contexte. L'outil prédit les deux taux de réussite à partir de l'architecture (γ_Padé + d_horizon + pression arch : petit d_head, GQA, SWA), reporte le gap, et trouve le \"contexte sûr pour reasoning\" où le reasoning reste ≥65%. Mode balayage montre la courbe à 1k/4k/16k/64k/T_train. <em>Cas d'usage</em> : avant de déployer au contexte revendiqué, découvrez si le modèle va vraiment raisonner là ou seulement retrouver.",
|
| 1954 |
+
"inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — votre \"128k\" raisonne-t-il vraiment là, ou seulement retrouve ?",
|
| 1955 |
|
| 1956 |
// v0.7 — Inventory modal 5ème card
|
| 1957 |
"inv.v07.title": "🆕 Pack anti-bullshit v0.7",
|
|
|
|
| 2059 |
"drift.status.empty_scores": "⚠ Saisissez les deux scores.",
|
| 2060 |
"drift.status.done": "✅ Verdict : {verdict}",
|
| 2061 |
"drift.status.sample_loaded": "✅ Échantillon chargé (bug chat-template canonique). Cliquez sur Calculer la borne de drift.",
|
| 2062 |
+
|
| 2063 |
+
// v0.7.6 — anti-bullshit pack #7: prédicteur de gap NIAH → reasoning
|
| 2064 |
+
"modes.niah": "🔍 NIAH→Reason",
|
| 2065 |
+
"mode_desc.niah": "Prédit les taux de réussite NIAH (retrieval) et reasoning multi-hop à n'importe quel contexte. Résout : les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte (paper RULER).",
|
| 2066 |
+
"niah.title": "🔍 Gap NIAH → Reasoning",
|
| 2067 |
+
"niah.tip": "NIAH (Needle in a Haystack) teste le retrieval : 'trouve ce fait dans un long texte'. Le reasoning multi-hop teste l'inférence : 'combine les faits X+Y au début avec le fait Z à la fin'. Le paper RULER (NVIDIA 2024) montre que les modèles long-context passent souvent NIAH mais échouent au reasoning au même contexte. Cet outil prédit les deux taux à partir de la seule architecture.",
|
| 2068 |
+
"niah.desc": "<strong>Votre modèle revendique 128k de contexte. Va-t-il vraiment raisonner à 64k, ou seulement retrouver ?</strong> Collez un model id HF et un contexte cible — l'outil prédit les taux de réussite NIAH et reasoning multi-hop, le gap, et un 'contexte sûr' où le reasoning reste ≥65%.",
|
| 2069 |
+
"niah.id_label": "ID modèle HF :",
|
| 2070 |
+
"niah.fetch_btn": "📥 Récupérer config",
|
| 2071 |
+
"niah.teval_label": "Contexte cible (T_eval) :",
|
| 2072 |
+
"niah.run_btn": "🔍 Prédire",
|
| 2073 |
+
"niah.sweep_btn": "📊 Balayer les contextes",
|
| 2074 |
+
"niah.label.niah": "Taux NIAH",
|
| 2075 |
+
"niah.label.reasoning": "Taux Reasoning",
|
| 2076 |
+
"niah.label.gap": "Gap",
|
| 2077 |
+
"niah.label.safe_ctx": "Contexte sûr pour reasoning",
|
| 2078 |
+
"niah.section.breakdown": "Détail architectural",
|
| 2079 |
+
"niah.section.reco": "Recommandation",
|
| 2080 |
+
"niah.section.sweep": "Balayage des taux par longueur de contexte",
|
| 2081 |
+
"niah.field.dhorizon": "d_horizon (effectif)",
|
| 2082 |
+
"niah.field.ratio": "T_eval / d_horizon",
|
| 2083 |
+
"niah.field.arch_pressure": "Pression arch (petit d_head + GQA + SWA)",
|
| 2084 |
+
"niah.field.theta": "RoPE θ",
|
| 2085 |
+
"niah.field.t_train": "T_train (revendiqué)",
|
| 2086 |
+
"niah.col.context": "T_eval",
|
| 2087 |
+
"niah.col.niah": "NIAH",
|
| 2088 |
+
"niah.col.reasoning": "Reasoning",
|
| 2089 |
+
"niah.col.gap": "Gap",
|
| 2090 |
+
"niah.col.verdict": "Verdict",
|
| 2091 |
+
"niah.verdict.robust": "✅ ROBUSTE",
|
| 2092 |
+
"niah.verdict.marginal": "⚠ MARGINAL",
|
| 2093 |
+
"niah.verdict.degraded": "⚠ DÉGRADÉ",
|
| 2094 |
+
"niah.verdict.retrieval_only": "❌ RETRIEVAL UNIQUEMENT",
|
| 2095 |
+
"niah.verdict.broken": "❌ CASSÉ",
|
| 2096 |
+
"niah.reco.robust": "Retrieval et reasoning tiennent tous deux à ce contexte. Sûr de déployer pour les tâches de lookup et d'inférence.",
|
| 2097 |
+
"niah.reco.marginal": "Borderline. Le retrieval fonctionne mais le reasoning est fragile. À utiliser pour le lookup, pas pour l'inférence multi-étapes.",
|
| 2098 |
+
"niah.reco.degraded": "Chute significative du reasoning. Le modèle trouve des faits mais peine à les combiner. Évitez les tâches multi-hop à cette longueur.",
|
| 2099 |
+
"niah.reco.retrieval_only": "Constat canonique de RULER : le modèle passe NIAH mais échoue au reasoning. Utile pour les setups RAG (où le LLM ne fait que localiser les faits) mais PAS pour l'inférence chaînée. Réduisez votre contexte à la valeur 'sûre' ci-dessous.",
|
| 2100 |
+
"niah.reco.broken": "Le modèle échoue même au retrieval basique à ce contexte. Traitez-le comme hors-distribution — re-testez à un contexte plus court.",
|
| 2101 |
+
"niah.safe_context": "≤ {ctx} tokens (reasoning ≥ 65%)",
|
| 2102 |
+
"niah.safe_context_none": "Aucun contexte sûr trouvé sous votre cible — le modèle échoue au reasoning même à de petits contextes.",
|
| 2103 |
+
"niah.summary.sweep": "<code>{modelId}</code> — taux par contexte",
|
| 2104 |
+
"niah.status.empty_id": "⚠ Saisissez un model id (ex. meta-llama/Llama-3.1-8B-Instruct).",
|
| 2105 |
+
"niah.status.bad_teval": "⚠ Saisissez un contexte cible (≥ 512 tokens).",
|
| 2106 |
+
"niah.status.fetching": "⏳ Récupération config.json pour {modelId}...",
|
| 2107 |
+
"niah.status.fetched": "✅ Config récupéré pour {modelId}. Réglez T_eval et cliquez Prédire (ou Balayer les contextes).",
|
| 2108 |
+
"niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
|
| 2109 |
+
"niah.status.sweep_done": "✅ Balayé {n} longueurs de contexte.",
|
| 2110 |
"share.import_desc": "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
|
| 2111 |
"share.import_btn": "📂 Charger JSON partagé",
|
| 2112 |
"synthesis.system": "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
|
|
|
|
| 2199 |
"common.no": "Non",
|
| 2200 |
|
| 2201 |
// Tooltips des modes
|
| 2202 |
+
"modes.tip": "<strong>Quatorze façons d'utiliser l'outil</strong>.<br><strong>📇 Profil</strong>: collez un id → TAF Card avec 5 recettes.<br><strong>🆚 Comparer</strong>: 2-3 modèles côte à côte sur une recette.<br><strong>🔍 Inspecter config</strong>: collez config.json brut → Profil complet.<br><strong>💬 Question</strong>: question libre, le LLM du navigateur choisit la recette.<br><strong>📋 Recette</strong>: sélection manuelle avec contrôle total du formulaire.<br><strong>🩺 Diagnostic CLI</strong>: génère commande Python pour mesurer γ localement.<br><strong>📊 Diagramme de phase</strong>: panel de 23 modèles dans le plan (log θ, γ).<br><strong>🪟 Démasquer</strong>: détecte un max_position_embeddings trompeur (SWA / YaRN / RoPE-scaling).<br><strong>📜 Chat-template</strong>: détecte la famille + donne le flag CLI exact pour lm-eval / vLLM / transformers.<br><strong>🎯 Arena CI</strong>: reconstruit les intervalles de confiance depuis les votes pairwise bruts ; détecte les égalités statistiques qu'Arena cache.<br><strong>🧪 Contamination</strong>: note 20+ benchmarks pour leur probabilité de contamination selon le cutoff d'entraînement vs la date de sortie.<br><strong>⚖️ Quant</strong>: prédit γ-shift et ΔPPL pour tout (modèle × schéma de quantification) ; recommande une alternative sûre en cas de cliff.<br><strong>🔀 Drift</strong>: même modèle, scores différents sur deux setups — bug ou bruit ? Prédit la bande de bruit numérique et signale les vrais bugs.<br><strong>🔍 NIAH→Reason</strong>: prédit les taux NIAH et reasoning multi-hop depuis l'architecture ; trouve le contexte sûr pour reasoning.",
|
| 2203 |
"profile.tip": "<strong>Diagnostic complet en un clic</strong>. Collez n'importe quel id de modèle HF (ou choisissez préréglage). L'outil exécute les 5 recettes (contexte long, compression KV, custom vs API, budget, hardware) et produit une <strong>TAF Card</strong> unique avec verdict par dimension + nombres clés + classification architecturale.<br><br><strong>Cas d'usage</strong>: « J'évalue Qwen2.5-32B pour la production — quel est son profil complet de viabilité ? » → collez id → Profiler → fait.",
|
| 2204 |
"compare.tip": "<strong>Même recette, plusieurs modèles</strong>. Choisissez 2-3 modèles candidats et une recette. Voyez les verdicts dans un seul tableau comparatif.<br><br><strong>Cas d'usage</strong>: « J'ai besoin de récupération longue contexte à 16K — quel est le meilleur : Llama-3-8B, Mistral-7B ou Qwen-7B ? » → choisissez 3 + X-2 + 16K → voyez le gagnant.",
|
| 2205 |
|
|
|
|
| 2702 |
"help.v07.drift.title": "🔀 跨框架 Drift 界",
|
| 2703 |
"help.v07.drift.body": "同一模型,不同 setup 下分数不同。工具预测仅由数值噪声(dtype、framework、batch)允许的最大 drift。若观测差距超过它 → 真实 bug,通常是 chat-template mismatch(lm-eval-harness issue #1841)或 KV-cache 布局。试试 "加载样本" 按钮看典型的 chat-template bug。<em>用例</em>:在报告回归或声称可复现性之前,验证两个评估之间的差距是否大于数值噪声能解释的范围。",
|
| 2704 |
"inv.v07.drift": "<strong>🔀 Drift</strong> — bug 还是噪声?预测两个评估间的最大可允许差距",
|
| 2705 |
+
"help.v07.niah.title": "🔍 NIAH → Reasoning Gap",
|
| 2706 |
+
"help.v07.niah.body": "RULER 论文(NVIDIA 2024)显示长上下文模型经常通过 NIAH(needle 检索)但在相同上下文上多跳 reasoning 失败。工具仅根据架构(γ_Padé + d_horizon + 架构压力:小 d_head、GQA、SWA)预测两种通过率,报告 gap,并找到模型 reasoning 保持 ≥65% 的\"安全 reasoning 上下文\"。扫描模式显示在 1k/4k/16k/64k/T_train 的曲线。<em>用例</em>:在声称的上下文部署之前,搞清楚模型是真的能在那里 reasoning 还是只能检索。",
|
| 2707 |
+
"inv.v07.niah": "<strong>🔍 NIAH→Reason</strong> — 你的\"128k 上下文\"真的能在那里 reasoning,还是只能检索?",
|
| 2708 |
|
| 2709 |
// v0.7 — Inventory 模态第 5 卡
|
| 2710 |
"inv.v07.title": "🆕 v0.7 anti-bullshit 套件",
|
|
|
|
| 2812 |
"drift.status.empty_scores": "⚠ 输入两个分数。",
|
| 2813 |
"drift.status.done": "✅ 判定:{verdict}",
|
| 2814 |
"drift.status.sample_loaded": "✅ 样本已加载(典型 chat-template bug)。点击计算 drift 界。",
|
| 2815 |
+
|
| 2816 |
+
// v0.7.6 — anti-bullshit pack #7: NIAH → reasoning gap 预测器
|
| 2817 |
+
"modes.niah": "🔍 NIAH→Reason",
|
| 2818 |
+
"mode_desc.niah": "在任意上下文下预测 NIAH(检索)与多跳 reasoning 通过率。解决:长上下文模型常常通过 NIAH 但在同一上下文上 reasoning 失败(RULER 论文)。",
|
| 2819 |
+
"niah.title": "🔍 NIAH → Reasoning Gap",
|
| 2820 |
+
"niah.tip": "NIAH(Needle in a Haystack)测试检索:\"在长文本中找到这个事实\"。多跳 reasoning 测试推理:\"把开头的事实 X+Y 与结尾的事实 Z 结合\"。RULER 论文(NVIDIA 2024)显示长上下文模型经常通过 NIAH 但在相同上下文上 reasoning 失败。本工具仅根据架构预测两种通过率。",
|
| 2821 |
+
"niah.desc": "<strong>你的模型声称 128k 上下文。它在 64k 是真的能 reasoning,还是只能检索?</strong>粘贴 HF 模型 id 和目标 eval 上下文 — 工具预测 NIAH 与多跳 reasoning 通过率、gap,以及 reasoning 保持 ≥65% 的 \"安全上下文\"。",
|
| 2822 |
+
"niah.id_label": "HF 模型 id:",
|
| 2823 |
+
"niah.fetch_btn": "📥 获取 config",
|
| 2824 |
+
"niah.teval_label": "目标上下文 (T_eval):",
|
| 2825 |
+
"niah.run_btn": "🔍 预测",
|
| 2826 |
+
"niah.sweep_btn": "📊 扫描上下文",
|
| 2827 |
+
"niah.label.niah": "NIAH 通过率",
|
| 2828 |
+
"niah.label.reasoning": "Reasoning 通过率",
|
| 2829 |
+
"niah.label.gap": "Gap",
|
| 2830 |
+
"niah.label.safe_ctx": "Reasoning 安全上下文",
|
| 2831 |
+
"niah.section.breakdown": "架构细节",
|
| 2832 |
+
"niah.section.reco": "建议",
|
| 2833 |
+
"niah.section.sweep": "按上下文长度扫描通过率",
|
| 2834 |
+
"niah.field.dhorizon": "d_horizon(有效)",
|
| 2835 |
+
"niah.field.ratio": "T_eval / d_horizon",
|
| 2836 |
+
"niah.field.arch_pressure": "架构压力(小 d_head + GQA + SWA)",
|
| 2837 |
+
"niah.field.theta": "RoPE θ",
|
| 2838 |
+
"niah.field.t_train": "T_train(声称)",
|
| 2839 |
+
"niah.col.context": "T_eval",
|
| 2840 |
+
"niah.col.niah": "NIAH",
|
| 2841 |
+
"niah.col.reasoning": "Reasoning",
|
| 2842 |
+
"niah.col.gap": "Gap",
|
| 2843 |
+
"niah.col.verdict": "判定",
|
| 2844 |
+
"niah.verdict.robust": "✅ 稳健",
|
| 2845 |
+
"niah.verdict.marginal": "⚠ 边缘",
|
| 2846 |
+
"niah.verdict.degraded": "⚠ 退化",
|
| 2847 |
+
"niah.verdict.retrieval_only": "❌ 仅检索",
|
| 2848 |
+
"niah.verdict.broken": "❌ 失效",
|
| 2849 |
+
"niah.reco.robust": "在此上下文下检索与 reasoning 都稳定。可安全部署用于查询和推理任务。",
|
| 2850 |
+
"niah.reco.marginal": "边缘。检索可用但 reasoning 不稳。用于事实查询,不要用于多步推理。",
|
| 2851 |
+
"niah.reco.degraded": "Reasoning 显著下降。模型能找到事实但难以组合它们。在此长度下避免多跳任务。",
|
| 2852 |
+
"niah.reco.retrieval_only": "RULER 的典型发现:模型通过 NIAH 但 reasoning 失败。适用于 RAG 设置(LLM 仅定位事实),不适用于链式推理。把上下文降到下方的 \"安全\" 值。",
|
| 2853 |
+
"niah.reco.broken": "在此上下文下模型连基本检索都失败。视为 out-of-distribution — 在更短上下文重测。",
|
| 2854 |
+
"niah.safe_context": "≤ {ctx} tokens(reasoning ≥ 65%)",
|
| 2855 |
+
"niah.safe_context_none": "在你的目标以下没找到安全上下文 — 模型即使在小上下文也 reasoning 失败。",
|
| 2856 |
+
"niah.summary.sweep": "<code>{modelId}</code> — 按上下文的通过率",
|
| 2857 |
+
"niah.status.empty_id": "⚠ 输入 model id(例如 meta-llama/Llama-3.1-8B-Instruct)。",
|
| 2858 |
+
"niah.status.bad_teval": "⚠ 输入目标上下文(≥ 512 tokens)。",
|
| 2859 |
+
"niah.status.fetching": "⏳ 正在获取 {modelId} 的 config.json...",
|
| 2860 |
+
"niah.status.fetched": "✅ 已获取 {modelId} 的 config。设置 T_eval 并点击预测(或扫描上下文)。",
|
| 2861 |
+
"niah.status.done": "✅ {verdict} — NIAH {niah}% · reasoning {reasoning}%",
|
| 2862 |
+
"niah.status.sweep_done": "✅ 已扫描 {n} 个上下文长度。",
|
| 2863 |
"share.import_desc": "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
|
| 2864 |
"share.import_btn": "📂 加载共享的 JSON",
|
| 2865 |
"synthesis.system": "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",
|
|
|
|
| 2952 |
"common.no": "否",
|
| 2953 |
|
| 2954 |
// 模式提示
|
| 2955 |
+
"modes.tip": "<strong>十四种使用方式</strong>。<br><strong>📇 画像</strong>: 粘贴模型 id → 5 个配方的 TAF 卡。<br><strong>🆚 比较</strong>: 2-3 个模型在一个配方上并排比较。<br><strong>🔍 检查 config</strong>: 粘贴原始 config.json → 完整画像。<br><strong>💬 提问</strong>: 自由形式问题,浏览器 LLM 选择配方。<br><strong>📋 配方</strong>: 手动选择,完全控制表单。<br><strong>🩺 CLI 诊断</strong>: 生成 Python 命令在本地测量 γ。<br><strong>📊 相图</strong>: 23 个面板模型在 (log θ, γ) 平面上。<br><strong>🪟 揭示</strong>: 检测误导的 max_position_embeddings(SWA / YaRN / RoPE 缩放)。<br><strong>📜 Chat-template</strong>: 检测系列 + 给出 lm-eval / vLLM / transformers 的精确 CLI flag。<br><strong>🎯 Arena CI</strong>: 从原始 pairwise 投票数据重建置信区间;检测 Arena 隐藏的统计并列。<br><strong>🧪 污染</strong>: 根据训练 cutoff 与发布日期,对 20+ benchmark 进行污染概率评估。<br><strong>⚖️ Quant</strong>: 预测任意(模型 × 量化方案)的 γ-shift 与 ΔPPL;cliff 时推荐更安全替代方案。<br><strong>🔀 Drift</strong>: 同一模型,两 setup 下分数不同 — bug 还是噪声?预测数值噪声区间并标记真实 bug。<br><strong>🔍 NIAH→Reason</strong>: 从架构预测 NIAH 与多跳 reasoning 通过率;找到模型的安全 reasoning 上下文。",
|
| 2956 |
"profile.tip": "<strong>一键完整诊断</strong>。粘贴任意 HF 模型 id (或选择预设)。工具运行所有 5 个配方 (长上下文、KV 压缩、自定义 vs API、预算、硬件),生成单个 <strong>TAF 卡</strong>,显示每个维度的判定 + 关键数字 + 架构分类。<br><br><strong>用例</strong>: \"我正在为生产评估 Qwen2.5-32B — 它的完整可行性概况是什么?\" → 粘贴 id → 画像 → 完成。",
|
| 2957 |
"compare.tip": "<strong>同一配方,多个模型</strong>。选择 2-3 个候选模型和一个配方。在单个比较表中查看判定。<br><br><strong>用例</strong>: \"我需要在 16K 进行长上下文检索 — 哪个最好: Llama-3-8B、Mistral-7B 或 Qwen-7B?\" → 选择 3 个 + X-2 + 16K → 看赢家。",
|
| 2958 |
|
|
@@ -18,6 +18,7 @@ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
|
|
| 18 |
import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
|
| 19 |
import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
|
| 20 |
import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
|
|
|
|
| 21 |
|
| 22 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 23 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
@@ -198,7 +199,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 198 |
"profile-section", "compare-section", "inspector-section",
|
| 199 |
"diagnose-section", "phase-section", "unmask-section",
|
| 200 |
"template-section", "arena-section", "contam-section",
|
| 201 |
-
"quant-section", "drift-section"].forEach(id => {
|
| 202 |
const el = $(id);
|
| 203 |
if (el) el.style.display = "none";
|
| 204 |
});
|
|
@@ -208,7 +209,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 208 |
compare: "compare-section", inspector: "inspector-section",
|
| 209 |
diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
|
| 210 |
template: "template-section", arena: "arena-section", contam: "contam-section",
|
| 211 |
-
quant: "quant-section", drift: "drift-section",
|
| 212 |
};
|
| 213 |
const sectionId = sectionMap[mode];
|
| 214 |
if (sectionId) $(sectionId).style.display = "";
|
|
@@ -1315,6 +1316,163 @@ populateDriftDropdowns();
|
|
| 1315 |
$("drift-run-btn")?.addEventListener("click", runDriftCompute);
|
| 1316 |
$("drift-sample-btn")?.addEventListener("click", loadDriftSample);
|
| 1317 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1318 |
function configToPreset(cfg, modelId) {
|
| 1319 |
const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
|
| 1320 |
const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
|
|
|
|
| 18 |
import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
|
| 19 |
import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
|
| 20 |
import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
|
| 21 |
+
import { predictNIAHReasoning, sweepContextLengths } from "./niah_reasoning.js";
|
| 22 |
|
| 23 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 24 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
|
|
| 199 |
"profile-section", "compare-section", "inspector-section",
|
| 200 |
"diagnose-section", "phase-section", "unmask-section",
|
| 201 |
"template-section", "arena-section", "contam-section",
|
| 202 |
+
"quant-section", "drift-section", "niah-section"].forEach(id => {
|
| 203 |
const el = $(id);
|
| 204 |
if (el) el.style.display = "none";
|
| 205 |
});
|
|
|
|
| 209 |
compare: "compare-section", inspector: "inspector-section",
|
| 210 |
diagnose: "diagnose-section", phase: "phase-section", unmask: "unmask-section",
|
| 211 |
template: "template-section", arena: "arena-section", contam: "contam-section",
|
| 212 |
+
quant: "quant-section", drift: "drift-section", niah: "niah-section",
|
| 213 |
};
|
| 214 |
const sectionId = sectionMap[mode];
|
| 215 |
if (sectionId) $(sectionId).style.display = "";
|
|
|
|
| 1316 |
$("drift-run-btn")?.addEventListener("click", runDriftCompute);
|
| 1317 |
$("drift-sample-btn")?.addEventListener("click", loadDriftSample);
|
| 1318 |
|
| 1319 |
+
// ════════════════════════════════════════════════════════════════════
|
| 1320 |
+
// 🔍 NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
|
| 1321 |
+
// ════════════════════════════════════════════════════════════════════
|
| 1322 |
+
|
| 1323 |
+
const NIAH_VERDICT_COLOR = {
|
| 1324 |
+
robust: "#3fb950",
|
| 1325 |
+
marginal: "#f1c40f",
|
| 1326 |
+
degraded: "#f1c40f",
|
| 1327 |
+
retrieval_only: "#f85149",
|
| 1328 |
+
broken: "#f85149",
|
| 1329 |
+
};
|
| 1330 |
+
|
| 1331 |
+
let __niahLastConfig = null;
|
| 1332 |
+
let __niahLastModelId = null;
|
| 1333 |
+
|
| 1334 |
+
async function niahFetchConfig() {
|
| 1335 |
+
const modelId = ($("niah-id").value || "").trim();
|
| 1336 |
+
if (!modelId) {
|
| 1337 |
+
$("niah-status").textContent = t("niah.status.empty_id") || "⚠ Enter a model id.";
|
| 1338 |
+
return null;
|
| 1339 |
+
}
|
| 1340 |
+
$("niah-status").textContent = tFmt("niah.status.fetching", { modelId });
|
| 1341 |
+
$("niah-fetch-btn").disabled = true;
|
| 1342 |
+
try {
|
| 1343 |
+
const cfg = await fetchHfConfig(modelId);
|
| 1344 |
+
__niahLastConfig = cfg;
|
| 1345 |
+
__niahLastModelId = modelId;
|
| 1346 |
+
$("niah-status").textContent = tFmt("niah.status.fetched", { modelId });
|
| 1347 |
+
return cfg;
|
| 1348 |
+
} catch (err) {
|
| 1349 |
+
if (err.code === "gated") {
|
| 1350 |
+
$("niah-status").innerHTML = `🔒 <strong>${err.modelId}</strong> ${t("hf_auto.gated_msg") || "is gated. Accept the license here:"} <a href="https://huggingface.co/${err.modelId}" target="_blank" rel="noopener">huggingface.co/${err.modelId}</a>`;
|
| 1351 |
+
} else {
|
| 1352 |
+
$("niah-status").textContent = `❌ ${err.message}`;
|
| 1353 |
+
}
|
| 1354 |
+
return null;
|
| 1355 |
+
} finally {
|
| 1356 |
+
$("niah-fetch-btn").disabled = false;
|
| 1357 |
+
}
|
| 1358 |
+
}
|
| 1359 |
+
|
| 1360 |
+
function renderNIAHCard(result, modelId) {
|
| 1361 |
+
const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
|
| 1362 |
+
({"&":"&","<":"<",">":">",'"':""","'":"'"}[c]));
|
| 1363 |
+
const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
|
| 1364 |
+
const color = NIAH_VERDICT_COLOR[result.verdict] || "#8b949e";
|
| 1365 |
+
const verdictLabel = t(`niah.verdict.${result.verdict}`) || result.verdict;
|
| 1366 |
+
const reco = t(`niah.reco.${result.verdict}`) || "";
|
| 1367 |
+
const safeText = result.safe_context
|
| 1368 |
+
? tFmt("niah.safe_context", { ctx: result.safe_context })
|
| 1369 |
+
: (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
|
| 1370 |
+
|
| 1371 |
+
return `
|
| 1372 |
+
<div class="unmask-result">
|
| 1373 |
+
<div class="unmask-hero" style="border-color: ${color};">
|
| 1374 |
+
<div class="unmask-verdict" style="color: ${color};">${verdictLabel}</div>
|
| 1375 |
+
<div class="unmask-model"><code>${escapeHtml(modelId)}</code> @ <code>${fmtN(result.T_eval)}</code> tokens</div>
|
| 1376 |
+
<div class="unmask-numbers">
|
| 1377 |
+
<div><span class="unmask-num-label">${t("niah.label.niah") || "NIAH pass rate"}</span><span class="unmask-num-val">${(result.niah_rate * 100).toFixed(0)}%</span></div>
|
| 1378 |
+
<div><span class="unmask-num-label">${t("niah.label.reasoning") || "Reasoning pass rate"}</span><span class="unmask-num-val">${(result.reasoning_rate * 100).toFixed(0)}%</span></div>
|
| 1379 |
+
<div><span class="unmask-num-label">${t("niah.label.gap") || "Gap"}</span><span class="unmask-num-val">${(result.gap * 100).toFixed(0)} pts</span></div>
|
| 1380 |
+
</div>
|
| 1381 |
+
</div>
|
| 1382 |
+
<div class="unmask-details">
|
| 1383 |
+
<details class="unmask-panel" open>
|
| 1384 |
+
<summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
|
| 1385 |
+
<ul>
|
| 1386 |
+
<li><strong>γ_Padé @ T_eval:</strong> ${result.gamma_pade}</li>
|
| 1387 |
+
<li><strong>${t("niah.field.dhorizon") || "d_horizon (effective)"}:</strong> ${fmtN(result.d_horizon)} tokens</li>
|
| 1388 |
+
<li><strong>${t("niah.field.ratio") || "T_eval / d_horizon"}:</strong> ${result.horizon_ratio}×</li>
|
| 1389 |
+
<li><strong>${t("niah.field.arch_pressure") || "Arch pressure (small d_head + GQA + SWA)"}:</strong> ×${result.arch_pressure}</li>
|
| 1390 |
+
<li><strong>${t("niah.field.theta") || "RoPE θ"}:</strong> ${fmtN(result.theta)}</li>
|
| 1391 |
+
<li><strong>${t("niah.field.t_train") || "T_train (claimed)"}:</strong> ${fmtN(result.T_train)}</li>
|
| 1392 |
+
</ul>
|
| 1393 |
+
</details>
|
| 1394 |
+
<details class="unmask-panel" open>
|
| 1395 |
+
<summary class="unmask-panel-title">${t("niah.section.reco") || "Recommendation"}</summary>
|
| 1396 |
+
<p class="unmask-reco">${reco}</p>
|
| 1397 |
+
<p class="unmask-reco"><strong>${t("niah.label.safe_ctx") || "Safe reasoning context"}:</strong> ${safeText}</p>
|
| 1398 |
+
</details>
|
| 1399 |
+
</div>
|
| 1400 |
+
</div>
|
| 1401 |
+
`;
|
| 1402 |
+
}
|
| 1403 |
+
|
| 1404 |
+
function renderNIAHSweep(rows, modelId) {
|
| 1405 |
+
const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
|
| 1406 |
+
({"&":"&","<":"<",">":">",'"':""","'":"'"}[c]));
|
| 1407 |
+
const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
|
| 1408 |
+
let body = "";
|
| 1409 |
+
for (const r of rows) {
|
| 1410 |
+
const color = NIAH_VERDICT_COLOR[r.verdict] || "#8b949e";
|
| 1411 |
+
const label = t(`niah.verdict.${r.verdict}`) || r.verdict;
|
| 1412 |
+
body += `<tr>
|
| 1413 |
+
<td><strong>${fmtN(r.T_eval)}</strong></td>
|
| 1414 |
+
<td class="arena-elo">${(r.niah_rate * 100).toFixed(0)}%</td>
|
| 1415 |
+
<td class="arena-elo">${(r.reasoning_rate * 100).toFixed(0)}%</td>
|
| 1416 |
+
<td class="arena-spread">${(r.gap * 100).toFixed(0)} pts</td>
|
| 1417 |
+
<td style="color: ${color};"><strong>${label}</strong></td>
|
| 1418 |
+
</tr>`;
|
| 1419 |
+
}
|
| 1420 |
+
return `
|
| 1421 |
+
<div class="arena-result">
|
| 1422 |
+
<div class="unmask-hero" style="border-color: #58a6ff;">
|
| 1423 |
+
<div class="unmask-verdict" style="color: #58a6ff;">${tFmt("niah.summary.sweep", { modelId })}</div>
|
| 1424 |
+
</div>
|
| 1425 |
+
<div class="unmask-details">
|
| 1426 |
+
<details class="unmask-panel" open>
|
| 1427 |
+
<summary class="unmask-panel-title">${t("niah.section.sweep") || "Pass rate sweep across context lengths"}</summary>
|
| 1428 |
+
<table class="arena-table">
|
| 1429 |
+
<thead><tr>
|
| 1430 |
+
<th>${t("niah.col.context") || "T_eval"}</th>
|
| 1431 |
+
<th>${t("niah.col.niah") || "NIAH"}</th>
|
| 1432 |
+
<th>${t("niah.col.reasoning") || "Reasoning"}</th>
|
| 1433 |
+
<th>${t("niah.col.gap") || "Gap"}</th>
|
| 1434 |
+
<th>${t("niah.col.verdict") || "Verdict"}</th>
|
| 1435 |
+
</tr></thead>
|
| 1436 |
+
<tbody>${body}</tbody>
|
| 1437 |
+
</table>
|
| 1438 |
+
</details>
|
| 1439 |
+
</div>
|
| 1440 |
+
</div>
|
| 1441 |
+
`;
|
| 1442 |
+
}
|
| 1443 |
+
|
| 1444 |
+
async function runNIAHPredict() {
|
| 1445 |
+
const cfg = __niahLastConfig || await niahFetchConfig();
|
| 1446 |
+
if (!cfg) return;
|
| 1447 |
+
const T_eval = parseInt($("niah-teval").value, 10);
|
| 1448 |
+
if (Number.isNaN(T_eval) || T_eval < 512) {
|
| 1449 |
+
$("niah-status").textContent = t("niah.status.bad_teval") || "⚠ Enter a target context (≥512).";
|
| 1450 |
+
return;
|
| 1451 |
+
}
|
| 1452 |
+
const result = predictNIAHReasoning(cfg, T_eval);
|
| 1453 |
+
$("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId);
|
| 1454 |
+
$("niah-status").textContent = tFmt("niah.status.done", {
|
| 1455 |
+
verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
|
| 1456 |
+
niah: (result.niah_rate * 100).toFixed(0),
|
| 1457 |
+
reasoning: (result.reasoning_rate * 100).toFixed(0),
|
| 1458 |
+
});
|
| 1459 |
+
}
|
| 1460 |
+
|
| 1461 |
+
async function runNIAHSweep() {
|
| 1462 |
+
const cfg = __niahLastConfig || await niahFetchConfig();
|
| 1463 |
+
if (!cfg) return;
|
| 1464 |
+
const rows = sweepContextLengths(cfg);
|
| 1465 |
+
$("niah-output").innerHTML = renderNIAHSweep(rows, __niahLastModelId);
|
| 1466 |
+
$("niah-status").textContent = tFmt("niah.status.sweep_done", { n: rows.length });
|
| 1467 |
+
}
|
| 1468 |
+
|
| 1469 |
+
$("niah-fetch-btn")?.addEventListener("click", niahFetchConfig);
|
| 1470 |
+
$("niah-run-btn")?.addEventListener("click", runNIAHPredict);
|
| 1471 |
+
$("niah-sweep-btn")?.addEventListener("click", runNIAHSweep);
|
| 1472 |
+
$("niah-id")?.addEventListener("keydown", (e) => {
|
| 1473 |
+
if (e.key === "Enter") { e.preventDefault(); niahFetchConfig(); }
|
| 1474 |
+
});
|
| 1475 |
+
|
| 1476 |
function configToPreset(cfg, modelId) {
|
| 1477 |
const n_attn = cfg.num_attention_heads || cfg.n_head || 0;
|
| 1478 |
const n_kv = cfg.num_key_value_heads || cfg.num_attention_heads || cfg.n_head || 0;
|
|
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7)
|
| 2 |
+
// Predicts pass rate at a given evaluation context for two tasks:
|
| 3 |
+
// - NIAH (Needle in a Haystack): single-fact retrieval, lenient
|
| 4 |
+
// - Multi-hop reasoning: chained inference, strict
|
| 5 |
+
// And the GAP — the dominant failure mode for "long context" claims.
|
| 6 |
+
//
|
| 7 |
+
// Calibration: rough empirical fit to RULER paper bands (NVIDIA 2024) +
|
| 8 |
+
// observed degradation curves on Llama-3.1, Mistral, Qwen2.5 at 8k/16k/32k/64k.
|
| 9 |
+
// Uses TAF's existing γ_Padé / d_horizon machinery for the architectural input.
|
| 10 |
+
//
|
| 11 |
+
// Pure logic — no human strings. Render via i18n in main.js.
|
| 12 |
+
|
| 13 |
+
import { gammaPade, thetaEffPade } from "./gamma_check.js";
|
| 14 |
+
|
| 15 |
+
// d_horizon ≈ effective attention horizon. Reproduces formula from
|
| 16 |
+
// taf_browser.py / paper §sec:gamma_decomposition. For browser-only v1 use.
|
| 17 |
+
function dHorizon(theta, gammaPredicted) {
|
| 18 |
+
if (gammaPredicted >= 1) return Infinity;
|
| 19 |
+
if (gammaPredicted <= 0) return theta;
|
| 20 |
+
// d_horizon ≈ θ × (1 + γ_predicted) / (1 - γ_predicted)
|
| 21 |
+
// Padé-canonical form (paper §sec:gamma_decomposition).
|
| 22 |
+
return theta * (1 + gammaPredicted) / (1 - gammaPredicted);
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
// Sigmoid-like passrate vs. ratio = T_eval / d_horizon.
|
| 26 |
+
// Calibrated such that:
|
| 27 |
+
// ratio = 0.25 → ≈ 0.95 (well within horizon)
|
| 28 |
+
// ratio = 0.50 → ≈ 0.88
|
| 29 |
+
// ratio = 1.00 → ≈ 0.65
|
| 30 |
+
// ratio = 2.00 → ≈ 0.35
|
| 31 |
+
// ratio = 4.00 → ≈ 0.15
|
| 32 |
+
function niahRate(ratio) {
|
| 33 |
+
// Logistic on log-ratio: P = 1/(1+exp(k*(log(ratio)-log(0.7))))
|
| 34 |
+
const k = 1.4;
|
| 35 |
+
const center = Math.log(0.7);
|
| 36 |
+
const x = Math.log(Math.max(0.01, ratio));
|
| 37 |
+
return 1 / (1 + Math.exp(k * (x - center)));
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
// Multi-hop reasoning is strictly harder than NIAH. RULER paper shows ~30-50%
|
| 41 |
+
// drop from NIAH-Single to multi-hop at long context. The gap grows with
|
| 42 |
+
// architecture pressure (small d_head, aggressive GQA, SWA boundary).
|
| 43 |
+
function reasoningPenalty(ratio, archPressure) {
|
| 44 |
+
// Base penalty grows with context ratio (more multi-hop steps required).
|
| 45 |
+
// archPressure ∈ [1.0, 1.6] from architecture (small d_head + GQA → higher).
|
| 46 |
+
const base = ratio < 0.5 ? 0.05 :
|
| 47 |
+
ratio < 1.0 ? 0.15 :
|
| 48 |
+
ratio < 2.0 ? 0.30 :
|
| 49 |
+
ratio < 4.0 ? 0.45 : 0.55;
|
| 50 |
+
return Math.min(0.7, base * archPressure);
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
function archPressureFromConfig(config) {
|
| 54 |
+
let p = 1.0;
|
| 55 |
+
const n_attn = config.num_attention_heads ?? null;
|
| 56 |
+
const n_kv = config.num_key_value_heads ?? n_attn;
|
| 57 |
+
const hidden = config.hidden_size ?? null;
|
| 58 |
+
const d_head = config.head_dim ?? (n_attn && hidden ? hidden / n_attn : null);
|
| 59 |
+
if (d_head !== null) {
|
| 60 |
+
if (d_head < 64) p *= 1.25;
|
| 61 |
+
else if (d_head < 96) p *= 1.10;
|
| 62 |
+
else if (d_head < 128) p *= 1.03;
|
| 63 |
+
}
|
| 64 |
+
if (n_attn && n_kv && n_kv < n_attn) {
|
| 65 |
+
const ratio = n_attn / n_kv;
|
| 66 |
+
if (ratio >= 8) p *= 1.15;
|
| 67 |
+
else if (ratio >= 4) p *= 1.08;
|
| 68 |
+
}
|
| 69 |
+
if (typeof config.sliding_window === "number" && config.sliding_window > 0) {
|
| 70 |
+
p *= 1.10; // SWA: cross-window reasoning costs extra
|
| 71 |
+
}
|
| 72 |
+
return Math.min(1.6, p);
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
export function predictNIAHReasoning(config, T_eval) {
|
| 76 |
+
const theta = config.rope_theta ?? 10000;
|
| 77 |
+
const T_train = config.max_position_embeddings ?? T_eval;
|
| 78 |
+
const gPade = gammaPade(theta, T_eval);
|
| 79 |
+
const dh = dHorizon(theta, gPade);
|
| 80 |
+
const ratio = dh === Infinity ? 0 : T_eval / dh;
|
| 81 |
+
|
| 82 |
+
const archPressure = archPressureFromConfig(config);
|
| 83 |
+
// Extrapolation penalty: models tested far beyond their training context
|
| 84 |
+
// degrade regardless of architecture (no positional embeddings learned for
|
| 85 |
+
// unseen positions). Capped at 0.7 so we never zero out completely.
|
| 86 |
+
const extrapolation_ratio = T_train > 0 ? T_eval / T_train : 1;
|
| 87 |
+
const extrapolation_penalty = extrapolation_ratio > 1
|
| 88 |
+
? Math.min(0.7, (extrapolation_ratio - 1) * 0.3)
|
| 89 |
+
: 0;
|
| 90 |
+
const niah = Math.max(0.02, niahRate(ratio) * (1 - extrapolation_penalty));
|
| 91 |
+
const penalty = reasoningPenalty(ratio, archPressure);
|
| 92 |
+
const reasoning = Math.max(0.02, niah * (1 - penalty));
|
| 93 |
+
const gap = niah - reasoning;
|
| 94 |
+
|
| 95 |
+
// Verdict bands
|
| 96 |
+
let verdict;
|
| 97 |
+
if (niah < 0.35) verdict = "broken"; // model can't even retrieve
|
| 98 |
+
else if (gap >= 0.30) verdict = "retrieval_only"; // canonical RULER finding
|
| 99 |
+
else if (gap >= 0.15) verdict = "degraded";
|
| 100 |
+
else if (niah >= 0.70 && reasoning >= 0.55) verdict = "robust";
|
| 101 |
+
else verdict = "marginal";
|
| 102 |
+
|
| 103 |
+
// Find a "safe" context where reasoning >= 0.65 (binary search-like sweep)
|
| 104 |
+
let safeT = null;
|
| 105 |
+
for (let t = 1024; t <= T_eval; t *= 2) {
|
| 106 |
+
const gP = gammaPade(theta, t);
|
| 107 |
+
const dh2 = dHorizon(theta, gP);
|
| 108 |
+
const r = dh2 === Infinity ? 0 : t / dh2;
|
| 109 |
+
const niah2 = niahRate(r);
|
| 110 |
+
const reas2 = niah2 * (1 - reasoningPenalty(r, archPressure));
|
| 111 |
+
if (reas2 >= 0.65) safeT = t;
|
| 112 |
+
else break;
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
return {
|
| 116 |
+
T_eval,
|
| 117 |
+
T_train,
|
| 118 |
+
theta,
|
| 119 |
+
arch_pressure: Math.round(archPressure * 100) / 100,
|
| 120 |
+
gamma_pade: Math.round(gPade * 1000) / 1000,
|
| 121 |
+
d_horizon: dh === Infinity ? null : Math.round(dh),
|
| 122 |
+
horizon_ratio: Math.round(ratio * 100) / 100,
|
| 123 |
+
niah_rate: Math.round(niah * 100) / 100,
|
| 124 |
+
reasoning_rate: Math.round(reasoning * 100) / 100,
|
| 125 |
+
gap: Math.round(gap * 100) / 100,
|
| 126 |
+
verdict,
|
| 127 |
+
safe_context: safeT,
|
| 128 |
+
};
|
| 129 |
+
}
|
| 130 |
+
|
| 131 |
+
// Sweep across context lengths (1k, 4k, 16k, 64k, 128k) so user sees the curve.
|
| 132 |
+
export function sweepContextLengths(config, lengths = null) {
|
| 133 |
+
const T_max = config.max_position_embeddings ?? 131072;
|
| 134 |
+
const defaults = lengths || [1024, 4096, 16384, 65536, T_max].filter((v, i, arr) =>
|
| 135 |
+
v <= T_max && arr.indexOf(v) === i
|
| 136 |
+
);
|
| 137 |
+
return defaults.map(T => predictNIAHReasoning(config, T));
|
| 138 |
+
}
|
|
@@ -34,22 +34,48 @@
|
|
| 34 |
}
|
| 35 |
|
| 36 |
/* v0.7.5 — Cross-framework drift bound */
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
.drift-grid {
|
| 38 |
display: grid;
|
| 39 |
-
grid-template-columns:
|
| 40 |
gap: 1em;
|
| 41 |
margin: 0.6em 0 0.8em;
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
}
|
| 43 |
.drift-setup {
|
| 44 |
border: 1px solid rgba(88, 166, 255, 0.3);
|
| 45 |
border-radius: 8px;
|
| 46 |
padding: 0.6em 1em 0.4em;
|
|
|
|
|
|
|
|
|
|
| 47 |
}
|
| 48 |
.drift-setup legend {
|
| 49 |
padding: 0 0.5em;
|
| 50 |
font-weight: 600;
|
| 51 |
color: #58a6ff;
|
| 52 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
/* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
|
| 55 |
.hf-autocomplete-dropdown {
|
|
|
|
| 34 |
}
|
| 35 |
|
| 36 |
/* v0.7.5 — Cross-framework drift bound */
|
| 37 |
+
/* Drift section overrides the default 980px main width so the two-column form
|
| 38 |
+
has room without overlapping. Mobile (<800px) collapses to single column. */
|
| 39 |
+
#drift-section {
|
| 40 |
+
max-width: 1100px;
|
| 41 |
+
}
|
| 42 |
.drift-grid {
|
| 43 |
display: grid;
|
| 44 |
+
grid-template-columns: 1fr 1fr;
|
| 45 |
gap: 1em;
|
| 46 |
margin: 0.6em 0 0.8em;
|
| 47 |
+
align-items: start;
|
| 48 |
+
}
|
| 49 |
+
@media (max-width: 800px) {
|
| 50 |
+
.drift-grid { grid-template-columns: 1fr; }
|
| 51 |
}
|
| 52 |
.drift-setup {
|
| 53 |
border: 1px solid rgba(88, 166, 255, 0.3);
|
| 54 |
border-radius: 8px;
|
| 55 |
padding: 0.6em 1em 0.4em;
|
| 56 |
+
min-width: 0; /* fix: lets grid item shrink instead of overflowing */
|
| 57 |
+
box-sizing: border-box;
|
| 58 |
+
overflow: hidden; /* clip any overlong content cleanly */
|
| 59 |
}
|
| 60 |
.drift-setup legend {
|
| 61 |
padding: 0 0.5em;
|
| 62 |
font-weight: 600;
|
| 63 |
color: #58a6ff;
|
| 64 |
}
|
| 65 |
+
.drift-setup .form-row {
|
| 66 |
+
flex-wrap: wrap;
|
| 67 |
+
gap: 0.4em 0.6em;
|
| 68 |
+
margin: 0.35em 0;
|
| 69 |
+
}
|
| 70 |
+
.drift-setup .form-row label {
|
| 71 |
+
flex: 0 0 110px;
|
| 72 |
+
font-size: 0.92em;
|
| 73 |
+
}
|
| 74 |
+
.drift-setup .form-row input,
|
| 75 |
+
.drift-setup .form-row select {
|
| 76 |
+
flex: 1 1 120px;
|
| 77 |
+
min-width: 0;
|
| 78 |
+
}
|
| 79 |
|
| 80 |
/* v0.7.4 — HF Hub autocomplete dropdown (attached to body) */
|
| 81 |
.hf-autocomplete-dropdown {
|