Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 5 days ago

Commit

5c94f4b

1 Parent(s): 2288c3d

v0.8.6 NIAH RULER calibration — anti-bullshit pack #12

The 🔍 NIAH→Reason mode predicted pass rates from architectural
inputs (γ_Padé, d_horizon, GQA pressure, SWA boundary) using a
heuristic logistic. That heuristic was calibrated against rough
RULER bands but never validated against per-model-per-context
ground truth. Anti-bullshit principle: if measured data exists,
USE the measured data.

Layered RULER calibration on top of the existing predictor:

- NEW data/ruler_kb.json — 12 models from RULER paper Table 3 +
DeepWiki leaderboard. Each row carries 4K/8K/16K/32K/64K/128K
aggregate scores, claimed-vs-effective context, params, and
multi-name aliases (org/name + bare-name + unsloth mirrors so
autocomplete-pasted ids hit). Models: GPT-4-1106, Command-R-35B,
Yi-34B-200K, Mixtral-8x7B/-8x22B, Mistral-7B-v0.2, ChatGLM3-6B,
LWM-7B, Llama-3.1-70B-Instruct, Gemini-1.5-Pro, Jamba-1.5-Large,
Qwen2.5-14B-1M.

- `loadRulerKB()` + `lookupRulerModel()` + `calibrateNIAH()` in
niah_reasoning.js. Lookup is case-insensitive on the alias_index
with org/name + bare-name + lowercase fallback. Linear-interpolate
in log-context between bracketing samples; clamp + flag
extrapolation outside the 4K-128K measured range.

- Per-task back-out: RULER aggregate × retrieval_factor (1.04) for
NIAH, × reasoning_factor (0.78) for multi-hop QA. Factors derived
from RULER paper Appendix Tables 13-16 (top-tier models score
retrieval 95-100%, QA ~70%, the canonical ~25pp gap). Honest
range surfaced inline (retrieval 0.95-1.10×, reasoning 0.60-0.85×).

- UI: green-bordered "📊 RULER-calibrated" panel above the existing
architecture breakdown when the model id matches the KB. Shows
measured RULER aggregate at T_eval, derived NIAH/reasoning rates,
and a side-by-side delta vs the heuristic prediction (color-coded
+/- pp). Extrapolation warning for T_eval > 128K. Source citation
links to the paper. KB-miss path shows an explicit "calibration
unavailable, heuristic only" hint.

- 17 i18n keys × 4 langs (EN/ES/FR/ZH) = 68 keys.

Verified locally: 12 models load, alias lookup hits HF org/name +
bare names + lowercase variants, calibration arithmetic produces
sensible numbers (Llama-3.1-70B @ 32K → RULER 94.8% → NIAH 99%
reasoning 74%; @ 128K → 66.6% → 69% / 52% — the canonical RULER
"reasoning collapses at long context" finding).

Refs:
- https://arxiv.org/abs/2404.06654 (Hsieh et al., COLM 2024)
- https://github.com/NVIDIA/RULER

Closes the v0.8 roadmap #5 commitment ("RULER-backed NIAH calibrator
to upgrade the predictor from heuristic to calibrated").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

data/ruler_kb.json +116 -0
js/i18n.js +60 -0
js/main.js +82 -3
js/niah_reasoning.js +145 -0

data/ruler_kb.json ADDED Viewed

	@@ -0,0 +1,116 @@

+{
+  "version": "1.0",
+  "compiled": "2026-05-07",
+  "source": {
+    "primary": "RULER paper Table 3 (Hsieh et al., COLM 2024) — arxiv.org/abs/2404.06654",
+    "secondary": "DeepWiki/NVIDIA/RULER aggregated leaderboard (~34 models)",
+    "license": "Numbers reproduced for review/calibration; see paper for evaluation methodology."
+  },
+  "task_breakdown_priors": {
+    "comment": "RULER aggregate score is the mean of 13 tasks across 4 categories. Per the paper's Appendix Tables 13-16, top-tier models (GPT-4) score: retrieval 95-100%, variable-tracking 99%, aggregation 93%, QA 70%. The ~25pp gap between retrieval and QA is the canonical 'retrieval-vs-reasoning' finding. We use these ratios to back out per-task estimates from the published aggregate.",
+    "retrieval_factor": 1.04,
+    "reasoning_factor": 0.78,
+    "retrieval_factor_caveat": "NIAH-single typically scores 95-99% for top models even when aggregate drops; this multiplier underestimates NIAH on small-model regimes. Honest range: 0.95×–1.10× aggregate.",
+    "reasoning_factor_caveat": "Multi-hop QA degrades faster than aggregate. At long context (>64K) the ratio can drop below 0.6×. Honest range: 0.60×–0.85× aggregate."
+  },
+  "models": {
+    "gpt-4-1106-preview": {
+      "ruler_avg": {"4k": 96.6, "8k": 96.3, "16k": 95.2, "32k": 93.2, "64k": 87.0, "128k": 81.2},
+      "claimed_context": 128000,
+      "effective_context": 64000,
+      "params_b": null,
+      "id_aliases": ["openai/gpt-4-1106-preview", "gpt-4-1106-preview", "gpt-4-turbo"],
+      "category": "frontier_api"
+    },
+    "command-r-35b": {
+      "ruler_avg": {"4k": 93.8, "8k": 93.3, "16k": 92.4, "32k": 89.5, "64k": 84.9, "128k": 76.0},
+      "claimed_context": 128000,
+      "effective_context": 32000,
+      "params_b": 35,
+      "id_aliases": ["CohereForAI/c4ai-command-r-v01", "command-r-35b", "c4ai-command-r-v01"],
+      "category": "open"
+    },
+    "yi-34b-200k": {
+      "ruler_avg": {"4k": 93.3, "8k": 92.2, "16k": 91.3, "32k": 87.5, "64k": 83.2, "128k": 77.3},
+      "claimed_context": 200000,
+      "effective_context": 32000,
+      "params_b": 34,
+      "id_aliases": ["01-ai/Yi-34B-200K", "yi-34b-200k", "Yi-34B-200K"],
+      "category": "open"
+    },
+    "mixtral-8x7b": {
+      "ruler_avg": {"4k": 94.9, "8k": 92.1, "16k": 92.5, "32k": 85.9, "64k": 72.4, "128k": 44.5},
+      "claimed_context": 32000,
+      "effective_context": 32000,
+      "params_b": 47,
+      "id_aliases": ["mistralai/Mixtral-8x7B-Instruct-v0.1", "mixtral-8x7b-instruct", "Mixtral-8x7B-Instruct-v0.1"],
+      "category": "open_moe"
+    },
+    "mistral-7b-v0.2": {
+      "ruler_avg": {"4k": 93.6, "8k": 91.2, "16k": 87.2, "32k": 75.4, "64k": 49.0, "128k": 13.8},
+      "claimed_context": 32000,
+      "effective_context": 16000,
+      "params_b": 7,
+      "id_aliases": ["mistralai/Mistral-7B-Instruct-v0.2", "mistral-7b-v0.2", "Mistral-7B-Instruct-v0.2"],
+      "category": "open"
+    },
+    "chatglm3-6b-128k": {
+      "ruler_avg": {"4k": 87.8, "8k": 83.4, "16k": 78.6, "32k": 69.9, "64k": 56.0, "128k": 42.0},
+      "claimed_context": 128000,
+      "effective_context": 4000,
+      "params_b": 6,
+      "id_aliases": ["THUDM/chatglm3-6b-128k", "chatglm3-6b-128k"],
+      "category": "open"
+    },
+    "lwm-7b": {
+      "ruler_avg": {"4k": 82.3, "8k": 78.4, "16k": 73.7, "32k": 69.1, "64k": 68.1, "128k": 65.0},
+      "claimed_context": 1000000,
+      "effective_context": 4000,
+      "params_b": 7,
+      "id_aliases": ["LargeWorldModel/LWM-Text-Chat-1M", "lwm-7b", "LWM-Text-Chat-1M"],
+      "category": "open"
+    },
+    "llama3.1-70b-instruct": {
+      "ruler_avg": {"4k": 96.5, "8k": 95.8, "16k": 95.4, "32k": 94.8, "64k": 88.4, "128k": 66.6},
+      "claimed_context": 128000,
+      "effective_context": 64000,
+      "params_b": 70,
+      "id_aliases": ["meta-llama/Llama-3.1-70B-Instruct", "llama-3.1-70b", "llama3.1-70b-instruct", "Meta-Llama-3.1-70B-Instruct", "Llama-3.1-70B-Instruct", "unsloth/Meta-Llama-3.1-70B-Instruct", "unsloth/Llama-3.1-70B-Instruct"],
+      "category": "open_frontier"
+    },
+    "mixtral-8x22b-instruct": {
+      "ruler_avg": {"4k": 95.6, "8k": 94.9, "16k": 93.4, "32k": 90.9, "64k": 84.7, "128k": 31.7},
+      "claimed_context": 64000,
+      "effective_context": 32000,
+      "params_b": 141,
+      "id_aliases": ["mistralai/Mixtral-8x22B-Instruct-v0.1", "mixtral-8x22b-instruct", "Mixtral-8x22B-Instruct-v0.1"],
+      "category": "open_moe"
+    },
+    "gemini-1.5-pro": {
+      "ruler_avg": {"4k": 96.7, "8k": 95.8, "16k": 96.0, "32k": 95.9, "64k": 95.9, "128k": 94.4},
+      "claimed_context": 1000000,
+      "effective_context": 128000,
+      "params_b": null,
+      "id_aliases": ["google/gemini-1.5-pro", "gemini-1.5-pro"],
+      "category": "frontier_api"
+    },
+    "jamba-1.5-large": {
+      "ruler_avg": {"4k": 96.3, "8k": 96.2, "16k": 96.0, "32k": 96.0, "64k": 96.0, "128k": 96.0},
+      "claimed_context": 256000,
+      "effective_context": 128000,
+      "params_b": 94,
+      "id_aliases": ["ai21labs/AI21-Jamba-1.5-Large", "jamba-1.5-large", "AI21-Jamba-1.5-Large"],
+      "category": "open_hybrid_ssm"
+    },
+    "qwen2.5-14b-instruct-1m": {
+      "ruler_avg": {"4k": 97.5, "8k": 96.8, "16k": 95.5, "32k": 94.0, "64k": 92.5, "128k": 92.2},
+      "claimed_context": 1000000,
+      "effective_context": 128000,
+      "params_b": 14,
+      "id_aliases": ["Qwen/Qwen2.5-14B-Instruct-1M", "qwen2.5-14b-1m", "Qwen2.5-14B-Instruct-1M", "qwen2.5-14b-instruct-1m"],
+      "category": "open_frontier"
+    }
+  },
+  "context_levels": [4096, 8192, 16384, 32768, 65536, 131072],
+  "context_level_keys": ["4k", "8k", "16k", "32k", "64k", "128k"]
+}

js/i18n.js CHANGED Viewed

@@ -439,6 +439,21 @@ export const TRANSLATIONS = {
     "niah.label.safe_ctx":         "Safe reasoning context",
     "niah.section.breakdown":      "Architecture breakdown",
     "niah.section.reco":           "Recommendation",
     "niah.section.sweep":          "Pass rate sweep across context lengths",
     "niah.field.dhorizon":         "d_horizon (effective)",
     "niah.field.ratio":            "T_eval / d_horizon",
@@ -1597,6 +1612,21 @@ export const TRANSLATIONS = {
     "niah.label.safe_ctx":         "Contexto seguro de reasoning",
     "niah.section.breakdown":      "Desglose arquitectónico",
     "niah.section.reco":           "Recomendación",
     "niah.section.sweep":          "Barrido de tasas pass por longitud de contexto",
     "niah.field.dhorizon":         "d_horizon (efectivo)",
     "niah.field.ratio":            "T_eval / d_horizon",
@@ -2619,6 +2649,21 @@ export const TRANSLATIONS = {
     "niah.label.safe_ctx":         "Contexte sûr pour reasoning",
     "niah.section.breakdown":      "Détail architectural",
     "niah.section.reco":           "Recommandation",
     "niah.section.sweep":          "Balayage des taux par longueur de contexte",
     "niah.field.dhorizon":         "d_horizon (effectif)",
     "niah.field.ratio":            "T_eval / d_horizon",
@@ -3641,6 +3686,21 @@ export const TRANSLATIONS = {
     "niah.label.safe_ctx":         "Reasoning 安全上下文",
     "niah.section.breakdown":      "架构细节",
     "niah.section.reco":           "建议",
     "niah.section.sweep":          "按上下文长度扫描通过率",
     "niah.field.dhorizon":         "d_horizon（有效）",
     "niah.field.ratio":            "T_eval / d_horizon",

     "niah.label.safe_ctx":         "Safe reasoning context",
     "niah.section.breakdown":      "Architecture breakdown",
     "niah.section.reco":           "Recommendation",
+    "niah.calib.heading":          "RULER-calibrated (NVIDIA published data)",
+    "niah.calib.matched":          "Matched <code>{alias}</code> → KB row <code>{canonical}</code>.",
+    "niah.calib.aggregate":        "RULER aggregate",
+    "niah.calib.interp":           "interpolated between",
+    "niah.calib.extrapolated":     "extrapolated outside RULER's measured range",
+    "niah.calib.col.heuristic":    "Heuristic",
+    "niah.calib.col.calibrated":   "RULER-calibrated",
+    "niah.calib.col.delta":        "Δ",
+    "niah.calib.factors":          "Per-task factors from RULER paper Appendix Tables 13-16:",
+    "niah.calib.factors_caveat":   "honest range: retrieval 0.95-1.10×, reasoning 0.60-0.85×",
+    "niah.calib.claimed_vs_effective": "Paper-reported",
+    "niah.calib.claimed":          "claimed",
+    "niah.calib.effective":        "effective",
+    "niah.calib.source":           "Source",
+    "niah.calib.miss":             "RULER calibration unavailable for this model — using architectural heuristic only. Add to data/ruler_kb.json if you have measured numbers.",
     "niah.section.sweep":          "Pass rate sweep across context lengths",
     "niah.field.dhorizon":         "d_horizon (effective)",
     "niah.field.ratio":            "T_eval / d_horizon",
     "niah.label.safe_ctx":         "Contexto seguro de reasoning",
     "niah.section.breakdown":      "Desglose arquitectónico",
     "niah.section.reco":           "Recomendación",
+    "niah.calib.heading":          "Calibrado con RULER (datos publicados por NVIDIA)",
+    "niah.calib.matched":          "Coincide <code>{alias}</code> → fila KB <code>{canonical}</code>.",
+    "niah.calib.aggregate":        "Agregado RULER",
+    "niah.calib.interp":           "interpolado entre",
+    "niah.calib.extrapolated":     "extrapolado fuera del rango medido por RULER",
+    "niah.calib.col.heuristic":    "Heurística",
+    "niah.calib.col.calibrated":   "Calibrado RULER",
+    "niah.calib.col.delta":        "Δ",
+    "niah.calib.factors":          "Factores por tarea del paper RULER, Apéndice Tablas 13-16:",
+    "niah.calib.factors_caveat":   "rango honesto: retrieval 0.95-1.10×, reasoning 0.60-0.85×",
+    "niah.calib.claimed_vs_effective": "Reportado en paper",
+    "niah.calib.claimed":          "claimed",
+    "niah.calib.effective":        "effective",
+    "niah.calib.source":           "Fuente",
+    "niah.calib.miss":             "Calibración RULER no disponible para este modelo — usando solo heurística arquitectónica. Añade a data/ruler_kb.json si tienes números medidos.",
     "niah.section.sweep":          "Barrido de tasas pass por longitud de contexto",
     "niah.field.dhorizon":         "d_horizon (efectivo)",
     "niah.field.ratio":            "T_eval / d_horizon",
     "niah.label.safe_ctx":         "Contexte sûr pour reasoning",
     "niah.section.breakdown":      "Détail architectural",
     "niah.section.reco":           "Recommandation",
+    "niah.calib.heading":          "Calibré avec RULER (données publiées par NVIDIA)",
+    "niah.calib.matched":          "Correspond <code>{alias}</code> → ligne KB <code>{canonical}</code>.",
+    "niah.calib.aggregate":        "Agrégat RULER",
+    "niah.calib.interp":           "interpolé entre",
+    "niah.calib.extrapolated":     "extrapolé hors de la plage mesurée par RULER",
+    "niah.calib.col.heuristic":    "Heuristique",
+    "niah.calib.col.calibrated":   "Calibré RULER",
+    "niah.calib.col.delta":        "Δ",
+    "niah.calib.factors":          "Facteurs par tâche du paper RULER, Appendice Tables 13-16 :",
+    "niah.calib.factors_caveat":   "plage honnête : retrieval 0.95-1.10×, reasoning 0.60-0.85×",
+    "niah.calib.claimed_vs_effective": "Rapporté dans le paper",
+    "niah.calib.claimed":          "claimed",
+    "niah.calib.effective":        "effective",
+    "niah.calib.source":           "Source",
+    "niah.calib.miss":             "Calibration RULER indisponible pour ce modèle — utilisation de l'heuristique architecturale seule. Ajoutez à data/ruler_kb.json si vous avez des chiffres mesurés.",
     "niah.section.sweep":          "Balayage des taux par longueur de contexte",
     "niah.field.dhorizon":         "d_horizon (effectif)",
     "niah.field.ratio":            "T_eval / d_horizon",
     "niah.label.safe_ctx":         "Reasoning 安全上下文",
     "niah.section.breakdown":      "架构细节",
     "niah.section.reco":           "建议",
+    "niah.calib.heading":          "RULER 校准（NVIDIA 已发布数据）",
+    "niah.calib.matched":          "匹配 <code>{alias}</code> → KB 行 <code>{canonical}</code>。",
+    "niah.calib.aggregate":        "RULER 聚合分",
+    "niah.calib.interp":           "在以下之间插值",
+    "niah.calib.extrapolated":     "外推到 RULER 已测范围之外",
+    "niah.calib.col.heuristic":    "启发式",
+    "niah.calib.col.calibrated":   "RULER 校准",
+    "niah.calib.col.delta":        "Δ",
+    "niah.calib.factors":          "来自 RULER 论文附录表 13-16 的每任务因子：",
+    "niah.calib.factors_caveat":   "诚实范围：retrieval 0.95-1.10×，reasoning 0.60-0.85×",
+    "niah.calib.claimed_vs_effective": "论文报告",
+    "niah.calib.claimed":          "claimed",
+    "niah.calib.effective":        "effective",
+    "niah.calib.source":           "来源",
+    "niah.calib.miss":             "此模型暂无 RULER 校准——仅使用架构启发式。如有实测数字，请添加到 data/ruler_kb.json。",
     "niah.section.sweep":          "按上下文长度扫描通过率",
     "niah.field.dhorizon":         "d_horizon（有效）",
     "niah.field.ratio":            "T_eval / d_horizon",

js/main.js CHANGED Viewed

@@ -18,7 +18,7 @@ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
 import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
 import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
 import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
-import { predictNIAHReasoning, sweepContextLengths } from "./niah_reasoning.js";
 import {
   loadSaturationKB, classifyAll, classifyBenchmark,
   listBenchmarks, attribution as saturationAttribution, tryFetchLive,
@@ -1419,7 +1419,7 @@ async function niahFetchConfig() {
   }
 }
-function renderNIAHCard(result, modelId) {
   const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
     ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
   const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
@@ -1430,6 +1430,80 @@ function renderNIAHCard(result, modelId) {
     ? tFmt("niah.safe_context", { ctx: result.safe_context })
     : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
   return `
     <div class="unmask-result">
       <div class="unmask-hero" style="border-color: ${color};">
@@ -1442,6 +1516,7 @@ function renderNIAHCard(result, modelId) {
         </div>
       </div>
       <div class="unmask-details">
         <details class="unmask-panel" open>
           <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
           <ul>
@@ -1512,7 +1587,11 @@ async function runNIAHPredict() {
     return;
   }
   const result = predictNIAHReasoning(cfg, T_eval);
-  $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId);
   $("niah-status").textContent = tFmt("niah.status.done", {
     verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
     niah: (result.niah_rate * 100).toFixed(0),

 import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
 import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
 import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
+import { predictNIAHReasoning, sweepContextLengths, loadRulerKB, calibrateNIAH, listRulerModels } from "./niah_reasoning.js";
 import {
   loadSaturationKB, classifyAll, classifyBenchmark,
   listBenchmarks, attribution as saturationAttribution, tryFetchLive,
   }
 }
+function renderNIAHCard(result, modelId, calib = null) {
   const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
     ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
   const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
     ? tFmt("niah.safe_context", { ctx: result.safe_context })
     : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
+  // RULER calibration block — appears only when KB lookup hits.
+  // Shows measured RULER aggregate, derived NIAH/reasoning, and the
+  // delta vs the heuristic so users see when the predictor was off.
+  let calibBlock = "";
+  if (calib) {
+    const fmtPct = (v) => `${(v * 100).toFixed(0)}%`;
+    const fmtDelta = (d) => {
+      if (d == null) return "—";
+      const pp = Math.round(d * 100);
+      const sign = pp > 0 ? "+" : "";
+      const col = Math.abs(pp) >= 10 ? "#f0883e" : Math.abs(pp) >= 5 ? "#d29922" : "#8b949e";
+      return `<span style="color:${col};">${sign}${pp} pp</span>`;
+    };
+    const extrapNote = calib.extrapolated
+      ? `<span class="subtle" style="color:#d29922;font-size:0.85em;"> ⚠ ${t("niah.calib.extrapolated") || "extrapolated outside RULER's measured range"}</span>`
+      : "";
+    calibBlock = `
+      <details class="unmask-panel" open style="border-left:3px solid #3fb950;">
+        <summary class="unmask-panel-title">📊 ${t("niah.calib.heading") || "RULER-calibrated (NVIDIA published data)"}</summary>
+        <p>${tFmt("niah.calib.matched", {
+          alias: escapeHtml(calib.matched_alias),
+          canonical: escapeHtml(calib.canonical_id),
+        }) || `Matched <code>${escapeHtml(calib.matched_alias)}</code> → KB row <code>${escapeHtml(calib.canonical_id)}</code>.`}</p>
+        <p>
+          <strong>${t("niah.calib.aggregate") || "RULER aggregate"} @ ${fmtN(result.T_eval)}:</strong>
+          <code>${calib.ruler_avg_pct}%</code>
+          <span class="subtle">(${t("niah.calib.interp") || "interpolated between"} ${calib.interp_anchor})</span>${extrapNote}
+        </p>
+        <table class="arena-table" style="margin-top:0.5em;">
+          <thead><tr>
+            <th></th>
+            <th>${t("niah.calib.col.heuristic") || "Heuristic"}</th>
+            <th>${t("niah.calib.col.calibrated") || "RULER-calibrated"}</th>
+            <th>${t("niah.calib.col.delta") || "Δ"}</th>
+          </tr></thead>
+          <tbody>
+            <tr>
+              <td><strong>NIAH</strong></td>
+              <td>${fmtPct(result.niah_rate)}</td>
+              <td><strong>${fmtPct(calib.niah_calibrated)}</strong></td>
+              <td>${fmtDelta(calib.delta_niah)}</td>
+            </tr>
+            <tr>
+              <td><strong>${t("niah.label.reasoning") || "Reasoning"}</strong></td>
+              <td>${fmtPct(result.reasoning_rate)}</td>
+              <td><strong>${fmtPct(calib.reasoning_calibrated)}</strong></td>
+              <td>${fmtDelta(calib.delta_reasoning)}</td>
+            </tr>
+          </tbody>
+        </table>
+        <p class="recipe-desc subtle" style="font-size:0.82em;">
+          ${t("niah.calib.factors") || "Per-task factors from RULER paper Appendix Tables 13-16:"}
+          retrieval = ${calib.retrieval_factor}× aggregate,
+          reasoning = ${calib.reasoning_factor}× aggregate
+          (${t("niah.calib.factors_caveat") || "honest range: retrieval 0.95-1.10×, reasoning 0.60-0.85×"}).
+        </p>
+        <p class="recipe-desc subtle" style="font-size:0.82em;">
+          ${t("niah.calib.claimed_vs_effective") || "Paper-reported"}:
+          ${t("niah.calib.claimed") || "claimed"} ${fmtN(calib.claimed_context)} /
+          ${t("niah.calib.effective") || "effective"} ${fmtN(calib.effective_context)}.
+          ${t("niah.calib.source") || "Source"}:
+          <a href="${calib.source_url}" target="_blank" rel="noopener noreferrer">RULER paper (Hsieh et al., COLM 2024)</a>
+        </p>
+      </details>
+    `;
+  } else if (modelId) {
+    // KB miss — explicitly state we're heuristic-only.
+    calibBlock = `
+      <p class="recipe-desc subtle" style="font-size:0.85em;margin-top:0.5em;">
+        💡 ${t("niah.calib.miss") || "RULER calibration unavailable for this model — using architectural heuristic only. Add to data/ruler_kb.json if you have measured numbers."}
+      </p>
+    `;
+  }
   return `
     <div class="unmask-result">
       <div class="unmask-hero" style="border-color: ${color};">
         </div>
       </div>
       <div class="unmask-details">
+        ${calibBlock}
         <details class="unmask-panel" open>
           <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
           <ul>
     return;
   }
   const result = predictNIAHReasoning(cfg, T_eval);
+  // Ensure RULER KB is loaded once; idempotent. No-op if already loaded.
+  await loadRulerKB();
+  // Calibrate against published RULER measurements if available.
+  const calib = calibrateNIAH(__niahLastModelId, T_eval, result);
+  $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId, calib);
   $("niah-status").textContent = tFmt("niah.status.done", {
     verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
     niah: (result.niah_rate * 100).toFixed(0),

js/niah_reasoning.js CHANGED Viewed

@@ -141,3 +141,148 @@ export function sweepContextLengths(config, lengths = null) {
   );
   return defaults.map(T => predictNIAHReasoning(config, T));
 }

   );
   return defaults.map(T => predictNIAHReasoning(config, T));
 }
+// =============================================================================
+// RULER calibration (v0.8.6 anti-bullshit pack #12)
+// =============================================================================
+//
+// The heuristic predictor above is a Padé-canonical extrapolation from
+// architectural inputs. It's calibrated against ROUGH RULER bands, but
+// for any specific (model, context) pair where NVIDIA published a
+// measurement, the published number is GROUND TRUTH. This block layers
+// calibration on top: when the user's model id matches a row in
+// data/ruler_kb.json, we interpolate the published RULER aggregate at
+// the requested T_eval and back out per-task estimates via the paper's
+// retrieval-vs-reasoning factor band.
+//
+// Anti-bullshit principle: if measured data exists, USE the measured
+// data, don't ship a heuristic guess that contradicts it. Surface the
+// heuristic-vs-calibrated delta so users see when our predictor was
+// over- or under-confident vs the published ground truth.
+let _rulerKb = null;
+export async function loadRulerKB(url = "./data/ruler_kb.json") {
+  if (_rulerKb) return _rulerKb;
+  try {
+    const res = await fetch(url);
+    if (!res.ok) throw new Error(`RULER KB fetch failed: ${res.status}`);
+    _rulerKb = await res.json();
+    // Build alias→canonical reverse index for fast lookup. Lowercase
+    // for case-insensitive matching of user-pasted ids.
+    _rulerKb._aliasIndex = {};
+    for (const [canon, m] of Object.entries(_rulerKb.models)) {
+      _rulerKb._aliasIndex[canon.toLowerCase()] = canon;
+      for (const a of m.id_aliases || []) {
+        _rulerKb._aliasIndex[a.toLowerCase()] = canon;
+      }
+    }
+    return _rulerKb;
+  } catch (e) {
+    return null;
+  }
+}
+export function getRulerKB() { return _rulerKb; }
+// Lookup a model in the KB. Tolerates: bare canonical key, any listed
+// alias, or HF "{org}/{name}" form. Returns the model entry or null.
+export function lookupRulerModel(modelId) {
+  if (!_rulerKb || !modelId) return null;
+  const k = String(modelId).trim().toLowerCase();
+  const canon = _rulerKb._aliasIndex[k];
+  if (canon) return { canonical: canon, ..._rulerKb.models[canon] };
+  // Try the post-`/` segment too (e.g. "meta-llama/Llama-3.1-70B-Instruct"
+  // → "Llama-3.1-70B-Instruct")
+  const tail = k.includes("/") ? k.split("/").pop() : null;
+  if (tail) {
+    const c2 = _rulerKb._aliasIndex[tail];
+    if (c2) return { canonical: c2, ..._rulerKb.models[c2] };
+  }
+  return null;
+}
+// Linear-interpolate RULER aggregate score between bracketing context
+// samples. Returns null when T_eval is outside the bracketed range
+// (we extrapolate cautiously: clamp at the nearest endpoint).
+function interpolateRulerAvg(rulerEntry, T_eval) {
+  const levels = [4096, 8192, 16384, 32768, 65536, 131072];
+  const keys   = ["4k", "8k", "16k", "32k", "64k", "128k"];
+  const vals = keys.map(k => rulerEntry.ruler_avg[k]).filter(v => typeof v === "number");
+  if (vals.length === 0) return null;
+  // Below smallest sample → clamp at first
+  if (T_eval <= levels[0]) {
+    return { value: rulerEntry.ruler_avg[keys[0]], extrapolated: T_eval < levels[0], anchor: keys[0] };
+  }
+  // Above largest sample → clamp at last (extrapolation flag set)
+  if (T_eval >= levels[levels.length - 1]) {
+    return { value: rulerEntry.ruler_avg[keys[keys.length - 1]], extrapolated: T_eval > levels[levels.length - 1], anchor: keys[keys.length - 1] };
+  }
+  // Find bracketing pair
+  for (let i = 0; i < levels.length - 1; i++) {
+    if (T_eval >= levels[i] && T_eval <= levels[i + 1]) {
+      const a = rulerEntry.ruler_avg[keys[i]];
+      const b = rulerEntry.ruler_avg[keys[i + 1]];
+      // Linear in log-context (RULER scores degrade roughly linearly
+      // in log T near the effective-length boundary)
+      const t = (Math.log2(T_eval) - Math.log2(levels[i])) /
+                (Math.log2(levels[i + 1]) - Math.log2(levels[i]));
+      return { value: a + (b - a) * t, extrapolated: false, anchor: `${keys[i]}↔${keys[i + 1]}` };
+    }
+  }
+  return null;
+}
+// Calibrate a heuristic prediction against the published RULER
+// aggregate. Returns null if the model isn't in the KB. Returns a
+// calibration object otherwise: measured aggregate, derived NIAH and
+// reasoning rates, and the delta vs heuristic.
+export function calibrateNIAH(modelId, T_eval, heuristicResult) {
+  const entry = lookupRulerModel(modelId);
+  if (!entry || !_rulerKb) return null;
+  const interp = interpolateRulerAvg(entry, T_eval);
+  if (!interp) return null;
+  const aggregate = interp.value;     // 0-100 scale per RULER convention
+  const priors = _rulerKb.task_breakdown_priors || {
+    retrieval_factor: 1.04,
+    reasoning_factor: 0.78,
+  };
+  const niahCalibrated      = Math.min(1.0, (aggregate * priors.retrieval_factor) / 100);
+  const reasoningCalibrated = Math.min(1.0, (aggregate * priors.reasoning_factor) / 100);
+  return {
+    canonical_id: entry.canonical,
+    matched_alias: modelId,
+    ruler_avg_pct: Math.round(aggregate * 10) / 10,
+    interp_anchor: interp.anchor,
+    extrapolated: interp.extrapolated,
+    claimed_context: entry.claimed_context,
+    effective_context: entry.effective_context,
+    niah_calibrated: Math.round(niahCalibrated * 100) / 100,
+    reasoning_calibrated: Math.round(reasoningCalibrated * 100) / 100,
+    delta_niah: heuristicResult
+      ? Math.round((niahCalibrated - heuristicResult.niah_rate) * 100) / 100
+      : null,
+    delta_reasoning: heuristicResult
+      ? Math.round((reasoningCalibrated - heuristicResult.reasoning_rate) * 100) / 100
+      : null,
+    retrieval_factor: priors.retrieval_factor,
+    reasoning_factor: priors.reasoning_factor,
+    source_url: _rulerKb.source?.primary || "",
+  };
+}
+// List all models in the KB (for UI dropdown / "did you mean" hint).
+export function listRulerModels() {
+  if (!_rulerKb) return [];
+  return Object.entries(_rulerKb.models).map(([k, v]) => ({
+    canonical: k,
+    aliases: v.id_aliases || [],
+    claimed_context: v.claimed_context,
+    effective_context: v.effective_context,
+    category: v.category,
+  }));
+}