karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
5c94f4b
·
1 Parent(s): 2288c3d

v0.8.6 NIAH RULER calibration — anti-bullshit pack #12

Browse files

The 🔍 NIAH→Reason mode predicted pass rates from architectural
inputs (γ_Padé, d_horizon, GQA pressure, SWA boundary) using a
heuristic logistic. That heuristic was calibrated against rough
RULER bands but never validated against per-model-per-context
ground truth. Anti-bullshit principle: if measured data exists,
USE the measured data.

Layered RULER calibration on top of the existing predictor:

- NEW data/ruler_kb.json — 12 models from RULER paper Table 3 +
DeepWiki leaderboard. Each row carries 4K/8K/16K/32K/64K/128K
aggregate scores, claimed-vs-effective context, params, and
multi-name aliases (org/name + bare-name + unsloth mirrors so
autocomplete-pasted ids hit). Models: GPT-4-1106, Command-R-35B,
Yi-34B-200K, Mixtral-8x7B/-8x22B, Mistral-7B-v0.2, ChatGLM3-6B,
LWM-7B, Llama-3.1-70B-Instruct, Gemini-1.5-Pro, Jamba-1.5-Large,
Qwen2.5-14B-1M.

- `loadRulerKB()` + `lookupRulerModel()` + `calibrateNIAH()` in
niah_reasoning.js. Lookup is case-insensitive on the alias_index
with org/name + bare-name + lowercase fallback. Linear-interpolate
in log-context between bracketing samples; clamp + flag
extrapolation outside the 4K-128K measured range.

- Per-task back-out: RULER aggregate × retrieval_factor (1.04) for
NIAH, × reasoning_factor (0.78) for multi-hop QA. Factors derived
from RULER paper Appendix Tables 13-16 (top-tier models score
retrieval 95-100%, QA ~70%, the canonical ~25pp gap). Honest
range surfaced inline (retrieval 0.95-1.10×, reasoning 0.60-0.85×).

- UI: green-bordered "📊 RULER-calibrated" panel above the existing
architecture breakdown when the model id matches the KB. Shows
measured RULER aggregate at T_eval, derived NIAH/reasoning rates,
and a side-by-side delta vs the heuristic prediction (color-coded
+/- pp). Extrapolation warning for T_eval > 128K. Source citation
links to the paper. KB-miss path shows an explicit "calibration
unavailable, heuristic only" hint.

- 17 i18n keys × 4 langs (EN/ES/FR/ZH) = 68 keys.

Verified locally: 12 models load, alias lookup hits HF org/name +
bare names + lowercase variants, calibration arithmetic produces
sensible numbers (Llama-3.1-70B @ 32K → RULER 94.8% → NIAH 99%
reasoning 74%; @ 128K → 66.6% → 69% / 52% — the canonical RULER
"reasoning collapses at long context" finding).

Refs:
- https://arxiv.org/abs/2404.06654 (Hsieh et al., COLM 2024)
- https://github.com/NVIDIA/RULER

Closes the v0.8 roadmap #5 commitment ("RULER-backed NIAH calibrator
to upgrade the predictor from heuristic to calibrated").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show
  1. data/ruler_kb.json +116 -0
  2. js/i18n.js +60 -0
  3. js/main.js +82 -3
  4. js/niah_reasoning.js +145 -0
data/ruler_kb.json ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "compiled": "2026-05-07",
4
+ "source": {
5
+ "primary": "RULER paper Table 3 (Hsieh et al., COLM 2024) — arxiv.org/abs/2404.06654",
6
+ "secondary": "DeepWiki/NVIDIA/RULER aggregated leaderboard (~34 models)",
7
+ "license": "Numbers reproduced for review/calibration; see paper for evaluation methodology."
8
+ },
9
+ "task_breakdown_priors": {
10
+ "comment": "RULER aggregate score is the mean of 13 tasks across 4 categories. Per the paper's Appendix Tables 13-16, top-tier models (GPT-4) score: retrieval 95-100%, variable-tracking 99%, aggregation 93%, QA 70%. The ~25pp gap between retrieval and QA is the canonical 'retrieval-vs-reasoning' finding. We use these ratios to back out per-task estimates from the published aggregate.",
11
+ "retrieval_factor": 1.04,
12
+ "reasoning_factor": 0.78,
13
+ "retrieval_factor_caveat": "NIAH-single typically scores 95-99% for top models even when aggregate drops; this multiplier underestimates NIAH on small-model regimes. Honest range: 0.95×–1.10× aggregate.",
14
+ "reasoning_factor_caveat": "Multi-hop QA degrades faster than aggregate. At long context (>64K) the ratio can drop below 0.6×. Honest range: 0.60×–0.85× aggregate."
15
+ },
16
+ "models": {
17
+ "gpt-4-1106-preview": {
18
+ "ruler_avg": {"4k": 96.6, "8k": 96.3, "16k": 95.2, "32k": 93.2, "64k": 87.0, "128k": 81.2},
19
+ "claimed_context": 128000,
20
+ "effective_context": 64000,
21
+ "params_b": null,
22
+ "id_aliases": ["openai/gpt-4-1106-preview", "gpt-4-1106-preview", "gpt-4-turbo"],
23
+ "category": "frontier_api"
24
+ },
25
+ "command-r-35b": {
26
+ "ruler_avg": {"4k": 93.8, "8k": 93.3, "16k": 92.4, "32k": 89.5, "64k": 84.9, "128k": 76.0},
27
+ "claimed_context": 128000,
28
+ "effective_context": 32000,
29
+ "params_b": 35,
30
+ "id_aliases": ["CohereForAI/c4ai-command-r-v01", "command-r-35b", "c4ai-command-r-v01"],
31
+ "category": "open"
32
+ },
33
+ "yi-34b-200k": {
34
+ "ruler_avg": {"4k": 93.3, "8k": 92.2, "16k": 91.3, "32k": 87.5, "64k": 83.2, "128k": 77.3},
35
+ "claimed_context": 200000,
36
+ "effective_context": 32000,
37
+ "params_b": 34,
38
+ "id_aliases": ["01-ai/Yi-34B-200K", "yi-34b-200k", "Yi-34B-200K"],
39
+ "category": "open"
40
+ },
41
+ "mixtral-8x7b": {
42
+ "ruler_avg": {"4k": 94.9, "8k": 92.1, "16k": 92.5, "32k": 85.9, "64k": 72.4, "128k": 44.5},
43
+ "claimed_context": 32000,
44
+ "effective_context": 32000,
45
+ "params_b": 47,
46
+ "id_aliases": ["mistralai/Mixtral-8x7B-Instruct-v0.1", "mixtral-8x7b-instruct", "Mixtral-8x7B-Instruct-v0.1"],
47
+ "category": "open_moe"
48
+ },
49
+ "mistral-7b-v0.2": {
50
+ "ruler_avg": {"4k": 93.6, "8k": 91.2, "16k": 87.2, "32k": 75.4, "64k": 49.0, "128k": 13.8},
51
+ "claimed_context": 32000,
52
+ "effective_context": 16000,
53
+ "params_b": 7,
54
+ "id_aliases": ["mistralai/Mistral-7B-Instruct-v0.2", "mistral-7b-v0.2", "Mistral-7B-Instruct-v0.2"],
55
+ "category": "open"
56
+ },
57
+ "chatglm3-6b-128k": {
58
+ "ruler_avg": {"4k": 87.8, "8k": 83.4, "16k": 78.6, "32k": 69.9, "64k": 56.0, "128k": 42.0},
59
+ "claimed_context": 128000,
60
+ "effective_context": 4000,
61
+ "params_b": 6,
62
+ "id_aliases": ["THUDM/chatglm3-6b-128k", "chatglm3-6b-128k"],
63
+ "category": "open"
64
+ },
65
+ "lwm-7b": {
66
+ "ruler_avg": {"4k": 82.3, "8k": 78.4, "16k": 73.7, "32k": 69.1, "64k": 68.1, "128k": 65.0},
67
+ "claimed_context": 1000000,
68
+ "effective_context": 4000,
69
+ "params_b": 7,
70
+ "id_aliases": ["LargeWorldModel/LWM-Text-Chat-1M", "lwm-7b", "LWM-Text-Chat-1M"],
71
+ "category": "open"
72
+ },
73
+ "llama3.1-70b-instruct": {
74
+ "ruler_avg": {"4k": 96.5, "8k": 95.8, "16k": 95.4, "32k": 94.8, "64k": 88.4, "128k": 66.6},
75
+ "claimed_context": 128000,
76
+ "effective_context": 64000,
77
+ "params_b": 70,
78
+ "id_aliases": ["meta-llama/Llama-3.1-70B-Instruct", "llama-3.1-70b", "llama3.1-70b-instruct", "Meta-Llama-3.1-70B-Instruct", "Llama-3.1-70B-Instruct", "unsloth/Meta-Llama-3.1-70B-Instruct", "unsloth/Llama-3.1-70B-Instruct"],
79
+ "category": "open_frontier"
80
+ },
81
+ "mixtral-8x22b-instruct": {
82
+ "ruler_avg": {"4k": 95.6, "8k": 94.9, "16k": 93.4, "32k": 90.9, "64k": 84.7, "128k": 31.7},
83
+ "claimed_context": 64000,
84
+ "effective_context": 32000,
85
+ "params_b": 141,
86
+ "id_aliases": ["mistralai/Mixtral-8x22B-Instruct-v0.1", "mixtral-8x22b-instruct", "Mixtral-8x22B-Instruct-v0.1"],
87
+ "category": "open_moe"
88
+ },
89
+ "gemini-1.5-pro": {
90
+ "ruler_avg": {"4k": 96.7, "8k": 95.8, "16k": 96.0, "32k": 95.9, "64k": 95.9, "128k": 94.4},
91
+ "claimed_context": 1000000,
92
+ "effective_context": 128000,
93
+ "params_b": null,
94
+ "id_aliases": ["google/gemini-1.5-pro", "gemini-1.5-pro"],
95
+ "category": "frontier_api"
96
+ },
97
+ "jamba-1.5-large": {
98
+ "ruler_avg": {"4k": 96.3, "8k": 96.2, "16k": 96.0, "32k": 96.0, "64k": 96.0, "128k": 96.0},
99
+ "claimed_context": 256000,
100
+ "effective_context": 128000,
101
+ "params_b": 94,
102
+ "id_aliases": ["ai21labs/AI21-Jamba-1.5-Large", "jamba-1.5-large", "AI21-Jamba-1.5-Large"],
103
+ "category": "open_hybrid_ssm"
104
+ },
105
+ "qwen2.5-14b-instruct-1m": {
106
+ "ruler_avg": {"4k": 97.5, "8k": 96.8, "16k": 95.5, "32k": 94.0, "64k": 92.5, "128k": 92.2},
107
+ "claimed_context": 1000000,
108
+ "effective_context": 128000,
109
+ "params_b": 14,
110
+ "id_aliases": ["Qwen/Qwen2.5-14B-Instruct-1M", "qwen2.5-14b-1m", "Qwen2.5-14B-Instruct-1M", "qwen2.5-14b-instruct-1m"],
111
+ "category": "open_frontier"
112
+ }
113
+ },
114
+ "context_levels": [4096, 8192, 16384, 32768, 65536, 131072],
115
+ "context_level_keys": ["4k", "8k", "16k", "32k", "64k", "128k"]
116
+ }
js/i18n.js CHANGED
@@ -439,6 +439,21 @@ export const TRANSLATIONS = {
439
  "niah.label.safe_ctx": "Safe reasoning context",
440
  "niah.section.breakdown": "Architecture breakdown",
441
  "niah.section.reco": "Recommendation",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
442
  "niah.section.sweep": "Pass rate sweep across context lengths",
443
  "niah.field.dhorizon": "d_horizon (effective)",
444
  "niah.field.ratio": "T_eval / d_horizon",
@@ -1597,6 +1612,21 @@ export const TRANSLATIONS = {
1597
  "niah.label.safe_ctx": "Contexto seguro de reasoning",
1598
  "niah.section.breakdown": "Desglose arquitectónico",
1599
  "niah.section.reco": "Recomendación",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1600
  "niah.section.sweep": "Barrido de tasas pass por longitud de contexto",
1601
  "niah.field.dhorizon": "d_horizon (efectivo)",
1602
  "niah.field.ratio": "T_eval / d_horizon",
@@ -2619,6 +2649,21 @@ export const TRANSLATIONS = {
2619
  "niah.label.safe_ctx": "Contexte sûr pour reasoning",
2620
  "niah.section.breakdown": "Détail architectural",
2621
  "niah.section.reco": "Recommandation",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2622
  "niah.section.sweep": "Balayage des taux par longueur de contexte",
2623
  "niah.field.dhorizon": "d_horizon (effectif)",
2624
  "niah.field.ratio": "T_eval / d_horizon",
@@ -3641,6 +3686,21 @@ export const TRANSLATIONS = {
3641
  "niah.label.safe_ctx": "Reasoning 安全上下文",
3642
  "niah.section.breakdown": "架构细节",
3643
  "niah.section.reco": "建议",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3644
  "niah.section.sweep": "按上下文长度扫描通过率",
3645
  "niah.field.dhorizon": "d_horizon(有效)",
3646
  "niah.field.ratio": "T_eval / d_horizon",
 
439
  "niah.label.safe_ctx": "Safe reasoning context",
440
  "niah.section.breakdown": "Architecture breakdown",
441
  "niah.section.reco": "Recommendation",
442
+ "niah.calib.heading": "RULER-calibrated (NVIDIA published data)",
443
+ "niah.calib.matched": "Matched <code>{alias}</code> → KB row <code>{canonical}</code>.",
444
+ "niah.calib.aggregate": "RULER aggregate",
445
+ "niah.calib.interp": "interpolated between",
446
+ "niah.calib.extrapolated": "extrapolated outside RULER's measured range",
447
+ "niah.calib.col.heuristic": "Heuristic",
448
+ "niah.calib.col.calibrated": "RULER-calibrated",
449
+ "niah.calib.col.delta": "Δ",
450
+ "niah.calib.factors": "Per-task factors from RULER paper Appendix Tables 13-16:",
451
+ "niah.calib.factors_caveat": "honest range: retrieval 0.95-1.10×, reasoning 0.60-0.85×",
452
+ "niah.calib.claimed_vs_effective": "Paper-reported",
453
+ "niah.calib.claimed": "claimed",
454
+ "niah.calib.effective": "effective",
455
+ "niah.calib.source": "Source",
456
+ "niah.calib.miss": "RULER calibration unavailable for this model — using architectural heuristic only. Add to data/ruler_kb.json if you have measured numbers.",
457
  "niah.section.sweep": "Pass rate sweep across context lengths",
458
  "niah.field.dhorizon": "d_horizon (effective)",
459
  "niah.field.ratio": "T_eval / d_horizon",
 
1612
  "niah.label.safe_ctx": "Contexto seguro de reasoning",
1613
  "niah.section.breakdown": "Desglose arquitectónico",
1614
  "niah.section.reco": "Recomendación",
1615
+ "niah.calib.heading": "Calibrado con RULER (datos publicados por NVIDIA)",
1616
+ "niah.calib.matched": "Coincide <code>{alias}</code> → fila KB <code>{canonical}</code>.",
1617
+ "niah.calib.aggregate": "Agregado RULER",
1618
+ "niah.calib.interp": "interpolado entre",
1619
+ "niah.calib.extrapolated": "extrapolado fuera del rango medido por RULER",
1620
+ "niah.calib.col.heuristic": "Heurística",
1621
+ "niah.calib.col.calibrated": "Calibrado RULER",
1622
+ "niah.calib.col.delta": "Δ",
1623
+ "niah.calib.factors": "Factores por tarea del paper RULER, Apéndice Tablas 13-16:",
1624
+ "niah.calib.factors_caveat": "rango honesto: retrieval 0.95-1.10×, reasoning 0.60-0.85×",
1625
+ "niah.calib.claimed_vs_effective": "Reportado en paper",
1626
+ "niah.calib.claimed": "claimed",
1627
+ "niah.calib.effective": "effective",
1628
+ "niah.calib.source": "Fuente",
1629
+ "niah.calib.miss": "Calibración RULER no disponible para este modelo — usando solo heurística arquitectónica. Añade a data/ruler_kb.json si tienes números medidos.",
1630
  "niah.section.sweep": "Barrido de tasas pass por longitud de contexto",
1631
  "niah.field.dhorizon": "d_horizon (efectivo)",
1632
  "niah.field.ratio": "T_eval / d_horizon",
 
2649
  "niah.label.safe_ctx": "Contexte sûr pour reasoning",
2650
  "niah.section.breakdown": "Détail architectural",
2651
  "niah.section.reco": "Recommandation",
2652
+ "niah.calib.heading": "Calibré avec RULER (données publiées par NVIDIA)",
2653
+ "niah.calib.matched": "Correspond <code>{alias}</code> → ligne KB <code>{canonical}</code>.",
2654
+ "niah.calib.aggregate": "Agrégat RULER",
2655
+ "niah.calib.interp": "interpolé entre",
2656
+ "niah.calib.extrapolated": "extrapolé hors de la plage mesurée par RULER",
2657
+ "niah.calib.col.heuristic": "Heuristique",
2658
+ "niah.calib.col.calibrated": "Calibré RULER",
2659
+ "niah.calib.col.delta": "Δ",
2660
+ "niah.calib.factors": "Facteurs par tâche du paper RULER, Appendice Tables 13-16 :",
2661
+ "niah.calib.factors_caveat": "plage honnête : retrieval 0.95-1.10×, reasoning 0.60-0.85×",
2662
+ "niah.calib.claimed_vs_effective": "Rapporté dans le paper",
2663
+ "niah.calib.claimed": "claimed",
2664
+ "niah.calib.effective": "effective",
2665
+ "niah.calib.source": "Source",
2666
+ "niah.calib.miss": "Calibration RULER indisponible pour ce modèle — utilisation de l'heuristique architecturale seule. Ajoutez à data/ruler_kb.json si vous avez des chiffres mesurés.",
2667
  "niah.section.sweep": "Balayage des taux par longueur de contexte",
2668
  "niah.field.dhorizon": "d_horizon (effectif)",
2669
  "niah.field.ratio": "T_eval / d_horizon",
 
3686
  "niah.label.safe_ctx": "Reasoning 安全上下文",
3687
  "niah.section.breakdown": "架构细节",
3688
  "niah.section.reco": "建议",
3689
+ "niah.calib.heading": "RULER 校准(NVIDIA 已发布数据)",
3690
+ "niah.calib.matched": "匹配 <code>{alias}</code> → KB 行 <code>{canonical}</code>。",
3691
+ "niah.calib.aggregate": "RULER 聚合分",
3692
+ "niah.calib.interp": "在以下之间插值",
3693
+ "niah.calib.extrapolated": "外推到 RULER 已测范围之外",
3694
+ "niah.calib.col.heuristic": "启发式",
3695
+ "niah.calib.col.calibrated": "RULER 校准",
3696
+ "niah.calib.col.delta": "Δ",
3697
+ "niah.calib.factors": "来自 RULER 论文附录表 13-16 的每任务因子:",
3698
+ "niah.calib.factors_caveat": "诚实范围:retrieval 0.95-1.10×,reasoning 0.60-0.85×",
3699
+ "niah.calib.claimed_vs_effective": "论文报告",
3700
+ "niah.calib.claimed": "claimed",
3701
+ "niah.calib.effective": "effective",
3702
+ "niah.calib.source": "来源",
3703
+ "niah.calib.miss": "此模型暂无 RULER 校准——仅使用架构启发式。如有实测数字,请添加到 data/ruler_kb.json。",
3704
  "niah.section.sweep": "按上下文长度扫描通过率",
3705
  "niah.field.dhorizon": "d_horizon(有效)",
3706
  "niah.field.ratio": "T_eval / d_horizon",
js/main.js CHANGED
@@ -18,7 +18,7 @@ import { rateAllBenchmarks, BENCHMARK_DB } from "./contamination_prior.js";
18
  import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
19
  import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
20
  import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
21
- import { predictNIAHReasoning, sweepContextLengths } from "./niah_reasoning.js";
22
  import {
23
  loadSaturationKB, classifyAll, classifyBenchmark,
24
  listBenchmarks, attribution as saturationAttribution, tryFetchLive,
@@ -1419,7 +1419,7 @@ async function niahFetchConfig() {
1419
  }
1420
  }
1421
 
1422
- function renderNIAHCard(result, modelId) {
1423
  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1424
  ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1425
  const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
@@ -1430,6 +1430,80 @@ function renderNIAHCard(result, modelId) {
1430
  ? tFmt("niah.safe_context", { ctx: result.safe_context })
1431
  : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
1432
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1433
  return `
1434
  <div class="unmask-result">
1435
  <div class="unmask-hero" style="border-color: ${color};">
@@ -1442,6 +1516,7 @@ function renderNIAHCard(result, modelId) {
1442
  </div>
1443
  </div>
1444
  <div class="unmask-details">
 
1445
  <details class="unmask-panel" open>
1446
  <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
1447
  <ul>
@@ -1512,7 +1587,11 @@ async function runNIAHPredict() {
1512
  return;
1513
  }
1514
  const result = predictNIAHReasoning(cfg, T_eval);
1515
- $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId);
 
 
 
 
1516
  $("niah-status").textContent = tFmt("niah.status.done", {
1517
  verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
1518
  niah: (result.niah_rate * 100).toFixed(0),
 
18
  import { predictQuantShift, predictAllSchemes, QUANT_SCHEMES } from "./quant_regime.js";
19
  import { attachAllHfAutocompletes } from "./hf_autocomplete.js";
20
  import { computeDriftBound, FRAMEWORKS as DRIFT_FRAMEWORKS, DTYPES as DRIFT_DTYPES } from "./cross_drift.js";
21
+ import { predictNIAHReasoning, sweepContextLengths, loadRulerKB, calibrateNIAH, listRulerModels } from "./niah_reasoning.js";
22
  import {
23
  loadSaturationKB, classifyAll, classifyBenchmark,
24
  listBenchmarks, attribution as saturationAttribution, tryFetchLive,
 
1419
  }
1420
  }
1421
 
1422
+ function renderNIAHCard(result, modelId, calib = null) {
1423
  const escapeHtml = (s) => String(s).replace(/[&<>"']/g, c =>
1424
  ({"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;"}[c]));
1425
  const fmtN = (x) => x === null || x === undefined ? "—" : Number(x).toLocaleString();
 
1430
  ? tFmt("niah.safe_context", { ctx: result.safe_context })
1431
  : (t("niah.safe_context_none") || "No safe context found below your target — model fails reasoning even at small contexts.");
1432
 
1433
+ // RULER calibration block — appears only when KB lookup hits.
1434
+ // Shows measured RULER aggregate, derived NIAH/reasoning, and the
1435
+ // delta vs the heuristic so users see when the predictor was off.
1436
+ let calibBlock = "";
1437
+ if (calib) {
1438
+ const fmtPct = (v) => `${(v * 100).toFixed(0)}%`;
1439
+ const fmtDelta = (d) => {
1440
+ if (d == null) return "—";
1441
+ const pp = Math.round(d * 100);
1442
+ const sign = pp > 0 ? "+" : "";
1443
+ const col = Math.abs(pp) >= 10 ? "#f0883e" : Math.abs(pp) >= 5 ? "#d29922" : "#8b949e";
1444
+ return `<span style="color:${col};">${sign}${pp} pp</span>`;
1445
+ };
1446
+ const extrapNote = calib.extrapolated
1447
+ ? `<span class="subtle" style="color:#d29922;font-size:0.85em;"> ⚠ ${t("niah.calib.extrapolated") || "extrapolated outside RULER's measured range"}</span>`
1448
+ : "";
1449
+ calibBlock = `
1450
+ <details class="unmask-panel" open style="border-left:3px solid #3fb950;">
1451
+ <summary class="unmask-panel-title">📊 ${t("niah.calib.heading") || "RULER-calibrated (NVIDIA published data)"}</summary>
1452
+ <p>${tFmt("niah.calib.matched", {
1453
+ alias: escapeHtml(calib.matched_alias),
1454
+ canonical: escapeHtml(calib.canonical_id),
1455
+ }) || `Matched <code>${escapeHtml(calib.matched_alias)}</code> → KB row <code>${escapeHtml(calib.canonical_id)}</code>.`}</p>
1456
+ <p>
1457
+ <strong>${t("niah.calib.aggregate") || "RULER aggregate"} @ ${fmtN(result.T_eval)}:</strong>
1458
+ <code>${calib.ruler_avg_pct}%</code>
1459
+ <span class="subtle">(${t("niah.calib.interp") || "interpolated between"} ${calib.interp_anchor})</span>${extrapNote}
1460
+ </p>
1461
+ <table class="arena-table" style="margin-top:0.5em;">
1462
+ <thead><tr>
1463
+ <th></th>
1464
+ <th>${t("niah.calib.col.heuristic") || "Heuristic"}</th>
1465
+ <th>${t("niah.calib.col.calibrated") || "RULER-calibrated"}</th>
1466
+ <th>${t("niah.calib.col.delta") || "Δ"}</th>
1467
+ </tr></thead>
1468
+ <tbody>
1469
+ <tr>
1470
+ <td><strong>NIAH</strong></td>
1471
+ <td>${fmtPct(result.niah_rate)}</td>
1472
+ <td><strong>${fmtPct(calib.niah_calibrated)}</strong></td>
1473
+ <td>${fmtDelta(calib.delta_niah)}</td>
1474
+ </tr>
1475
+ <tr>
1476
+ <td><strong>${t("niah.label.reasoning") || "Reasoning"}</strong></td>
1477
+ <td>${fmtPct(result.reasoning_rate)}</td>
1478
+ <td><strong>${fmtPct(calib.reasoning_calibrated)}</strong></td>
1479
+ <td>${fmtDelta(calib.delta_reasoning)}</td>
1480
+ </tr>
1481
+ </tbody>
1482
+ </table>
1483
+ <p class="recipe-desc subtle" style="font-size:0.82em;">
1484
+ ${t("niah.calib.factors") || "Per-task factors from RULER paper Appendix Tables 13-16:"}
1485
+ retrieval = ${calib.retrieval_factor}× aggregate,
1486
+ reasoning = ${calib.reasoning_factor}× aggregate
1487
+ (${t("niah.calib.factors_caveat") || "honest range: retrieval 0.95-1.10×, reasoning 0.60-0.85×"}).
1488
+ </p>
1489
+ <p class="recipe-desc subtle" style="font-size:0.82em;">
1490
+ ${t("niah.calib.claimed_vs_effective") || "Paper-reported"}:
1491
+ ${t("niah.calib.claimed") || "claimed"} ${fmtN(calib.claimed_context)} /
1492
+ ${t("niah.calib.effective") || "effective"} ${fmtN(calib.effective_context)}.
1493
+ ${t("niah.calib.source") || "Source"}:
1494
+ <a href="${calib.source_url}" target="_blank" rel="noopener noreferrer">RULER paper (Hsieh et al., COLM 2024)</a>
1495
+ </p>
1496
+ </details>
1497
+ `;
1498
+ } else if (modelId) {
1499
+ // KB miss — explicitly state we're heuristic-only.
1500
+ calibBlock = `
1501
+ <p class="recipe-desc subtle" style="font-size:0.85em;margin-top:0.5em;">
1502
+ 💡 ${t("niah.calib.miss") || "RULER calibration unavailable for this model — using architectural heuristic only. Add to data/ruler_kb.json if you have measured numbers."}
1503
+ </p>
1504
+ `;
1505
+ }
1506
+
1507
  return `
1508
  <div class="unmask-result">
1509
  <div class="unmask-hero" style="border-color: ${color};">
 
1516
  </div>
1517
  </div>
1518
  <div class="unmask-details">
1519
+ ${calibBlock}
1520
  <details class="unmask-panel" open>
1521
  <summary class="unmask-panel-title">${t("niah.section.breakdown") || "Architecture breakdown"}</summary>
1522
  <ul>
 
1587
  return;
1588
  }
1589
  const result = predictNIAHReasoning(cfg, T_eval);
1590
+ // Ensure RULER KB is loaded once; idempotent. No-op if already loaded.
1591
+ await loadRulerKB();
1592
+ // Calibrate against published RULER measurements if available.
1593
+ const calib = calibrateNIAH(__niahLastModelId, T_eval, result);
1594
+ $("niah-output").innerHTML = renderNIAHCard(result, __niahLastModelId, calib);
1595
  $("niah-status").textContent = tFmt("niah.status.done", {
1596
  verdict: t(`niah.verdict.${result.verdict}`) || result.verdict,
1597
  niah: (result.niah_rate * 100).toFixed(0),
js/niah_reasoning.js CHANGED
@@ -141,3 +141,148 @@ export function sweepContextLengths(config, lengths = null) {
141
  );
142
  return defaults.map(T => predictNIAHReasoning(config, T));
143
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  );
142
  return defaults.map(T => predictNIAHReasoning(config, T));
143
  }
144
+
145
+
146
+ // =============================================================================
147
+ // RULER calibration (v0.8.6 anti-bullshit pack #12)
148
+ // =============================================================================
149
+ //
150
+ // The heuristic predictor above is a Padé-canonical extrapolation from
151
+ // architectural inputs. It's calibrated against ROUGH RULER bands, but
152
+ // for any specific (model, context) pair where NVIDIA published a
153
+ // measurement, the published number is GROUND TRUTH. This block layers
154
+ // calibration on top: when the user's model id matches a row in
155
+ // data/ruler_kb.json, we interpolate the published RULER aggregate at
156
+ // the requested T_eval and back out per-task estimates via the paper's
157
+ // retrieval-vs-reasoning factor band.
158
+ //
159
+ // Anti-bullshit principle: if measured data exists, USE the measured
160
+ // data, don't ship a heuristic guess that contradicts it. Surface the
161
+ // heuristic-vs-calibrated delta so users see when our predictor was
162
+ // over- or under-confident vs the published ground truth.
163
+
164
+ let _rulerKb = null;
165
+
166
+ export async function loadRulerKB(url = "./data/ruler_kb.json") {
167
+ if (_rulerKb) return _rulerKb;
168
+ try {
169
+ const res = await fetch(url);
170
+ if (!res.ok) throw new Error(`RULER KB fetch failed: ${res.status}`);
171
+ _rulerKb = await res.json();
172
+ // Build alias→canonical reverse index for fast lookup. Lowercase
173
+ // for case-insensitive matching of user-pasted ids.
174
+ _rulerKb._aliasIndex = {};
175
+ for (const [canon, m] of Object.entries(_rulerKb.models)) {
176
+ _rulerKb._aliasIndex[canon.toLowerCase()] = canon;
177
+ for (const a of m.id_aliases || []) {
178
+ _rulerKb._aliasIndex[a.toLowerCase()] = canon;
179
+ }
180
+ }
181
+ return _rulerKb;
182
+ } catch (e) {
183
+ return null;
184
+ }
185
+ }
186
+
187
+ export function getRulerKB() { return _rulerKb; }
188
+
189
+ // Lookup a model in the KB. Tolerates: bare canonical key, any listed
190
+ // alias, or HF "{org}/{name}" form. Returns the model entry or null.
191
+ export function lookupRulerModel(modelId) {
192
+ if (!_rulerKb || !modelId) return null;
193
+ const k = String(modelId).trim().toLowerCase();
194
+ const canon = _rulerKb._aliasIndex[k];
195
+ if (canon) return { canonical: canon, ..._rulerKb.models[canon] };
196
+ // Try the post-`/` segment too (e.g. "meta-llama/Llama-3.1-70B-Instruct"
197
+ // → "Llama-3.1-70B-Instruct")
198
+ const tail = k.includes("/") ? k.split("/").pop() : null;
199
+ if (tail) {
200
+ const c2 = _rulerKb._aliasIndex[tail];
201
+ if (c2) return { canonical: c2, ..._rulerKb.models[c2] };
202
+ }
203
+ return null;
204
+ }
205
+
206
+ // Linear-interpolate RULER aggregate score between bracketing context
207
+ // samples. Returns null when T_eval is outside the bracketed range
208
+ // (we extrapolate cautiously: clamp at the nearest endpoint).
209
+ function interpolateRulerAvg(rulerEntry, T_eval) {
210
+ const levels = [4096, 8192, 16384, 32768, 65536, 131072];
211
+ const keys = ["4k", "8k", "16k", "32k", "64k", "128k"];
212
+ const vals = keys.map(k => rulerEntry.ruler_avg[k]).filter(v => typeof v === "number");
213
+ if (vals.length === 0) return null;
214
+ // Below smallest sample → clamp at first
215
+ if (T_eval <= levels[0]) {
216
+ return { value: rulerEntry.ruler_avg[keys[0]], extrapolated: T_eval < levels[0], anchor: keys[0] };
217
+ }
218
+ // Above largest sample → clamp at last (extrapolation flag set)
219
+ if (T_eval >= levels[levels.length - 1]) {
220
+ return { value: rulerEntry.ruler_avg[keys[keys.length - 1]], extrapolated: T_eval > levels[levels.length - 1], anchor: keys[keys.length - 1] };
221
+ }
222
+ // Find bracketing pair
223
+ for (let i = 0; i < levels.length - 1; i++) {
224
+ if (T_eval >= levels[i] && T_eval <= levels[i + 1]) {
225
+ const a = rulerEntry.ruler_avg[keys[i]];
226
+ const b = rulerEntry.ruler_avg[keys[i + 1]];
227
+ // Linear in log-context (RULER scores degrade roughly linearly
228
+ // in log T near the effective-length boundary)
229
+ const t = (Math.log2(T_eval) - Math.log2(levels[i])) /
230
+ (Math.log2(levels[i + 1]) - Math.log2(levels[i]));
231
+ return { value: a + (b - a) * t, extrapolated: false, anchor: `${keys[i]}↔${keys[i + 1]}` };
232
+ }
233
+ }
234
+ return null;
235
+ }
236
+
237
+ // Calibrate a heuristic prediction against the published RULER
238
+ // aggregate. Returns null if the model isn't in the KB. Returns a
239
+ // calibration object otherwise: measured aggregate, derived NIAH and
240
+ // reasoning rates, and the delta vs heuristic.
241
+ export function calibrateNIAH(modelId, T_eval, heuristicResult) {
242
+ const entry = lookupRulerModel(modelId);
243
+ if (!entry || !_rulerKb) return null;
244
+
245
+ const interp = interpolateRulerAvg(entry, T_eval);
246
+ if (!interp) return null;
247
+
248
+ const aggregate = interp.value; // 0-100 scale per RULER convention
249
+ const priors = _rulerKb.task_breakdown_priors || {
250
+ retrieval_factor: 1.04,
251
+ reasoning_factor: 0.78,
252
+ };
253
+ const niahCalibrated = Math.min(1.0, (aggregate * priors.retrieval_factor) / 100);
254
+ const reasoningCalibrated = Math.min(1.0, (aggregate * priors.reasoning_factor) / 100);
255
+
256
+ return {
257
+ canonical_id: entry.canonical,
258
+ matched_alias: modelId,
259
+ ruler_avg_pct: Math.round(aggregate * 10) / 10,
260
+ interp_anchor: interp.anchor,
261
+ extrapolated: interp.extrapolated,
262
+ claimed_context: entry.claimed_context,
263
+ effective_context: entry.effective_context,
264
+ niah_calibrated: Math.round(niahCalibrated * 100) / 100,
265
+ reasoning_calibrated: Math.round(reasoningCalibrated * 100) / 100,
266
+ delta_niah: heuristicResult
267
+ ? Math.round((niahCalibrated - heuristicResult.niah_rate) * 100) / 100
268
+ : null,
269
+ delta_reasoning: heuristicResult
270
+ ? Math.round((reasoningCalibrated - heuristicResult.reasoning_rate) * 100) / 100
271
+ : null,
272
+ retrieval_factor: priors.retrieval_factor,
273
+ reasoning_factor: priors.reasoning_factor,
274
+ source_url: _rulerKb.source?.primary || "",
275
+ };
276
+ }
277
+
278
+ // List all models in the KB (for UI dropdown / "did you mean" hint).
279
+ export function listRulerModels() {
280
+ if (!_rulerKb) return [];
281
+ return Object.entries(_rulerKb.models).map(([k, v]) => ({
282
+ canonical: k,
283
+ aliases: v.id_aliases || [],
284
+ claimed_context: v.claimed_context,
285
+ effective_context: v.effective_context,
286
+ category: v.category,
287
+ }));
288
+ }