Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 5 days ago

Commit

ebabb49

1 Parent(s): 6f608c8

v0.8.8 LongScore mode — anti-bullshit pack #14 + Hub badge readability fix

22nd mode: 🎯 LongScore. Lookup any HF model id → see relative degradation
past short context, sourced from RULER per-length + HELMET aggregate.

Why: every model claims a 128K context window, but accuracy degrades long
before that. Raw long-ctx scores are dominated by base ability — a smarter
model with a worse long-ctx recipe still scores higher than a less-smart
model with a better one, hiding the actual degradation. The 100-LongBench
paper (ACL 2025, arXiv:2505.19293) proposed LongScore to disentangle base
ability from true long-ctx capability:
LC_l = (S_l − Base) / Base
Base = mean(S_short)
LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})
0 = no degradation; -0.30 = severe.

Stage 0.5 verified (per PROTOCOL.md): no existing browser tool surfaces
LongScore for a given model id. tiktokenizer.vercel.app, llm-stats.com,
BenchLM.ai use raw RULER/needle scores. HELM Long Context aggregates 5
benchmarks but doesn't compute LongScore. Genuinely novel.

Files:
- data/longscore_kb.json (NEW, ~70 KB) — 93 unique models keyed by
canonical HF id. 35 with full LongScore (RULER per-length present).
Source: ruler_kb_day5.json + helmet_kb.json from large_model_validation/
Day-5 + Day-8 work. Built by build_longscore_kb_for_tafagent.py.
- js/longscore.js (NEW) — pure logic: loadKB, normalize, classify,
lookup, listAllIds, rank. No UI strings. ES module.
- index.html — new tab 🎯 LongScore between Token Tax and Solutions Hub.
New <section id="longscore-section"> with input + 3 example buttons +
output panels. Help modal v0.8.8 entry. Inventory v0.8.8 entry.
- js/main.js — initLongscore, renderLongscoreResult (per-length bars +
HELMET 7-task collapsible + verdict color band), runLongscoreLookup,
button wiring + Enter-key handler.
- js/i18n.js — 35 new keys × 4 langs (EN/ES/FR/ZH) = 140 keys total.
Includes mode label, tooltip, formula note, miss/hit/helmet_only state
text, verdict ladder (no_degradation → mild → moderate → severe →
extreme), help modal body.
- data/solutions_hub.json — new pain entry "long_ctx_degradation"
covered by 🎯 LongScore. Curated 6 external tools: 100-LongBench
paper, HELMET (repo + sheet), RULER, LongBench v2, Chroma context-rot.
- scripts/test_longscore.mjs (NEW) — 25 smoke tests for normalize,
classify, lookup. All pass.
- scripts/test_longscore_e2e.mjs (NEW) — 5 E2E lookup cases for the
3 example buttons + HELMET-only model + miss. All pass.

Hub badge fix (separate but bundled, 1-line):
- The "covered by mode" badges in Solutions Hub had `color: var(--success)`
inherited from .badge default and were also given inline
`background:#3fb950` (also success green). Result: green text on green
background → invisible. Fixed by adding inline `color:#fff` (white) and
`border-color` to all 3 badge variants (covered, planned, external).

Verification:
- node scripts/test_longscore.mjs → 25/25 pass
- node scripts/test_longscore_e2e.mjs → 5/5 pass
- python -m http.server + curl all assets → HTTP 200
- KB sanity: Llama-3.1-70B-Inst LongScore = -0.1024 matches Day-8
manual computation exactly.

Live HF Space verification PENDING (requires push to origin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (8) hide show

data/longscore_kb.json +2941 -0
data/solutions_hub.json +16 -0
index.html +33 -0
js/i18n.js +152 -0
js/longscore.js +105 -0
js/main.js +171 -4
scripts/test_longscore.mjs +72 -0
scripts/test_longscore_e2e.mjs +34 -0

data/longscore_kb.json ADDED Viewed

	@@ -0,0 +1,2941 @@

+{
+  "version": "v0.8.8-longscore-2026-05-08",
+  "metric": "LongScore (100-LongBench, ACL 2025, arXiv:2505.19293, §3.2)",
+  "metric_formula": "Base = mean(S_4K, S_8K); LC_l = (S_l - Base) / Base; LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})",
+  "metric_interpretation": {
+    "0.0": "no degradation past short context",
+    "-0.05": "mild degradation (~5% relative drop)",
+    "-0.15": "moderate degradation (~15% relative drop)",
+    "-0.30": "severe degradation (~30% relative drop)",
+    "negative": "more negative = worse long-ctx retention"
+  },
+  "thresholds": {
+    "no_degradation": -0.02,
+    "mild": -0.1,
+    "moderate": -0.2,
+    "severe": -0.3
+  },
+  "sources": {
+    "ruler": "NVIDIA RULER leaderboard + Qwen2.5 Tech Report Table 16 (n=35)",
+    "helmet": "HELMET Google Sheet (princeton-nlp; arXiv:2410.02694; n=63)"
+  },
+  "stats": {
+    "n_total": 93,
+    "n_ruler_only": 30,
+    "n_helmet_only": 58,
+    "n_both": 5,
+    "n_with_longscore": 35
+  },
+  "models": {
+    "qwen-2-5-7-b-instruct": {
+      "display_name": "qwen2.5-7b-instruct",
+      "ruler_per_ctx": {
+        "4k": 96.7,
+        "8k": 95.1,
+        "16k": 93.7,
+        "32k": 89.4,
+        "64k": 74.5,
+        "128k": 31.4
+      },
+      "ruler_long_score": {
+        "base": 95.9,
+        "per_length_lc": {
+          "16k": -0.0229,
+          "32k": -0.0678,
+          "64k": -0.2231,
+          "128k": -0.6726
+        },
+        "avg_lc": -0.2466
+      },
+      "helmet": {
+        "overall": 22.8,
+        "categories": {
+          "Recall": 11.4,
+          "RAG": 30.6,
+          "Cite": 3.1,
+          "Re-rank": 1.6,
+          "ICL": 72.0,
+          "LongQA": 21.9,
+          "Summ": 18.8
+        },
+        "context_window_k": 128.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": "native",
+      "params_b": 7,
+      "native_context_k": 32,
+      "source": "ruler+helmet"
+    },
+    "qwen-2-5-14-b-instruct": {
+      "display_name": "qwen2.5-14b-instruct",
+      "ruler_per_ctx": {
+        "4k": 96.9,
+        "8k": 97.1,
+        "16k": 95.5,
+        "32k": 95.5,
+        "64k": 90.3,
+        "128k": 82.0
+      },
+      "ruler_long_score": {
+        "base": 97.0,
+        "per_length_lc": {
+          "16k": -0.0155,
+          "32k": -0.0155,
+          "64k": -0.0691,
+          "128k": -0.1546
+        },
+        "avg_lc": -0.0637
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 14,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "qwen-2-5-32-b-instruct": {
+      "display_name": "qwen2.5-32b-instruct",
+      "ruler_per_ctx": {
+        "4k": 97.7,
+        "8k": 97.2,
+        "16k": 97.7,
+        "32k": 96.5,
+        "64k": 88.5,
+        "128k": 67.0
+      },
+      "ruler_long_score": {
+        "base": 97.45,
+        "per_length_lc": {
+          "16k": 0.0026,
+          "32k": -0.0097,
+          "64k": -0.0918,
+          "128k": -0.3125
+        },
+        "avg_lc": -0.1028
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 32,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "mistral-7-b-v-0-2": {
+      "display_name": "mistral-7b-v0.2",
+      "ruler_per_ctx": {
+        "4k": 93.6,
+        "8k": 91.2,
+        "16k": 87.2,
+        "32k": 75.4,
+        "64k": 49.0,
+        "128k": 13.8
+      },
+      "ruler_long_score": {
+        "base": 92.4,
+        "per_length_lc": {
+          "16k": -0.0563,
+          "32k": -0.184,
+          "64k": -0.4697,
+          "128k": -0.8506
+        },
+        "avg_lc": -0.3901
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 7,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "mixtral-8-x-7-b": {
+      "display_name": "mixtral-8x7b",
+      "ruler_per_ctx": {
+        "4k": 94.9,
+        "8k": 92.1,
+        "16k": 92.5,
+        "32k": 85.9,
+        "64k": 72.4,
+        "128k": 44.5
+      },
+      "ruler_long_score": {
+        "base": 93.5,
+        "per_length_lc": {
+          "16k": -0.0107,
+          "32k": -0.0813,
+          "64k": -0.2257,
+          "128k": -0.5241
+        },
+        "avg_lc": -0.2104
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 47,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "mixtral-8-x-22-b-instruct": {
+      "display_name": "mixtral-8x22b-instruct",
+      "ruler_per_ctx": {
+        "4k": 95.6,
+        "8k": 94.9,
+        "16k": 93.4,
+        "32k": 90.9,
+        "64k": 84.7,
+        "128k": 31.7
+      },
+      "ruler_long_score": {
+        "base": 95.25,
+        "per_length_lc": {
+          "16k": -0.0194,
+          "32k": -0.0457,
+          "64k": -0.1108,
+          "128k": -0.6672
+        },
+        "avg_lc": -0.2108
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 141,
+      "native_context_k": 64,
+      "source": "ruler"
+    },
+    "qwen-2-72-b": {
+      "display_name": "qwen2-72b",
+      "ruler_per_ctx": {
+        "4k": 96.9,
+        "8k": 96.1,
+        "16k": 94.9,
+        "32k": 94.1,
+        "64k": 79.8,
+        "128k": 53.7
+      },
+      "ruler_long_score": {
+        "base": 96.5,
+        "per_length_lc": {
+          "16k": -0.0166,
+          "32k": -0.0249,
+          "64k": -0.1731,
+          "128k": -0.4435
+        },
+        "avg_lc": -0.1645
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 72,
+      "native_context_k": 128,
+      "source": "ruler"
+    },
+    "dbrx": {
+      "display_name": "dbrx",
+      "ruler_per_ctx": {
+        "4k": 95.1,
+        "8k": 93.8,
+        "16k": 83.6,
+        "32k": 63.1,
+        "64k": 2.4,
+        "128k": 0.0
+      },
+      "ruler_long_score": {
+        "base": 94.45,
+        "per_length_lc": {
+          "16k": -0.1149,
+          "32k": -0.3319,
+          "64k": -0.9746,
+          "128k": -1.0
+        },
+        "avg_lc": -0.6054
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 132,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "qwen-1-5-72-b": {
+      "display_name": "qwen1.5-72b",
+      "ruler_per_ctx": {
+        "4k": 94.9,
+        "8k": 93.8,
+        "16k": 78.0,
+        "32k": 67.8,
+        "64k": 0.0,
+        "128k": 0.0
+      },
+      "ruler_long_score": {
+        "base": 94.35,
+        "per_length_lc": {
+          "16k": -0.1733,
+          "32k": -0.2814,
+          "64k": -1.0,
+          "128k": -1.0
+        },
+        "avg_lc": -0.6137
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 72,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "mistral-large-2407": {
+      "display_name": "mistral-large-2407",
+      "ruler_per_ctx": {
+        "4k": 96.2,
+        "8k": 96.1,
+        "16k": 95.1,
+        "32k": 93.0,
+        "64k": 78.8,
+        "128k": 23.7
+      },
+      "ruler_long_score": {
+        "base": 96.15,
+        "per_length_lc": {
+          "16k": -0.0109,
+          "32k": -0.0328,
+          "64k": -0.1804,
+          "128k": -0.7535
+        },
+        "avg_lc": -0.2444
+      },
+      "helmet": null,
+      "recipe_class": "native",
+      "params_b": 123,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "yi-34-b-200-k": {
+      "display_name": "yi-34b-200k",
+      "ruler_per_ctx": {
+        "4k": 93.3,
+        "8k": 92.2,
+        "16k": 91.3,
+        "32k": 87.5,
+        "64k": 83.2,
+        "128k": 77.3
+      },
+      "ruler_long_score": {
+        "base": 92.75,
+        "per_length_lc": {
+          "16k": -0.0156,
+          "32k": -0.0566,
+          "64k": -0.103,
+          "128k": -0.1666
+        },
+        "avg_lc": -0.0854
+      },
+      "helmet": {
+        "overall": 40.3,
+        "categories": {
+          "Recall": 74.1,
+          "RAG": 59.5,
+          "Cite": 1.7,
+          "Re-rank": 20.5,
+          "ICL": 85.0,
+          "LongQA": 26.6,
+          "Summ": 14.3
+        },
+        "context_window_k": 200.0,
+        "params_b": "34.0",
+        "type": "♭"
+      },
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 34,
+      "native_context_k": 4,
+      "source": "ruler+helmet"
+    },
+    "command-r-35-b": {
+      "display_name": "command-r-35b",
+      "ruler_per_ctx": {
+        "4k": 93.8,
+        "8k": 93.3,
+        "16k": 92.4,
+        "32k": 89.5,
+        "64k": 84.9,
+        "128k": 76.0
+      },
+      "ruler_long_score": {
+        "base": 93.55,
+        "per_length_lc": {
+          "16k": -0.0123,
+          "32k": -0.0433,
+          "64k": -0.0925,
+          "128k": -0.1876
+        },
+        "avg_lc": -0.0839
+      },
+      "helmet": null,
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 35,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "command-r-0824-32-b": {
+      "display_name": "command-r-0824-32b",
+      "ruler_per_ctx": {
+        "4k": 94.7,
+        "8k": 93.7,
+        "16k": 93.1,
+        "32k": 90.8,
+        "64k": 86.6,
+        "128k": 74.7
+      },
+      "ruler_long_score": {
+        "base": 94.2,
+        "per_length_lc": {
+          "16k": -0.0117,
+          "32k": -0.0361,
+          "64k": -0.0807,
+          "128k": -0.207
+        },
+        "avg_lc": -0.0839
+      },
+      "helmet": null,
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 32,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "command-r-plus-104-b": {
+      "display_name": "command-r-plus-104b",
+      "ruler_per_ctx": {
+        "4k": 95.6,
+        "8k": 95.2,
+        "16k": 94.2,
+        "32k": 92.0,
+        "64k": 84.3,
+        "128k": 63.1
+      },
+      "ruler_long_score": {
+        "base": 95.4,
+        "per_length_lc": {
+          "16k": -0.0126,
+          "32k": -0.0356,
+          "64k": -0.1164,
+          "128k": -0.3386
+        },
+        "avg_lc": -0.1258
+      },
+      "helmet": null,
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 104,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "command-r-plus-0824-104-b": {
+      "display_name": "command-r-plus-0824-104b",
+      "ruler_per_ctx": {
+        "4k": 96.0,
+        "8k": 95.1,
+        "16k": 94.0,
+        "32k": 92.4,
+        "64k": 85.4,
+        "128k": 64.6
+      },
+      "ruler_long_score": {
+        "base": 95.55,
+        "per_length_lc": {
+          "16k": -0.0162,
+          "32k": -0.033,
+          "64k": -0.1062,
+          "128k": -0.3239
+        },
+        "avg_lc": -0.1198
+      },
+      "helmet": null,
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 104,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "mistral-large-2411-123-b": {
+      "display_name": "mistral-large-2411-123b",
+      "ruler_per_ctx": {
+        "4k": 96.4,
+        "8k": 96.3,
+        "16k": 95.3,
+        "32k": 94.0,
+        "64k": 85.9,
+        "128k": 48.1
+      },
+      "ruler_long_score": {
+        "base": 96.35,
+        "per_length_lc": {
+          "16k": -0.0109,
+          "32k": -0.0244,
+          "64k": -0.1085,
+          "128k": -0.5008
+        },
+        "avg_lc": -0.1612
+      },
+      "helmet": null,
+      "recipe_class": "engineered_long_ctx",
+      "params_b": 123,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "chatglm-3-6-b-128-k": {
+      "display_name": "chatglm3-6b-128k",
+      "ruler_per_ctx": {
+        "4k": 87.8,
+        "8k": 83.4,
+        "16k": 78.6,
+        "32k": 69.9,
+        "64k": 56.0,
+        "128k": 42.0
+      },
+      "ruler_long_score": {
+        "base": 85.6,
+        "per_length_lc": {
+          "16k": -0.0818,
+          "32k": -0.1834,
+          "64k": -0.3458,
+          "128k": -0.5093
+        },
+        "avg_lc": -0.2801
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 6,
+      "native_context_k": 2,
+      "source": "ruler"
+    },
+    "lwm-7-b": {
+      "display_name": "lwm-7b",
+      "ruler_per_ctx": {
+        "4k": 82.3,
+        "8k": 78.4,
+        "16k": 73.7,
+        "32k": 69.1,
+        "64k": 68.1,
+        "128k": 65.0
+      },
+      "ruler_long_score": {
+        "base": 80.35,
+        "per_length_lc": {
+          "16k": -0.0828,
+          "32k": -0.14,
+          "64k": -0.1525,
+          "128k": -0.191
+        },
+        "avg_lc": -0.1416
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 7,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "glm-4-9-b-1-m": {
+      "display_name": "glm4-9b-1m",
+      "ruler_per_ctx": {
+        "4k": 94.7,
+        "8k": 92.8,
+        "16k": 92.1,
+        "32k": 89.9,
+        "64k": 86.7,
+        "128k": 83.1
+      },
+      "ruler_long_score": {
+        "base": 93.75,
+        "per_length_lc": {
+          "16k": -0.0176,
+          "32k": -0.0411,
+          "64k": -0.0752,
+          "128k": -0.1136
+        },
+        "avg_lc": -0.0619
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 9,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "prolong-8-b-512-k": {
+      "display_name": "prolong-8b-512k",
+      "ruler_per_ctx": {
+        "4k": 94.5,
+        "8k": 92.5,
+        "16k": 92.3,
+        "32k": 89.3,
+        "64k": 83.2,
+        "128k": 81.6
+      },
+      "ruler_long_score": {
+        "base": 93.5,
+        "per_length_lc": {
+          "16k": -0.0128,
+          "32k": -0.0449,
+          "64k": -0.1102,
+          "128k": -0.1273
+        },
+        "avg_lc": -0.0738
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 8,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "megabeam-mistral-7-b-512-k": {
+      "display_name": "megabeam-mistral-7b-512k",
+      "ruler_per_ctx": {
+        "4k": 93.8,
+        "8k": 92.5,
+        "16k": 92.0,
+        "32k": 89.2,
+        "64k": 83.7,
+        "128k": 83.7
+      },
+      "ruler_long_score": {
+        "base": 93.15,
+        "per_length_lc": {
+          "16k": -0.0123,
+          "32k": -0.0424,
+          "64k": -0.1014,
+          "128k": -0.1014
+        },
+        "avg_lc": -0.0644
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 7,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "internlm-2-5-7-b-1-m": {
+      "display_name": "internlm2.5-7b-1m",
+      "ruler_per_ctx": {
+        "4k": 88.1,
+        "8k": 85.5,
+        "16k": 84.5,
+        "32k": 82.7,
+        "64k": 75.5,
+        "128k": 68.9
+      },
+      "ruler_long_score": {
+        "base": 86.8,
+        "per_length_lc": {
+          "16k": -0.0265,
+          "32k": -0.0472,
+          "64k": -0.1302,
+          "128k": -0.2062
+        },
+        "avg_lc": -0.1025
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 7,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "gradientai-llama-3-70-b-1-m": {
+      "display_name": "gradientai-llama3-70b-1m",
+      "ruler_per_ctx": {
+        "4k": 95.1,
+        "8k": 94.4,
+        "16k": 90.8,
+        "32k": 85.4,
+        "64k": 80.9,
+        "128k": 72.1
+      },
+      "ruler_long_score": {
+        "base": 94.75,
+        "per_length_lc": {
+          "16k": -0.0417,
+          "32k": -0.0987,
+          "64k": -0.1462,
+          "128k": -0.2391
+        },
+        "avg_lc": -0.1314
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 70,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "gradientai-llama-3-8-b-1-m": {
+      "display_name": "gradientai-llama3-8b-1m",
+      "ruler_per_ctx": {
+        "4k": 92.8,
+        "8k": 90.3,
+        "16k": 85.7,
+        "32k": 79.9,
+        "64k": 76.3,
+        "128k": 69.5
+      },
+      "ruler_long_score": {
+        "base": 91.55,
+        "per_length_lc": {
+          "16k": -0.0639,
+          "32k": -0.1273,
+          "64k": -0.1666,
+          "128k": -0.2409
+        },
+        "avg_lc": -0.1497
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 8,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "longchat-7-b-32-k": {
+      "display_name": "longchat-7b-32k",
+      "ruler_per_ctx": {
+        "4k": 84.7,
+        "8k": 79.9,
+        "16k": 70.8,
+        "32k": 59.3,
+        "64k": 0.0,
+        "128k": 0.0
+      },
+      "ruler_long_score": {
+        "base": 82.3,
+        "per_length_lc": {
+          "16k": -0.1397,
+          "32k": -0.2795,
+          "64k": -1.0,
+          "128k": -1.0
+        },
+        "avg_lc": -0.6048
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 7,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "longalpaca-13-b-32-k": {
+      "display_name": "longalpaca-13b-32k",
+      "ruler_per_ctx": {
+        "4k": 60.6,
+        "8k": 57.0,
+        "16k": 56.6,
+        "32k": 43.6,
+        "64k": 0.0,
+        "128k": 0.0
+      },
+      "ruler_long_score": {
+        "base": 58.8,
+        "per_length_lc": {
+          "16k": -0.0374,
+          "32k": -0.2585,
+          "64k": -1.0,
+          "128k": -1.0
+        },
+        "avg_lc": -0.574
+      },
+      "helmet": null,
+      "recipe_class": "aggressive_ext_small_base",
+      "params_b": 13,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "llama-3-1-70-b-instruct": {
+      "display_name": "llama3.1-70b-instruct",
+      "ruler_per_ctx": {
+        "4k": 96.5,
+        "8k": 95.8,
+        "16k": 95.4,
+        "32k": 94.8,
+        "64k": 88.4,
+        "128k": 66.6
+      },
+      "ruler_long_score": {
+        "base": 96.15,
+        "per_length_lc": {
+          "16k": -0.0078,
+          "32k": -0.014,
+          "64k": -0.0806,
+          "128k": -0.3073
+        },
+        "avg_lc": -0.1024
+      },
+      "helmet": {
+        "overall": 49.7,
+        "categories": {
+          "Recall": 90.7,
+          "RAG": 56.2,
+          "Cite": 7.5,
+          "Re-rank": 24.5,
+          "ICL": 81.4,
+          "LongQA": 56.3,
+          "Summ": 31.6
+        },
+        "context_window_k": 128.0,
+        "params_b": "70.0",
+        "type": "♯"
+      },
+      "recipe_class": "continued_pretrain_cliff",
+      "params_b": 70,
+      "native_context_k": 8,
+      "source": "ruler+helmet"
+    },
+    "llama-3-1-8-b-instruct": {
+      "display_name": "llama3.1-8b-instruct",
+      "ruler_per_ctx": {
+        "4k": 95.5,
+        "8k": 93.8,
+        "16k": 91.6,
+        "32k": 87.4,
+        "64k": 84.7,
+        "128k": 77.0
+      },
+      "ruler_long_score": {
+        "base": 94.65,
+        "per_length_lc": {
+          "16k": -0.0322,
+          "32k": -0.0766,
+          "64k": -0.1051,
+          "128k": -0.1865
+        },
+        "avg_lc": -0.1001
+      },
+      "helmet": {
+        "overall": 46.5,
+        "categories": {
+          "Recall": 95.2,
+          "RAG": 59.5,
+          "Cite": 2.9,
+          "Re-rank": 14.0,
+          "ICL": 83.9,
+          "LongQA": 43.2,
+          "Summ": 27.0
+        },
+        "context_window_k": 128.0,
+        "params_b": "8.0",
+        "type": "♯"
+      },
+      "recipe_class": "continued_pretrain_cliff",
+      "params_b": 8,
+      "native_context_k": 8,
+      "source": "ruler+helmet"
+    },
+    "mistral-nemo-12-b": {
+      "display_name": "mistral-nemo-12b",
+      "ruler_per_ctx": {
+        "4k": 87.8,
+        "8k": 87.2,
+        "16k": 87.7,
+        "32k": 69.0,
+        "64k": 46.8,
+        "128k": 19.0
+      },
+      "ruler_long_score": {
+        "base": 87.5,
+        "per_length_lc": {
+          "16k": 0.0023,
+          "32k": -0.2114,
+          "64k": -0.4651,
+          "128k": -0.7829
+        },
+        "avg_lc": -0.3643
+      },
+      "helmet": null,
+      "recipe_class": "continued_pretrain_cliff",
+      "params_b": 12,
+      "native_context_k": 8,
+      "source": "ruler"
+    },
+    "qwen-2-5-7-b-instruct-1-m": {
+      "display_name": "qwen2.5-7b-instruct-1m",
+      "ruler_per_ctx": {
+        "4k": 96.8,
+        "8k": 95.3,
+        "16k": 93.0,
+        "32k": 91.1,
+        "64k": 90.4,
+        "128k": 84.4
+      },
+      "ruler_long_score": {
+        "base": 96.05,
+        "per_length_lc": {
+          "16k": -0.0318,
+          "32k": -0.0515,
+          "64k": -0.0588,
+          "128k": -0.1213
+        },
+        "avg_lc": -0.0659
+      },
+      "helmet": null,
+      "recipe_class": "dca_ssm_flat",
+      "params_b": 7,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "qwen-2-5-14-b-instruct-1-m": {
+      "display_name": "qwen2.5-14b-instruct-1m",
+      "ruler_per_ctx": {
+        "4k": 97.5,
+        "8k": 97.1,
+        "16k": 94.6,
+        "32k": 94.9,
+        "64k": 94.9,
+        "128k": 92.2
+      },
+      "ruler_long_score": {
+        "base": 97.3,
+        "per_length_lc": {
+          "16k": -0.0277,
+          "32k": -0.0247,
+          "64k": -0.0247,
+          "128k": -0.0524
+        },
+        "avg_lc": -0.0324
+      },
+      "helmet": null,
+      "recipe_class": "dca_ssm_flat",
+      "params_b": 14,
+      "native_context_k": 32,
+      "source": "ruler"
+    },
+    "jamba-1-5-large": {
+      "display_name": "jamba-1.5-large",
+      "ruler_per_ctx": {
+        "4k": 96.7,
+        "8k": 96.6,
+        "16k": 96.4,
+        "32k": 96.0,
+        "64k": 95.4,
+        "128k": 95.1
+      },
+      "ruler_long_score": {
+        "base": 96.65,
+        "per_length_lc": {
+          "16k": -0.0026,
+          "32k": -0.0067,
+          "64k": -0.0129,
+          "128k": -0.016
+        },
+        "avg_lc": -0.0095
+      },
+      "helmet": null,
+      "recipe_class": "dca_ssm_flat",
+      "params_b": 398,
+      "native_context_k": 256,
+      "source": "ruler"
+    },
+    "jamba-1-5-mini": {
+      "display_name": "jamba-1.5-mini",
+      "ruler_per_ctx": {
+        "4k": 95.6,
+        "8k": 95.6,
+        "16k": 94.8,
+        "32k": 94.6,
+        "64k": 92.8,
+        "128k": 90.0
+      },
+      "ruler_long_score": {
+        "base": 95.6,
+        "per_length_lc": {
+          "16k": -0.0084,
+          "32k": -0.0105,
+          "64k": -0.0293,
+          "128k": -0.0586
+        },
+        "avg_lc": -0.0267
+      },
+      "helmet": {
+        "overall": 46.9,
+        "categories": {
+          "Recall": 90.0,
+          "RAG": 57.3,
+          "Cite": 3.1,
+          "Re-rank": 14.6,
+          "ICL": 91.0,
+          "LongQA": 54.2,
+          "Summ": 18.1
+        },
+        "context_window_k": 256.0,
+        "params_b": "12.0",
+        "type": "♯"
+      },
+      "recipe_class": "dca_ssm_flat",
+      "params_b": 52,
+      "native_context_k": 256,
+      "source": "ruler+helmet"
+    },
+    "phi-3-mini-3-8-b": {
+      "display_name": "phi3-mini-3.8b",
+      "ruler_per_ctx": {
+        "4k": 92.2,
+        "8k": 91.5,
+        "16k": 90.7,
+        "32k": 87.5,
+        "64k": 80.6,
+        "128k": 66.7
+      },
+      "ruler_long_score": {
+        "base": 91.85,
+        "per_length_lc": {
+          "16k": -0.0125,
+          "32k": -0.0474,
+          "64k": -0.1225,
+          "128k": -0.2738
+        },
+        "avg_lc": -0.114
+      },
+      "helmet": null,
+      "recipe_class": "longrope",
+      "params_b": 3.8,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "phi-3-medium-14-b": {
+      "display_name": "phi3-medium-14b",
+      "ruler_per_ctx": {
+        "4k": 93.3,
+        "8k": 93.2,
+        "16k": 91.1,
+        "32k": 86.8,
+        "64k": 78.6,
+        "128k": 46.1
+      },
+      "ruler_long_score": {
+        "base": 93.25,
+        "per_length_lc": {
+          "16k": -0.0231,
+          "32k": -0.0692,
+          "64k": -0.1571,
+          "128k": -0.5056
+        },
+        "avg_lc": -0.1888
+      },
+      "helmet": null,
+      "recipe_class": "longrope",
+      "params_b": 14,
+      "native_context_k": 4,
+      "source": "ruler"
+    },
+    "gpt-4": {
+      "display_name": "GPT-4",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 39.2,
+        "categories": {
+          "Recall": 73.5,
+          "RAG": 64.7,
+          "Cite": 1.2,
+          "Re-rank": 9.6,
+          "ICL": 46.0,
+          "LongQA": 46.7,
+          "Summ": 33.0
+        },
+        "context_window_k": 128.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gpt-4-o-mini": {
+      "display_name": "GPT-4o-mini",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 55.7,
+        "categories": {
+          "Recall": 89.6,
+          "RAG": 68.0,
+          "Cite": 24.3,
+          "Re-rank": 31.2,
+          "ICL": 81.2,
+          "LongQA": 54.8,
+          "Summ": 40.8
+        },
+        "context_window_k": 128.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gpt-4-o-05": {
+      "display_name": "GPT-4o-05",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 56.4,
+        "categories": {
+          "Recall": 82.4,
+          "RAG": 70.4,
+          "Cite": 40.4,
+          "Re-rank": 47.7,
+          "ICL": 48.4,
+          "LongQA": 62.0,
+          "Summ": 43.6
+        },
+        "context_window_k": 128.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gpt-4-o-08": {
+      "display_name": "GPT-4o-08",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 64.8,
+        "categories": {
+          "Recall": 99.9,
+          "RAG": 70.2,
+          "Cite": 44.3,
+          "Re-rank": 50.0,
+          "ICL": 86.3,
+          "LongQA": 59.3,
+          "Summ": 43.2
+        },
+        "context_window_k": 128.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "claude-3-5-sonnet": {
+      "display_name": "Claude-3.5-Sonnet",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 38.4,
+        "categories": {
+          "Recall": 94.7,
+          "RAG": 38.1,
+          "Cite": 18.7,
+          "Re-rank": 7.2,
+          "ICL": 61.0,
+          "LongQA": 12.6,
+          "Summ": 36.6
+        },
+        "context_window_k": 200.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemini-1-5-flash": {
+      "display_name": "Gemini-1.5-Flash",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 52.1,
+        "categories": {
+          "Recall": 91.2,
+          "RAG": 66.5,
+          "Cite": 31.7,
+          "Re-rank": 50.8,
+          "ICL": 28.1,
+          "LongQA": 57.3,
+          "Summ": 39.4
+        },
+        "context_window_k": 1024.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemini-1-5-pro": {
+      "display_name": "Gemini-1.5-Pro",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 64.4,
+        "categories": {
+          "Recall": 91.0,
+          "RAG": 71.1,
+          "Cite": 43.6,
+          "Re-rank": 59.7,
+          "ICL": 79.4,
+          "LongQA": 59.6,
+          "Summ": 46.4
+        },
+        "context_window_k": 2048.0,
+        "params_b": "?",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "?",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-2-7-b-32-k": {
+      "display_name": "Llama-2-7B-32k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.3,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 9.4,
+          "Summ": 0.0
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-2-7-b-32-k-instruct": {
+      "display_name": "Llama-2-7B-32k-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.3,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 9.1,
+          "Summ": 0.0
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-7-b-80-k": {
+      "display_name": "Llama-7B-80k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 22.4,
+        "categories": {
+          "Recall": 17.0,
+          "RAG": 40.8,
+          "Cite": 1.0,
+          "Re-rank": 0.2,
+          "ICL": 76.8,
+          "LongQA": 19.8,
+          "Summ": 0.9
+        },
+        "context_window_k": 80.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "yarn-llama-2-7-b-64-k": {
+      "display_name": "Yarn-Llama-2-7B-64k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 0.6,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.2,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 3.8,
+          "Summ": 0.2
+        },
+        "context_window_k": 64.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "yarn-llama-2-7-b-128-k": {
+      "display_name": "Yarn-Llama-2-7B-128k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 17.9,
+        "categories": {
+          "Recall": 4.3,
+          "RAG": 26.7,
+          "Cite": 0.7,
+          "Re-rank": 0.4,
+          "ICL": 75.4,
+          "LongQA": 16.1,
+          "Summ": 1.9
+        },
+        "context_window_k": 128.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-8-b": {
+      "display_name": "Llama-3-8B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.5,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 10.5,
+          "Summ": 0.0
+        },
+        "context_window_k": 8.0,
+        "params_b": "8.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-8-b-instruct": {
+      "display_name": "Llama-3-8B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.2,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 8.7,
+          "Summ": 0.0
+        },
+        "context_window_k": 8.0,
+        "params_b": "8.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-8-b-θ": {
+      "display_name": "Llama-3-8B-θ",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 0.9,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 2.4,
+          "Cite": 0.1,
+          "Re-rank": 0.0,
+          "ICL": 2.8,
+          "LongQA": 0.8,
+          "Summ": 0.0
+        },
+        "context_window_k": 8.0,
+        "params_b": "8.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-8-b-instruct-θ": {
+      "display_name": "Llama-3-8B-Inst-θ",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.6,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 1.3,
+          "Cite": 0.8,
+          "Re-rank": 0.1,
+          "ICL": 4.3,
+          "LongQA": 2.5,
+          "Summ": 1.9
+        },
+        "context_window_k": 8.0,
+        "params_b": "8.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-70-b-θ": {
+      "display_name": "Llama-3-70B-θ",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 1.2,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 2.9,
+          "Re-rank": 0.0,
+          "ICL": 0.2,
+          "LongQA": 5.4,
+          "Summ": 0.1
+        },
+        "context_window_k": 8.0,
+        "params_b": "70.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "70.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-70-b-instruct-θ": {
+      "display_name": "Llama-3-70B-Inst-θ",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 2.1,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 0.0,
+          "LongQA": 12.0,
+          "Summ": 2.7
+        },
+        "context_window_k": 8.0,
+        "params_b": "70.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "70.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-1-8-b": {
+      "display_name": "Llama-3.1-8B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 37.1,
+        "categories": {
+          "Recall": 76.6,
+          "RAG": 54.5,
+          "Cite": 1.6,
+          "Re-rank": 9.3,
+          "ICL": 79.5,
+          "LongQA": 36.2,
+          "Summ": 1.7
+        },
+        "context_window_k": 128.0,
+        "params_b": "8.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-1-70-b": {
+      "display_name": "Llama-3.1-70B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 41.7,
+        "categories": {
+          "Recall": 76.3,
+          "RAG": 56.6,
+          "Cite": 3.9,
+          "Re-rank": 22.6,
+          "ICL": 80.4,
+          "LongQA": 46.0,
+          "Summ": 6.3
+        },
+        "context_window_k": 128.0,
+        "params_b": "70.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "70.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-3-70-b-instruct": {
+      "display_name": "Llama-3.3-70B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 48.2,
+        "categories": {
+          "Recall": 81.8,
+          "RAG": 55.1,
+          "Cite": 10.9,
+          "Re-rank": 26.2,
+          "ICL": 77.8,
+          "LongQA": 52.0,
+          "Summ": 33.3
+        },
+        "context_window_k": 128.0,
+        "params_b": "70.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "70.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-2-1-b": {
+      "display_name": "Llama-3.2-1B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 22.6,
+        "categories": {
+          "Recall": 25.2,
+          "RAG": 32.1,
+          "Cite": 1.5,
+          "Re-rank": 5.6,
+          "ICL": 76.9,
+          "LongQA": 15.9,
+          "Summ": 0.9
+        },
+        "context_window_k": 128.0,
+        "params_b": "1.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "1.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-2-1-b-instruct": {
+      "display_name": "Llama-3.2-1B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 25.0,
+        "categories": {
+          "Recall": 32.4,
+          "RAG": 33.8,
+          "Cite": 2.4,
+          "Re-rank": 1.7,
+          "ICL": 77.3,
+          "LongQA": 18.1,
+          "Summ": 9.1
+        },
+        "context_window_k": 128.0,
+        "params_b": "1.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "1.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-2-3-b": {
+      "display_name": "Llama-3.2-3B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 33.5,
+        "categories": {
+          "Recall": 58.8,
+          "RAG": 51.4,
+          "Cite": 2.1,
+          "Re-rank": 5.4,
+          "ICL": 89.5,
+          "LongQA": 27.6,
+          "Summ": 0.0
+        },
+        "context_window_k": 128.0,
+        "params_b": "3.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "3.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-3-2-3-b-instruct": {
+      "display_name": "Llama-3.2-3B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 37.0,
+        "categories": {
+          "Recall": 53.1,
+          "RAG": 56.4,
+          "Cite": 7.4,
+          "Re-rank": 0.7,
+          "ICL": 86.4,
+          "LongQA": 28.9,
+          "Summ": 25.9
+        },
+        "context_window_k": 128.0,
+        "params_b": "3.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "3.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-4-17-b": {
+      "display_name": "Llama-4-17B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 31.4,
+        "categories": {
+          "Recall": 42.8,
+          "RAG": 50.4,
+          "Cite": 3.7,
+          "Re-rank": 6.1,
+          "ICL": 88.0,
+          "LongQA": 28.1,
+          "Summ": 0.6
+        },
+        "context_window_k": 128.0,
+        "params_b": "17.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "17.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "llama-4-17-b-instruct": {
+      "display_name": "Llama-4-17B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 45.0,
+        "categories": {
+          "Recall": 76.6,
+          "RAG": 51.2,
+          "Cite": 4.7,
+          "Re-rank": 17.9,
+          "ICL": 81.0,
+          "LongQA": 50.5,
+          "Summ": 33.2
+        },
+        "context_window_k": 128.0,
+        "params_b": "17.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "17.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-7-b-instruct-v-0-1": {
+      "display_name": "Mistral-7B-Inst-v0.1",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 15.8,
+        "categories": {
+          "Recall": 3.8,
+          "RAG": 36.6,
+          "Cite": 1.3,
+          "Re-rank": 0.0,
+          "ICL": 44.8,
+          "LongQA": 15.2,
+          "Summ": 9.2
+        },
+        "context_window_k": 8.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-7-b-instruct-v-0-2": {
+      "display_name": "Mistral-7B-Inst-v0.2",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 8.7,
+        "categories": {
+          "Recall": 1.5,
+          "RAG": 11.4,
+          "Cite": 0.6,
+          "Re-rank": 1.2,
+          "ICL": 38.0,
+          "LongQA": 4.1,
+          "Summ": 4.3
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-7-b-v-0-3": {
+      "display_name": "Mistral-7B-v0.3",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 10.8,
+        "categories": {
+          "Recall": 3.6,
+          "RAG": 4.3,
+          "Cite": 0.4,
+          "Re-rank": 1.2,
+          "ICL": 54.8,
+          "LongQA": 10.7,
+          "Summ": 0.4
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-7-b-instruct-v-0-3": {
+      "display_name": "Mistral-7B-Inst-v0.3",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 17.3,
+        "categories": {
+          "Recall": 8.0,
+          "RAG": 21.3,
+          "Cite": 0.9,
+          "Re-rank": 0.5,
+          "ICL": 73.6,
+          "LongQA": 12.0,
+          "Summ": 5.1
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "ministral-8-b-instruct": {
+      "display_name": "Ministral-8B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 24.2,
+        "categories": {
+          "Recall": 12.8,
+          "RAG": 35.3,
+          "Cite": 0.4,
+          "Re-rank": 0.0,
+          "ICL": 84.4,
+          "LongQA": 25.0,
+          "Summ": 11.5
+        },
+        "context_window_k": 128.0,
+        "params_b": "8.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-nemo": {
+      "display_name": "Mistral-Nemo",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 25.0,
+        "categories": {
+          "Recall": 17.7,
+          "RAG": 45.9,
+          "Cite": 1.0,
+          "Re-rank": 0.0,
+          "ICL": 82.4,
+          "LongQA": 26.2,
+          "Summ": 1.7
+        },
+        "context_window_k": 128.0,
+        "params_b": "12.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "12.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "mistral-nemo-instruct": {
+      "display_name": "Mistral-Nemo-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 25.7,
+        "categories": {
+          "Recall": 14.6,
+          "RAG": 40.0,
+          "Cite": 0.5,
+          "Re-rank": 0.0,
+          "ICL": 84.0,
+          "LongQA": 22.5,
+          "Summ": 18.5
+        },
+        "context_window_k": 128.0,
+        "params_b": "12.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "12.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "megabeam-mistral": {
+      "display_name": "MegaBeam-Mistral",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 45.4,
+        "categories": {
+          "Recall": 89.6,
+          "RAG": 57.0,
+          "Cite": 4.0,
+          "Re-rank": 14.7,
+          "ICL": 86.2,
+          "LongQA": 37.3,
+          "Summ": 28.9
+        },
+        "context_window_k": 512.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "yi-6-b-200-k": {
+      "display_name": "Yi-6B-200k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 27.8,
+        "categories": {
+          "Recall": 37.4,
+          "RAG": 36.9,
+          "Cite": 1.1,
+          "Re-rank": 3.4,
+          "ICL": 82.3,
+          "LongQA": 30.0,
+          "Summ": 3.5
+        },
+        "context_window_k": 200.0,
+        "params_b": "6.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "6.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "yi-9-b-200-k": {
+      "display_name": "Yi-9B-200k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 33.9,
+        "categories": {
+          "Recall": 56.2,
+          "RAG": 50.5,
+          "Cite": 3.3,
+          "Re-rank": 8.3,
+          "ICL": 73.4,
+          "LongQA": 39.3,
+          "Summ": 6.2
+        },
+        "context_window_k": 200.0,
+        "params_b": "9.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "9.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "yi-1-5-9-b-32-k": {
+      "display_name": "Yi-1.5-9B-32k",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 2.1,
+        "categories": {
+          "Recall": 0.0,
+          "RAG": 0.0,
+          "Cite": 0.0,
+          "Re-rank": 0.0,
+          "ICL": 4.2,
+          "LongQA": 7.1,
+          "Summ": 3.2
+        },
+        "context_window_k": 32.0,
+        "params_b": "9.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "9.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "phi-3-mini-128-k-instruct": {
+      "display_name": "Phi-3-mini-128k-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 33.6,
+        "categories": {
+          "Recall": 50.1,
+          "RAG": 46.7,
+          "Cite": 0.6,
+          "Re-rank": 5.8,
+          "ICL": 78.7,
+          "LongQA": 29.9,
+          "Summ": 23.7
+        },
+        "context_window_k": 128.0,
+        "params_b": "4.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "4.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "phi-3-small-128-k-instruct": {
+      "display_name": "Phi-3-small-128k-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 24.9,
+        "categories": {
+          "Recall": 22.3,
+          "RAG": 33.8,
+          "Cite": 3.0,
+          "Re-rank": 1.9,
+          "ICL": 79.6,
+          "LongQA": 27.5,
+          "Summ": 6.6
+        },
+        "context_window_k": 128.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "phi-3-med-128-k-instruct": {
+      "display_name": "Phi-3-med-128k-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 29.1,
+        "categories": {
+          "Recall": 24.5,
+          "RAG": 44.8,
+          "Cite": 3.3,
+          "Re-rank": 6.6,
+          "ICL": 73.2,
+          "LongQA": 26.9,
+          "Summ": 24.8
+        },
+        "context_window_k": 128.0,
+        "params_b": "14.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "14.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "phi-3-5-mini-instruct": {
+      "display_name": "Phi-3.5-mini-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 33.6,
+        "categories": {
+          "Recall": 48.8,
+          "RAG": 43.1,
+          "Cite": 1.6,
+          "Re-rank": 7.8,
+          "ICL": 79.5,
+          "LongQA": 28.5,
+          "Summ": 26.3
+        },
+        "context_window_k": 128.0,
+        "params_b": "4.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "4.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-7-b": {
+      "display_name": "Qwen2-7B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 30.0,
+        "categories": {
+          "Recall": 38.2,
+          "RAG": 45.0,
+          "Cite": 2.3,
+          "Re-rank": 3.6,
+          "ICL": 77.5,
+          "LongQA": 36.8,
+          "Summ": 6.8
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-7-b-instruct": {
+      "display_name": "Qwen2-7B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 31.2,
+        "categories": {
+          "Recall": 36.8,
+          "RAG": 47.4,
+          "Cite": 2.4,
+          "Re-rank": 5.9,
+          "ICL": 66.2,
+          "LongQA": 31.9,
+          "Summ": 27.6
+        },
+        "context_window_k": 32.0,
+        "params_b": "7.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-57-b": {
+      "display_name": "Qwen2-57B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 11.0,
+        "categories": {
+          "Recall": 1.8,
+          "RAG": 10.6,
+          "Cite": 1.1,
+          "Re-rank": 0.0,
+          "ICL": 43.3,
+          "LongQA": 15.7,
+          "Summ": 4.9
+        },
+        "context_window_k": 32.0,
+        "params_b": "14.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "14.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-57-b-instruct": {
+      "display_name": "Qwen2-57B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 13.4,
+        "categories": {
+          "Recall": 5.9,
+          "RAG": 12.7,
+          "Cite": 1.1,
+          "Re-rank": 0.0,
+          "ICL": 40.6,
+          "LongQA": 22.9,
+          "Summ": 10.5
+        },
+        "context_window_k": 32.0,
+        "params_b": "14.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "14.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-1-5-b": {
+      "display_name": "Qwen2.5-1.5B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 12.9,
+        "categories": {
+          "Recall": 3.5,
+          "RAG": 22.3,
+          "Cite": 0.6,
+          "Re-rank": 0.0,
+          "ICL": 43.7,
+          "LongQA": 18.2,
+          "Summ": 2.1
+        },
+        "context_window_k": 32.0,
+        "params_b": "1.5",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "1.5",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-1-5-b-instruct": {
+      "display_name": "Qwen2.5-1.5B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 13.2,
+        "categories": {
+          "Recall": 7.8,
+          "RAG": 25.0,
+          "Cite": 1.7,
+          "Re-rank": 0.3,
+          "ICL": 38.4,
+          "LongQA": 7.8,
+          "Summ": 11.3
+        },
+        "context_window_k": 32.0,
+        "params_b": "1.5",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "1.5",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-3-b": {
+      "display_name": "Qwen2.5-3B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 12.3,
+        "categories": {
+          "Recall": 7.5,
+          "RAG": 21.1,
+          "Cite": 1.0,
+          "Re-rank": 2.2,
+          "ICL": 32.0,
+          "LongQA": 20.5,
+          "Summ": 1.8
+        },
+        "context_window_k": 32.0,
+        "params_b": "3.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "3.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-3-b-instruct": {
+      "display_name": "Qwen2.5-3B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 18.1,
+        "categories": {
+          "Recall": 10.4,
+          "RAG": 22.4,
+          "Cite": 1.9,
+          "Re-rank": 3.7,
+          "ICL": 49.4,
+          "LongQA": 21.3,
+          "Summ": 17.5
+        },
+        "context_window_k": 32.0,
+        "params_b": "3.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "3.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-7-b": {
+      "display_name": "Qwen2.5-7B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 20.9,
+        "categories": {
+          "Recall": 13.9,
+          "RAG": 27.0,
+          "Cite": 1.3,
+          "Re-rank": 0.1,
+          "ICL": 71.5,
+          "LongQA": 25.9,
+          "Summ": 6.8
+        },
+        "context_window_k": 128.0,
+        "params_b": "7.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "7.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "qwen-2-5-72-b-instruct": {
+      "display_name": "Qwen2.5-72B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 38.2,
+        "categories": {
+          "Recall": 38.4,
+          "RAG": 43.0,
+          "Cite": 8.0,
+          "Re-rank": 24.5,
+          "ICL": 83.2,
+          "LongQA": 38.9,
+          "Summ": 31.4
+        },
+        "context_window_k": 128.0,
+        "params_b": "72.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "72.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "prolong": {
+      "display_name": "ProLong",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 49.4,
+        "categories": {
+          "Recall": 98.8,
+          "RAG": 63.2,
+          "Cite": 1.4,
+          "Re-rank": 22.5,
+          "ICL": 86.5,
+          "LongQA": 43.9,
+          "Summ": 29.2
+        },
+        "context_window_k": 512.0,
+        "params_b": "8.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "8.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemma-3-12-b": {
+      "display_name": "Gemma-3-12B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 40.6,
+        "categories": {
+          "Recall": 76.6,
+          "RAG": 59.3,
+          "Cite": 2.0,
+          "Re-rank": 16.0,
+          "ICL": 88.7,
+          "LongQA": 41.4,
+          "Summ": 0.2
+        },
+        "context_window_k": 128.0,
+        "params_b": "12.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "12.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemma-3-12-b-instruct": {
+      "display_name": "Gemma-3-12B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 36.0,
+        "categories": {
+          "Recall": 25.9,
+          "RAG": 52.1,
+          "Cite": 1.3,
+          "Re-rank": 21.3,
+          "ICL": 75.4,
+          "LongQA": 43.0,
+          "Summ": 33.1
+        },
+        "context_window_k": 128.0,
+        "params_b": "12.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "12.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemma-3-27-b": {
+      "display_name": "Gemma-3-27B",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 44.5,
+        "categories": {
+          "Recall": 82.6,
+          "RAG": 64.1,
+          "Cite": 2.8,
+          "Re-rank": 30.4,
+          "ICL": 86.3,
+          "LongQA": 44.6,
+          "Summ": 0.7
+        },
+        "context_window_k": 128.0,
+        "params_b": "27.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "27.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "gemma-3-27-b-instruct": {
+      "display_name": "Gemma-3-27B-Inst",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 41.9,
+        "categories": {
+          "Recall": 45.1,
+          "RAG": 56.6,
+          "Cite": 1.6,
+          "Re-rank": 30.0,
+          "ICL": 77.4,
+          "LongQA": 46.8,
+          "Summ": 35.9
+        },
+        "context_window_k": 128.0,
+        "params_b": "27.0",
+        "type": "♯"
+      },
+      "recipe_class": null,
+      "params_b": "27.0",
+      "native_context_k": null,
+      "source": "helmet"
+    },
+    "jamba-v-0-1": {
+      "display_name": "Jamba-v0.1",
+      "ruler_per_ctx": null,
+      "ruler_long_score": null,
+      "helmet": {
+        "overall": 41.5,
+        "categories": {
+          "Recall": 80.3,
+          "RAG": 60.3,
+          "Cite": 6.0,
+          "Re-rank": 9.3,
+          "ICL": 88.7,
+          "LongQA": 41.7,
+          "Summ": 4.0
+        },
+        "context_window_k": 256.0,
+        "params_b": "12.0",
+        "type": "♭"
+      },
+      "recipe_class": null,
+      "params_b": "12.0",
+      "native_context_k": null,
+      "source": "helmet"
+    }
+  },
+  "aliases": {
+    "qwen-2-5-7-b-instruct": [
+      "qwen2.5-7b-instruct",
+      "Qwen2.5-7B-Inst",
+      "Qwen2.5-7B-Inst",
+      "Qwen2.5-7B-Inst",
+      "Qwen2.5-7B-Inst",
+      "Qwen2.5-7B-Inst"
+    ],
+    "qwen-2-5-14-b-instruct": [
+      "qwen2.5-14b-instruct"
+    ],
+    "qwen-2-5-32-b-instruct": [
+      "qwen2.5-32b-instruct"
+    ],
+    "mistral-7-b-v-0-2": [
+      "mistral-7b-v0.2"
+    ],
+    "mixtral-8-x-7-b": [
+      "mixtral-8x7b"
+    ],
+    "mixtral-8-x-22-b-instruct": [
+      "mixtral-8x22b-instruct"
+    ],
+    "qwen-2-72-b": [
+      "qwen2-72b"
+    ],
+    "dbrx": [
+      "dbrx"
+    ],
+    "qwen-1-5-72-b": [
+      "qwen1.5-72b"
+    ],
+    "mistral-large-2407": [
+      "mistral-large-2407"
+    ],
+    "yi-34-b-200-k": [
+      "yi-34b-200k",
+      "Yi-34B-200k",
+      "Yi-34B-200k",
+      "Yi-34B-200k",
+      "Yi-34B-200k",
+      "Yi-34B-200k"
+    ],
+    "command-r-35-b": [
+      "command-r-35b"
+    ],
+    "command-r-0824-32-b": [
+      "command-r-0824-32b"
+    ],
+    "command-r-plus-104-b": [
+      "command-r-plus-104b"
+    ],
+    "command-r-plus-0824-104-b": [
+      "command-r-plus-0824-104b"
+    ],
+    "mistral-large-2411-123-b": [
+      "mistral-large-2411-123b"
+    ],
+    "chatglm-3-6-b-128-k": [
+      "chatglm3-6b-128k"
+    ],
+    "lwm-7-b": [
+      "lwm-7b"
+    ],
+    "glm-4-9-b-1-m": [
+      "glm4-9b-1m"
+    ],
+    "prolong-8-b-512-k": [
+      "prolong-8b-512k"
+    ],
+    "megabeam-mistral-7-b-512-k": [
+      "megabeam-mistral-7b-512k"
+    ],
+    "internlm-2-5-7-b-1-m": [
+      "internlm2.5-7b-1m"
+    ],
+    "gradientai-llama-3-70-b-1-m": [
+      "gradientai-llama3-70b-1m"
+    ],
+    "gradientai-llama-3-8-b-1-m": [
+      "gradientai-llama3-8b-1m"
+    ],
+    "longchat-7-b-32-k": [
+      "longchat-7b-32k"
+    ],
+    "longalpaca-13-b-32-k": [
+      "longalpaca-13b-32k"
+    ],
+    "llama-3-1-70-b-instruct": [
+      "llama3.1-70b-instruct",
+      "Llama-3.1-70B-Inst",
+      "Llama-3.1-70B-Inst",
+      "Llama-3.1-70B-Inst",
+      "Llama-3.1-70B-Inst",
+      "Llama-3.1-70B-Inst"
+    ],
+    "llama-3-1-8-b-instruct": [
+      "llama3.1-8b-instruct",
+      "Llama-3.1-8B-Inst",
+      "Llama-3.1-8B-Inst",
+      "Llama-3.1-8B-Inst",
+      "Llama-3.1-8B-Inst",
+      "Llama-3.1-8B-Inst"
+    ],
+    "mistral-nemo-12-b": [
+      "mistral-nemo-12b"
+    ],
+    "qwen-2-5-7-b-instruct-1-m": [
+      "qwen2.5-7b-instruct-1m"
+    ],
+    "qwen-2-5-14-b-instruct-1-m": [
+      "qwen2.5-14b-instruct-1m"
+    ],
+    "jamba-1-5-large": [
+      "jamba-1.5-large"
+    ],
+    "jamba-1-5-mini": [
+      "jamba-1.5-mini",
+      "Jamba-1.5-Mini",
+      "Jamba-1.5-Mini",
+      "Jamba-1.5-Mini",
+      "Jamba-1.5-Mini",
+      "Jamba-1.5-Mini"
+    ],
+    "phi-3-mini-3-8-b": [
+      "phi3-mini-3.8b"
+    ],
+    "phi-3-medium-14-b": [
+      "phi3-medium-14b"
+    ],
+    "gpt-4": [
+      "GPT-4",
+      "GPT-4",
+      "GPT-4",
+      "GPT-4",
+      "GPT-4"
+    ],
+    "gpt-4-o-mini": [
+      "GPT-4o-mini",
+      "GPT-4o-mini",
+      "GPT-4o-mini",
+      "GPT-4o-mini",
+      "GPT-4o-mini"
+    ],
+    "gpt-4-o-05": [
+      "GPT-4o-05",
+      "GPT-4o-05",
+      "GPT-4o-05",
+      "GPT-4o-05",
+      "GPT-4o-05"
+    ],
+    "gpt-4-o-08": [
+      "GPT-4o-08",
+      "GPT-4o-08",
+      "GPT-4o-08",
+      "GPT-4o-08",
+      "GPT-4o-08"
+    ],
+    "claude-3-5-sonnet": [
+      "Claude-3.5-Sonnet",
+      "Claude-3.5-Sonnet",
+      "Claude-3.5-Sonnet",
+      "Claude-3.5-Sonnet",
+      "Claude-3.5-Sonnet"
+    ],
+    "gemini-1-5-flash": [
+      "Gemini-1.5-Flash",
+      "Gemini-1.5-Flash",
+      "Gemini-1.5-Flash",
+      "Gemini-1.5-Flash",
+      "Gemini-1.5-Flash"
+    ],
+    "gemini-1-5-pro": [
+      "Gemini-1.5-Pro",
+      "Gemini-1.5-Pro",
+      "Gemini-1.5-Pro",
+      "Gemini-1.5-Pro",
+      "Gemini-1.5-Pro"
+    ],
+    "llama-2-7-b-32-k": [
+      "Llama-2-7B-32k",
+      "Llama-2-7B-32k",
+      "Llama-2-7B-32k",
+      "Llama-2-7B-32k",
+      "Llama-2-7B-32k"
+    ],
+    "llama-2-7-b-32-k-instruct": [
+      "Llama-2-7B-32k-Inst",
+      "Llama-2-7B-32k-Inst",
+      "Llama-2-7B-32k-Inst",
+      "Llama-2-7B-32k-Inst",
+      "Llama-2-7B-32k-Inst"
+    ],
+    "llama-7-b-80-k": [
+      "Llama-7B-80k",
+      "Llama-7B-80k",
+      "Llama-7B-80k",
+      "Llama-7B-80k",
+      "Llama-7B-80k"
+    ],
+    "yarn-llama-2-7-b-64-k": [
+      "Yarn-Llama-2-7B-64k",
+      "Yarn-Llama-2-7B-64k",
+      "Yarn-Llama-2-7B-64k",
+      "Yarn-Llama-2-7B-64k",
+      "Yarn-Llama-2-7B-64k"
+    ],
+    "yarn-llama-2-7-b-128-k": [
+      "Yarn-Llama-2-7B-128k",
+      "Yarn-Llama-2-7B-128k",
+      "Yarn-Llama-2-7B-128k",
+      "Yarn-Llama-2-7B-128k",
+      "Yarn-Llama-2-7B-128k"
+    ],
+    "llama-3-8-b": [
+      "Llama-3-8B",
+      "Llama-3-8B",
+      "Llama-3-8B",
+      "Llama-3-8B",
+      "Llama-3-8B"
+    ],
+    "llama-3-8-b-instruct": [
+      "Llama-3-8B-Inst",
+      "Llama-3-8B-Inst",
+      "Llama-3-8B-Inst",
+      "Llama-3-8B-Inst",
+      "Llama-3-8B-Inst"
+    ],
+    "llama-3-8-b-θ": [
+      "Llama-3-8B-θ",
+      "Llama-3-8B-θ",
+      "Llama-3-8B-θ",
+      "Llama-3-8B-θ",
+      "Llama-3-8B-θ"
+    ],
+    "llama-3-8-b-instruct-θ": [
+      "Llama-3-8B-Inst-θ",
+      "Llama-3-8B-Inst-θ",
+      "Llama-3-8B-Inst-θ",
+      "Llama-3-8B-Inst-θ",
+      "Llama-3-8B-Inst-θ"
+    ],
+    "llama-3-70-b-θ": [
+      "Llama-3-70B-θ",
+      "Llama-3-70B-θ",
+      "Llama-3-70B-θ",
+      "Llama-3-70B-θ",
+      "Llama-3-70B-θ"
+    ],
+    "llama-3-70-b-instruct-θ": [
+      "Llama-3-70B-Inst-θ",
+      "Llama-3-70B-Inst-θ",
+      "Llama-3-70B-Inst-θ",
+      "Llama-3-70B-Inst-θ",
+      "Llama-3-70B-Inst-θ"
+    ],
+    "llama-3-1-8-b": [
+      "Llama-3.1-8B",
+      "Llama-3.1-8B",
+      "Llama-3.1-8B",
+      "Llama-3.1-8B",
+      "Llama-3.1-8B"
+    ],
+    "llama-3-1-70-b": [
+      "Llama-3.1-70B",
+      "Llama-3.1-70B",
+      "Llama-3.1-70B",
+      "Llama-3.1-70B",
+      "Llama-3.1-70B"
+    ],
+    "llama-3-3-70-b-instruct": [
+      "Llama-3.3-70B-Inst",
+      "Llama-3.3-70B-Inst",
+      "Llama-3.3-70B-Inst",
+      "Llama-3.3-70B-Inst",
+      "Llama-3.3-70B-Inst"
+    ],
+    "llama-3-2-1-b": [
+      "Llama-3.2-1B",
+      "Llama-3.2-1B",
+      "Llama-3.2-1B",
+      "Llama-3.2-1B",
+      "Llama-3.2-1B"
+    ],
+    "llama-3-2-1-b-instruct": [
+      "Llama-3.2-1B-Inst",
+      "Llama-3.2-1B-Inst",
+      "Llama-3.2-1B-Inst",
+      "Llama-3.2-1B-Inst",
+      "Llama-3.2-1B-Inst"
+    ],
+    "llama-3-2-3-b": [
+      "Llama-3.2-3B",
+      "Llama-3.2-3B",
+      "Llama-3.2-3B",
+      "Llama-3.2-3B",
+      "Llama-3.2-3B"
+    ],
+    "llama-3-2-3-b-instruct": [
+      "Llama-3.2-3B-Inst",
+      "Llama-3.2-3B-Inst",
+      "Llama-3.2-3B-Inst",
+      "Llama-3.2-3B-Inst",
+      "Llama-3.2-3B-Inst"
+    ],
+    "llama-4-17-b": [
+      "Llama-4-17B",
+      "Llama-4-17B",
+      "Llama-4-17B",
+      "Llama-4-17B",
+      "Llama-4-17B"
+    ],
+    "llama-4-17-b-instruct": [
+      "Llama-4-17B-Inst",
+      "Llama-4-17B-Inst",
+      "Llama-4-17B-Inst",
+      "Llama-4-17B-Inst",
+      "Llama-4-17B-Inst"
+    ],
+    "mistral-7-b-instruct-v-0-1": [
+      "Mistral-7B-Inst-v0.1",
+      "Mistral-7B-Inst-v0.1",
+      "Mistral-7B-Inst-v0.1",
+      "Mistral-7B-Inst-v0.1",
+      "Mistral-7B-Inst-v0.1"
+    ],
+    "mistral-7-b-instruct-v-0-2": [
+      "Mistral-7B-Inst-v0.2",
+      "Mistral-7B-Inst-v0.2",
+      "Mistral-7B-Inst-v0.2",
+      "Mistral-7B-Inst-v0.2",
+      "Mistral-7B-Inst-v0.2"
+    ],
+    "mistral-7-b-v-0-3": [
+      "Mistral-7B-v0.3",
+      "Mistral-7B-v0.3",
+      "Mistral-7B-v0.3",
+      "Mistral-7B-v0.3",
+      "Mistral-7B-v0.3"
+    ],
+    "mistral-7-b-instruct-v-0-3": [
+      "Mistral-7B-Inst-v0.3",
+      "Mistral-7B-Inst-v0.3",
+      "Mistral-7B-Inst-v0.3",
+      "Mistral-7B-Inst-v0.3",
+      "Mistral-7B-Inst-v0.3"
+    ],
+    "ministral-8-b-instruct": [
+      "Ministral-8B-Inst",
+      "Ministral-8B-Inst",
+      "Ministral-8B-Inst",
+      "Ministral-8B-Inst",
+      "Ministral-8B-Inst"
+    ],
+    "mistral-nemo": [
+      "Mistral-Nemo",
+      "Mistral-Nemo",
+      "Mistral-Nemo",
+      "Mistral-Nemo",
+      "Mistral-Nemo"
+    ],
+    "mistral-nemo-instruct": [
+      "Mistral-Nemo-Inst",
+      "Mistral-Nemo-Inst",
+      "Mistral-Nemo-Inst",
+      "Mistral-Nemo-Inst",
+      "Mistral-Nemo-Inst"
+    ],
+    "megabeam-mistral": [
+      "MegaBeam-Mistral",
+      "MegaBeam-Mistral",
+      "MegaBeam-Mistral",
+      "MegaBeam-Mistral",
+      "MegaBeam-Mistral"
+    ],
+    "yi-6-b-200-k": [
+      "Yi-6B-200k",
+      "Yi-6B-200k",
+      "Yi-6B-200k",
+      "Yi-6B-200k",
+      "Yi-6B-200k"
+    ],
+    "yi-9-b-200-k": [
+      "Yi-9B-200k",
+      "Yi-9B-200k",
+      "Yi-9B-200k",
+      "Yi-9B-200k",
+      "Yi-9B-200k"
+    ],
+    "yi-1-5-9-b-32-k": [
+      "Yi-1.5-9B-32k",
+      "Yi-1.5-9B-32k",
+      "Yi-1.5-9B-32k",
+      "Yi-1.5-9B-32k",
+      "Yi-1.5-9B-32k"
+    ],
+    "phi-3-mini-128-k-instruct": [
+      "Phi-3-mini-128k-Inst",
+      "Phi-3-mini-128k-Inst",
+      "Phi-3-mini-128k-Inst",
+      "Phi-3-mini-128k-Inst",
+      "Phi-3-mini-128k-Inst"
+    ],
+    "phi-3-small-128-k-instruct": [
+      "Phi-3-small-128k-Inst",
+      "Phi-3-small-128k-Inst",
+      "Phi-3-small-128k-Inst",
+      "Phi-3-small-128k-Inst",
+      "Phi-3-small-128k-Inst"
+    ],
+    "phi-3-med-128-k-instruct": [
+      "Phi-3-med-128k-Inst",
+      "Phi-3-med-128k-Inst",
+      "Phi-3-med-128k-Inst",
+      "Phi-3-med-128k-Inst",
+      "Phi-3-med-128k-Inst"
+    ],
+    "phi-3-5-mini-instruct": [
+      "Phi-3.5-mini-Inst",
+      "Phi-3.5-mini-Inst",
+      "Phi-3.5-mini-Inst",
+      "Phi-3.5-mini-Inst",
+      "Phi-3.5-mini-Inst"
+    ],
+    "qwen-2-7-b": [
+      "Qwen2-7B",
+      "Qwen2-7B",
+      "Qwen2-7B",
+      "Qwen2-7B",
+      "Qwen2-7B"
+    ],
+    "qwen-2-7-b-instruct": [
+      "Qwen2-7B-Inst",
+      "Qwen2-7B-Inst",
+      "Qwen2-7B-Inst",
+      "Qwen2-7B-Inst",
+      "Qwen2-7B-Inst"
+    ],
+    "qwen-2-57-b": [
+      "Qwen2-57B",
+      "Qwen2-57B",
+      "Qwen2-57B",
+      "Qwen2-57B",
+      "Qwen2-57B"
+    ],
+    "qwen-2-57-b-instruct": [
+      "Qwen2-57B-Inst",
+      "Qwen2-57B-Inst",
+      "Qwen2-57B-Inst",
+      "Qwen2-57B-Inst",
+      "Qwen2-57B-Inst"
+    ],
+    "qwen-2-5-1-5-b": [
+      "Qwen2.5-1.5B",
+      "Qwen2.5-1.5B",
+      "Qwen2.5-1.5B",
+      "Qwen2.5-1.5B",
+      "Qwen2.5-1.5B"
+    ],
+    "qwen-2-5-1-5-b-instruct": [
+      "Qwen2.5-1.5B-Inst",
+      "Qwen2.5-1.5B-Inst",
+      "Qwen2.5-1.5B-Inst",
+      "Qwen2.5-1.5B-Inst",
+      "Qwen2.5-1.5B-Inst"
+    ],
+    "qwen-2-5-3-b": [
+      "Qwen2.5-3B",
+      "Qwen2.5-3B",
+      "Qwen2.5-3B",
+      "Qwen2.5-3B",
+      "Qwen2.5-3B"
+    ],
+    "qwen-2-5-3-b-instruct": [
+      "Qwen2.5-3B-Inst",
+      "Qwen2.5-3B-Inst",
+      "Qwen2.5-3B-Inst",
+      "Qwen2.5-3B-Inst",
+      "Qwen2.5-3B-Inst"
+    ],
+    "qwen-2-5-7-b": [
+      "Qwen2.5-7B",
+      "Qwen2.5-7B",
+      "Qwen2.5-7B",
+      "Qwen2.5-7B",
+      "Qwen2.5-7B"
+    ],
+    "qwen-2-5-72-b-instruct": [
+      "Qwen2.5-72B-Inst",
+      "Qwen2.5-72B-Inst",
+      "Qwen2.5-72B-Inst",
+      "Qwen2.5-72B-Inst",
+      "Qwen2.5-72B-Inst"
+    ],
+    "prolong": [
+      "ProLong",
+      "ProLong",
+      "ProLong",
+      "ProLong",
+      "ProLong"
+    ],
+    "gemma-3-12-b": [
+      "Gemma-3-12B",
+      "Gemma-3-12B",
+      "Gemma-3-12B",
+      "Gemma-3-12B",
+      "Gemma-3-12B"
+    ],
+    "gemma-3-12-b-instruct": [
+      "Gemma-3-12B-Inst",
+      "Gemma-3-12B-Inst",
+      "Gemma-3-12B-Inst",
+      "Gemma-3-12B-Inst",
+      "Gemma-3-12B-Inst"
+    ],
+    "gemma-3-27-b": [
+      "Gemma-3-27B",
+      "Gemma-3-27B",
+      "Gemma-3-27B",
+      "Gemma-3-27B",
+      "Gemma-3-27B"
+    ],
+    "gemma-3-27-b-instruct": [
+      "Gemma-3-27B-Inst",
+      "Gemma-3-27B-Inst",
+      "Gemma-3-27B-Inst",
+      "Gemma-3-27B-Inst",
+      "Gemma-3-27B-Inst"
+    ],
+    "jamba-v-0-1": [
+      "Jamba-v0.1",
+      "Jamba-v0.1",
+      "Jamba-v0.1",
+      "Jamba-v0.1",
+      "Jamba-v0.1"
+    ]
+  }
+}

data/solutions_hub.json CHANGED Viewed

@@ -137,6 +137,22 @@
       "best_for": "Predicting NIAH and reasoning pass rates from architecture alone — no inference needed.",
       "not_for": "Final go/no-go decision — re-test on your domain after architectural screening passes."
     },
     {
       "id": "tokenizer_glitch",
       "category": "diagnostic",

       "best_for": "Predicting NIAH and reasoning pass rates from architecture alone — no inference needed.",
       "not_for": "Final go/no-go decision — re-test on your domain after architectural screening passes."
     },
+    {
+      "id": "long_ctx_degradation",
+      "category": "diagnostic",
+      "pain": "Long-ctx accuracy drops well before the claimed 128K window — but raw scores are dominated by base ability and hide which model truly retains capability past short context.",
+      "tafagent_mode": "🎯 LongScore",
+      "external_tools": [
+        {"name": "100-LongBench paper (LongScore metric)", "url": "https://arxiv.org/abs/2505.19293", "type": "paper"},
+        {"name": "HELMET (Princeton, 7-task long-ctx benchmark)", "url": "https://github.com/princeton-nlp/HELMET", "type": "tool"},
+        {"name": "HELMET full results sheet (n=315)", "url": "https://docs.google.com/spreadsheets/d/1LBt6dP4UwZwU_CjoYhyAd_rjKhQLvo0Gq4cYUnpi_CA", "type": "leaderboard"},
+        {"name": "NVIDIA RULER per-length scores", "url": "https://github.com/NVIDIA/RULER", "type": "tool"},
+        {"name": "LongBench v2 (THUDM)", "url": "https://longbench2.github.io/", "type": "leaderboard"},
+        {"name": "Chroma context rot study", "url": "https://research.trychroma.com/context-rot", "type": "paper"}
+      ],
+      "best_for": "Picking a long-ctx model by relative degradation rather than raw score. Uses peer-reviewed LongScore metric to disentangle base ability from long-ctx capability. Browser-only, no GPU.",
+      "not_for": "Models not yet covered by RULER or HELMET (KB has n=93). Use the external tools above for breadth."
+    },
     {
       "id": "tokenizer_glitch",
       "category": "diagnostic",

index.html CHANGED Viewed

@@ -231,6 +231,9 @@
       <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
       <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
       <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
       <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
@@ -348,6 +351,7 @@
             <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
             <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
             <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
             <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
           </ul>
         </details>
@@ -485,6 +489,7 @@
         <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
         <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
         <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
         <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
@@ -1178,6 +1183,34 @@
       </p>
     </section>
     <section id="hub-section" style="display:none;">
       <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
         <span class="info"><span class="tooltip" data-i18n="hub.tip">

       <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
       <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
+      <p><strong data-i18n="help.v088.longscore.title">🎯 LongScore</strong></p>
+      <p data-i18n="help.v088.longscore.body">Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).</p>
       <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
       <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
             <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
             <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
             <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
+            <li data-i18n="inv.v088.longscore"><strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.</li>
             <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
           </ul>
         </details>
         <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
         <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
         <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
+        <button class="mode-btn" data-mode="longscore" role="tab" aria-selected="false" data-i18n="modes.longscore">🎯 LongScore</button>
         <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
       </div>
       <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
       </p>
     </section>
+    <!-- LongScore (mode=longscore, v0.8.8 anti-bullshit pack #14) -->
+    <section id="longscore-section" style="display:none;">
+      <h2><span data-i18n="longscore.title">🎯 LongScore</span>
+        <span class="info"><span class="tooltip" data-i18n="longscore.tip">
+          <strong>Why this matters</strong>: every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).
+        </span></span>
+      </h2>
+      <p class="recipe-desc" data-i18n="longscore.desc">
+        <strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.
+      </p>
+      <div class="form-row">
+        <label for="longscore-input" data-i18n="longscore.input_label">Model id:</label>
+        <input type="text" id="longscore-input" data-i18n-placeholder="longscore.input.placeholder"
+          placeholder="e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct" style="flex:1;" />
+        <button type="button" id="longscore-lookup-btn" data-i18n="longscore.lookup_btn">🔎 Lookup</button>
+      </div>
+      <div class="form-row">
+        <button type="button" id="longscore-example-good-btn" class="secondary" data-i18n="longscore.example_good_btn">↳ Example: Jamba-1.5-Large (no degradation)</button>
+        <button type="button" id="longscore-example-mid-btn" class="secondary" data-i18n="longscore.example_mid_btn">↳ Example: Llama-3.1-70B (moderate)</button>
+        <button type="button" id="longscore-example-bad-btn" class="secondary" data-i18n="longscore.example_bad_btn">↳ Example: dbrx (severe)</button>
+      </div>
+      <p id="longscore-status" class="recipe-desc" style="font-size:0.92em;"></p>
+      <div id="longscore-output" style="margin-top: 1em;"></div>
+      <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="longscore.formula_note">
+        💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href="https://arxiv.org/abs/2505.19293" target="_blank">100-LongBench, ACL 2025</a>. Data: <a href="https://github.com/NVIDIA/RULER" target="_blank">NVIDIA RULER</a> (per-length, n=33) + <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.
+      </p>
+    </section>
     <section id="hub-section" style="display:none;">
       <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
         <span class="info"><span class="tooltip" data-i18n="hub.tip">

js/i18n.js CHANGED Viewed

@@ -753,6 +753,44 @@ export const TRANSLATIONS = {
     "help.v087.tax.title":         "🌍 Multilingual Tokenizer Tax",
     "help.v087.tax.body":          "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
@@ -1965,6 +2003,44 @@ export const TRANSLATIONS = {
     "help.v087.tax.title":         "🌍 Impuesto de Tokenizer Multilingüe",
     "help.v087.tax.body":          "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
@@ -3041,6 +3117,44 @@ export const TRANSLATIONS = {
     "help.v087.tax.title":         "🌍 Taxe Tokenizer Multilingue",
     "help.v087.tax.body":          "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
@@ -4117,6 +4231,44 @@ export const TRANSLATIONS = {
     "help.v087.tax.title":         "🌍 多语言 Tokenizer 税",
     "help.v087.tax.body":          "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token，但在拉丁字母训练的 tokenizer（Llama、Phi）上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js（~750 KB CDN），并对粘贴的文本运行 6 个预设供应商 tokenizer（Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似）的 tokenize。输出：每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块（拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文）让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>：『我的多语言支持给账单加了 30%——哪种语言成本最高？』→ 粘贴真实生产文本，查看每个 tokenizer 的精确分解。",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别（评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性），每个映射到（a）解决它的 tafagent 模式（若存在），以及（b）社区已信任的最佳外部工具（RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等）。搜索框匹配 pain、场景和工具名称。<em>用例</em>：'我有问题 X — tafagent 解决它吗，如果不，谁解决？'",

     "help.v087.tax.title":         "🌍 Multilingual Tokenizer Tax",
     "help.v087.tax.body":          "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
+    // v0.8.8 — anti-bullshit pack #14: LongScore (RULER + HELMET lookup)
+    "modes.longscore":             "🎯 LongScore",
+    "mode_desc.longscore":         "Look up your model's relative degradation past short context. RULER + HELMET KBs (n=93 models). LongScore metric from 100-LongBench (ACL 2025).",
+    "longscore.title":             "🎯 LongScore",
+    "longscore.tip":               "Every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).",
+    "longscore.desc":              "<strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.",
+    "longscore.input_label":       "Model id:",
+    "longscore.input.placeholder": "e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct",
+    "longscore.lookup_btn":        "🔎 Lookup",
+    "longscore.example_good_btn":  "↳ Example: Jamba-1.5-Large (no degradation)",
+    "longscore.example_mid_btn":   "↳ Example: Llama-3.1-70B (moderate)",
+    "longscore.example_bad_btn":   "↳ Example: dbrx (severe)",
+    "longscore.formula_note":      "💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Data: <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.",
+    "longscore.miss.title":        "Model not found in KB",
+    "longscore.miss.body":         "Looked up <code>{id}</code>. KB has {n} models. Try a canonical HF id (e.g. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
+    "longscore.miss.suggest":      "Check coverage at",
+    "longscore.no_ruler":          "⚠ No per-length data — LongScore not computable. Showing HELMET aggregate at 128K instead.",
+    "longscore.score_label":       "LongScore",
+    "longscore.helmet_label":      "HELMET 7-task breakdown",
+    "longscore.col.ctx":           "Context",
+    "longscore.col.score":         "Score",
+    "longscore.col.lc":            "LC",
+    "longscore.col.task":          "Task",
+    "longscore.source_note":       "Data source",
+    "longscore.hint.empty":        "⚠ Paste a model id first.",
+    "longscore.status.lookup":     "⏳ Looking up…",
+    "longscore.status.miss":       "ℹ Model not in KB",
+    "longscore.status.ruler_hit":  "✅ RULER per-length data found",
+    "longscore.status.helmet_only":"ℹ HELMET aggregate only (no per-length data)",
+    "longscore.verdict.no_degradation": "✅ No degradation past short context",
+    "longscore.verdict.mild":      "🟢 Mild degradation (<10%)",
+    "longscore.verdict.moderate":  "🟠 Moderate degradation (10-20%)",
+    "longscore.verdict.severe":    "🔴 Severe degradation (20-30%)",
+    "longscore.verdict.extreme":   "🚨 Extreme degradation (>30%)",
+    "inv.v088.longscore":          "<strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.",
+    "help.v088.longscore.title":   "🎯 LongScore",
+    "help.v088.longscore.body":    "Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability — a smarter model with a worse long-ctx recipe still scores higher than a less-smart model with a better recipe, masking the actual long-ctx degradation. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. Result: a relative-degradation number per model that compares apples to apples. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id (lowercase, dashes, dots normalized). For models with RULER data, you get the full LongScore + per-length breakdown + verdict (no/mild/moderate/severe/extreme degradation). For HELMET-only models, you get the 7-category aggregate at 128K. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff). Decide whether to use it, switch to a model with engineered long-ctx, or chunk your input.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
     "help.v087.tax.title":         "🌍 Impuesto de Tokenizer Multilingüe",
     "help.v087.tax.body":          "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
+    // v0.8.8 — anti-bullshit pack #14: LongScore (RULER + HELMET lookup)
+    "modes.longscore":             "🎯 LongScore",
+    "mode_desc.longscore":         "Consulta la degradación relativa de tu modelo más allá del contexto corto. KBs RULER + HELMET (n=93 modelos). Métrica LongScore de 100-LongBench (ACL 2025).",
+    "longscore.title":             "🎯 LongScore",
+    "longscore.tip":               "Cada modelo dice tener ventana de 128K, pero la accuracy degrada mucho antes. LongScore (métrica peer-reviewed de 100-LongBench, ACL 2025) mide la degradación relativa más allá del contexto corto. Separa la base ability de la capacidad real long-ctx — comparas degradación, no scores brutos. Lookup contra KBs RULER + HELMET (n=93 modelos).",
+    "longscore.desc":              "<strong>¿Cuánto degrada tu modelo más allá del contexto corto?</strong> Pega un id de modelo HF → ve LongScore (degradación relativa) + breakdown por longitud + scores HELMET 7-task cuando estén disponibles. Sin GPU. Sin inferencia. Lookup puro contra benchmarks publicados.",
+    "longscore.input_label":       "Id del modelo:",
+    "longscore.input.placeholder": "ej. Qwen2.5-72B-Instruct o meta-llama/Llama-3.1-70B-Instruct",
+    "longscore.lookup_btn":        "🔎 Buscar",
+    "longscore.example_good_btn":  "↳ Ejemplo: Jamba-1.5-Large (sin degradación)",
+    "longscore.example_mid_btn":   "↳ Ejemplo: Llama-3.1-70B (moderado)",
+    "longscore.example_bad_btn":   "↳ Ejemplo: dbrx (severo)",
+    "longscore.formula_note":      "💡 <strong>LongScore</strong> = media sobre l ∈ {16K, 32K, 64K, 128K} de (S_l − Base) / Base, donde Base = media(S_4K, S_8K). Fuente: <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Datos: <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (agregado a 128K, n=60). 0 = sin degradación; -0.30 = severo.",
+    "longscore.miss.title":        "Modelo no encontrado en KB",
+    "longscore.miss.body":         "Buscado <code>{id}</code>. KB tiene {n} modelos. Prueba un id HF canónico (ej. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
+    "longscore.miss.suggest":      "Comprueba cobertura en",
+    "longscore.no_ruler":          "⚠ Sin datos per-length — LongScore no computable. Mostrando agregado HELMET a 128K.",
+    "longscore.score_label":       "LongScore",
+    "longscore.helmet_label":      "Breakdown HELMET 7-task",
+    "longscore.col.ctx":           "Contexto",
+    "longscore.col.score":         "Score",
+    "longscore.col.lc":            "LC",
+    "longscore.col.task":          "Tarea",
+    "longscore.source_note":       "Fuente",
+    "longscore.hint.empty":        "⚠ Pega un id de modelo primero.",
+    "longscore.status.lookup":     "⏳ Buscando…",
+    "longscore.status.miss":       "ℹ Modelo no en KB",
+    "longscore.status.ruler_hit":  "✅ Datos RULER per-length encontrados",
+    "longscore.status.helmet_only":"ℹ Solo agregado HELMET (sin datos per-length)",
+    "longscore.verdict.no_degradation": "✅ Sin degradación más allá del contexto corto",
+    "longscore.verdict.mild":      "🟢 Degradación leve (<10%)",
+    "longscore.verdict.moderate":  "🟠 Degradación moderada (10-20%)",
+    "longscore.verdict.severe":    "🔴 Degradación severa (20-30%)",
+    "longscore.verdict.extreme":   "🚨 Degradación extrema (>30%)",
+    "inv.v088.longscore":          "<strong>🎯 LongScore</strong> — métrica de degradación peer-reviewed (100-LongBench, ACL 2025). Lookup de cualquier modelo en KBs RULER + HELMET (n=93). Ve cuánto cae tu modelo en realidad más allá del contexto corto.",
+    "help.v088.longscore.title":   "🎯 LongScore",
+    "help.v088.longscore.body":    "Cada LLM long-ctx dice 128K pero degrada mucho antes. El paper 100-LongBench (ACL 2025, arXiv:2505.19293) notó que los scores brutos long-ctx están dominados por base ability — un modelo más smart con peor receta long-ctx puntúa más que uno menos smart con mejor receta, ocultando la degradación real. Proponen <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> con <code>Base = media(S_short)</code>, luego promedio sobre longitudes largas. Resultado: número de degradación relativa por modelo que compara apples to apples. Este mode tafagent embebe datos LongScore-ready: agregado RULER per-context (n=33 modelos, 4K-128K) + agregado HELMET a 128K (n=60 modelos, 7 categorías). Lookup es match exacto por id HF (lowercase, dashes, dots normalizados). Para modelos con datos RULER, obtienes el LongScore completo + breakdown per-length + verdict (no/leve/moderado/severo/extremo). Para modelos solo-HELMET, obtienes el agregado 7-categorías a 128K. <em>Caso de uso</em>: '¿quiero usar Llama-3.1-70B-Instruct para resumen de docs 100K-token — cuánta accuracy pierdo realmente?' → pega id, ve -10% LongScore (degradación moderada, sobre todo el cliff a 128K). Decide si usarlo, cambiar a un modelo con long-ctx engineered, o chunkear tu input.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
     "help.v087.tax.title":         "🌍 Taxe Tokenizer Multilingue",
     "help.v087.tax.body":          "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
+    // v0.8.8 — anti-bullshit pack #14 : LongScore (RULER + HELMET lookup)
+    "modes.longscore":             "���� LongScore",
+    "mode_desc.longscore":         "Recherchez la dégradation relative de votre modèle au-delà du contexte court. KBs RULER + HELMET (n=93 modèles). Métrique LongScore de 100-LongBench (ACL 2025).",
+    "longscore.title":             "🎯 LongScore",
+    "longscore.tip":               "Chaque modèle prétend une fenêtre 128K, mais la précision dégrade bien avant. LongScore (métrique peer-reviewed de 100-LongBench, ACL 2025) mesure la dégradation relative au-delà du contexte court. Sépare la base ability de la vraie capacité long-ctx — vous comparez la dégradation, pas les scores bruts. Lookup contre KBs RULER + HELMET (n=93 modèles).",
+    "longscore.desc":              "<strong>Combien votre modèle dégrade-t-il au-delà du contexte court ?</strong> Collez un id modèle HF → voyez LongScore (dégradation relative) + breakdown par longueur + scores HELMET 7-task quand disponibles. Pas de GPU. Pas d'inférence. Lookup pur contre des benchmarks publiés.",
+    "longscore.input_label":       "Id du modèle :",
+    "longscore.input.placeholder": "ex. Qwen2.5-72B-Instruct ou meta-llama/Llama-3.1-70B-Instruct",
+    "longscore.lookup_btn":        "🔎 Rechercher",
+    "longscore.example_good_btn":  "↳ Exemple : Jamba-1.5-Large (sans dégradation)",
+    "longscore.example_mid_btn":   "↳ Exemple : Llama-3.1-70B (modéré)",
+    "longscore.example_bad_btn":   "↳ Exemple : dbrx (sévère)",
+    "longscore.formula_note":      "💡 <strong>LongScore</strong> = moyenne sur l ∈ {16K, 32K, 64K, 128K} de (S_l − Base) / Base, où Base = moyenne(S_4K, S_8K). Source : <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Données : <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (agrégat à 128K, n=60). 0 = pas de dégradation ; -0.30 = sévère.",
+    "longscore.miss.title":        "Modèle non trouvé en KB",
+    "longscore.miss.body":         "Recherché <code>{id}</code>. KB contient {n} modèles. Essayez un id HF canonique (ex. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
+    "longscore.miss.suggest":      "Vérifiez la couverture sur",
+    "longscore.no_ruler":          "⚠ Pas de données per-length — LongScore non calculable. Affichage agrégat HELMET à 128K.",
+    "longscore.score_label":       "LongScore",
+    "longscore.helmet_label":      "Breakdown HELMET 7-task",
+    "longscore.col.ctx":           "Contexte",
+    "longscore.col.score":         "Score",
+    "longscore.col.lc":            "LC",
+    "longscore.col.task":          "Tâche",
+    "longscore.source_note":       "Source",
+    "longscore.hint.empty":        "⚠ Collez un id modèle d'abord.",
+    "longscore.status.lookup":     "⏳ Recherche…",
+    "longscore.status.miss":       "ℹ Modèle pas en KB",
+    "longscore.status.ruler_hit":  "✅ Données RULER per-length trouvées",
+    "longscore.status.helmet_only":"ℹ Agrégat HELMET seulement (pas de données per-length)",
+    "longscore.verdict.no_degradation": "✅ Pas de dégradation au-delà du contexte court",
+    "longscore.verdict.mild":      "🟢 Dégradation légère (<10%)",
+    "longscore.verdict.moderate":  "🟠 Dégradation modérée (10-20%)",
+    "longscore.verdict.severe":    "🔴 Dégradation sévère (20-30%)",
+    "longscore.verdict.extreme":   "🚨 Dégradation extrême (>30%)",
+    "inv.v088.longscore":          "<strong>🎯 LongScore</strong> — métrique de dégradation peer-reviewed (100-LongBench, ACL 2025). Lookup de tout modèle dans KBs RULER + HELMET (n=93). Voyez combien votre modèle chute réellement au-delà du contexte court.",
+    "help.v088.longscore.title":   "🎯 LongScore",
+    "help.v088.longscore.body":    "Chaque LLM long-ctx prétend 128K mais dégrade bien avant. Le paper 100-LongBench (ACL 2025, arXiv:2505.19293) a remarqué que les scores long-ctx bruts sont dominés par la base ability — un modèle plus smart avec une moins bonne recette long-ctx score plus qu'un moins smart avec une meilleure recette, masquant la vraie dégradation. Ils proposent <strong>LongScore</strong> : <code>LC_l = (S_l − Base) / Base</code> avec <code>Base = moyenne(S_short)</code>, puis moyenne sur les longueurs longues. Résultat : un nombre de dégradation relative par modèle qui compare apples to apples. Ce mode tafagent embarque les données LongScore-ready : agrégat RULER per-context (n=33 modèles, 4K-128K) + agrégat HELMET à 128K (n=60 modèles, 7 catégories). Lookup est match exact par id HF (lowercase, dashes, dots normalisés). Pour les modèles avec données RULER, vous obtenez le LongScore complet + breakdown per-length + verdict (pas/légère/modérée/sévère/extrême). Pour les modèles HELMET-only, vous obtenez l'agrégat 7-catégories à 128K. <em>Cas d'usage</em> : 'je veux utiliser Llama-3.1-70B-Instruct pour résumé de docs 100K-token — combien de précision je perds vraiment ?' → collez l'id, voyez -10% LongScore (dégradation modérée, surtout le cliff à 128K). Décidez de l'utiliser, passer à un modèle avec long-ctx engineered, ou chunker votre input.",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
     "help.v087.tax.title":         "🌍 多语言 Tokenizer 税",
     "help.v087.tax.body":          "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token，但在拉丁字母训练的 tokenizer（Llama、Phi）上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js（~750 KB CDN），并对粘贴的文本运行 6 个预设供应商 tokenizer（Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似）的 tokenize。输出：每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块（拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文）让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>：『我的多语言支持给账单加了 30%——哪种语言成本最高？』→ 粘贴真实生产文本，查看每个 tokenizer 的精确分解。",
+    // v0.8.8 — anti-bullshit pack #14：LongScore (RULER + HELMET 查询)
+    "modes.longscore":             "🎯 LongScore",
+    "mode_desc.longscore":         "查询你的模型在短上下文之外的相对降级。RULER + HELMET KB（n=93 模型）。LongScore 指标来自 100-LongBench (ACL 2025)。",
+    "longscore.title":             "🎯 LongScore",
+    "longscore.tip":               "每个模型都声称 128K 上下文窗口，但准确率早就开始降级。LongScore（来自 100-LongBench, ACL 2025 的 peer-reviewed 指标）测量相对于短上下文的降级。将基础能力与真正的长上下文能力解耦——你比较的是降级，而不是原始分数。在 RULER + HELMET KB 中查询（n=93 模型）。",
+    "longscore.desc":              "<strong>你的模型在短上下文之外降级多少？</strong> 粘贴 HF 模型 id → 查看 LongScore（相对降级）+ 每长度分解 + HELMET 7-task 分数（如有）。无 GPU。无推理。纯查询已发布的 benchmark。",
+    "longscore.input_label":       "模型 id：",
+    "longscore.input.placeholder": "例如 Qwen2.5-72B-Instruct 或 meta-llama/Llama-3.1-70B-Instruct",
+    "longscore.lookup_btn":        "🔎 查询",
+    "longscore.example_good_btn":  "↳ 示例：Jamba-1.5-Large（无降级）",
+    "longscore.example_mid_btn":   "↳ 示例：Llama-3.1-70B（中等）",
+    "longscore.example_bad_btn":   "↳ 示例：dbrx（严重）",
+    "longscore.formula_note":      "💡 <strong>LongScore</strong> = 在 l ∈ {16K, 32K, 64K, 128K} 上的 (S_l − Base) / Base 平均值，其中 Base = mean(S_4K, S_8K)。来源：<a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>。数据：<a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a>（每长度，n=33）+ <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a>（128K 聚合，n=60）。0 = 无降级；-0.30 = 严重。",
+    "longscore.miss.title":        "KB 中未找到模型",
+    "longscore.miss.body":         "查询了 <code>{id}</code>。KB 包含 {n} 个模型。请尝试规范 HF id（例如 <code>Qwen2.5-72B-Instruct</code>、<code>Llama-3.1-70B-Instruct</code>、<code>Jamba-1.5-Mini</code>）。",
+    "longscore.miss.suggest":      "在以下位置检查覆盖范围",
+    "longscore.no_ruler":          "⚠ 无每长度数据 — LongScore 无法计算。改为显示 128K 处的 HELMET 聚合。",
+    "longscore.score_label":       "LongScore",
+    "longscore.helmet_label":      "HELMET 7-task 分解",
+    "longscore.col.ctx":           "上下文",
+    "longscore.col.score":         "分数",
+    "longscore.col.lc":            "LC",
+    "longscore.col.task":          "任务",
+    "longscore.source_note":       "数据源",
+    "longscore.hint.empty":        "⚠ 请先粘贴模型 id。",
+    "longscore.status.lookup":     "⏳ 查询中…",
+    "longscore.status.miss":       "ℹ 模型不在 KB 中",
+    "longscore.status.ruler_hit":  "✅ 找到 RULER 每长度数据",
+    "longscore.status.helmet_only":"ℹ 仅 HELMET 聚合（无每长度数据）",
+    "longscore.verdict.no_degradation": "✅ 短上下文之外无降级",
+    "longscore.verdict.mild":      "🟢 轻度降级（<10%）",
+    "longscore.verdict.moderate":  "🟠 中度降级（10-20%）",
+    "longscore.verdict.severe":    "🔴 严重降级（20-30%）",
+    "longscore.verdict.extreme":   "🚨 极端降级（>30%）",
+    "inv.v088.longscore":          "<strong>🎯 LongScore</strong> — peer-reviewed 降级指标（100-LongBench, ACL 2025）。在 RULER + HELMET KB（n=93）中查询任意模型。看你的模型在短上下文之外实际下降多少。",
+    "help.v088.longscore.title":   "🎯 LongScore",
+    "help.v088.longscore.body":    "每个长上下文 LLM 都声称 128K，但早就开始降级。100-LongBench 论文 (ACL 2025, arXiv:2505.19293) 注意到原始长上下文分数被基础能力主导——一个更聪明但长上下文配方更差的模型，仍然得分高于一个不那么聪明但配方更好的模型，掩盖了真正的长上下文降级。他们提出 <strong>LongScore</strong>：<code>LC_l = (S_l − Base) / Base</code>，其中 <code>Base = mean(S_short)</code>，然后对长长度取平均。结果：每个模型一个相对降级数字，可以同等比较。这个 tafagent 模式嵌入了 LongScore-ready 数据：RULER 每上下文聚合（n=33 模型，4K-128K）+ HELMET 128K 聚合（n=60 模型，7 类别）。查询是按 HF 模型 id 精确匹配（小写、连字符、点号已规范化）。对于有 RULER 数据的模型，你得到完整的 LongScore + 每长度分解 + 判定（无/轻/中/严重/极端降级）。对于仅 HELMET 模型，你得到 128K 处的 7-类别聚合。<em>用例</em>：『我想用 Llama-3.1-70B-Instruct 做 100K-token 文档摘要——实际上我损失多少准确率？』→ 粘贴 id，看到 -10% LongScore（中度降级，主要是 128K 处的 cliff）。决定是否使用、改用 long-ctx engineered 的模型，或者分块输入。",
     "inv.v081.hub":                "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
     "help.v081.hub.title":         "🧭 Solutions Hub",
     "help.v081.hub.body":          "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别（评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性），每个映射到（a）解决它的 tafagent 模式（若存在），以及（b）社区已信任的最佳外部工具（RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等）。搜索框匹配 pain、场景和工具名称。<em>用例</em>：'我有问题 X — tafagent 解决它吗，如果不，谁解决？'",

js/longscore.js ADDED Viewed

	@@ -0,0 +1,105 @@

+// longscore.js — pure logic for the 🎯 LongScore mode.
+//
+// Looks up an HF-style model id in data/longscore_kb.json and returns:
+// - exact match: ruler_per_ctx (if available) + ruler_long_score (computed) + helmet aggregate
+// - HELMET-only: aggregate scores at 128K, no LongScore (no per-length data)
+// - miss: fallback for unknown models
+//
+// No UI strings — emits codes + params; main.js translates via i18n.
+//
+// LongScore formula (100-LongBench, ACL 2025, arXiv:2505.19293, §3.2):
+//   Base = mean(S_4K, S_8K)
+//   LC_l = (S_l - Base) / Base
+//   LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})
+//
+// More negative = worse long-ctx retention.
+let KB = null;
+export async function loadKB() {
+  if (KB) return KB;
+  const res = await fetch("data/longscore_kb.json");
+  if (!res.ok) throw new Error("longscore_kb fetch failed: " + res.status);
+  KB = await res.json();
+  return KB;
+}
+export function normalize(name) {
+  if (!name) return "";
+  let s = String(name).toLowerCase().trim();
+  s = s.replace(/^(meta-llama\/|01-ai\/|ai21labs\/|nvidia\/|princeton-nlp\/|unsloth\/)/, "");
+  s = s.replace(/_/g, "-").replace(/\./g, "-");
+  s = s.replace(/([a-z])(\d)/g, "$1-$2");
+  s = s.replace(/(\d)([a-z])/g, "$1-$2");
+  s = s.replace(/-+/g, "-");
+  // -inst → -instruct (both at end and in middle, before next -segment)
+  s = s.replace(/-inst(?=-|$)/g, "-instruct");
+  return s;
+}
+/** Classify LongScore avg into verdict code. */
+export function classify(longscore_avg, thresholds) {
+  if (longscore_avg === null || longscore_avg === undefined) return "no_data";
+  if (longscore_avg >= thresholds.no_degradation) return "no_degradation";
+  if (longscore_avg >= thresholds.mild) return "mild";
+  if (longscore_avg >= thresholds.moderate) return "moderate";
+  if (longscore_avg >= thresholds.severe) return "severe";
+  return "extreme";
+}
+/** Look up a model and return a structured result. */
+export async function lookup(rawId) {
+  const kb = await loadKB();
+  const id = normalize(rawId);
+  const entry = kb.models[id];
+  if (!entry) {
+    return {
+      code: "miss",
+      normalized_id: id,
+      n_kb_total: kb.stats.n_total,
+    };
+  }
+  const longscore = entry.ruler_long_score;
+  const verdict = longscore
+    ? classify(longscore.avg_lc, kb.thresholds)
+    : null;
+  return {
+    code: longscore ? "ruler_hit" : (entry.helmet ? "helmet_only" : "partial"),
+    display_name: entry.display_name,
+    normalized_id: id,
+    ruler_per_ctx: entry.ruler_per_ctx,
+    ruler_long_score: longscore,
+    helmet: entry.helmet,
+    recipe_class: entry.recipe_class,
+    params_b: entry.params_b,
+    native_context_k: entry.native_context_k,
+    source: entry.source,
+    verdict,
+    thresholds: kb.thresholds,
+  };
+}
+/** Get sorted list of all model ids — for autocomplete. */
+export async function listAllIds() {
+  const kb = await loadKB();
+  return Object.keys(kb.models).sort();
+}
+/** Top-N best/worst by LongScore (for sanity inspection). Optional helper. */
+export async function rank(direction) {
+  const kb = await loadKB();
+  const items = Object.entries(kb.models)
+    .filter(([, m]) => m.ruler_long_score)
+    .map(([id, m]) => ({
+      id,
+      display_name: m.display_name,
+      recipe_class: m.recipe_class,
+      avg_lc: m.ruler_long_score.avg_lc,
+    }));
+  items.sort((a, b) =>
+    direction === "best" ? b.avg_lc - a.avg_lc : a.avg_lc - b.avg_lc
+  );
+  return items;
+}

js/main.js CHANGED Viewed

@@ -35,6 +35,9 @@ import {
   tokenizeAll, detectLanguageBlocks,
   PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
 } from "./tokenizer_tax.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -229,6 +232,7 @@ document.addEventListener("click", (e) => {
       cache: "cache-section",
       speculative: "speculative-section",
       tax: "tax-section",
       hub: "hub-section",
     }[targetMode];
     if (sectionId) {
@@ -254,7 +258,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
-     "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "hub-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
@@ -271,6 +275,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
       cache: "cache-section",
       speculative: "speculative-section",
       tax: "tax-section",
       hub: "hub-section",
     };
     const sectionId = sectionMap[mode];
@@ -283,6 +288,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
     if (mode === "cache") initCacheDiff();
     if (mode === "speculative") initSpeculative();
     if (mode === "tax") initTax();
     if (mode === "hub") initHub();
   });
 });
@@ -3469,10 +3475,10 @@ async function initHub() {
 function renderEntry(e) {
   const modeBadge = e.tafagent_mode
-    ? `<span class="badge" style="background:#3fb950;">${e.tafagent_mode}</span>`
     : (e.tafagent_planned_mode
-        ? `<span class="badge" style="background:#d29922;">${t("hub.planned") || "planned:"} ${e.tafagent_planned_mode}</span>`
-        : `<span class="badge" style="background:#6e7781;">${t("hub.no_mode") || "external"}</span>`);
   const tools = (e.external_tools || [])
     .map(tl => {
       const icon = HUB_TYPE_BADGE[tl.type] || "🔗";
@@ -4410,6 +4416,167 @@ $("tax-sample-code-btn")?.addEventListener("click", () => {
   runTaxTokenize();
 });
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

   tokenizeAll, detectLanguageBlocks,
   PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
 } from "./tokenizer_tax.js";
+import {
+  loadKB as loadLongscoreKB, lookup as longscoreLookup, rank as longscoreRank,
+} from "./longscore.js";
 // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
 // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
       cache: "cache-section",
       speculative: "speculative-section",
       tax: "tax-section",
+      longscore: "longscore-section",
       hub: "hub-section",
     }[targetMode];
     if (sectionId) {
      "diagnose-section", "phase-section", "unmask-section",
      "template-section", "arena-section", "contam-section",
      "quant-section", "drift-section", "niah-section",
+     "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section"].forEach(id => {
       const el = $(id);
       if (el) el.style.display = "none";
     });
       cache: "cache-section",
       speculative: "speculative-section",
       tax: "tax-section",
+      longscore: "longscore-section",
       hub: "hub-section",
     };
     const sectionId = sectionMap[mode];
     if (mode === "cache") initCacheDiff();
     if (mode === "speculative") initSpeculative();
     if (mode === "tax") initTax();
+    if (mode === "longscore") initLongscore();
     if (mode === "hub") initHub();
   });
 });
 function renderEntry(e) {
   const modeBadge = e.tafagent_mode
+    ? `<span class="badge" style="background:#3fb950;color:#fff;border-color:#3fb950;">${e.tafagent_mode}</span>`
     : (e.tafagent_planned_mode
+        ? `<span class="badge" style="background:#d29922;color:#1a1a1a;border-color:#d29922;">${t("hub.planned") || "planned:"} ${e.tafagent_planned_mode}</span>`
+        : `<span class="badge" style="background:#6e7781;color:#fff;border-color:#6e7781;">${t("hub.no_mode") || "external"}</span>`);
   const tools = (e.external_tools || [])
     .map(tl => {
       const icon = HUB_TYPE_BADGE[tl.type] || "🔗";
   runTaxTokenize();
 });
+// ════════════════════════════════════════════════════════════════════
+// LongScore mode (v0.8.8 anti-bullshit pack #14)
+// ════════════════════════════════════════════════════════════════════
+let __longscoreInited = false;
+function initLongscore() {
+  if (__longscoreInited) return;
+  __longscoreInited = true;
+  // Eager-load KB so the first lookup is instant (KB is ~70KB, no real cost)
+  loadLongscoreKB().catch(e => {
+    console.warn("longscore_kb preload failed", e);
+  });
+}
+function fmtPct(x, sign) {
+  if (x == null) return "—";
+  const v = (x * 100);
+  return `${sign && v >= 0 ? "+" : ""}${v.toFixed(1)}%`;
+}
+function lcColor(avg) {
+  if (avg == null) return "#8b949e";
+  if (avg >= -0.02) return "#3fb950";       // green: no degradation
+  if (avg >= -0.10) return "#a5d36a";       // light green
+  if (avg >= -0.20) return "#f0883e";       // orange
+  if (avg >= -0.30) return "#f85149";       // red
+  return "#a01b1b";                          // dark red: extreme
+}
+function renderLongscoreResult(res) {
+  if (res.code === "miss") {
+    return `<div class="arena-result">
+      <p style="color:#f0883e;"><strong>${t("longscore.miss.title") || "Model not found in KB"}</strong></p>
+      <p>${tFmt("longscore.miss.body", { id: res.normalized_id, n: res.n_kb_total }) || `Looked up <code>${res.normalized_id}</code>. KB has ${res.n_kb_total} models. Try a canonical HF id (e.g. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).`}</p>
+      <p class="subtle" style="font-size:0.85em;">${t("longscore.miss.suggest") || "Check coverage at"} <a href="https://github.com/NVIDIA/RULER" target="_blank">RULER</a> · <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a>.</p>
+    </div>`;
+  }
+  const verdictMap = {
+    no_degradation: { color: "#3fb950", label: t("longscore.verdict.no_degradation") || "✅ No degradation past short context" },
+    mild:           { color: "#a5d36a", label: t("longscore.verdict.mild")           || "🟢 Mild degradation (<10%)" },
+    moderate:       { color: "#f0883e", label: t("longscore.verdict.moderate")       || "🟠 Moderate degradation (10-20%)" },
+    severe:         { color: "#f85149", label: t("longscore.verdict.severe")         || "🔴 Severe degradation (20-30%)" },
+    extreme:        { color: "#a01b1b", label: t("longscore.verdict.extreme")        || "🚨 Extreme degradation (>30%)" },
+  };
+  let html = `<div class="arena-result">`;
+  html += `<p><strong>${escapeHtml(res.display_name)}</strong>`;
+  if (res.params_b) html += ` <span class="subtle">· ${res.params_b}B params</span>`;
+  if (res.recipe_class) html += ` <span class="subtle">· ${escapeHtml(res.recipe_class)}</span>`;
+  if (res.native_context_k) html += ` <span class="subtle">· native ctx ${res.native_context_k}K</span>`;
+  html += `</p>`;
+  // RULER per-length + LongScore
+  if (res.ruler_long_score) {
+    const ls = res.ruler_long_score;
+    const v = verdictMap[res.verdict] || { color: "#8b949e", label: res.verdict };
+    html += `<p style="margin-top:0.8em;font-size:1.1em;">
+      <strong>${t("longscore.score_label") || "LongScore"}:</strong>
+      <span style="color:${lcColor(ls.avg_lc)};font-family:monospace;font-size:1.2em;font-weight:bold;">${fmtPct(ls.avg_lc, true)}</span>
+      <span class="subtle">· Base = ${ls.base.toFixed(1)}% (mean of 4K, 8K)</span>
+    </p>`;
+    html += `<p style="color:${v.color};font-weight:bold;">${v.label}</p>`;
+    // Per-length bars
+    html += `<table class="lean-table" style="margin-top:0.8em;width:100%;">
+      <thead><tr>
+        <th style="text-align:left;">${t("longscore.col.ctx") || "Context"}</th>
+        <th style="text-align:right;">${t("longscore.col.score") || "Score"}</th>
+        <th style="text-align:right;">${t("longscore.col.lc") || "LC"}</th>
+      </tr></thead><tbody>`;
+    const ctxKeys = ["4k", "8k", "16k", "32k", "64k", "128k"];
+    for (const k of ctxKeys) {
+      const score = res.ruler_per_ctx?.[k];
+      if (score == null) continue;
+      const isShort = k === "4k" || k === "8k";
+      const lc = ls.per_length_lc?.[k];
+      html += `<tr ${isShort ? 'style="opacity:0.7;"' : ""}>
+        <td><strong>${k.toUpperCase()}</strong>${isShort ? ` <span class="subtle" style="font-size:0.8em;">(base)</span>` : ""}</td>
+        <td style="text-align:right;font-family:monospace;">${score.toFixed(1)}%</td>
+        <td style="text-align:right;font-family:monospace;color:${lcColor(lc)};">${lc != null ? fmtPct(lc, true) : "—"}</td>
+      </tr>`;
+    }
+    html += `</tbody></table>`;
+  } else {
+    // Helmet-only or partial
+    html += `<p style="margin-top:0.8em;color:#f0883e;">${t("longscore.no_ruler") || "⚠ No per-length data — LongScore not computable. Showing HELMET aggregate at 128K instead."}</p>`;
+  }
+  // HELMET breakdown if available
+  if (res.helmet) {
+    html += `<details style="margin-top:1em;" open>
+      <summary><strong>${t("longscore.helmet_label") || "HELMET 7-task breakdown"} (at 128K)</strong></summary>
+      <table class="lean-table" style="margin-top:0.5em;width:100%;">
+        <thead><tr>
+          <th style="text-align:left;">${t("longscore.col.task") || "Task"}</th>
+          <th style="text-align:right;">${t("longscore.col.score") || "Score"}</th>
+        </tr></thead><tbody>`;
+    if (res.helmet.overall != null) {
+      html += `<tr style="background:#1f2933;"><td><strong>Overall</strong></td><td style="text-align:right;font-family:monospace;"><strong>${res.helmet.overall.toFixed(1)}</strong></td></tr>`;
+    }
+    if (res.helmet.categories) {
+      for (const [task, score] of Object.entries(res.helmet.categories)) {
+        html += `<tr><td>${escapeHtml(task)}</td><td style="text-align:right;font-family:monospace;">${score != null ? score.toFixed(1) : "—"}</td></tr>`;
+      }
+    }
+    html += `</tbody></table></details>`;
+  }
+  html += `<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;">
+    ${t("longscore.source_note") || "Data source"}: ${escapeHtml(res.source)} ·
+    <a href="https://arxiv.org/abs/2505.19293" target="_blank">LongScore metric</a>
+  </p>`;
+  html += `</div>`;
+  return html;
+}
+async function runLongscoreLookup() {
+  const id = $("longscore-input")?.value?.trim();
+  if (!id) {
+    $("longscore-status").textContent = t("longscore.hint.empty") || "⚠ Paste a model id first.";
+    return;
+  }
+  $("longscore-status").textContent = t("longscore.status.lookup") || "⏳ Looking up…";
+  $("longscore-output").innerHTML = "";
+  try {
+    const res = await longscoreLookup(id);
+    $("longscore-output").innerHTML = renderLongscoreResult(res);
+    if (res.code === "miss") {
+      $("longscore-status").textContent = t("longscore.status.miss") || "ℹ Model not in KB";
+    } else if (res.code === "ruler_hit") {
+      $("longscore-status").textContent = t("longscore.status.ruler_hit") || "✅ RULER per-length data found";
+    } else {
+      $("longscore-status").textContent = t("longscore.status.helmet_only") || "ℹ HELMET aggregate only (no per-length data)";
+    }
+  } catch (e) {
+    $("longscore-status").textContent = `❌ ${e.message || e}`;
+    console.error(e);
+  }
+}
+$("longscore-lookup-btn")?.addEventListener("click", runLongscoreLookup);
+$("longscore-input")?.addEventListener("keydown", e => {
+  if (e.key === "Enter") {
+    e.preventDefault();
+    runLongscoreLookup();
+  }
+});
+$("longscore-example-good-btn")?.addEventListener("click", () => {
+  $("longscore-input").value = "Jamba-1.5-Large";
+  runLongscoreLookup();
+});
+$("longscore-example-mid-btn")?.addEventListener("click", () => {
+  $("longscore-input").value = "Llama-3.1-70B-Instruct";
+  runLongscoreLookup();
+});
+$("longscore-example-bad-btn")?.addEventListener("click", () => {
+  $("longscore-input").value = "dbrx";
+  runLongscoreLookup();
+});
 // ════════════════════════════════════════════════════════════════════
 // Bootstrap
 // ════════════════════════════════════════════════════════════════════

scripts/test_longscore.mjs ADDED Viewed

	@@ -0,0 +1,72 @@

+// Smoke test for js/longscore.js — verifies normalize, lookup, classify codes.
+// Run: node scripts/test_longscore.mjs
+import { readFileSync } from "fs";
+// Mock fetch for Node ESM
+globalThis.fetch = async (url) => {
+  const path = url.startsWith("data/") ? `./${url}` : url;
+  const txt = readFileSync(path, "utf-8");
+  return {
+    ok: true,
+    json: async () => JSON.parse(txt),
+  };
+};
+const { normalize, lookup, classify, rank } = await import("../js/longscore.js");
+let pass = 0, fail = 0;
+function check(name, cond, detail) {
+  if (cond) { pass++; console.log(`  ✓ ${name}`); }
+  else { fail++; console.log(`  ✗ ${name}${detail ? ": " + detail : ""}`); }
+}
+console.log("--- normalize ---");
+check("trims + lowercases", normalize("  Qwen2.5  ") === "qwen-2-5");
+check("strips meta-llama/", normalize("meta-llama/Llama-3.1-70B-Instruct") === "llama-3-1-70-b-instruct");
+check("strips 01-ai/", normalize("01-ai/Yi-34B-200K") === "yi-34-b-200-k");
+check("inst → instruct", normalize("Mistral-7B-Inst-v0.2") === "mistral-7-b-instruct-v-0-2");
+check("dot → dash", normalize("Phi-3.5-mini-instruct") === "phi-3-5-mini-instruct");
+check("empty", normalize("") === "");
+console.log("\n--- classify ---");
+const t = { no_degradation: -0.02, mild: -0.10, moderate: -0.20, severe: -0.30 };
+check("no_data", classify(null, t) === "no_data");
+check("no_degradation", classify(0.0, t) === "no_degradation");
+check("mild", classify(-0.05, t) === "mild");
+check("moderate", classify(-0.15, t) === "moderate");
+check("severe", classify(-0.25, t) === "severe");
+check("extreme", classify(-0.50, t) === "extreme");
+console.log("\n--- lookup (RULER hit) ---");
+const r1 = await lookup("Llama-3.1-70B-Instruct");
+check("ruler_hit code", r1.code === "ruler_hit");
+check("longscore present", typeof r1.ruler_long_score?.avg_lc === "number");
+check("verdict assigned", r1.verdict !== null);
+check("base ~96", r1.ruler_long_score?.base > 95 && r1.ruler_long_score?.base < 97,
+  `got base=${r1.ruler_long_score?.base}`);
+check("Llama-3.1-70B avg_lc ~-0.10", Math.abs(r1.ruler_long_score?.avg_lc - (-0.1024)) < 0.001,
+  `got ${r1.ruler_long_score?.avg_lc}`);
+console.log("\n--- lookup (Jamba — best LongScore) ---");
+const r2 = await lookup("Jamba-1.5-Large");
+check("ruler_hit", r2.code === "ruler_hit");
+check("Jamba near-zero degradation", r2.ruler_long_score?.avg_lc > -0.02);
+console.log("\n--- lookup (dbrx — severe) ---");
+const r3 = await lookup("dbrx");
+check("ruler_hit", r3.code === "ruler_hit");
+check("dbrx severe verdict", r3.verdict === "severe" || r3.verdict === "extreme",
+  `got verdict=${r3.verdict} for avg_lc=${r3.ruler_long_score?.avg_lc}`);
+console.log("\n--- lookup (miss) ---");
+const r4 = await lookup("nonexistent-model-123");
+check("miss code", r4.code === "miss");
+check("normalized id present", r4.normalized_id === "nonexistent-model-123");
+console.log("\n--- rank ---");
+const ranking = await rank("worst");
+check("ranking returned", Array.isArray(ranking) && ranking.length > 0);
+check("worst is most negative", ranking[0].avg_lc < ranking[ranking.length - 1].avg_lc);
+console.log(`\n${pass} passed, ${fail} failed`);
+process.exit(fail > 0 ? 1 : 0);

scripts/test_longscore_e2e.mjs ADDED Viewed

	@@ -0,0 +1,34 @@

+// E2E lookup smoke for the 3 example buttons (Jamba/Llama/dbrx) + a HELMET-only model.
+import { readFileSync } from "fs";
+globalThis.fetch = async (url) => {
+  const path = url.startsWith("data/") ? `./${url}` : url;
+  return { ok: true, json: async () => JSON.parse(readFileSync(path, "utf-8")) };
+};
+const { lookup } = await import("../js/longscore.js");
+const cases = [
+  { input: "Jamba-1.5-Large", expect: { code: "ruler_hit", verdict: "no_degradation" } },
+  { input: "Llama-3.1-70B-Instruct", expect: { code: "ruler_hit", verdict: "moderate" } },
+  { input: "dbrx", expect: { code: "ruler_hit", verdict: "extreme" } },
+  { input: "GPT-4", expect: { code: "helmet_only" } },  // HELMET-only
+  { input: "totally-fake-model-xyz", expect: { code: "miss" } },
+];
+let pass = 0, fail = 0;
+for (const c of cases) {
+  const r = await lookup(c.input);
+  const ok = r.code === c.expect.code &&
+    (!c.expect.verdict || r.verdict === c.expect.verdict);
+  if (ok) {
+    pass++;
+    const score = r.ruler_long_score ? `LongScore=${(r.ruler_long_score.avg_lc*100).toFixed(1)}%` :
+                  r.helmet ? `HELMET overall=${r.helmet.overall}` : "";
+    console.log(`  ✓ ${c.input.padEnd(30)} → ${r.code.padEnd(12)} ${r.verdict || "n/a".padEnd(15)} ${score}`);
+  } else {
+    fail++;
+    console.log(`  ✗ ${c.input.padEnd(30)} → got code=${r.code} verdict=${r.verdict}, expected=${JSON.stringify(c.expect)}`);
+  }
+}
+console.log(`\n${pass}/${pass+fail} cases pass`);
+process.exit(fail > 0 ? 1 : 0);