karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
ebabb49
·
1 Parent(s): 6f608c8

v0.8.8 LongScore mode — anti-bullshit pack #14 + Hub badge readability fix

Browse files

22nd mode: 🎯 LongScore. Lookup any HF model id → see relative degradation
past short context, sourced from RULER per-length + HELMET aggregate.

Why: every model claims a 128K context window, but accuracy degrades long
before that. Raw long-ctx scores are dominated by base ability — a smarter
model with a worse long-ctx recipe still scores higher than a less-smart
model with a better one, hiding the actual degradation. The 100-LongBench
paper (ACL 2025, arXiv:2505.19293) proposed LongScore to disentangle base
ability from true long-ctx capability:
LC_l = (S_l − Base) / Base
Base = mean(S_short)
LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})
0 = no degradation; -0.30 = severe.

Stage 0.5 verified (per PROTOCOL.md): no existing browser tool surfaces
LongScore for a given model id. tiktokenizer.vercel.app, llm-stats.com,
BenchLM.ai use raw RULER/needle scores. HELM Long Context aggregates 5
benchmarks but doesn't compute LongScore. Genuinely novel.

Files:
- data/longscore_kb.json (NEW, ~70 KB) — 93 unique models keyed by
canonical HF id. 35 with full LongScore (RULER per-length present).
Source: ruler_kb_day5.json + helmet_kb.json from large_model_validation/
Day-5 + Day-8 work. Built by build_longscore_kb_for_tafagent.py.
- js/longscore.js (NEW) — pure logic: loadKB, normalize, classify,
lookup, listAllIds, rank. No UI strings. ES module.
- index.html — new tab 🎯 LongScore between Token Tax and Solutions Hub.
New <section id="longscore-section"> with input + 3 example buttons +
output panels. Help modal v0.8.8 entry. Inventory v0.8.8 entry.
- js/main.js — initLongscore, renderLongscoreResult (per-length bars +
HELMET 7-task collapsible + verdict color band), runLongscoreLookup,
button wiring + Enter-key handler.
- js/i18n.js — 35 new keys × 4 langs (EN/ES/FR/ZH) = 140 keys total.
Includes mode label, tooltip, formula note, miss/hit/helmet_only state
text, verdict ladder (no_degradation → mild → moderate → severe →
extreme), help modal body.
- data/solutions_hub.json — new pain entry "long_ctx_degradation"
covered by 🎯 LongScore. Curated 6 external tools: 100-LongBench
paper, HELMET (repo + sheet), RULER, LongBench v2, Chroma context-rot.
- scripts/test_longscore.mjs (NEW) — 25 smoke tests for normalize,
classify, lookup. All pass.
- scripts/test_longscore_e2e.mjs (NEW) — 5 E2E lookup cases for the
3 example buttons + HELMET-only model + miss. All pass.

Hub badge fix (separate but bundled, 1-line):
- The "covered by mode" badges in Solutions Hub had `color: var(--success)`
inherited from .badge default and were also given inline
`background:#3fb950` (also success green). Result: green text on green
background → invisible. Fixed by adding inline `color:#fff` (white) and
`border-color` to all 3 badge variants (covered, planned, external).

Verification:
- node scripts/test_longscore.mjs → 25/25 pass
- node scripts/test_longscore_e2e.mjs → 5/5 pass
- python -m http.server + curl all assets → HTTP 200
- KB sanity: Llama-3.1-70B-Inst LongScore = -0.1024 matches Day-8
manual computation exactly.

Live HF Space verification PENDING (requires push to origin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

data/longscore_kb.json ADDED
@@ -0,0 +1,2941 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "v0.8.8-longscore-2026-05-08",
3
+ "metric": "LongScore (100-LongBench, ACL 2025, arXiv:2505.19293, §3.2)",
4
+ "metric_formula": "Base = mean(S_4K, S_8K); LC_l = (S_l - Base) / Base; LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})",
5
+ "metric_interpretation": {
6
+ "0.0": "no degradation past short context",
7
+ "-0.05": "mild degradation (~5% relative drop)",
8
+ "-0.15": "moderate degradation (~15% relative drop)",
9
+ "-0.30": "severe degradation (~30% relative drop)",
10
+ "negative": "more negative = worse long-ctx retention"
11
+ },
12
+ "thresholds": {
13
+ "no_degradation": -0.02,
14
+ "mild": -0.1,
15
+ "moderate": -0.2,
16
+ "severe": -0.3
17
+ },
18
+ "sources": {
19
+ "ruler": "NVIDIA RULER leaderboard + Qwen2.5 Tech Report Table 16 (n=35)",
20
+ "helmet": "HELMET Google Sheet (princeton-nlp; arXiv:2410.02694; n=63)"
21
+ },
22
+ "stats": {
23
+ "n_total": 93,
24
+ "n_ruler_only": 30,
25
+ "n_helmet_only": 58,
26
+ "n_both": 5,
27
+ "n_with_longscore": 35
28
+ },
29
+ "models": {
30
+ "qwen-2-5-7-b-instruct": {
31
+ "display_name": "qwen2.5-7b-instruct",
32
+ "ruler_per_ctx": {
33
+ "4k": 96.7,
34
+ "8k": 95.1,
35
+ "16k": 93.7,
36
+ "32k": 89.4,
37
+ "64k": 74.5,
38
+ "128k": 31.4
39
+ },
40
+ "ruler_long_score": {
41
+ "base": 95.9,
42
+ "per_length_lc": {
43
+ "16k": -0.0229,
44
+ "32k": -0.0678,
45
+ "64k": -0.2231,
46
+ "128k": -0.6726
47
+ },
48
+ "avg_lc": -0.2466
49
+ },
50
+ "helmet": {
51
+ "overall": 22.8,
52
+ "categories": {
53
+ "Recall": 11.4,
54
+ "RAG": 30.6,
55
+ "Cite": 3.1,
56
+ "Re-rank": 1.6,
57
+ "ICL": 72.0,
58
+ "LongQA": 21.9,
59
+ "Summ": 18.8
60
+ },
61
+ "context_window_k": 128.0,
62
+ "params_b": "7.0",
63
+ "type": "♯"
64
+ },
65
+ "recipe_class": "native",
66
+ "params_b": 7,
67
+ "native_context_k": 32,
68
+ "source": "ruler+helmet"
69
+ },
70
+ "qwen-2-5-14-b-instruct": {
71
+ "display_name": "qwen2.5-14b-instruct",
72
+ "ruler_per_ctx": {
73
+ "4k": 96.9,
74
+ "8k": 97.1,
75
+ "16k": 95.5,
76
+ "32k": 95.5,
77
+ "64k": 90.3,
78
+ "128k": 82.0
79
+ },
80
+ "ruler_long_score": {
81
+ "base": 97.0,
82
+ "per_length_lc": {
83
+ "16k": -0.0155,
84
+ "32k": -0.0155,
85
+ "64k": -0.0691,
86
+ "128k": -0.1546
87
+ },
88
+ "avg_lc": -0.0637
89
+ },
90
+ "helmet": null,
91
+ "recipe_class": "native",
92
+ "params_b": 14,
93
+ "native_context_k": 32,
94
+ "source": "ruler"
95
+ },
96
+ "qwen-2-5-32-b-instruct": {
97
+ "display_name": "qwen2.5-32b-instruct",
98
+ "ruler_per_ctx": {
99
+ "4k": 97.7,
100
+ "8k": 97.2,
101
+ "16k": 97.7,
102
+ "32k": 96.5,
103
+ "64k": 88.5,
104
+ "128k": 67.0
105
+ },
106
+ "ruler_long_score": {
107
+ "base": 97.45,
108
+ "per_length_lc": {
109
+ "16k": 0.0026,
110
+ "32k": -0.0097,
111
+ "64k": -0.0918,
112
+ "128k": -0.3125
113
+ },
114
+ "avg_lc": -0.1028
115
+ },
116
+ "helmet": null,
117
+ "recipe_class": "native",
118
+ "params_b": 32,
119
+ "native_context_k": 32,
120
+ "source": "ruler"
121
+ },
122
+ "mistral-7-b-v-0-2": {
123
+ "display_name": "mistral-7b-v0.2",
124
+ "ruler_per_ctx": {
125
+ "4k": 93.6,
126
+ "8k": 91.2,
127
+ "16k": 87.2,
128
+ "32k": 75.4,
129
+ "64k": 49.0,
130
+ "128k": 13.8
131
+ },
132
+ "ruler_long_score": {
133
+ "base": 92.4,
134
+ "per_length_lc": {
135
+ "16k": -0.0563,
136
+ "32k": -0.184,
137
+ "64k": -0.4697,
138
+ "128k": -0.8506
139
+ },
140
+ "avg_lc": -0.3901
141
+ },
142
+ "helmet": null,
143
+ "recipe_class": "native",
144
+ "params_b": 7,
145
+ "native_context_k": 32,
146
+ "source": "ruler"
147
+ },
148
+ "mixtral-8-x-7-b": {
149
+ "display_name": "mixtral-8x7b",
150
+ "ruler_per_ctx": {
151
+ "4k": 94.9,
152
+ "8k": 92.1,
153
+ "16k": 92.5,
154
+ "32k": 85.9,
155
+ "64k": 72.4,
156
+ "128k": 44.5
157
+ },
158
+ "ruler_long_score": {
159
+ "base": 93.5,
160
+ "per_length_lc": {
161
+ "16k": -0.0107,
162
+ "32k": -0.0813,
163
+ "64k": -0.2257,
164
+ "128k": -0.5241
165
+ },
166
+ "avg_lc": -0.2104
167
+ },
168
+ "helmet": null,
169
+ "recipe_class": "native",
170
+ "params_b": 47,
171
+ "native_context_k": 32,
172
+ "source": "ruler"
173
+ },
174
+ "mixtral-8-x-22-b-instruct": {
175
+ "display_name": "mixtral-8x22b-instruct",
176
+ "ruler_per_ctx": {
177
+ "4k": 95.6,
178
+ "8k": 94.9,
179
+ "16k": 93.4,
180
+ "32k": 90.9,
181
+ "64k": 84.7,
182
+ "128k": 31.7
183
+ },
184
+ "ruler_long_score": {
185
+ "base": 95.25,
186
+ "per_length_lc": {
187
+ "16k": -0.0194,
188
+ "32k": -0.0457,
189
+ "64k": -0.1108,
190
+ "128k": -0.6672
191
+ },
192
+ "avg_lc": -0.2108
193
+ },
194
+ "helmet": null,
195
+ "recipe_class": "native",
196
+ "params_b": 141,
197
+ "native_context_k": 64,
198
+ "source": "ruler"
199
+ },
200
+ "qwen-2-72-b": {
201
+ "display_name": "qwen2-72b",
202
+ "ruler_per_ctx": {
203
+ "4k": 96.9,
204
+ "8k": 96.1,
205
+ "16k": 94.9,
206
+ "32k": 94.1,
207
+ "64k": 79.8,
208
+ "128k": 53.7
209
+ },
210
+ "ruler_long_score": {
211
+ "base": 96.5,
212
+ "per_length_lc": {
213
+ "16k": -0.0166,
214
+ "32k": -0.0249,
215
+ "64k": -0.1731,
216
+ "128k": -0.4435
217
+ },
218
+ "avg_lc": -0.1645
219
+ },
220
+ "helmet": null,
221
+ "recipe_class": "native",
222
+ "params_b": 72,
223
+ "native_context_k": 128,
224
+ "source": "ruler"
225
+ },
226
+ "dbrx": {
227
+ "display_name": "dbrx",
228
+ "ruler_per_ctx": {
229
+ "4k": 95.1,
230
+ "8k": 93.8,
231
+ "16k": 83.6,
232
+ "32k": 63.1,
233
+ "64k": 2.4,
234
+ "128k": 0.0
235
+ },
236
+ "ruler_long_score": {
237
+ "base": 94.45,
238
+ "per_length_lc": {
239
+ "16k": -0.1149,
240
+ "32k": -0.3319,
241
+ "64k": -0.9746,
242
+ "128k": -1.0
243
+ },
244
+ "avg_lc": -0.6054
245
+ },
246
+ "helmet": null,
247
+ "recipe_class": "native",
248
+ "params_b": 132,
249
+ "native_context_k": 32,
250
+ "source": "ruler"
251
+ },
252
+ "qwen-1-5-72-b": {
253
+ "display_name": "qwen1.5-72b",
254
+ "ruler_per_ctx": {
255
+ "4k": 94.9,
256
+ "8k": 93.8,
257
+ "16k": 78.0,
258
+ "32k": 67.8,
259
+ "64k": 0.0,
260
+ "128k": 0.0
261
+ },
262
+ "ruler_long_score": {
263
+ "base": 94.35,
264
+ "per_length_lc": {
265
+ "16k": -0.1733,
266
+ "32k": -0.2814,
267
+ "64k": -1.0,
268
+ "128k": -1.0
269
+ },
270
+ "avg_lc": -0.6137
271
+ },
272
+ "helmet": null,
273
+ "recipe_class": "native",
274
+ "params_b": 72,
275
+ "native_context_k": 32,
276
+ "source": "ruler"
277
+ },
278
+ "mistral-large-2407": {
279
+ "display_name": "mistral-large-2407",
280
+ "ruler_per_ctx": {
281
+ "4k": 96.2,
282
+ "8k": 96.1,
283
+ "16k": 95.1,
284
+ "32k": 93.0,
285
+ "64k": 78.8,
286
+ "128k": 23.7
287
+ },
288
+ "ruler_long_score": {
289
+ "base": 96.15,
290
+ "per_length_lc": {
291
+ "16k": -0.0109,
292
+ "32k": -0.0328,
293
+ "64k": -0.1804,
294
+ "128k": -0.7535
295
+ },
296
+ "avg_lc": -0.2444
297
+ },
298
+ "helmet": null,
299
+ "recipe_class": "native",
300
+ "params_b": 123,
301
+ "native_context_k": 32,
302
+ "source": "ruler"
303
+ },
304
+ "yi-34-b-200-k": {
305
+ "display_name": "yi-34b-200k",
306
+ "ruler_per_ctx": {
307
+ "4k": 93.3,
308
+ "8k": 92.2,
309
+ "16k": 91.3,
310
+ "32k": 87.5,
311
+ "64k": 83.2,
312
+ "128k": 77.3
313
+ },
314
+ "ruler_long_score": {
315
+ "base": 92.75,
316
+ "per_length_lc": {
317
+ "16k": -0.0156,
318
+ "32k": -0.0566,
319
+ "64k": -0.103,
320
+ "128k": -0.1666
321
+ },
322
+ "avg_lc": -0.0854
323
+ },
324
+ "helmet": {
325
+ "overall": 40.3,
326
+ "categories": {
327
+ "Recall": 74.1,
328
+ "RAG": 59.5,
329
+ "Cite": 1.7,
330
+ "Re-rank": 20.5,
331
+ "ICL": 85.0,
332
+ "LongQA": 26.6,
333
+ "Summ": 14.3
334
+ },
335
+ "context_window_k": 200.0,
336
+ "params_b": "34.0",
337
+ "type": "♭"
338
+ },
339
+ "recipe_class": "engineered_long_ctx",
340
+ "params_b": 34,
341
+ "native_context_k": 4,
342
+ "source": "ruler+helmet"
343
+ },
344
+ "command-r-35-b": {
345
+ "display_name": "command-r-35b",
346
+ "ruler_per_ctx": {
347
+ "4k": 93.8,
348
+ "8k": 93.3,
349
+ "16k": 92.4,
350
+ "32k": 89.5,
351
+ "64k": 84.9,
352
+ "128k": 76.0
353
+ },
354
+ "ruler_long_score": {
355
+ "base": 93.55,
356
+ "per_length_lc": {
357
+ "16k": -0.0123,
358
+ "32k": -0.0433,
359
+ "64k": -0.0925,
360
+ "128k": -0.1876
361
+ },
362
+ "avg_lc": -0.0839
363
+ },
364
+ "helmet": null,
365
+ "recipe_class": "engineered_long_ctx",
366
+ "params_b": 35,
367
+ "native_context_k": 8,
368
+ "source": "ruler"
369
+ },
370
+ "command-r-0824-32-b": {
371
+ "display_name": "command-r-0824-32b",
372
+ "ruler_per_ctx": {
373
+ "4k": 94.7,
374
+ "8k": 93.7,
375
+ "16k": 93.1,
376
+ "32k": 90.8,
377
+ "64k": 86.6,
378
+ "128k": 74.7
379
+ },
380
+ "ruler_long_score": {
381
+ "base": 94.2,
382
+ "per_length_lc": {
383
+ "16k": -0.0117,
384
+ "32k": -0.0361,
385
+ "64k": -0.0807,
386
+ "128k": -0.207
387
+ },
388
+ "avg_lc": -0.0839
389
+ },
390
+ "helmet": null,
391
+ "recipe_class": "engineered_long_ctx",
392
+ "params_b": 32,
393
+ "native_context_k": 8,
394
+ "source": "ruler"
395
+ },
396
+ "command-r-plus-104-b": {
397
+ "display_name": "command-r-plus-104b",
398
+ "ruler_per_ctx": {
399
+ "4k": 95.6,
400
+ "8k": 95.2,
401
+ "16k": 94.2,
402
+ "32k": 92.0,
403
+ "64k": 84.3,
404
+ "128k": 63.1
405
+ },
406
+ "ruler_long_score": {
407
+ "base": 95.4,
408
+ "per_length_lc": {
409
+ "16k": -0.0126,
410
+ "32k": -0.0356,
411
+ "64k": -0.1164,
412
+ "128k": -0.3386
413
+ },
414
+ "avg_lc": -0.1258
415
+ },
416
+ "helmet": null,
417
+ "recipe_class": "engineered_long_ctx",
418
+ "params_b": 104,
419
+ "native_context_k": 8,
420
+ "source": "ruler"
421
+ },
422
+ "command-r-plus-0824-104-b": {
423
+ "display_name": "command-r-plus-0824-104b",
424
+ "ruler_per_ctx": {
425
+ "4k": 96.0,
426
+ "8k": 95.1,
427
+ "16k": 94.0,
428
+ "32k": 92.4,
429
+ "64k": 85.4,
430
+ "128k": 64.6
431
+ },
432
+ "ruler_long_score": {
433
+ "base": 95.55,
434
+ "per_length_lc": {
435
+ "16k": -0.0162,
436
+ "32k": -0.033,
437
+ "64k": -0.1062,
438
+ "128k": -0.3239
439
+ },
440
+ "avg_lc": -0.1198
441
+ },
442
+ "helmet": null,
443
+ "recipe_class": "engineered_long_ctx",
444
+ "params_b": 104,
445
+ "native_context_k": 8,
446
+ "source": "ruler"
447
+ },
448
+ "mistral-large-2411-123-b": {
449
+ "display_name": "mistral-large-2411-123b",
450
+ "ruler_per_ctx": {
451
+ "4k": 96.4,
452
+ "8k": 96.3,
453
+ "16k": 95.3,
454
+ "32k": 94.0,
455
+ "64k": 85.9,
456
+ "128k": 48.1
457
+ },
458
+ "ruler_long_score": {
459
+ "base": 96.35,
460
+ "per_length_lc": {
461
+ "16k": -0.0109,
462
+ "32k": -0.0244,
463
+ "64k": -0.1085,
464
+ "128k": -0.5008
465
+ },
466
+ "avg_lc": -0.1612
467
+ },
468
+ "helmet": null,
469
+ "recipe_class": "engineered_long_ctx",
470
+ "params_b": 123,
471
+ "native_context_k": 8,
472
+ "source": "ruler"
473
+ },
474
+ "chatglm-3-6-b-128-k": {
475
+ "display_name": "chatglm3-6b-128k",
476
+ "ruler_per_ctx": {
477
+ "4k": 87.8,
478
+ "8k": 83.4,
479
+ "16k": 78.6,
480
+ "32k": 69.9,
481
+ "64k": 56.0,
482
+ "128k": 42.0
483
+ },
484
+ "ruler_long_score": {
485
+ "base": 85.6,
486
+ "per_length_lc": {
487
+ "16k": -0.0818,
488
+ "32k": -0.1834,
489
+ "64k": -0.3458,
490
+ "128k": -0.5093
491
+ },
492
+ "avg_lc": -0.2801
493
+ },
494
+ "helmet": null,
495
+ "recipe_class": "aggressive_ext_small_base",
496
+ "params_b": 6,
497
+ "native_context_k": 2,
498
+ "source": "ruler"
499
+ },
500
+ "lwm-7-b": {
501
+ "display_name": "lwm-7b",
502
+ "ruler_per_ctx": {
503
+ "4k": 82.3,
504
+ "8k": 78.4,
505
+ "16k": 73.7,
506
+ "32k": 69.1,
507
+ "64k": 68.1,
508
+ "128k": 65.0
509
+ },
510
+ "ruler_long_score": {
511
+ "base": 80.35,
512
+ "per_length_lc": {
513
+ "16k": -0.0828,
514
+ "32k": -0.14,
515
+ "64k": -0.1525,
516
+ "128k": -0.191
517
+ },
518
+ "avg_lc": -0.1416
519
+ },
520
+ "helmet": null,
521
+ "recipe_class": "aggressive_ext_small_base",
522
+ "params_b": 7,
523
+ "native_context_k": 4,
524
+ "source": "ruler"
525
+ },
526
+ "glm-4-9-b-1-m": {
527
+ "display_name": "glm4-9b-1m",
528
+ "ruler_per_ctx": {
529
+ "4k": 94.7,
530
+ "8k": 92.8,
531
+ "16k": 92.1,
532
+ "32k": 89.9,
533
+ "64k": 86.7,
534
+ "128k": 83.1
535
+ },
536
+ "ruler_long_score": {
537
+ "base": 93.75,
538
+ "per_length_lc": {
539
+ "16k": -0.0176,
540
+ "32k": -0.0411,
541
+ "64k": -0.0752,
542
+ "128k": -0.1136
543
+ },
544
+ "avg_lc": -0.0619
545
+ },
546
+ "helmet": null,
547
+ "recipe_class": "aggressive_ext_small_base",
548
+ "params_b": 9,
549
+ "native_context_k": 8,
550
+ "source": "ruler"
551
+ },
552
+ "prolong-8-b-512-k": {
553
+ "display_name": "prolong-8b-512k",
554
+ "ruler_per_ctx": {
555
+ "4k": 94.5,
556
+ "8k": 92.5,
557
+ "16k": 92.3,
558
+ "32k": 89.3,
559
+ "64k": 83.2,
560
+ "128k": 81.6
561
+ },
562
+ "ruler_long_score": {
563
+ "base": 93.5,
564
+ "per_length_lc": {
565
+ "16k": -0.0128,
566
+ "32k": -0.0449,
567
+ "64k": -0.1102,
568
+ "128k": -0.1273
569
+ },
570
+ "avg_lc": -0.0738
571
+ },
572
+ "helmet": null,
573
+ "recipe_class": "aggressive_ext_small_base",
574
+ "params_b": 8,
575
+ "native_context_k": 8,
576
+ "source": "ruler"
577
+ },
578
+ "megabeam-mistral-7-b-512-k": {
579
+ "display_name": "megabeam-mistral-7b-512k",
580
+ "ruler_per_ctx": {
581
+ "4k": 93.8,
582
+ "8k": 92.5,
583
+ "16k": 92.0,
584
+ "32k": 89.2,
585
+ "64k": 83.7,
586
+ "128k": 83.7
587
+ },
588
+ "ruler_long_score": {
589
+ "base": 93.15,
590
+ "per_length_lc": {
591
+ "16k": -0.0123,
592
+ "32k": -0.0424,
593
+ "64k": -0.1014,
594
+ "128k": -0.1014
595
+ },
596
+ "avg_lc": -0.0644
597
+ },
598
+ "helmet": null,
599
+ "recipe_class": "aggressive_ext_small_base",
600
+ "params_b": 7,
601
+ "native_context_k": 32,
602
+ "source": "ruler"
603
+ },
604
+ "internlm-2-5-7-b-1-m": {
605
+ "display_name": "internlm2.5-7b-1m",
606
+ "ruler_per_ctx": {
607
+ "4k": 88.1,
608
+ "8k": 85.5,
609
+ "16k": 84.5,
610
+ "32k": 82.7,
611
+ "64k": 75.5,
612
+ "128k": 68.9
613
+ },
614
+ "ruler_long_score": {
615
+ "base": 86.8,
616
+ "per_length_lc": {
617
+ "16k": -0.0265,
618
+ "32k": -0.0472,
619
+ "64k": -0.1302,
620
+ "128k": -0.2062
621
+ },
622
+ "avg_lc": -0.1025
623
+ },
624
+ "helmet": null,
625
+ "recipe_class": "aggressive_ext_small_base",
626
+ "params_b": 7,
627
+ "native_context_k": 4,
628
+ "source": "ruler"
629
+ },
630
+ "gradientai-llama-3-70-b-1-m": {
631
+ "display_name": "gradientai-llama3-70b-1m",
632
+ "ruler_per_ctx": {
633
+ "4k": 95.1,
634
+ "8k": 94.4,
635
+ "16k": 90.8,
636
+ "32k": 85.4,
637
+ "64k": 80.9,
638
+ "128k": 72.1
639
+ },
640
+ "ruler_long_score": {
641
+ "base": 94.75,
642
+ "per_length_lc": {
643
+ "16k": -0.0417,
644
+ "32k": -0.0987,
645
+ "64k": -0.1462,
646
+ "128k": -0.2391
647
+ },
648
+ "avg_lc": -0.1314
649
+ },
650
+ "helmet": null,
651
+ "recipe_class": "aggressive_ext_small_base",
652
+ "params_b": 70,
653
+ "native_context_k": 8,
654
+ "source": "ruler"
655
+ },
656
+ "gradientai-llama-3-8-b-1-m": {
657
+ "display_name": "gradientai-llama3-8b-1m",
658
+ "ruler_per_ctx": {
659
+ "4k": 92.8,
660
+ "8k": 90.3,
661
+ "16k": 85.7,
662
+ "32k": 79.9,
663
+ "64k": 76.3,
664
+ "128k": 69.5
665
+ },
666
+ "ruler_long_score": {
667
+ "base": 91.55,
668
+ "per_length_lc": {
669
+ "16k": -0.0639,
670
+ "32k": -0.1273,
671
+ "64k": -0.1666,
672
+ "128k": -0.2409
673
+ },
674
+ "avg_lc": -0.1497
675
+ },
676
+ "helmet": null,
677
+ "recipe_class": "aggressive_ext_small_base",
678
+ "params_b": 8,
679
+ "native_context_k": 8,
680
+ "source": "ruler"
681
+ },
682
+ "longchat-7-b-32-k": {
683
+ "display_name": "longchat-7b-32k",
684
+ "ruler_per_ctx": {
685
+ "4k": 84.7,
686
+ "8k": 79.9,
687
+ "16k": 70.8,
688
+ "32k": 59.3,
689
+ "64k": 0.0,
690
+ "128k": 0.0
691
+ },
692
+ "ruler_long_score": {
693
+ "base": 82.3,
694
+ "per_length_lc": {
695
+ "16k": -0.1397,
696
+ "32k": -0.2795,
697
+ "64k": -1.0,
698
+ "128k": -1.0
699
+ },
700
+ "avg_lc": -0.6048
701
+ },
702
+ "helmet": null,
703
+ "recipe_class": "aggressive_ext_small_base",
704
+ "params_b": 7,
705
+ "native_context_k": 4,
706
+ "source": "ruler"
707
+ },
708
+ "longalpaca-13-b-32-k": {
709
+ "display_name": "longalpaca-13b-32k",
710
+ "ruler_per_ctx": {
711
+ "4k": 60.6,
712
+ "8k": 57.0,
713
+ "16k": 56.6,
714
+ "32k": 43.6,
715
+ "64k": 0.0,
716
+ "128k": 0.0
717
+ },
718
+ "ruler_long_score": {
719
+ "base": 58.8,
720
+ "per_length_lc": {
721
+ "16k": -0.0374,
722
+ "32k": -0.2585,
723
+ "64k": -1.0,
724
+ "128k": -1.0
725
+ },
726
+ "avg_lc": -0.574
727
+ },
728
+ "helmet": null,
729
+ "recipe_class": "aggressive_ext_small_base",
730
+ "params_b": 13,
731
+ "native_context_k": 4,
732
+ "source": "ruler"
733
+ },
734
+ "llama-3-1-70-b-instruct": {
735
+ "display_name": "llama3.1-70b-instruct",
736
+ "ruler_per_ctx": {
737
+ "4k": 96.5,
738
+ "8k": 95.8,
739
+ "16k": 95.4,
740
+ "32k": 94.8,
741
+ "64k": 88.4,
742
+ "128k": 66.6
743
+ },
744
+ "ruler_long_score": {
745
+ "base": 96.15,
746
+ "per_length_lc": {
747
+ "16k": -0.0078,
748
+ "32k": -0.014,
749
+ "64k": -0.0806,
750
+ "128k": -0.3073
751
+ },
752
+ "avg_lc": -0.1024
753
+ },
754
+ "helmet": {
755
+ "overall": 49.7,
756
+ "categories": {
757
+ "Recall": 90.7,
758
+ "RAG": 56.2,
759
+ "Cite": 7.5,
760
+ "Re-rank": 24.5,
761
+ "ICL": 81.4,
762
+ "LongQA": 56.3,
763
+ "Summ": 31.6
764
+ },
765
+ "context_window_k": 128.0,
766
+ "params_b": "70.0",
767
+ "type": "♯"
768
+ },
769
+ "recipe_class": "continued_pretrain_cliff",
770
+ "params_b": 70,
771
+ "native_context_k": 8,
772
+ "source": "ruler+helmet"
773
+ },
774
+ "llama-3-1-8-b-instruct": {
775
+ "display_name": "llama3.1-8b-instruct",
776
+ "ruler_per_ctx": {
777
+ "4k": 95.5,
778
+ "8k": 93.8,
779
+ "16k": 91.6,
780
+ "32k": 87.4,
781
+ "64k": 84.7,
782
+ "128k": 77.0
783
+ },
784
+ "ruler_long_score": {
785
+ "base": 94.65,
786
+ "per_length_lc": {
787
+ "16k": -0.0322,
788
+ "32k": -0.0766,
789
+ "64k": -0.1051,
790
+ "128k": -0.1865
791
+ },
792
+ "avg_lc": -0.1001
793
+ },
794
+ "helmet": {
795
+ "overall": 46.5,
796
+ "categories": {
797
+ "Recall": 95.2,
798
+ "RAG": 59.5,
799
+ "Cite": 2.9,
800
+ "Re-rank": 14.0,
801
+ "ICL": 83.9,
802
+ "LongQA": 43.2,
803
+ "Summ": 27.0
804
+ },
805
+ "context_window_k": 128.0,
806
+ "params_b": "8.0",
807
+ "type": "♯"
808
+ },
809
+ "recipe_class": "continued_pretrain_cliff",
810
+ "params_b": 8,
811
+ "native_context_k": 8,
812
+ "source": "ruler+helmet"
813
+ },
814
+ "mistral-nemo-12-b": {
815
+ "display_name": "mistral-nemo-12b",
816
+ "ruler_per_ctx": {
817
+ "4k": 87.8,
818
+ "8k": 87.2,
819
+ "16k": 87.7,
820
+ "32k": 69.0,
821
+ "64k": 46.8,
822
+ "128k": 19.0
823
+ },
824
+ "ruler_long_score": {
825
+ "base": 87.5,
826
+ "per_length_lc": {
827
+ "16k": 0.0023,
828
+ "32k": -0.2114,
829
+ "64k": -0.4651,
830
+ "128k": -0.7829
831
+ },
832
+ "avg_lc": -0.3643
833
+ },
834
+ "helmet": null,
835
+ "recipe_class": "continued_pretrain_cliff",
836
+ "params_b": 12,
837
+ "native_context_k": 8,
838
+ "source": "ruler"
839
+ },
840
+ "qwen-2-5-7-b-instruct-1-m": {
841
+ "display_name": "qwen2.5-7b-instruct-1m",
842
+ "ruler_per_ctx": {
843
+ "4k": 96.8,
844
+ "8k": 95.3,
845
+ "16k": 93.0,
846
+ "32k": 91.1,
847
+ "64k": 90.4,
848
+ "128k": 84.4
849
+ },
850
+ "ruler_long_score": {
851
+ "base": 96.05,
852
+ "per_length_lc": {
853
+ "16k": -0.0318,
854
+ "32k": -0.0515,
855
+ "64k": -0.0588,
856
+ "128k": -0.1213
857
+ },
858
+ "avg_lc": -0.0659
859
+ },
860
+ "helmet": null,
861
+ "recipe_class": "dca_ssm_flat",
862
+ "params_b": 7,
863
+ "native_context_k": 32,
864
+ "source": "ruler"
865
+ },
866
+ "qwen-2-5-14-b-instruct-1-m": {
867
+ "display_name": "qwen2.5-14b-instruct-1m",
868
+ "ruler_per_ctx": {
869
+ "4k": 97.5,
870
+ "8k": 97.1,
871
+ "16k": 94.6,
872
+ "32k": 94.9,
873
+ "64k": 94.9,
874
+ "128k": 92.2
875
+ },
876
+ "ruler_long_score": {
877
+ "base": 97.3,
878
+ "per_length_lc": {
879
+ "16k": -0.0277,
880
+ "32k": -0.0247,
881
+ "64k": -0.0247,
882
+ "128k": -0.0524
883
+ },
884
+ "avg_lc": -0.0324
885
+ },
886
+ "helmet": null,
887
+ "recipe_class": "dca_ssm_flat",
888
+ "params_b": 14,
889
+ "native_context_k": 32,
890
+ "source": "ruler"
891
+ },
892
+ "jamba-1-5-large": {
893
+ "display_name": "jamba-1.5-large",
894
+ "ruler_per_ctx": {
895
+ "4k": 96.7,
896
+ "8k": 96.6,
897
+ "16k": 96.4,
898
+ "32k": 96.0,
899
+ "64k": 95.4,
900
+ "128k": 95.1
901
+ },
902
+ "ruler_long_score": {
903
+ "base": 96.65,
904
+ "per_length_lc": {
905
+ "16k": -0.0026,
906
+ "32k": -0.0067,
907
+ "64k": -0.0129,
908
+ "128k": -0.016
909
+ },
910
+ "avg_lc": -0.0095
911
+ },
912
+ "helmet": null,
913
+ "recipe_class": "dca_ssm_flat",
914
+ "params_b": 398,
915
+ "native_context_k": 256,
916
+ "source": "ruler"
917
+ },
918
+ "jamba-1-5-mini": {
919
+ "display_name": "jamba-1.5-mini",
920
+ "ruler_per_ctx": {
921
+ "4k": 95.6,
922
+ "8k": 95.6,
923
+ "16k": 94.8,
924
+ "32k": 94.6,
925
+ "64k": 92.8,
926
+ "128k": 90.0
927
+ },
928
+ "ruler_long_score": {
929
+ "base": 95.6,
930
+ "per_length_lc": {
931
+ "16k": -0.0084,
932
+ "32k": -0.0105,
933
+ "64k": -0.0293,
934
+ "128k": -0.0586
935
+ },
936
+ "avg_lc": -0.0267
937
+ },
938
+ "helmet": {
939
+ "overall": 46.9,
940
+ "categories": {
941
+ "Recall": 90.0,
942
+ "RAG": 57.3,
943
+ "Cite": 3.1,
944
+ "Re-rank": 14.6,
945
+ "ICL": 91.0,
946
+ "LongQA": 54.2,
947
+ "Summ": 18.1
948
+ },
949
+ "context_window_k": 256.0,
950
+ "params_b": "12.0",
951
+ "type": "♯"
952
+ },
953
+ "recipe_class": "dca_ssm_flat",
954
+ "params_b": 52,
955
+ "native_context_k": 256,
956
+ "source": "ruler+helmet"
957
+ },
958
+ "phi-3-mini-3-8-b": {
959
+ "display_name": "phi3-mini-3.8b",
960
+ "ruler_per_ctx": {
961
+ "4k": 92.2,
962
+ "8k": 91.5,
963
+ "16k": 90.7,
964
+ "32k": 87.5,
965
+ "64k": 80.6,
966
+ "128k": 66.7
967
+ },
968
+ "ruler_long_score": {
969
+ "base": 91.85,
970
+ "per_length_lc": {
971
+ "16k": -0.0125,
972
+ "32k": -0.0474,
973
+ "64k": -0.1225,
974
+ "128k": -0.2738
975
+ },
976
+ "avg_lc": -0.114
977
+ },
978
+ "helmet": null,
979
+ "recipe_class": "longrope",
980
+ "params_b": 3.8,
981
+ "native_context_k": 4,
982
+ "source": "ruler"
983
+ },
984
+ "phi-3-medium-14-b": {
985
+ "display_name": "phi3-medium-14b",
986
+ "ruler_per_ctx": {
987
+ "4k": 93.3,
988
+ "8k": 93.2,
989
+ "16k": 91.1,
990
+ "32k": 86.8,
991
+ "64k": 78.6,
992
+ "128k": 46.1
993
+ },
994
+ "ruler_long_score": {
995
+ "base": 93.25,
996
+ "per_length_lc": {
997
+ "16k": -0.0231,
998
+ "32k": -0.0692,
999
+ "64k": -0.1571,
1000
+ "128k": -0.5056
1001
+ },
1002
+ "avg_lc": -0.1888
1003
+ },
1004
+ "helmet": null,
1005
+ "recipe_class": "longrope",
1006
+ "params_b": 14,
1007
+ "native_context_k": 4,
1008
+ "source": "ruler"
1009
+ },
1010
+ "gpt-4": {
1011
+ "display_name": "GPT-4",
1012
+ "ruler_per_ctx": null,
1013
+ "ruler_long_score": null,
1014
+ "helmet": {
1015
+ "overall": 39.2,
1016
+ "categories": {
1017
+ "Recall": 73.5,
1018
+ "RAG": 64.7,
1019
+ "Cite": 1.2,
1020
+ "Re-rank": 9.6,
1021
+ "ICL": 46.0,
1022
+ "LongQA": 46.7,
1023
+ "Summ": 33.0
1024
+ },
1025
+ "context_window_k": 128.0,
1026
+ "params_b": "?",
1027
+ "type": "♯"
1028
+ },
1029
+ "recipe_class": null,
1030
+ "params_b": "?",
1031
+ "native_context_k": null,
1032
+ "source": "helmet"
1033
+ },
1034
+ "gpt-4-o-mini": {
1035
+ "display_name": "GPT-4o-mini",
1036
+ "ruler_per_ctx": null,
1037
+ "ruler_long_score": null,
1038
+ "helmet": {
1039
+ "overall": 55.7,
1040
+ "categories": {
1041
+ "Recall": 89.6,
1042
+ "RAG": 68.0,
1043
+ "Cite": 24.3,
1044
+ "Re-rank": 31.2,
1045
+ "ICL": 81.2,
1046
+ "LongQA": 54.8,
1047
+ "Summ": 40.8
1048
+ },
1049
+ "context_window_k": 128.0,
1050
+ "params_b": "?",
1051
+ "type": "♯"
1052
+ },
1053
+ "recipe_class": null,
1054
+ "params_b": "?",
1055
+ "native_context_k": null,
1056
+ "source": "helmet"
1057
+ },
1058
+ "gpt-4-o-05": {
1059
+ "display_name": "GPT-4o-05",
1060
+ "ruler_per_ctx": null,
1061
+ "ruler_long_score": null,
1062
+ "helmet": {
1063
+ "overall": 56.4,
1064
+ "categories": {
1065
+ "Recall": 82.4,
1066
+ "RAG": 70.4,
1067
+ "Cite": 40.4,
1068
+ "Re-rank": 47.7,
1069
+ "ICL": 48.4,
1070
+ "LongQA": 62.0,
1071
+ "Summ": 43.6
1072
+ },
1073
+ "context_window_k": 128.0,
1074
+ "params_b": "?",
1075
+ "type": "♯"
1076
+ },
1077
+ "recipe_class": null,
1078
+ "params_b": "?",
1079
+ "native_context_k": null,
1080
+ "source": "helmet"
1081
+ },
1082
+ "gpt-4-o-08": {
1083
+ "display_name": "GPT-4o-08",
1084
+ "ruler_per_ctx": null,
1085
+ "ruler_long_score": null,
1086
+ "helmet": {
1087
+ "overall": 64.8,
1088
+ "categories": {
1089
+ "Recall": 99.9,
1090
+ "RAG": 70.2,
1091
+ "Cite": 44.3,
1092
+ "Re-rank": 50.0,
1093
+ "ICL": 86.3,
1094
+ "LongQA": 59.3,
1095
+ "Summ": 43.2
1096
+ },
1097
+ "context_window_k": 128.0,
1098
+ "params_b": "?",
1099
+ "type": "♯"
1100
+ },
1101
+ "recipe_class": null,
1102
+ "params_b": "?",
1103
+ "native_context_k": null,
1104
+ "source": "helmet"
1105
+ },
1106
+ "claude-3-5-sonnet": {
1107
+ "display_name": "Claude-3.5-Sonnet",
1108
+ "ruler_per_ctx": null,
1109
+ "ruler_long_score": null,
1110
+ "helmet": {
1111
+ "overall": 38.4,
1112
+ "categories": {
1113
+ "Recall": 94.7,
1114
+ "RAG": 38.1,
1115
+ "Cite": 18.7,
1116
+ "Re-rank": 7.2,
1117
+ "ICL": 61.0,
1118
+ "LongQA": 12.6,
1119
+ "Summ": 36.6
1120
+ },
1121
+ "context_window_k": 200.0,
1122
+ "params_b": "?",
1123
+ "type": "♯"
1124
+ },
1125
+ "recipe_class": null,
1126
+ "params_b": "?",
1127
+ "native_context_k": null,
1128
+ "source": "helmet"
1129
+ },
1130
+ "gemini-1-5-flash": {
1131
+ "display_name": "Gemini-1.5-Flash",
1132
+ "ruler_per_ctx": null,
1133
+ "ruler_long_score": null,
1134
+ "helmet": {
1135
+ "overall": 52.1,
1136
+ "categories": {
1137
+ "Recall": 91.2,
1138
+ "RAG": 66.5,
1139
+ "Cite": 31.7,
1140
+ "Re-rank": 50.8,
1141
+ "ICL": 28.1,
1142
+ "LongQA": 57.3,
1143
+ "Summ": 39.4
1144
+ },
1145
+ "context_window_k": 1024.0,
1146
+ "params_b": "?",
1147
+ "type": "♯"
1148
+ },
1149
+ "recipe_class": null,
1150
+ "params_b": "?",
1151
+ "native_context_k": null,
1152
+ "source": "helmet"
1153
+ },
1154
+ "gemini-1-5-pro": {
1155
+ "display_name": "Gemini-1.5-Pro",
1156
+ "ruler_per_ctx": null,
1157
+ "ruler_long_score": null,
1158
+ "helmet": {
1159
+ "overall": 64.4,
1160
+ "categories": {
1161
+ "Recall": 91.0,
1162
+ "RAG": 71.1,
1163
+ "Cite": 43.6,
1164
+ "Re-rank": 59.7,
1165
+ "ICL": 79.4,
1166
+ "LongQA": 59.6,
1167
+ "Summ": 46.4
1168
+ },
1169
+ "context_window_k": 2048.0,
1170
+ "params_b": "?",
1171
+ "type": "♯"
1172
+ },
1173
+ "recipe_class": null,
1174
+ "params_b": "?",
1175
+ "native_context_k": null,
1176
+ "source": "helmet"
1177
+ },
1178
+ "llama-2-7-b-32-k": {
1179
+ "display_name": "Llama-2-7B-32k",
1180
+ "ruler_per_ctx": null,
1181
+ "ruler_long_score": null,
1182
+ "helmet": {
1183
+ "overall": 1.3,
1184
+ "categories": {
1185
+ "Recall": 0.0,
1186
+ "RAG": 0.0,
1187
+ "Cite": 0.0,
1188
+ "Re-rank": 0.0,
1189
+ "ICL": 0.0,
1190
+ "LongQA": 9.4,
1191
+ "Summ": 0.0
1192
+ },
1193
+ "context_window_k": 32.0,
1194
+ "params_b": "7.0",
1195
+ "type": "♭"
1196
+ },
1197
+ "recipe_class": null,
1198
+ "params_b": "7.0",
1199
+ "native_context_k": null,
1200
+ "source": "helmet"
1201
+ },
1202
+ "llama-2-7-b-32-k-instruct": {
1203
+ "display_name": "Llama-2-7B-32k-Inst",
1204
+ "ruler_per_ctx": null,
1205
+ "ruler_long_score": null,
1206
+ "helmet": {
1207
+ "overall": 1.3,
1208
+ "categories": {
1209
+ "Recall": 0.0,
1210
+ "RAG": 0.0,
1211
+ "Cite": 0.0,
1212
+ "Re-rank": 0.0,
1213
+ "ICL": 0.0,
1214
+ "LongQA": 9.1,
1215
+ "Summ": 0.0
1216
+ },
1217
+ "context_window_k": 32.0,
1218
+ "params_b": "7.0",
1219
+ "type": "♯"
1220
+ },
1221
+ "recipe_class": null,
1222
+ "params_b": "7.0",
1223
+ "native_context_k": null,
1224
+ "source": "helmet"
1225
+ },
1226
+ "llama-7-b-80-k": {
1227
+ "display_name": "Llama-7B-80k",
1228
+ "ruler_per_ctx": null,
1229
+ "ruler_long_score": null,
1230
+ "helmet": {
1231
+ "overall": 22.4,
1232
+ "categories": {
1233
+ "Recall": 17.0,
1234
+ "RAG": 40.8,
1235
+ "Cite": 1.0,
1236
+ "Re-rank": 0.2,
1237
+ "ICL": 76.8,
1238
+ "LongQA": 19.8,
1239
+ "Summ": 0.9
1240
+ },
1241
+ "context_window_k": 80.0,
1242
+ "params_b": "7.0",
1243
+ "type": "♭"
1244
+ },
1245
+ "recipe_class": null,
1246
+ "params_b": "7.0",
1247
+ "native_context_k": null,
1248
+ "source": "helmet"
1249
+ },
1250
+ "yarn-llama-2-7-b-64-k": {
1251
+ "display_name": "Yarn-Llama-2-7B-64k",
1252
+ "ruler_per_ctx": null,
1253
+ "ruler_long_score": null,
1254
+ "helmet": {
1255
+ "overall": 0.6,
1256
+ "categories": {
1257
+ "Recall": 0.0,
1258
+ "RAG": 0.0,
1259
+ "Cite": 0.2,
1260
+ "Re-rank": 0.0,
1261
+ "ICL": 0.0,
1262
+ "LongQA": 3.8,
1263
+ "Summ": 0.2
1264
+ },
1265
+ "context_window_k": 64.0,
1266
+ "params_b": "7.0",
1267
+ "type": "♭"
1268
+ },
1269
+ "recipe_class": null,
1270
+ "params_b": "7.0",
1271
+ "native_context_k": null,
1272
+ "source": "helmet"
1273
+ },
1274
+ "yarn-llama-2-7-b-128-k": {
1275
+ "display_name": "Yarn-Llama-2-7B-128k",
1276
+ "ruler_per_ctx": null,
1277
+ "ruler_long_score": null,
1278
+ "helmet": {
1279
+ "overall": 17.9,
1280
+ "categories": {
1281
+ "Recall": 4.3,
1282
+ "RAG": 26.7,
1283
+ "Cite": 0.7,
1284
+ "Re-rank": 0.4,
1285
+ "ICL": 75.4,
1286
+ "LongQA": 16.1,
1287
+ "Summ": 1.9
1288
+ },
1289
+ "context_window_k": 128.0,
1290
+ "params_b": "7.0",
1291
+ "type": "♭"
1292
+ },
1293
+ "recipe_class": null,
1294
+ "params_b": "7.0",
1295
+ "native_context_k": null,
1296
+ "source": "helmet"
1297
+ },
1298
+ "llama-3-8-b": {
1299
+ "display_name": "Llama-3-8B",
1300
+ "ruler_per_ctx": null,
1301
+ "ruler_long_score": null,
1302
+ "helmet": {
1303
+ "overall": 1.5,
1304
+ "categories": {
1305
+ "Recall": 0.0,
1306
+ "RAG": 0.0,
1307
+ "Cite": 0.0,
1308
+ "Re-rank": 0.0,
1309
+ "ICL": 0.0,
1310
+ "LongQA": 10.5,
1311
+ "Summ": 0.0
1312
+ },
1313
+ "context_window_k": 8.0,
1314
+ "params_b": "8.0",
1315
+ "type": "♭"
1316
+ },
1317
+ "recipe_class": null,
1318
+ "params_b": "8.0",
1319
+ "native_context_k": null,
1320
+ "source": "helmet"
1321
+ },
1322
+ "llama-3-8-b-instruct": {
1323
+ "display_name": "Llama-3-8B-Inst",
1324
+ "ruler_per_ctx": null,
1325
+ "ruler_long_score": null,
1326
+ "helmet": {
1327
+ "overall": 1.2,
1328
+ "categories": {
1329
+ "Recall": 0.0,
1330
+ "RAG": 0.0,
1331
+ "Cite": 0.0,
1332
+ "Re-rank": 0.0,
1333
+ "ICL": 0.0,
1334
+ "LongQA": 8.7,
1335
+ "Summ": 0.0
1336
+ },
1337
+ "context_window_k": 8.0,
1338
+ "params_b": "8.0",
1339
+ "type": "♯"
1340
+ },
1341
+ "recipe_class": null,
1342
+ "params_b": "8.0",
1343
+ "native_context_k": null,
1344
+ "source": "helmet"
1345
+ },
1346
+ "llama-3-8-b-θ": {
1347
+ "display_name": "Llama-3-8B-θ",
1348
+ "ruler_per_ctx": null,
1349
+ "ruler_long_score": null,
1350
+ "helmet": {
1351
+ "overall": 0.9,
1352
+ "categories": {
1353
+ "Recall": 0.0,
1354
+ "RAG": 2.4,
1355
+ "Cite": 0.1,
1356
+ "Re-rank": 0.0,
1357
+ "ICL": 2.8,
1358
+ "LongQA": 0.8,
1359
+ "Summ": 0.0
1360
+ },
1361
+ "context_window_k": 8.0,
1362
+ "params_b": "8.0",
1363
+ "type": "♭"
1364
+ },
1365
+ "recipe_class": null,
1366
+ "params_b": "8.0",
1367
+ "native_context_k": null,
1368
+ "source": "helmet"
1369
+ },
1370
+ "llama-3-8-b-instruct-θ": {
1371
+ "display_name": "Llama-3-8B-Inst-θ",
1372
+ "ruler_per_ctx": null,
1373
+ "ruler_long_score": null,
1374
+ "helmet": {
1375
+ "overall": 1.6,
1376
+ "categories": {
1377
+ "Recall": 0.0,
1378
+ "RAG": 1.3,
1379
+ "Cite": 0.8,
1380
+ "Re-rank": 0.1,
1381
+ "ICL": 4.3,
1382
+ "LongQA": 2.5,
1383
+ "Summ": 1.9
1384
+ },
1385
+ "context_window_k": 8.0,
1386
+ "params_b": "8.0",
1387
+ "type": "♯"
1388
+ },
1389
+ "recipe_class": null,
1390
+ "params_b": "8.0",
1391
+ "native_context_k": null,
1392
+ "source": "helmet"
1393
+ },
1394
+ "llama-3-70-b-θ": {
1395
+ "display_name": "Llama-3-70B-θ",
1396
+ "ruler_per_ctx": null,
1397
+ "ruler_long_score": null,
1398
+ "helmet": {
1399
+ "overall": 1.2,
1400
+ "categories": {
1401
+ "Recall": 0.0,
1402
+ "RAG": 0.0,
1403
+ "Cite": 2.9,
1404
+ "Re-rank": 0.0,
1405
+ "ICL": 0.2,
1406
+ "LongQA": 5.4,
1407
+ "Summ": 0.1
1408
+ },
1409
+ "context_window_k": 8.0,
1410
+ "params_b": "70.0",
1411
+ "type": "♭"
1412
+ },
1413
+ "recipe_class": null,
1414
+ "params_b": "70.0",
1415
+ "native_context_k": null,
1416
+ "source": "helmet"
1417
+ },
1418
+ "llama-3-70-b-instruct-θ": {
1419
+ "display_name": "Llama-3-70B-Inst-θ",
1420
+ "ruler_per_ctx": null,
1421
+ "ruler_long_score": null,
1422
+ "helmet": {
1423
+ "overall": 2.1,
1424
+ "categories": {
1425
+ "Recall": 0.0,
1426
+ "RAG": 0.0,
1427
+ "Cite": 0.0,
1428
+ "Re-rank": 0.0,
1429
+ "ICL": 0.0,
1430
+ "LongQA": 12.0,
1431
+ "Summ": 2.7
1432
+ },
1433
+ "context_window_k": 8.0,
1434
+ "params_b": "70.0",
1435
+ "type": "♯"
1436
+ },
1437
+ "recipe_class": null,
1438
+ "params_b": "70.0",
1439
+ "native_context_k": null,
1440
+ "source": "helmet"
1441
+ },
1442
+ "llama-3-1-8-b": {
1443
+ "display_name": "Llama-3.1-8B",
1444
+ "ruler_per_ctx": null,
1445
+ "ruler_long_score": null,
1446
+ "helmet": {
1447
+ "overall": 37.1,
1448
+ "categories": {
1449
+ "Recall": 76.6,
1450
+ "RAG": 54.5,
1451
+ "Cite": 1.6,
1452
+ "Re-rank": 9.3,
1453
+ "ICL": 79.5,
1454
+ "LongQA": 36.2,
1455
+ "Summ": 1.7
1456
+ },
1457
+ "context_window_k": 128.0,
1458
+ "params_b": "8.0",
1459
+ "type": "♭"
1460
+ },
1461
+ "recipe_class": null,
1462
+ "params_b": "8.0",
1463
+ "native_context_k": null,
1464
+ "source": "helmet"
1465
+ },
1466
+ "llama-3-1-70-b": {
1467
+ "display_name": "Llama-3.1-70B",
1468
+ "ruler_per_ctx": null,
1469
+ "ruler_long_score": null,
1470
+ "helmet": {
1471
+ "overall": 41.7,
1472
+ "categories": {
1473
+ "Recall": 76.3,
1474
+ "RAG": 56.6,
1475
+ "Cite": 3.9,
1476
+ "Re-rank": 22.6,
1477
+ "ICL": 80.4,
1478
+ "LongQA": 46.0,
1479
+ "Summ": 6.3
1480
+ },
1481
+ "context_window_k": 128.0,
1482
+ "params_b": "70.0",
1483
+ "type": "♭"
1484
+ },
1485
+ "recipe_class": null,
1486
+ "params_b": "70.0",
1487
+ "native_context_k": null,
1488
+ "source": "helmet"
1489
+ },
1490
+ "llama-3-3-70-b-instruct": {
1491
+ "display_name": "Llama-3.3-70B-Inst",
1492
+ "ruler_per_ctx": null,
1493
+ "ruler_long_score": null,
1494
+ "helmet": {
1495
+ "overall": 48.2,
1496
+ "categories": {
1497
+ "Recall": 81.8,
1498
+ "RAG": 55.1,
1499
+ "Cite": 10.9,
1500
+ "Re-rank": 26.2,
1501
+ "ICL": 77.8,
1502
+ "LongQA": 52.0,
1503
+ "Summ": 33.3
1504
+ },
1505
+ "context_window_k": 128.0,
1506
+ "params_b": "70.0",
1507
+ "type": "♯"
1508
+ },
1509
+ "recipe_class": null,
1510
+ "params_b": "70.0",
1511
+ "native_context_k": null,
1512
+ "source": "helmet"
1513
+ },
1514
+ "llama-3-2-1-b": {
1515
+ "display_name": "Llama-3.2-1B",
1516
+ "ruler_per_ctx": null,
1517
+ "ruler_long_score": null,
1518
+ "helmet": {
1519
+ "overall": 22.6,
1520
+ "categories": {
1521
+ "Recall": 25.2,
1522
+ "RAG": 32.1,
1523
+ "Cite": 1.5,
1524
+ "Re-rank": 5.6,
1525
+ "ICL": 76.9,
1526
+ "LongQA": 15.9,
1527
+ "Summ": 0.9
1528
+ },
1529
+ "context_window_k": 128.0,
1530
+ "params_b": "1.0",
1531
+ "type": "♭"
1532
+ },
1533
+ "recipe_class": null,
1534
+ "params_b": "1.0",
1535
+ "native_context_k": null,
1536
+ "source": "helmet"
1537
+ },
1538
+ "llama-3-2-1-b-instruct": {
1539
+ "display_name": "Llama-3.2-1B-Inst",
1540
+ "ruler_per_ctx": null,
1541
+ "ruler_long_score": null,
1542
+ "helmet": {
1543
+ "overall": 25.0,
1544
+ "categories": {
1545
+ "Recall": 32.4,
1546
+ "RAG": 33.8,
1547
+ "Cite": 2.4,
1548
+ "Re-rank": 1.7,
1549
+ "ICL": 77.3,
1550
+ "LongQA": 18.1,
1551
+ "Summ": 9.1
1552
+ },
1553
+ "context_window_k": 128.0,
1554
+ "params_b": "1.0",
1555
+ "type": "♯"
1556
+ },
1557
+ "recipe_class": null,
1558
+ "params_b": "1.0",
1559
+ "native_context_k": null,
1560
+ "source": "helmet"
1561
+ },
1562
+ "llama-3-2-3-b": {
1563
+ "display_name": "Llama-3.2-3B",
1564
+ "ruler_per_ctx": null,
1565
+ "ruler_long_score": null,
1566
+ "helmet": {
1567
+ "overall": 33.5,
1568
+ "categories": {
1569
+ "Recall": 58.8,
1570
+ "RAG": 51.4,
1571
+ "Cite": 2.1,
1572
+ "Re-rank": 5.4,
1573
+ "ICL": 89.5,
1574
+ "LongQA": 27.6,
1575
+ "Summ": 0.0
1576
+ },
1577
+ "context_window_k": 128.0,
1578
+ "params_b": "3.0",
1579
+ "type": "♭"
1580
+ },
1581
+ "recipe_class": null,
1582
+ "params_b": "3.0",
1583
+ "native_context_k": null,
1584
+ "source": "helmet"
1585
+ },
1586
+ "llama-3-2-3-b-instruct": {
1587
+ "display_name": "Llama-3.2-3B-Inst",
1588
+ "ruler_per_ctx": null,
1589
+ "ruler_long_score": null,
1590
+ "helmet": {
1591
+ "overall": 37.0,
1592
+ "categories": {
1593
+ "Recall": 53.1,
1594
+ "RAG": 56.4,
1595
+ "Cite": 7.4,
1596
+ "Re-rank": 0.7,
1597
+ "ICL": 86.4,
1598
+ "LongQA": 28.9,
1599
+ "Summ": 25.9
1600
+ },
1601
+ "context_window_k": 128.0,
1602
+ "params_b": "3.0",
1603
+ "type": "♯"
1604
+ },
1605
+ "recipe_class": null,
1606
+ "params_b": "3.0",
1607
+ "native_context_k": null,
1608
+ "source": "helmet"
1609
+ },
1610
+ "llama-4-17-b": {
1611
+ "display_name": "Llama-4-17B",
1612
+ "ruler_per_ctx": null,
1613
+ "ruler_long_score": null,
1614
+ "helmet": {
1615
+ "overall": 31.4,
1616
+ "categories": {
1617
+ "Recall": 42.8,
1618
+ "RAG": 50.4,
1619
+ "Cite": 3.7,
1620
+ "Re-rank": 6.1,
1621
+ "ICL": 88.0,
1622
+ "LongQA": 28.1,
1623
+ "Summ": 0.6
1624
+ },
1625
+ "context_window_k": 128.0,
1626
+ "params_b": "17.0",
1627
+ "type": "♭"
1628
+ },
1629
+ "recipe_class": null,
1630
+ "params_b": "17.0",
1631
+ "native_context_k": null,
1632
+ "source": "helmet"
1633
+ },
1634
+ "llama-4-17-b-instruct": {
1635
+ "display_name": "Llama-4-17B-Inst",
1636
+ "ruler_per_ctx": null,
1637
+ "ruler_long_score": null,
1638
+ "helmet": {
1639
+ "overall": 45.0,
1640
+ "categories": {
1641
+ "Recall": 76.6,
1642
+ "RAG": 51.2,
1643
+ "Cite": 4.7,
1644
+ "Re-rank": 17.9,
1645
+ "ICL": 81.0,
1646
+ "LongQA": 50.5,
1647
+ "Summ": 33.2
1648
+ },
1649
+ "context_window_k": 128.0,
1650
+ "params_b": "17.0",
1651
+ "type": "♯"
1652
+ },
1653
+ "recipe_class": null,
1654
+ "params_b": "17.0",
1655
+ "native_context_k": null,
1656
+ "source": "helmet"
1657
+ },
1658
+ "mistral-7-b-instruct-v-0-1": {
1659
+ "display_name": "Mistral-7B-Inst-v0.1",
1660
+ "ruler_per_ctx": null,
1661
+ "ruler_long_score": null,
1662
+ "helmet": {
1663
+ "overall": 15.8,
1664
+ "categories": {
1665
+ "Recall": 3.8,
1666
+ "RAG": 36.6,
1667
+ "Cite": 1.3,
1668
+ "Re-rank": 0.0,
1669
+ "ICL": 44.8,
1670
+ "LongQA": 15.2,
1671
+ "Summ": 9.2
1672
+ },
1673
+ "context_window_k": 8.0,
1674
+ "params_b": "7.0",
1675
+ "type": "♯"
1676
+ },
1677
+ "recipe_class": null,
1678
+ "params_b": "7.0",
1679
+ "native_context_k": null,
1680
+ "source": "helmet"
1681
+ },
1682
+ "mistral-7-b-instruct-v-0-2": {
1683
+ "display_name": "Mistral-7B-Inst-v0.2",
1684
+ "ruler_per_ctx": null,
1685
+ "ruler_long_score": null,
1686
+ "helmet": {
1687
+ "overall": 8.7,
1688
+ "categories": {
1689
+ "Recall": 1.5,
1690
+ "RAG": 11.4,
1691
+ "Cite": 0.6,
1692
+ "Re-rank": 1.2,
1693
+ "ICL": 38.0,
1694
+ "LongQA": 4.1,
1695
+ "Summ": 4.3
1696
+ },
1697
+ "context_window_k": 32.0,
1698
+ "params_b": "7.0",
1699
+ "type": "♯"
1700
+ },
1701
+ "recipe_class": null,
1702
+ "params_b": "7.0",
1703
+ "native_context_k": null,
1704
+ "source": "helmet"
1705
+ },
1706
+ "mistral-7-b-v-0-3": {
1707
+ "display_name": "Mistral-7B-v0.3",
1708
+ "ruler_per_ctx": null,
1709
+ "ruler_long_score": null,
1710
+ "helmet": {
1711
+ "overall": 10.8,
1712
+ "categories": {
1713
+ "Recall": 3.6,
1714
+ "RAG": 4.3,
1715
+ "Cite": 0.4,
1716
+ "Re-rank": 1.2,
1717
+ "ICL": 54.8,
1718
+ "LongQA": 10.7,
1719
+ "Summ": 0.4
1720
+ },
1721
+ "context_window_k": 32.0,
1722
+ "params_b": "7.0",
1723
+ "type": "♭"
1724
+ },
1725
+ "recipe_class": null,
1726
+ "params_b": "7.0",
1727
+ "native_context_k": null,
1728
+ "source": "helmet"
1729
+ },
1730
+ "mistral-7-b-instruct-v-0-3": {
1731
+ "display_name": "Mistral-7B-Inst-v0.3",
1732
+ "ruler_per_ctx": null,
1733
+ "ruler_long_score": null,
1734
+ "helmet": {
1735
+ "overall": 17.3,
1736
+ "categories": {
1737
+ "Recall": 8.0,
1738
+ "RAG": 21.3,
1739
+ "Cite": 0.9,
1740
+ "Re-rank": 0.5,
1741
+ "ICL": 73.6,
1742
+ "LongQA": 12.0,
1743
+ "Summ": 5.1
1744
+ },
1745
+ "context_window_k": 32.0,
1746
+ "params_b": "7.0",
1747
+ "type": "♯"
1748
+ },
1749
+ "recipe_class": null,
1750
+ "params_b": "7.0",
1751
+ "native_context_k": null,
1752
+ "source": "helmet"
1753
+ },
1754
+ "ministral-8-b-instruct": {
1755
+ "display_name": "Ministral-8B-Inst",
1756
+ "ruler_per_ctx": null,
1757
+ "ruler_long_score": null,
1758
+ "helmet": {
1759
+ "overall": 24.2,
1760
+ "categories": {
1761
+ "Recall": 12.8,
1762
+ "RAG": 35.3,
1763
+ "Cite": 0.4,
1764
+ "Re-rank": 0.0,
1765
+ "ICL": 84.4,
1766
+ "LongQA": 25.0,
1767
+ "Summ": 11.5
1768
+ },
1769
+ "context_window_k": 128.0,
1770
+ "params_b": "8.0",
1771
+ "type": "♯"
1772
+ },
1773
+ "recipe_class": null,
1774
+ "params_b": "8.0",
1775
+ "native_context_k": null,
1776
+ "source": "helmet"
1777
+ },
1778
+ "mistral-nemo": {
1779
+ "display_name": "Mistral-Nemo",
1780
+ "ruler_per_ctx": null,
1781
+ "ruler_long_score": null,
1782
+ "helmet": {
1783
+ "overall": 25.0,
1784
+ "categories": {
1785
+ "Recall": 17.7,
1786
+ "RAG": 45.9,
1787
+ "Cite": 1.0,
1788
+ "Re-rank": 0.0,
1789
+ "ICL": 82.4,
1790
+ "LongQA": 26.2,
1791
+ "Summ": 1.7
1792
+ },
1793
+ "context_window_k": 128.0,
1794
+ "params_b": "12.0",
1795
+ "type": "♭"
1796
+ },
1797
+ "recipe_class": null,
1798
+ "params_b": "12.0",
1799
+ "native_context_k": null,
1800
+ "source": "helmet"
1801
+ },
1802
+ "mistral-nemo-instruct": {
1803
+ "display_name": "Mistral-Nemo-Inst",
1804
+ "ruler_per_ctx": null,
1805
+ "ruler_long_score": null,
1806
+ "helmet": {
1807
+ "overall": 25.7,
1808
+ "categories": {
1809
+ "Recall": 14.6,
1810
+ "RAG": 40.0,
1811
+ "Cite": 0.5,
1812
+ "Re-rank": 0.0,
1813
+ "ICL": 84.0,
1814
+ "LongQA": 22.5,
1815
+ "Summ": 18.5
1816
+ },
1817
+ "context_window_k": 128.0,
1818
+ "params_b": "12.0",
1819
+ "type": "♯"
1820
+ },
1821
+ "recipe_class": null,
1822
+ "params_b": "12.0",
1823
+ "native_context_k": null,
1824
+ "source": "helmet"
1825
+ },
1826
+ "megabeam-mistral": {
1827
+ "display_name": "MegaBeam-Mistral",
1828
+ "ruler_per_ctx": null,
1829
+ "ruler_long_score": null,
1830
+ "helmet": {
1831
+ "overall": 45.4,
1832
+ "categories": {
1833
+ "Recall": 89.6,
1834
+ "RAG": 57.0,
1835
+ "Cite": 4.0,
1836
+ "Re-rank": 14.7,
1837
+ "ICL": 86.2,
1838
+ "LongQA": 37.3,
1839
+ "Summ": 28.9
1840
+ },
1841
+ "context_window_k": 512.0,
1842
+ "params_b": "7.0",
1843
+ "type": "♯"
1844
+ },
1845
+ "recipe_class": null,
1846
+ "params_b": "7.0",
1847
+ "native_context_k": null,
1848
+ "source": "helmet"
1849
+ },
1850
+ "yi-6-b-200-k": {
1851
+ "display_name": "Yi-6B-200k",
1852
+ "ruler_per_ctx": null,
1853
+ "ruler_long_score": null,
1854
+ "helmet": {
1855
+ "overall": 27.8,
1856
+ "categories": {
1857
+ "Recall": 37.4,
1858
+ "RAG": 36.9,
1859
+ "Cite": 1.1,
1860
+ "Re-rank": 3.4,
1861
+ "ICL": 82.3,
1862
+ "LongQA": 30.0,
1863
+ "Summ": 3.5
1864
+ },
1865
+ "context_window_k": 200.0,
1866
+ "params_b": "6.0",
1867
+ "type": "♭"
1868
+ },
1869
+ "recipe_class": null,
1870
+ "params_b": "6.0",
1871
+ "native_context_k": null,
1872
+ "source": "helmet"
1873
+ },
1874
+ "yi-9-b-200-k": {
1875
+ "display_name": "Yi-9B-200k",
1876
+ "ruler_per_ctx": null,
1877
+ "ruler_long_score": null,
1878
+ "helmet": {
1879
+ "overall": 33.9,
1880
+ "categories": {
1881
+ "Recall": 56.2,
1882
+ "RAG": 50.5,
1883
+ "Cite": 3.3,
1884
+ "Re-rank": 8.3,
1885
+ "ICL": 73.4,
1886
+ "LongQA": 39.3,
1887
+ "Summ": 6.2
1888
+ },
1889
+ "context_window_k": 200.0,
1890
+ "params_b": "9.0",
1891
+ "type": "♭"
1892
+ },
1893
+ "recipe_class": null,
1894
+ "params_b": "9.0",
1895
+ "native_context_k": null,
1896
+ "source": "helmet"
1897
+ },
1898
+ "yi-1-5-9-b-32-k": {
1899
+ "display_name": "Yi-1.5-9B-32k",
1900
+ "ruler_per_ctx": null,
1901
+ "ruler_long_score": null,
1902
+ "helmet": {
1903
+ "overall": 2.1,
1904
+ "categories": {
1905
+ "Recall": 0.0,
1906
+ "RAG": 0.0,
1907
+ "Cite": 0.0,
1908
+ "Re-rank": 0.0,
1909
+ "ICL": 4.2,
1910
+ "LongQA": 7.1,
1911
+ "Summ": 3.2
1912
+ },
1913
+ "context_window_k": 32.0,
1914
+ "params_b": "9.0",
1915
+ "type": "♭"
1916
+ },
1917
+ "recipe_class": null,
1918
+ "params_b": "9.0",
1919
+ "native_context_k": null,
1920
+ "source": "helmet"
1921
+ },
1922
+ "phi-3-mini-128-k-instruct": {
1923
+ "display_name": "Phi-3-mini-128k-Inst",
1924
+ "ruler_per_ctx": null,
1925
+ "ruler_long_score": null,
1926
+ "helmet": {
1927
+ "overall": 33.6,
1928
+ "categories": {
1929
+ "Recall": 50.1,
1930
+ "RAG": 46.7,
1931
+ "Cite": 0.6,
1932
+ "Re-rank": 5.8,
1933
+ "ICL": 78.7,
1934
+ "LongQA": 29.9,
1935
+ "Summ": 23.7
1936
+ },
1937
+ "context_window_k": 128.0,
1938
+ "params_b": "4.0",
1939
+ "type": "♯"
1940
+ },
1941
+ "recipe_class": null,
1942
+ "params_b": "4.0",
1943
+ "native_context_k": null,
1944
+ "source": "helmet"
1945
+ },
1946
+ "phi-3-small-128-k-instruct": {
1947
+ "display_name": "Phi-3-small-128k-Inst",
1948
+ "ruler_per_ctx": null,
1949
+ "ruler_long_score": null,
1950
+ "helmet": {
1951
+ "overall": 24.9,
1952
+ "categories": {
1953
+ "Recall": 22.3,
1954
+ "RAG": 33.8,
1955
+ "Cite": 3.0,
1956
+ "Re-rank": 1.9,
1957
+ "ICL": 79.6,
1958
+ "LongQA": 27.5,
1959
+ "Summ": 6.6
1960
+ },
1961
+ "context_window_k": 128.0,
1962
+ "params_b": "7.0",
1963
+ "type": "♯"
1964
+ },
1965
+ "recipe_class": null,
1966
+ "params_b": "7.0",
1967
+ "native_context_k": null,
1968
+ "source": "helmet"
1969
+ },
1970
+ "phi-3-med-128-k-instruct": {
1971
+ "display_name": "Phi-3-med-128k-Inst",
1972
+ "ruler_per_ctx": null,
1973
+ "ruler_long_score": null,
1974
+ "helmet": {
1975
+ "overall": 29.1,
1976
+ "categories": {
1977
+ "Recall": 24.5,
1978
+ "RAG": 44.8,
1979
+ "Cite": 3.3,
1980
+ "Re-rank": 6.6,
1981
+ "ICL": 73.2,
1982
+ "LongQA": 26.9,
1983
+ "Summ": 24.8
1984
+ },
1985
+ "context_window_k": 128.0,
1986
+ "params_b": "14.0",
1987
+ "type": "♯"
1988
+ },
1989
+ "recipe_class": null,
1990
+ "params_b": "14.0",
1991
+ "native_context_k": null,
1992
+ "source": "helmet"
1993
+ },
1994
+ "phi-3-5-mini-instruct": {
1995
+ "display_name": "Phi-3.5-mini-Inst",
1996
+ "ruler_per_ctx": null,
1997
+ "ruler_long_score": null,
1998
+ "helmet": {
1999
+ "overall": 33.6,
2000
+ "categories": {
2001
+ "Recall": 48.8,
2002
+ "RAG": 43.1,
2003
+ "Cite": 1.6,
2004
+ "Re-rank": 7.8,
2005
+ "ICL": 79.5,
2006
+ "LongQA": 28.5,
2007
+ "Summ": 26.3
2008
+ },
2009
+ "context_window_k": 128.0,
2010
+ "params_b": "4.0",
2011
+ "type": "♯"
2012
+ },
2013
+ "recipe_class": null,
2014
+ "params_b": "4.0",
2015
+ "native_context_k": null,
2016
+ "source": "helmet"
2017
+ },
2018
+ "qwen-2-7-b": {
2019
+ "display_name": "Qwen2-7B",
2020
+ "ruler_per_ctx": null,
2021
+ "ruler_long_score": null,
2022
+ "helmet": {
2023
+ "overall": 30.0,
2024
+ "categories": {
2025
+ "Recall": 38.2,
2026
+ "RAG": 45.0,
2027
+ "Cite": 2.3,
2028
+ "Re-rank": 3.6,
2029
+ "ICL": 77.5,
2030
+ "LongQA": 36.8,
2031
+ "Summ": 6.8
2032
+ },
2033
+ "context_window_k": 32.0,
2034
+ "params_b": "7.0",
2035
+ "type": "♭"
2036
+ },
2037
+ "recipe_class": null,
2038
+ "params_b": "7.0",
2039
+ "native_context_k": null,
2040
+ "source": "helmet"
2041
+ },
2042
+ "qwen-2-7-b-instruct": {
2043
+ "display_name": "Qwen2-7B-Inst",
2044
+ "ruler_per_ctx": null,
2045
+ "ruler_long_score": null,
2046
+ "helmet": {
2047
+ "overall": 31.2,
2048
+ "categories": {
2049
+ "Recall": 36.8,
2050
+ "RAG": 47.4,
2051
+ "Cite": 2.4,
2052
+ "Re-rank": 5.9,
2053
+ "ICL": 66.2,
2054
+ "LongQA": 31.9,
2055
+ "Summ": 27.6
2056
+ },
2057
+ "context_window_k": 32.0,
2058
+ "params_b": "7.0",
2059
+ "type": "♯"
2060
+ },
2061
+ "recipe_class": null,
2062
+ "params_b": "7.0",
2063
+ "native_context_k": null,
2064
+ "source": "helmet"
2065
+ },
2066
+ "qwen-2-57-b": {
2067
+ "display_name": "Qwen2-57B",
2068
+ "ruler_per_ctx": null,
2069
+ "ruler_long_score": null,
2070
+ "helmet": {
2071
+ "overall": 11.0,
2072
+ "categories": {
2073
+ "Recall": 1.8,
2074
+ "RAG": 10.6,
2075
+ "Cite": 1.1,
2076
+ "Re-rank": 0.0,
2077
+ "ICL": 43.3,
2078
+ "LongQA": 15.7,
2079
+ "Summ": 4.9
2080
+ },
2081
+ "context_window_k": 32.0,
2082
+ "params_b": "14.0",
2083
+ "type": "♭"
2084
+ },
2085
+ "recipe_class": null,
2086
+ "params_b": "14.0",
2087
+ "native_context_k": null,
2088
+ "source": "helmet"
2089
+ },
2090
+ "qwen-2-57-b-instruct": {
2091
+ "display_name": "Qwen2-57B-Inst",
2092
+ "ruler_per_ctx": null,
2093
+ "ruler_long_score": null,
2094
+ "helmet": {
2095
+ "overall": 13.4,
2096
+ "categories": {
2097
+ "Recall": 5.9,
2098
+ "RAG": 12.7,
2099
+ "Cite": 1.1,
2100
+ "Re-rank": 0.0,
2101
+ "ICL": 40.6,
2102
+ "LongQA": 22.9,
2103
+ "Summ": 10.5
2104
+ },
2105
+ "context_window_k": 32.0,
2106
+ "params_b": "14.0",
2107
+ "type": "♯"
2108
+ },
2109
+ "recipe_class": null,
2110
+ "params_b": "14.0",
2111
+ "native_context_k": null,
2112
+ "source": "helmet"
2113
+ },
2114
+ "qwen-2-5-1-5-b": {
2115
+ "display_name": "Qwen2.5-1.5B",
2116
+ "ruler_per_ctx": null,
2117
+ "ruler_long_score": null,
2118
+ "helmet": {
2119
+ "overall": 12.9,
2120
+ "categories": {
2121
+ "Recall": 3.5,
2122
+ "RAG": 22.3,
2123
+ "Cite": 0.6,
2124
+ "Re-rank": 0.0,
2125
+ "ICL": 43.7,
2126
+ "LongQA": 18.2,
2127
+ "Summ": 2.1
2128
+ },
2129
+ "context_window_k": 32.0,
2130
+ "params_b": "1.5",
2131
+ "type": "♭"
2132
+ },
2133
+ "recipe_class": null,
2134
+ "params_b": "1.5",
2135
+ "native_context_k": null,
2136
+ "source": "helmet"
2137
+ },
2138
+ "qwen-2-5-1-5-b-instruct": {
2139
+ "display_name": "Qwen2.5-1.5B-Inst",
2140
+ "ruler_per_ctx": null,
2141
+ "ruler_long_score": null,
2142
+ "helmet": {
2143
+ "overall": 13.2,
2144
+ "categories": {
2145
+ "Recall": 7.8,
2146
+ "RAG": 25.0,
2147
+ "Cite": 1.7,
2148
+ "Re-rank": 0.3,
2149
+ "ICL": 38.4,
2150
+ "LongQA": 7.8,
2151
+ "Summ": 11.3
2152
+ },
2153
+ "context_window_k": 32.0,
2154
+ "params_b": "1.5",
2155
+ "type": "♯"
2156
+ },
2157
+ "recipe_class": null,
2158
+ "params_b": "1.5",
2159
+ "native_context_k": null,
2160
+ "source": "helmet"
2161
+ },
2162
+ "qwen-2-5-3-b": {
2163
+ "display_name": "Qwen2.5-3B",
2164
+ "ruler_per_ctx": null,
2165
+ "ruler_long_score": null,
2166
+ "helmet": {
2167
+ "overall": 12.3,
2168
+ "categories": {
2169
+ "Recall": 7.5,
2170
+ "RAG": 21.1,
2171
+ "Cite": 1.0,
2172
+ "Re-rank": 2.2,
2173
+ "ICL": 32.0,
2174
+ "LongQA": 20.5,
2175
+ "Summ": 1.8
2176
+ },
2177
+ "context_window_k": 32.0,
2178
+ "params_b": "3.0",
2179
+ "type": "♭"
2180
+ },
2181
+ "recipe_class": null,
2182
+ "params_b": "3.0",
2183
+ "native_context_k": null,
2184
+ "source": "helmet"
2185
+ },
2186
+ "qwen-2-5-3-b-instruct": {
2187
+ "display_name": "Qwen2.5-3B-Inst",
2188
+ "ruler_per_ctx": null,
2189
+ "ruler_long_score": null,
2190
+ "helmet": {
2191
+ "overall": 18.1,
2192
+ "categories": {
2193
+ "Recall": 10.4,
2194
+ "RAG": 22.4,
2195
+ "Cite": 1.9,
2196
+ "Re-rank": 3.7,
2197
+ "ICL": 49.4,
2198
+ "LongQA": 21.3,
2199
+ "Summ": 17.5
2200
+ },
2201
+ "context_window_k": 32.0,
2202
+ "params_b": "3.0",
2203
+ "type": "♯"
2204
+ },
2205
+ "recipe_class": null,
2206
+ "params_b": "3.0",
2207
+ "native_context_k": null,
2208
+ "source": "helmet"
2209
+ },
2210
+ "qwen-2-5-7-b": {
2211
+ "display_name": "Qwen2.5-7B",
2212
+ "ruler_per_ctx": null,
2213
+ "ruler_long_score": null,
2214
+ "helmet": {
2215
+ "overall": 20.9,
2216
+ "categories": {
2217
+ "Recall": 13.9,
2218
+ "RAG": 27.0,
2219
+ "Cite": 1.3,
2220
+ "Re-rank": 0.1,
2221
+ "ICL": 71.5,
2222
+ "LongQA": 25.9,
2223
+ "Summ": 6.8
2224
+ },
2225
+ "context_window_k": 128.0,
2226
+ "params_b": "7.0",
2227
+ "type": "♭"
2228
+ },
2229
+ "recipe_class": null,
2230
+ "params_b": "7.0",
2231
+ "native_context_k": null,
2232
+ "source": "helmet"
2233
+ },
2234
+ "qwen-2-5-72-b-instruct": {
2235
+ "display_name": "Qwen2.5-72B-Inst",
2236
+ "ruler_per_ctx": null,
2237
+ "ruler_long_score": null,
2238
+ "helmet": {
2239
+ "overall": 38.2,
2240
+ "categories": {
2241
+ "Recall": 38.4,
2242
+ "RAG": 43.0,
2243
+ "Cite": 8.0,
2244
+ "Re-rank": 24.5,
2245
+ "ICL": 83.2,
2246
+ "LongQA": 38.9,
2247
+ "Summ": 31.4
2248
+ },
2249
+ "context_window_k": 128.0,
2250
+ "params_b": "72.0",
2251
+ "type": "♯"
2252
+ },
2253
+ "recipe_class": null,
2254
+ "params_b": "72.0",
2255
+ "native_context_k": null,
2256
+ "source": "helmet"
2257
+ },
2258
+ "prolong": {
2259
+ "display_name": "ProLong",
2260
+ "ruler_per_ctx": null,
2261
+ "ruler_long_score": null,
2262
+ "helmet": {
2263
+ "overall": 49.4,
2264
+ "categories": {
2265
+ "Recall": 98.8,
2266
+ "RAG": 63.2,
2267
+ "Cite": 1.4,
2268
+ "Re-rank": 22.5,
2269
+ "ICL": 86.5,
2270
+ "LongQA": 43.9,
2271
+ "Summ": 29.2
2272
+ },
2273
+ "context_window_k": 512.0,
2274
+ "params_b": "8.0",
2275
+ "type": "♯"
2276
+ },
2277
+ "recipe_class": null,
2278
+ "params_b": "8.0",
2279
+ "native_context_k": null,
2280
+ "source": "helmet"
2281
+ },
2282
+ "gemma-3-12-b": {
2283
+ "display_name": "Gemma-3-12B",
2284
+ "ruler_per_ctx": null,
2285
+ "ruler_long_score": null,
2286
+ "helmet": {
2287
+ "overall": 40.6,
2288
+ "categories": {
2289
+ "Recall": 76.6,
2290
+ "RAG": 59.3,
2291
+ "Cite": 2.0,
2292
+ "Re-rank": 16.0,
2293
+ "ICL": 88.7,
2294
+ "LongQA": 41.4,
2295
+ "Summ": 0.2
2296
+ },
2297
+ "context_window_k": 128.0,
2298
+ "params_b": "12.0",
2299
+ "type": "♭"
2300
+ },
2301
+ "recipe_class": null,
2302
+ "params_b": "12.0",
2303
+ "native_context_k": null,
2304
+ "source": "helmet"
2305
+ },
2306
+ "gemma-3-12-b-instruct": {
2307
+ "display_name": "Gemma-3-12B-Inst",
2308
+ "ruler_per_ctx": null,
2309
+ "ruler_long_score": null,
2310
+ "helmet": {
2311
+ "overall": 36.0,
2312
+ "categories": {
2313
+ "Recall": 25.9,
2314
+ "RAG": 52.1,
2315
+ "Cite": 1.3,
2316
+ "Re-rank": 21.3,
2317
+ "ICL": 75.4,
2318
+ "LongQA": 43.0,
2319
+ "Summ": 33.1
2320
+ },
2321
+ "context_window_k": 128.0,
2322
+ "params_b": "12.0",
2323
+ "type": "♯"
2324
+ },
2325
+ "recipe_class": null,
2326
+ "params_b": "12.0",
2327
+ "native_context_k": null,
2328
+ "source": "helmet"
2329
+ },
2330
+ "gemma-3-27-b": {
2331
+ "display_name": "Gemma-3-27B",
2332
+ "ruler_per_ctx": null,
2333
+ "ruler_long_score": null,
2334
+ "helmet": {
2335
+ "overall": 44.5,
2336
+ "categories": {
2337
+ "Recall": 82.6,
2338
+ "RAG": 64.1,
2339
+ "Cite": 2.8,
2340
+ "Re-rank": 30.4,
2341
+ "ICL": 86.3,
2342
+ "LongQA": 44.6,
2343
+ "Summ": 0.7
2344
+ },
2345
+ "context_window_k": 128.0,
2346
+ "params_b": "27.0",
2347
+ "type": "♭"
2348
+ },
2349
+ "recipe_class": null,
2350
+ "params_b": "27.0",
2351
+ "native_context_k": null,
2352
+ "source": "helmet"
2353
+ },
2354
+ "gemma-3-27-b-instruct": {
2355
+ "display_name": "Gemma-3-27B-Inst",
2356
+ "ruler_per_ctx": null,
2357
+ "ruler_long_score": null,
2358
+ "helmet": {
2359
+ "overall": 41.9,
2360
+ "categories": {
2361
+ "Recall": 45.1,
2362
+ "RAG": 56.6,
2363
+ "Cite": 1.6,
2364
+ "Re-rank": 30.0,
2365
+ "ICL": 77.4,
2366
+ "LongQA": 46.8,
2367
+ "Summ": 35.9
2368
+ },
2369
+ "context_window_k": 128.0,
2370
+ "params_b": "27.0",
2371
+ "type": "♯"
2372
+ },
2373
+ "recipe_class": null,
2374
+ "params_b": "27.0",
2375
+ "native_context_k": null,
2376
+ "source": "helmet"
2377
+ },
2378
+ "jamba-v-0-1": {
2379
+ "display_name": "Jamba-v0.1",
2380
+ "ruler_per_ctx": null,
2381
+ "ruler_long_score": null,
2382
+ "helmet": {
2383
+ "overall": 41.5,
2384
+ "categories": {
2385
+ "Recall": 80.3,
2386
+ "RAG": 60.3,
2387
+ "Cite": 6.0,
2388
+ "Re-rank": 9.3,
2389
+ "ICL": 88.7,
2390
+ "LongQA": 41.7,
2391
+ "Summ": 4.0
2392
+ },
2393
+ "context_window_k": 256.0,
2394
+ "params_b": "12.0",
2395
+ "type": "♭"
2396
+ },
2397
+ "recipe_class": null,
2398
+ "params_b": "12.0",
2399
+ "native_context_k": null,
2400
+ "source": "helmet"
2401
+ }
2402
+ },
2403
+ "aliases": {
2404
+ "qwen-2-5-7-b-instruct": [
2405
+ "qwen2.5-7b-instruct",
2406
+ "Qwen2.5-7B-Inst",
2407
+ "Qwen2.5-7B-Inst",
2408
+ "Qwen2.5-7B-Inst",
2409
+ "Qwen2.5-7B-Inst",
2410
+ "Qwen2.5-7B-Inst"
2411
+ ],
2412
+ "qwen-2-5-14-b-instruct": [
2413
+ "qwen2.5-14b-instruct"
2414
+ ],
2415
+ "qwen-2-5-32-b-instruct": [
2416
+ "qwen2.5-32b-instruct"
2417
+ ],
2418
+ "mistral-7-b-v-0-2": [
2419
+ "mistral-7b-v0.2"
2420
+ ],
2421
+ "mixtral-8-x-7-b": [
2422
+ "mixtral-8x7b"
2423
+ ],
2424
+ "mixtral-8-x-22-b-instruct": [
2425
+ "mixtral-8x22b-instruct"
2426
+ ],
2427
+ "qwen-2-72-b": [
2428
+ "qwen2-72b"
2429
+ ],
2430
+ "dbrx": [
2431
+ "dbrx"
2432
+ ],
2433
+ "qwen-1-5-72-b": [
2434
+ "qwen1.5-72b"
2435
+ ],
2436
+ "mistral-large-2407": [
2437
+ "mistral-large-2407"
2438
+ ],
2439
+ "yi-34-b-200-k": [
2440
+ "yi-34b-200k",
2441
+ "Yi-34B-200k",
2442
+ "Yi-34B-200k",
2443
+ "Yi-34B-200k",
2444
+ "Yi-34B-200k",
2445
+ "Yi-34B-200k"
2446
+ ],
2447
+ "command-r-35-b": [
2448
+ "command-r-35b"
2449
+ ],
2450
+ "command-r-0824-32-b": [
2451
+ "command-r-0824-32b"
2452
+ ],
2453
+ "command-r-plus-104-b": [
2454
+ "command-r-plus-104b"
2455
+ ],
2456
+ "command-r-plus-0824-104-b": [
2457
+ "command-r-plus-0824-104b"
2458
+ ],
2459
+ "mistral-large-2411-123-b": [
2460
+ "mistral-large-2411-123b"
2461
+ ],
2462
+ "chatglm-3-6-b-128-k": [
2463
+ "chatglm3-6b-128k"
2464
+ ],
2465
+ "lwm-7-b": [
2466
+ "lwm-7b"
2467
+ ],
2468
+ "glm-4-9-b-1-m": [
2469
+ "glm4-9b-1m"
2470
+ ],
2471
+ "prolong-8-b-512-k": [
2472
+ "prolong-8b-512k"
2473
+ ],
2474
+ "megabeam-mistral-7-b-512-k": [
2475
+ "megabeam-mistral-7b-512k"
2476
+ ],
2477
+ "internlm-2-5-7-b-1-m": [
2478
+ "internlm2.5-7b-1m"
2479
+ ],
2480
+ "gradientai-llama-3-70-b-1-m": [
2481
+ "gradientai-llama3-70b-1m"
2482
+ ],
2483
+ "gradientai-llama-3-8-b-1-m": [
2484
+ "gradientai-llama3-8b-1m"
2485
+ ],
2486
+ "longchat-7-b-32-k": [
2487
+ "longchat-7b-32k"
2488
+ ],
2489
+ "longalpaca-13-b-32-k": [
2490
+ "longalpaca-13b-32k"
2491
+ ],
2492
+ "llama-3-1-70-b-instruct": [
2493
+ "llama3.1-70b-instruct",
2494
+ "Llama-3.1-70B-Inst",
2495
+ "Llama-3.1-70B-Inst",
2496
+ "Llama-3.1-70B-Inst",
2497
+ "Llama-3.1-70B-Inst",
2498
+ "Llama-3.1-70B-Inst"
2499
+ ],
2500
+ "llama-3-1-8-b-instruct": [
2501
+ "llama3.1-8b-instruct",
2502
+ "Llama-3.1-8B-Inst",
2503
+ "Llama-3.1-8B-Inst",
2504
+ "Llama-3.1-8B-Inst",
2505
+ "Llama-3.1-8B-Inst",
2506
+ "Llama-3.1-8B-Inst"
2507
+ ],
2508
+ "mistral-nemo-12-b": [
2509
+ "mistral-nemo-12b"
2510
+ ],
2511
+ "qwen-2-5-7-b-instruct-1-m": [
2512
+ "qwen2.5-7b-instruct-1m"
2513
+ ],
2514
+ "qwen-2-5-14-b-instruct-1-m": [
2515
+ "qwen2.5-14b-instruct-1m"
2516
+ ],
2517
+ "jamba-1-5-large": [
2518
+ "jamba-1.5-large"
2519
+ ],
2520
+ "jamba-1-5-mini": [
2521
+ "jamba-1.5-mini",
2522
+ "Jamba-1.5-Mini",
2523
+ "Jamba-1.5-Mini",
2524
+ "Jamba-1.5-Mini",
2525
+ "Jamba-1.5-Mini",
2526
+ "Jamba-1.5-Mini"
2527
+ ],
2528
+ "phi-3-mini-3-8-b": [
2529
+ "phi3-mini-3.8b"
2530
+ ],
2531
+ "phi-3-medium-14-b": [
2532
+ "phi3-medium-14b"
2533
+ ],
2534
+ "gpt-4": [
2535
+ "GPT-4",
2536
+ "GPT-4",
2537
+ "GPT-4",
2538
+ "GPT-4",
2539
+ "GPT-4"
2540
+ ],
2541
+ "gpt-4-o-mini": [
2542
+ "GPT-4o-mini",
2543
+ "GPT-4o-mini",
2544
+ "GPT-4o-mini",
2545
+ "GPT-4o-mini",
2546
+ "GPT-4o-mini"
2547
+ ],
2548
+ "gpt-4-o-05": [
2549
+ "GPT-4o-05",
2550
+ "GPT-4o-05",
2551
+ "GPT-4o-05",
2552
+ "GPT-4o-05",
2553
+ "GPT-4o-05"
2554
+ ],
2555
+ "gpt-4-o-08": [
2556
+ "GPT-4o-08",
2557
+ "GPT-4o-08",
2558
+ "GPT-4o-08",
2559
+ "GPT-4o-08",
2560
+ "GPT-4o-08"
2561
+ ],
2562
+ "claude-3-5-sonnet": [
2563
+ "Claude-3.5-Sonnet",
2564
+ "Claude-3.5-Sonnet",
2565
+ "Claude-3.5-Sonnet",
2566
+ "Claude-3.5-Sonnet",
2567
+ "Claude-3.5-Sonnet"
2568
+ ],
2569
+ "gemini-1-5-flash": [
2570
+ "Gemini-1.5-Flash",
2571
+ "Gemini-1.5-Flash",
2572
+ "Gemini-1.5-Flash",
2573
+ "Gemini-1.5-Flash",
2574
+ "Gemini-1.5-Flash"
2575
+ ],
2576
+ "gemini-1-5-pro": [
2577
+ "Gemini-1.5-Pro",
2578
+ "Gemini-1.5-Pro",
2579
+ "Gemini-1.5-Pro",
2580
+ "Gemini-1.5-Pro",
2581
+ "Gemini-1.5-Pro"
2582
+ ],
2583
+ "llama-2-7-b-32-k": [
2584
+ "Llama-2-7B-32k",
2585
+ "Llama-2-7B-32k",
2586
+ "Llama-2-7B-32k",
2587
+ "Llama-2-7B-32k",
2588
+ "Llama-2-7B-32k"
2589
+ ],
2590
+ "llama-2-7-b-32-k-instruct": [
2591
+ "Llama-2-7B-32k-Inst",
2592
+ "Llama-2-7B-32k-Inst",
2593
+ "Llama-2-7B-32k-Inst",
2594
+ "Llama-2-7B-32k-Inst",
2595
+ "Llama-2-7B-32k-Inst"
2596
+ ],
2597
+ "llama-7-b-80-k": [
2598
+ "Llama-7B-80k",
2599
+ "Llama-7B-80k",
2600
+ "Llama-7B-80k",
2601
+ "Llama-7B-80k",
2602
+ "Llama-7B-80k"
2603
+ ],
2604
+ "yarn-llama-2-7-b-64-k": [
2605
+ "Yarn-Llama-2-7B-64k",
2606
+ "Yarn-Llama-2-7B-64k",
2607
+ "Yarn-Llama-2-7B-64k",
2608
+ "Yarn-Llama-2-7B-64k",
2609
+ "Yarn-Llama-2-7B-64k"
2610
+ ],
2611
+ "yarn-llama-2-7-b-128-k": [
2612
+ "Yarn-Llama-2-7B-128k",
2613
+ "Yarn-Llama-2-7B-128k",
2614
+ "Yarn-Llama-2-7B-128k",
2615
+ "Yarn-Llama-2-7B-128k",
2616
+ "Yarn-Llama-2-7B-128k"
2617
+ ],
2618
+ "llama-3-8-b": [
2619
+ "Llama-3-8B",
2620
+ "Llama-3-8B",
2621
+ "Llama-3-8B",
2622
+ "Llama-3-8B",
2623
+ "Llama-3-8B"
2624
+ ],
2625
+ "llama-3-8-b-instruct": [
2626
+ "Llama-3-8B-Inst",
2627
+ "Llama-3-8B-Inst",
2628
+ "Llama-3-8B-Inst",
2629
+ "Llama-3-8B-Inst",
2630
+ "Llama-3-8B-Inst"
2631
+ ],
2632
+ "llama-3-8-b-θ": [
2633
+ "Llama-3-8B-θ",
2634
+ "Llama-3-8B-θ",
2635
+ "Llama-3-8B-θ",
2636
+ "Llama-3-8B-θ",
2637
+ "Llama-3-8B-θ"
2638
+ ],
2639
+ "llama-3-8-b-instruct-θ": [
2640
+ "Llama-3-8B-Inst-θ",
2641
+ "Llama-3-8B-Inst-θ",
2642
+ "Llama-3-8B-Inst-θ",
2643
+ "Llama-3-8B-Inst-θ",
2644
+ "Llama-3-8B-Inst-θ"
2645
+ ],
2646
+ "llama-3-70-b-θ": [
2647
+ "Llama-3-70B-θ",
2648
+ "Llama-3-70B-θ",
2649
+ "Llama-3-70B-θ",
2650
+ "Llama-3-70B-θ",
2651
+ "Llama-3-70B-θ"
2652
+ ],
2653
+ "llama-3-70-b-instruct-θ": [
2654
+ "Llama-3-70B-Inst-θ",
2655
+ "Llama-3-70B-Inst-θ",
2656
+ "Llama-3-70B-Inst-θ",
2657
+ "Llama-3-70B-Inst-θ",
2658
+ "Llama-3-70B-Inst-θ"
2659
+ ],
2660
+ "llama-3-1-8-b": [
2661
+ "Llama-3.1-8B",
2662
+ "Llama-3.1-8B",
2663
+ "Llama-3.1-8B",
2664
+ "Llama-3.1-8B",
2665
+ "Llama-3.1-8B"
2666
+ ],
2667
+ "llama-3-1-70-b": [
2668
+ "Llama-3.1-70B",
2669
+ "Llama-3.1-70B",
2670
+ "Llama-3.1-70B",
2671
+ "Llama-3.1-70B",
2672
+ "Llama-3.1-70B"
2673
+ ],
2674
+ "llama-3-3-70-b-instruct": [
2675
+ "Llama-3.3-70B-Inst",
2676
+ "Llama-3.3-70B-Inst",
2677
+ "Llama-3.3-70B-Inst",
2678
+ "Llama-3.3-70B-Inst",
2679
+ "Llama-3.3-70B-Inst"
2680
+ ],
2681
+ "llama-3-2-1-b": [
2682
+ "Llama-3.2-1B",
2683
+ "Llama-3.2-1B",
2684
+ "Llama-3.2-1B",
2685
+ "Llama-3.2-1B",
2686
+ "Llama-3.2-1B"
2687
+ ],
2688
+ "llama-3-2-1-b-instruct": [
2689
+ "Llama-3.2-1B-Inst",
2690
+ "Llama-3.2-1B-Inst",
2691
+ "Llama-3.2-1B-Inst",
2692
+ "Llama-3.2-1B-Inst",
2693
+ "Llama-3.2-1B-Inst"
2694
+ ],
2695
+ "llama-3-2-3-b": [
2696
+ "Llama-3.2-3B",
2697
+ "Llama-3.2-3B",
2698
+ "Llama-3.2-3B",
2699
+ "Llama-3.2-3B",
2700
+ "Llama-3.2-3B"
2701
+ ],
2702
+ "llama-3-2-3-b-instruct": [
2703
+ "Llama-3.2-3B-Inst",
2704
+ "Llama-3.2-3B-Inst",
2705
+ "Llama-3.2-3B-Inst",
2706
+ "Llama-3.2-3B-Inst",
2707
+ "Llama-3.2-3B-Inst"
2708
+ ],
2709
+ "llama-4-17-b": [
2710
+ "Llama-4-17B",
2711
+ "Llama-4-17B",
2712
+ "Llama-4-17B",
2713
+ "Llama-4-17B",
2714
+ "Llama-4-17B"
2715
+ ],
2716
+ "llama-4-17-b-instruct": [
2717
+ "Llama-4-17B-Inst",
2718
+ "Llama-4-17B-Inst",
2719
+ "Llama-4-17B-Inst",
2720
+ "Llama-4-17B-Inst",
2721
+ "Llama-4-17B-Inst"
2722
+ ],
2723
+ "mistral-7-b-instruct-v-0-1": [
2724
+ "Mistral-7B-Inst-v0.1",
2725
+ "Mistral-7B-Inst-v0.1",
2726
+ "Mistral-7B-Inst-v0.1",
2727
+ "Mistral-7B-Inst-v0.1",
2728
+ "Mistral-7B-Inst-v0.1"
2729
+ ],
2730
+ "mistral-7-b-instruct-v-0-2": [
2731
+ "Mistral-7B-Inst-v0.2",
2732
+ "Mistral-7B-Inst-v0.2",
2733
+ "Mistral-7B-Inst-v0.2",
2734
+ "Mistral-7B-Inst-v0.2",
2735
+ "Mistral-7B-Inst-v0.2"
2736
+ ],
2737
+ "mistral-7-b-v-0-3": [
2738
+ "Mistral-7B-v0.3",
2739
+ "Mistral-7B-v0.3",
2740
+ "Mistral-7B-v0.3",
2741
+ "Mistral-7B-v0.3",
2742
+ "Mistral-7B-v0.3"
2743
+ ],
2744
+ "mistral-7-b-instruct-v-0-3": [
2745
+ "Mistral-7B-Inst-v0.3",
2746
+ "Mistral-7B-Inst-v0.3",
2747
+ "Mistral-7B-Inst-v0.3",
2748
+ "Mistral-7B-Inst-v0.3",
2749
+ "Mistral-7B-Inst-v0.3"
2750
+ ],
2751
+ "ministral-8-b-instruct": [
2752
+ "Ministral-8B-Inst",
2753
+ "Ministral-8B-Inst",
2754
+ "Ministral-8B-Inst",
2755
+ "Ministral-8B-Inst",
2756
+ "Ministral-8B-Inst"
2757
+ ],
2758
+ "mistral-nemo": [
2759
+ "Mistral-Nemo",
2760
+ "Mistral-Nemo",
2761
+ "Mistral-Nemo",
2762
+ "Mistral-Nemo",
2763
+ "Mistral-Nemo"
2764
+ ],
2765
+ "mistral-nemo-instruct": [
2766
+ "Mistral-Nemo-Inst",
2767
+ "Mistral-Nemo-Inst",
2768
+ "Mistral-Nemo-Inst",
2769
+ "Mistral-Nemo-Inst",
2770
+ "Mistral-Nemo-Inst"
2771
+ ],
2772
+ "megabeam-mistral": [
2773
+ "MegaBeam-Mistral",
2774
+ "MegaBeam-Mistral",
2775
+ "MegaBeam-Mistral",
2776
+ "MegaBeam-Mistral",
2777
+ "MegaBeam-Mistral"
2778
+ ],
2779
+ "yi-6-b-200-k": [
2780
+ "Yi-6B-200k",
2781
+ "Yi-6B-200k",
2782
+ "Yi-6B-200k",
2783
+ "Yi-6B-200k",
2784
+ "Yi-6B-200k"
2785
+ ],
2786
+ "yi-9-b-200-k": [
2787
+ "Yi-9B-200k",
2788
+ "Yi-9B-200k",
2789
+ "Yi-9B-200k",
2790
+ "Yi-9B-200k",
2791
+ "Yi-9B-200k"
2792
+ ],
2793
+ "yi-1-5-9-b-32-k": [
2794
+ "Yi-1.5-9B-32k",
2795
+ "Yi-1.5-9B-32k",
2796
+ "Yi-1.5-9B-32k",
2797
+ "Yi-1.5-9B-32k",
2798
+ "Yi-1.5-9B-32k"
2799
+ ],
2800
+ "phi-3-mini-128-k-instruct": [
2801
+ "Phi-3-mini-128k-Inst",
2802
+ "Phi-3-mini-128k-Inst",
2803
+ "Phi-3-mini-128k-Inst",
2804
+ "Phi-3-mini-128k-Inst",
2805
+ "Phi-3-mini-128k-Inst"
2806
+ ],
2807
+ "phi-3-small-128-k-instruct": [
2808
+ "Phi-3-small-128k-Inst",
2809
+ "Phi-3-small-128k-Inst",
2810
+ "Phi-3-small-128k-Inst",
2811
+ "Phi-3-small-128k-Inst",
2812
+ "Phi-3-small-128k-Inst"
2813
+ ],
2814
+ "phi-3-med-128-k-instruct": [
2815
+ "Phi-3-med-128k-Inst",
2816
+ "Phi-3-med-128k-Inst",
2817
+ "Phi-3-med-128k-Inst",
2818
+ "Phi-3-med-128k-Inst",
2819
+ "Phi-3-med-128k-Inst"
2820
+ ],
2821
+ "phi-3-5-mini-instruct": [
2822
+ "Phi-3.5-mini-Inst",
2823
+ "Phi-3.5-mini-Inst",
2824
+ "Phi-3.5-mini-Inst",
2825
+ "Phi-3.5-mini-Inst",
2826
+ "Phi-3.5-mini-Inst"
2827
+ ],
2828
+ "qwen-2-7-b": [
2829
+ "Qwen2-7B",
2830
+ "Qwen2-7B",
2831
+ "Qwen2-7B",
2832
+ "Qwen2-7B",
2833
+ "Qwen2-7B"
2834
+ ],
2835
+ "qwen-2-7-b-instruct": [
2836
+ "Qwen2-7B-Inst",
2837
+ "Qwen2-7B-Inst",
2838
+ "Qwen2-7B-Inst",
2839
+ "Qwen2-7B-Inst",
2840
+ "Qwen2-7B-Inst"
2841
+ ],
2842
+ "qwen-2-57-b": [
2843
+ "Qwen2-57B",
2844
+ "Qwen2-57B",
2845
+ "Qwen2-57B",
2846
+ "Qwen2-57B",
2847
+ "Qwen2-57B"
2848
+ ],
2849
+ "qwen-2-57-b-instruct": [
2850
+ "Qwen2-57B-Inst",
2851
+ "Qwen2-57B-Inst",
2852
+ "Qwen2-57B-Inst",
2853
+ "Qwen2-57B-Inst",
2854
+ "Qwen2-57B-Inst"
2855
+ ],
2856
+ "qwen-2-5-1-5-b": [
2857
+ "Qwen2.5-1.5B",
2858
+ "Qwen2.5-1.5B",
2859
+ "Qwen2.5-1.5B",
2860
+ "Qwen2.5-1.5B",
2861
+ "Qwen2.5-1.5B"
2862
+ ],
2863
+ "qwen-2-5-1-5-b-instruct": [
2864
+ "Qwen2.5-1.5B-Inst",
2865
+ "Qwen2.5-1.5B-Inst",
2866
+ "Qwen2.5-1.5B-Inst",
2867
+ "Qwen2.5-1.5B-Inst",
2868
+ "Qwen2.5-1.5B-Inst"
2869
+ ],
2870
+ "qwen-2-5-3-b": [
2871
+ "Qwen2.5-3B",
2872
+ "Qwen2.5-3B",
2873
+ "Qwen2.5-3B",
2874
+ "Qwen2.5-3B",
2875
+ "Qwen2.5-3B"
2876
+ ],
2877
+ "qwen-2-5-3-b-instruct": [
2878
+ "Qwen2.5-3B-Inst",
2879
+ "Qwen2.5-3B-Inst",
2880
+ "Qwen2.5-3B-Inst",
2881
+ "Qwen2.5-3B-Inst",
2882
+ "Qwen2.5-3B-Inst"
2883
+ ],
2884
+ "qwen-2-5-7-b": [
2885
+ "Qwen2.5-7B",
2886
+ "Qwen2.5-7B",
2887
+ "Qwen2.5-7B",
2888
+ "Qwen2.5-7B",
2889
+ "Qwen2.5-7B"
2890
+ ],
2891
+ "qwen-2-5-72-b-instruct": [
2892
+ "Qwen2.5-72B-Inst",
2893
+ "Qwen2.5-72B-Inst",
2894
+ "Qwen2.5-72B-Inst",
2895
+ "Qwen2.5-72B-Inst",
2896
+ "Qwen2.5-72B-Inst"
2897
+ ],
2898
+ "prolong": [
2899
+ "ProLong",
2900
+ "ProLong",
2901
+ "ProLong",
2902
+ "ProLong",
2903
+ "ProLong"
2904
+ ],
2905
+ "gemma-3-12-b": [
2906
+ "Gemma-3-12B",
2907
+ "Gemma-3-12B",
2908
+ "Gemma-3-12B",
2909
+ "Gemma-3-12B",
2910
+ "Gemma-3-12B"
2911
+ ],
2912
+ "gemma-3-12-b-instruct": [
2913
+ "Gemma-3-12B-Inst",
2914
+ "Gemma-3-12B-Inst",
2915
+ "Gemma-3-12B-Inst",
2916
+ "Gemma-3-12B-Inst",
2917
+ "Gemma-3-12B-Inst"
2918
+ ],
2919
+ "gemma-3-27-b": [
2920
+ "Gemma-3-27B",
2921
+ "Gemma-3-27B",
2922
+ "Gemma-3-27B",
2923
+ "Gemma-3-27B",
2924
+ "Gemma-3-27B"
2925
+ ],
2926
+ "gemma-3-27-b-instruct": [
2927
+ "Gemma-3-27B-Inst",
2928
+ "Gemma-3-27B-Inst",
2929
+ "Gemma-3-27B-Inst",
2930
+ "Gemma-3-27B-Inst",
2931
+ "Gemma-3-27B-Inst"
2932
+ ],
2933
+ "jamba-v-0-1": [
2934
+ "Jamba-v0.1",
2935
+ "Jamba-v0.1",
2936
+ "Jamba-v0.1",
2937
+ "Jamba-v0.1",
2938
+ "Jamba-v0.1"
2939
+ ]
2940
+ }
2941
+ }
data/solutions_hub.json CHANGED
@@ -137,6 +137,22 @@
137
  "best_for": "Predicting NIAH and reasoning pass rates from architecture alone — no inference needed.",
138
  "not_for": "Final go/no-go decision — re-test on your domain after architectural screening passes."
139
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  {
141
  "id": "tokenizer_glitch",
142
  "category": "diagnostic",
 
137
  "best_for": "Predicting NIAH and reasoning pass rates from architecture alone — no inference needed.",
138
  "not_for": "Final go/no-go decision — re-test on your domain after architectural screening passes."
139
  },
140
+ {
141
+ "id": "long_ctx_degradation",
142
+ "category": "diagnostic",
143
+ "pain": "Long-ctx accuracy drops well before the claimed 128K window — but raw scores are dominated by base ability and hide which model truly retains capability past short context.",
144
+ "tafagent_mode": "🎯 LongScore",
145
+ "external_tools": [
146
+ {"name": "100-LongBench paper (LongScore metric)", "url": "https://arxiv.org/abs/2505.19293", "type": "paper"},
147
+ {"name": "HELMET (Princeton, 7-task long-ctx benchmark)", "url": "https://github.com/princeton-nlp/HELMET", "type": "tool"},
148
+ {"name": "HELMET full results sheet (n=315)", "url": "https://docs.google.com/spreadsheets/d/1LBt6dP4UwZwU_CjoYhyAd_rjKhQLvo0Gq4cYUnpi_CA", "type": "leaderboard"},
149
+ {"name": "NVIDIA RULER per-length scores", "url": "https://github.com/NVIDIA/RULER", "type": "tool"},
150
+ {"name": "LongBench v2 (THUDM)", "url": "https://longbench2.github.io/", "type": "leaderboard"},
151
+ {"name": "Chroma context rot study", "url": "https://research.trychroma.com/context-rot", "type": "paper"}
152
+ ],
153
+ "best_for": "Picking a long-ctx model by relative degradation rather than raw score. Uses peer-reviewed LongScore metric to disentangle base ability from long-ctx capability. Browser-only, no GPU.",
154
+ "not_for": "Models not yet covered by RULER or HELMET (KB has n=93). Use the external tools above for breadth."
155
+ },
156
  {
157
  "id": "tokenizer_glitch",
158
  "category": "diagnostic",
index.html CHANGED
@@ -231,6 +231,9 @@
231
  <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
232
  <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
233
 
 
 
 
234
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
235
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
236
 
@@ -348,6 +351,7 @@
348
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
349
  <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
350
  <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
 
351
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
352
  </ul>
353
  </details>
@@ -485,6 +489,7 @@
485
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
486
  <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
487
  <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
 
488
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
489
  </div>
490
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
@@ -1178,6 +1183,34 @@
1178
  </p>
1179
  </section>
1180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1181
  <section id="hub-section" style="display:none;">
1182
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1183
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
 
231
  <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
232
  <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
233
 
234
+ <p><strong data-i18n="help.v088.longscore.title">🎯 LongScore</strong></p>
235
+ <p data-i18n="help.v088.longscore.body">Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).</p>
236
+
237
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
238
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
239
 
 
351
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
352
  <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
353
  <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
354
+ <li data-i18n="inv.v088.longscore"><strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.</li>
355
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
356
  </ul>
357
  </details>
 
489
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
490
  <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
491
  <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
492
+ <button class="mode-btn" data-mode="longscore" role="tab" aria-selected="false" data-i18n="modes.longscore">🎯 LongScore</button>
493
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
494
  </div>
495
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
 
1183
  </p>
1184
  </section>
1185
 
1186
+ <!-- LongScore (mode=longscore, v0.8.8 anti-bullshit pack #14) -->
1187
+ <section id="longscore-section" style="display:none;">
1188
+ <h2><span data-i18n="longscore.title">🎯 LongScore</span>
1189
+ <span class="info"><span class="tooltip" data-i18n="longscore.tip">
1190
+ <strong>Why this matters</strong>: every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).
1191
+ </span></span>
1192
+ </h2>
1193
+ <p class="recipe-desc" data-i18n="longscore.desc">
1194
+ <strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.
1195
+ </p>
1196
+ <div class="form-row">
1197
+ <label for="longscore-input" data-i18n="longscore.input_label">Model id:</label>
1198
+ <input type="text" id="longscore-input" data-i18n-placeholder="longscore.input.placeholder"
1199
+ placeholder="e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct" style="flex:1;" />
1200
+ <button type="button" id="longscore-lookup-btn" data-i18n="longscore.lookup_btn">🔎 Lookup</button>
1201
+ </div>
1202
+ <div class="form-row">
1203
+ <button type="button" id="longscore-example-good-btn" class="secondary" data-i18n="longscore.example_good_btn">↳ Example: Jamba-1.5-Large (no degradation)</button>
1204
+ <button type="button" id="longscore-example-mid-btn" class="secondary" data-i18n="longscore.example_mid_btn">↳ Example: Llama-3.1-70B (moderate)</button>
1205
+ <button type="button" id="longscore-example-bad-btn" class="secondary" data-i18n="longscore.example_bad_btn">↳ Example: dbrx (severe)</button>
1206
+ </div>
1207
+ <p id="longscore-status" class="recipe-desc" style="font-size:0.92em;"></p>
1208
+ <div id="longscore-output" style="margin-top: 1em;"></div>
1209
+ <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="longscore.formula_note">
1210
+ 💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href="https://arxiv.org/abs/2505.19293" target="_blank">100-LongBench, ACL 2025</a>. Data: <a href="https://github.com/NVIDIA/RULER" target="_blank">NVIDIA RULER</a> (per-length, n=33) + <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.
1211
+ </p>
1212
+ </section>
1213
+
1214
  <section id="hub-section" style="display:none;">
1215
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1216
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
js/i18n.js CHANGED
@@ -753,6 +753,44 @@ export const TRANSLATIONS = {
753
  "help.v087.tax.title": "🌍 Multilingual Tokenizer Tax",
754
  "help.v087.tax.body": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
755
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
756
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
757
  "help.v081.hub.title": "🧭 Solutions Hub",
758
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
@@ -1965,6 +2003,44 @@ export const TRANSLATIONS = {
1965
  "help.v087.tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
1966
  "help.v087.tax.body": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
1967
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1968
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
1969
  "help.v081.hub.title": "🧭 Solutions Hub",
1970
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
@@ -3041,6 +3117,44 @@ export const TRANSLATIONS = {
3041
  "help.v087.tax.title": "🌍 Taxe Tokenizer Multilingue",
3042
  "help.v087.tax.body": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
3043
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3044
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
3045
  "help.v081.hub.title": "🧭 Solutions Hub",
3046
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
@@ -4117,6 +4231,44 @@ export const TRANSLATIONS = {
4117
  "help.v087.tax.title": "🌍 多语言 Tokenizer 税",
4118
  "help.v087.tax.body": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js(~750 KB CDN),并对粘贴的文本运行 6 个预设供应商 tokenizer(Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似)的 tokenize。输出:每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块(拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文)让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>:『我的多语言支持给账单加了 30%——哪种语言成本最高?』→ 粘贴真实生产文本,查看每个 tokenizer 的精确分解。",
4119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4120
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
4121
  "help.v081.hub.title": "🧭 Solutions Hub",
4122
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
 
753
  "help.v087.tax.title": "🌍 Multilingual Tokenizer Tax",
754
  "help.v087.tax.body": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
755
 
756
+ // v0.8.8 — anti-bullshit pack #14: LongScore (RULER + HELMET lookup)
757
+ "modes.longscore": "🎯 LongScore",
758
+ "mode_desc.longscore": "Look up your model's relative degradation past short context. RULER + HELMET KBs (n=93 models). LongScore metric from 100-LongBench (ACL 2025).",
759
+ "longscore.title": "🎯 LongScore",
760
+ "longscore.tip": "Every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).",
761
+ "longscore.desc": "<strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.",
762
+ "longscore.input_label": "Model id:",
763
+ "longscore.input.placeholder": "e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct",
764
+ "longscore.lookup_btn": "🔎 Lookup",
765
+ "longscore.example_good_btn": "↳ Example: Jamba-1.5-Large (no degradation)",
766
+ "longscore.example_mid_btn": "↳ Example: Llama-3.1-70B (moderate)",
767
+ "longscore.example_bad_btn": "↳ Example: dbrx (severe)",
768
+ "longscore.formula_note": "💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Data: <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.",
769
+ "longscore.miss.title": "Model not found in KB",
770
+ "longscore.miss.body": "Looked up <code>{id}</code>. KB has {n} models. Try a canonical HF id (e.g. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
771
+ "longscore.miss.suggest": "Check coverage at",
772
+ "longscore.no_ruler": "⚠ No per-length data — LongScore not computable. Showing HELMET aggregate at 128K instead.",
773
+ "longscore.score_label": "LongScore",
774
+ "longscore.helmet_label": "HELMET 7-task breakdown",
775
+ "longscore.col.ctx": "Context",
776
+ "longscore.col.score": "Score",
777
+ "longscore.col.lc": "LC",
778
+ "longscore.col.task": "Task",
779
+ "longscore.source_note": "Data source",
780
+ "longscore.hint.empty": "⚠ Paste a model id first.",
781
+ "longscore.status.lookup": "⏳ Looking up…",
782
+ "longscore.status.miss": "ℹ Model not in KB",
783
+ "longscore.status.ruler_hit": "✅ RULER per-length data found",
784
+ "longscore.status.helmet_only":"ℹ HELMET aggregate only (no per-length data)",
785
+ "longscore.verdict.no_degradation": "✅ No degradation past short context",
786
+ "longscore.verdict.mild": "🟢 Mild degradation (<10%)",
787
+ "longscore.verdict.moderate": "🟠 Moderate degradation (10-20%)",
788
+ "longscore.verdict.severe": "🔴 Severe degradation (20-30%)",
789
+ "longscore.verdict.extreme": "🚨 Extreme degradation (>30%)",
790
+ "inv.v088.longscore": "<strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.",
791
+ "help.v088.longscore.title": "🎯 LongScore",
792
+ "help.v088.longscore.body": "Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability — a smarter model with a worse long-ctx recipe still scores higher than a less-smart model with a better recipe, masking the actual long-ctx degradation. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. Result: a relative-degradation number per model that compares apples to apples. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id (lowercase, dashes, dots normalized). For models with RULER data, you get the full LongScore + per-length breakdown + verdict (no/mild/moderate/severe/extreme degradation). For HELMET-only models, you get the 7-category aggregate at 128K. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff). Decide whether to use it, switch to a model with engineered long-ctx, or chunk your input.",
793
+
794
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
795
  "help.v081.hub.title": "🧭 Solutions Hub",
796
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
 
2003
  "help.v087.tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
2004
  "help.v087.tax.body": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
2005
 
2006
+ // v0.8.8 — anti-bullshit pack #14: LongScore (RULER + HELMET lookup)
2007
+ "modes.longscore": "🎯 LongScore",
2008
+ "mode_desc.longscore": "Consulta la degradación relativa de tu modelo más allá del contexto corto. KBs RULER + HELMET (n=93 modelos). Métrica LongScore de 100-LongBench (ACL 2025).",
2009
+ "longscore.title": "🎯 LongScore",
2010
+ "longscore.tip": "Cada modelo dice tener ventana de 128K, pero la accuracy degrada mucho antes. LongScore (métrica peer-reviewed de 100-LongBench, ACL 2025) mide la degradación relativa más allá del contexto corto. Separa la base ability de la capacidad real long-ctx — comparas degradación, no scores brutos. Lookup contra KBs RULER + HELMET (n=93 modelos).",
2011
+ "longscore.desc": "<strong>¿Cuánto degrada tu modelo más allá del contexto corto?</strong> Pega un id de modelo HF → ve LongScore (degradación relativa) + breakdown por longitud + scores HELMET 7-task cuando estén disponibles. Sin GPU. Sin inferencia. Lookup puro contra benchmarks publicados.",
2012
+ "longscore.input_label": "Id del modelo:",
2013
+ "longscore.input.placeholder": "ej. Qwen2.5-72B-Instruct o meta-llama/Llama-3.1-70B-Instruct",
2014
+ "longscore.lookup_btn": "🔎 Buscar",
2015
+ "longscore.example_good_btn": "↳ Ejemplo: Jamba-1.5-Large (sin degradación)",
2016
+ "longscore.example_mid_btn": "↳ Ejemplo: Llama-3.1-70B (moderado)",
2017
+ "longscore.example_bad_btn": "↳ Ejemplo: dbrx (severo)",
2018
+ "longscore.formula_note": "💡 <strong>LongScore</strong> = media sobre l ∈ {16K, 32K, 64K, 128K} de (S_l − Base) / Base, donde Base = media(S_4K, S_8K). Fuente: <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Datos: <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (agregado a 128K, n=60). 0 = sin degradación; -0.30 = severo.",
2019
+ "longscore.miss.title": "Modelo no encontrado en KB",
2020
+ "longscore.miss.body": "Buscado <code>{id}</code>. KB tiene {n} modelos. Prueba un id HF canónico (ej. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
2021
+ "longscore.miss.suggest": "Comprueba cobertura en",
2022
+ "longscore.no_ruler": "⚠ Sin datos per-length — LongScore no computable. Mostrando agregado HELMET a 128K.",
2023
+ "longscore.score_label": "LongScore",
2024
+ "longscore.helmet_label": "Breakdown HELMET 7-task",
2025
+ "longscore.col.ctx": "Contexto",
2026
+ "longscore.col.score": "Score",
2027
+ "longscore.col.lc": "LC",
2028
+ "longscore.col.task": "Tarea",
2029
+ "longscore.source_note": "Fuente",
2030
+ "longscore.hint.empty": "⚠ Pega un id de modelo primero.",
2031
+ "longscore.status.lookup": "⏳ Buscando…",
2032
+ "longscore.status.miss": "ℹ Modelo no en KB",
2033
+ "longscore.status.ruler_hit": "✅ Datos RULER per-length encontrados",
2034
+ "longscore.status.helmet_only":"ℹ Solo agregado HELMET (sin datos per-length)",
2035
+ "longscore.verdict.no_degradation": "✅ Sin degradación más allá del contexto corto",
2036
+ "longscore.verdict.mild": "🟢 Degradación leve (<10%)",
2037
+ "longscore.verdict.moderate": "🟠 Degradación moderada (10-20%)",
2038
+ "longscore.verdict.severe": "🔴 Degradación severa (20-30%)",
2039
+ "longscore.verdict.extreme": "🚨 Degradación extrema (>30%)",
2040
+ "inv.v088.longscore": "<strong>🎯 LongScore</strong> — métrica de degradación peer-reviewed (100-LongBench, ACL 2025). Lookup de cualquier modelo en KBs RULER + HELMET (n=93). Ve cuánto cae tu modelo en realidad más allá del contexto corto.",
2041
+ "help.v088.longscore.title": "🎯 LongScore",
2042
+ "help.v088.longscore.body": "Cada LLM long-ctx dice 128K pero degrada mucho antes. El paper 100-LongBench (ACL 2025, arXiv:2505.19293) notó que los scores brutos long-ctx están dominados por base ability — un modelo más smart con peor receta long-ctx puntúa más que uno menos smart con mejor receta, ocultando la degradación real. Proponen <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> con <code>Base = media(S_short)</code>, luego promedio sobre longitudes largas. Resultado: número de degradación relativa por modelo que compara apples to apples. Este mode tafagent embebe datos LongScore-ready: agregado RULER per-context (n=33 modelos, 4K-128K) + agregado HELMET a 128K (n=60 modelos, 7 categorías). Lookup es match exacto por id HF (lowercase, dashes, dots normalizados). Para modelos con datos RULER, obtienes el LongScore completo + breakdown per-length + verdict (no/leve/moderado/severo/extremo). Para modelos solo-HELMET, obtienes el agregado 7-categorías a 128K. <em>Caso de uso</em>: '¿quiero usar Llama-3.1-70B-Instruct para resumen de docs 100K-token — cuánta accuracy pierdo realmente?' → pega id, ve -10% LongScore (degradación moderada, sobre todo el cliff a 128K). Decide si usarlo, cambiar a un modelo con long-ctx engineered, o chunkear tu input.",
2043
+
2044
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
2045
  "help.v081.hub.title": "🧭 Solutions Hub",
2046
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
 
3117
  "help.v087.tax.title": "🌍 Taxe Tokenizer Multilingue",
3118
  "help.v087.tax.body": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
3119
 
3120
+ // v0.8.8 — anti-bullshit pack #14 : LongScore (RULER + HELMET lookup)
3121
+ "modes.longscore": "���� LongScore",
3122
+ "mode_desc.longscore": "Recherchez la dégradation relative de votre modèle au-delà du contexte court. KBs RULER + HELMET (n=93 modèles). Métrique LongScore de 100-LongBench (ACL 2025).",
3123
+ "longscore.title": "🎯 LongScore",
3124
+ "longscore.tip": "Chaque modèle prétend une fenêtre 128K, mais la précision dégrade bien avant. LongScore (métrique peer-reviewed de 100-LongBench, ACL 2025) mesure la dégradation relative au-delà du contexte court. Sépare la base ability de la vraie capacité long-ctx — vous comparez la dégradation, pas les scores bruts. Lookup contre KBs RULER + HELMET (n=93 modèles).",
3125
+ "longscore.desc": "<strong>Combien votre modèle dégrade-t-il au-delà du contexte court ?</strong> Collez un id modèle HF → voyez LongScore (dégradation relative) + breakdown par longueur + scores HELMET 7-task quand disponibles. Pas de GPU. Pas d'inférence. Lookup pur contre des benchmarks publiés.",
3126
+ "longscore.input_label": "Id du modèle :",
3127
+ "longscore.input.placeholder": "ex. Qwen2.5-72B-Instruct ou meta-llama/Llama-3.1-70B-Instruct",
3128
+ "longscore.lookup_btn": "🔎 Rechercher",
3129
+ "longscore.example_good_btn": "↳ Exemple : Jamba-1.5-Large (sans dégradation)",
3130
+ "longscore.example_mid_btn": "↳ Exemple : Llama-3.1-70B (modéré)",
3131
+ "longscore.example_bad_btn": "↳ Exemple : dbrx (sévère)",
3132
+ "longscore.formula_note": "💡 <strong>LongScore</strong> = moyenne sur l ∈ {16K, 32K, 64K, 128K} de (S_l − Base) / Base, où Base = moyenne(S_4K, S_8K). Source : <a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>. Données : <a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a> (per-length, n=33) + <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a> (agrégat à 128K, n=60). 0 = pas de dégradation ; -0.30 = sévère.",
3133
+ "longscore.miss.title": "Modèle non trouvé en KB",
3134
+ "longscore.miss.body": "Recherché <code>{id}</code>. KB contient {n} modèles. Essayez un id HF canonique (ex. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).",
3135
+ "longscore.miss.suggest": "Vérifiez la couverture sur",
3136
+ "longscore.no_ruler": "⚠ Pas de données per-length — LongScore non calculable. Affichage agrégat HELMET à 128K.",
3137
+ "longscore.score_label": "LongScore",
3138
+ "longscore.helmet_label": "Breakdown HELMET 7-task",
3139
+ "longscore.col.ctx": "Contexte",
3140
+ "longscore.col.score": "Score",
3141
+ "longscore.col.lc": "LC",
3142
+ "longscore.col.task": "Tâche",
3143
+ "longscore.source_note": "Source",
3144
+ "longscore.hint.empty": "⚠ Collez un id modèle d'abord.",
3145
+ "longscore.status.lookup": "⏳ Recherche…",
3146
+ "longscore.status.miss": "ℹ Modèle pas en KB",
3147
+ "longscore.status.ruler_hit": "✅ Données RULER per-length trouvées",
3148
+ "longscore.status.helmet_only":"ℹ Agrégat HELMET seulement (pas de données per-length)",
3149
+ "longscore.verdict.no_degradation": "✅ Pas de dégradation au-delà du contexte court",
3150
+ "longscore.verdict.mild": "🟢 Dégradation légère (<10%)",
3151
+ "longscore.verdict.moderate": "🟠 Dégradation modérée (10-20%)",
3152
+ "longscore.verdict.severe": "🔴 Dégradation sévère (20-30%)",
3153
+ "longscore.verdict.extreme": "🚨 Dégradation extrême (>30%)",
3154
+ "inv.v088.longscore": "<strong>🎯 LongScore</strong> — métrique de dégradation peer-reviewed (100-LongBench, ACL 2025). Lookup de tout modèle dans KBs RULER + HELMET (n=93). Voyez combien votre modèle chute réellement au-delà du contexte court.",
3155
+ "help.v088.longscore.title": "🎯 LongScore",
3156
+ "help.v088.longscore.body": "Chaque LLM long-ctx prétend 128K mais dégrade bien avant. Le paper 100-LongBench (ACL 2025, arXiv:2505.19293) a remarqué que les scores long-ctx bruts sont dominés par la base ability — un modèle plus smart avec une moins bonne recette long-ctx score plus qu'un moins smart avec une meilleure recette, masquant la vraie dégradation. Ils proposent <strong>LongScore</strong> : <code>LC_l = (S_l − Base) / Base</code> avec <code>Base = moyenne(S_short)</code>, puis moyenne sur les longueurs longues. Résultat : un nombre de dégradation relative par modèle qui compare apples to apples. Ce mode tafagent embarque les données LongScore-ready : agrégat RULER per-context (n=33 modèles, 4K-128K) + agrégat HELMET à 128K (n=60 modèles, 7 catégories). Lookup est match exact par id HF (lowercase, dashes, dots normalisés). Pour les modèles avec données RULER, vous obtenez le LongScore complet + breakdown per-length + verdict (pas/légère/modérée/sévère/extrême). Pour les modèles HELMET-only, vous obtenez l'agrégat 7-catégories à 128K. <em>Cas d'usage</em> : 'je veux utiliser Llama-3.1-70B-Instruct pour résumé de docs 100K-token — combien de précision je perds vraiment ?' → collez l'id, voyez -10% LongScore (dégradation modérée, surtout le cliff à 128K). Décidez de l'utiliser, passer à un modèle avec long-ctx engineered, ou chunker votre input.",
3157
+
3158
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
3159
  "help.v081.hub.title": "🧭 Solutions Hub",
3160
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
 
4231
  "help.v087.tax.title": "🌍 多语言 Tokenizer 税",
4232
  "help.v087.tax.body": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js(~750 KB CDN),并对粘贴的文本运行 6 个预设供应商 tokenizer(Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似)的 tokenize。输出:每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块(拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文)让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>:『我的多语言支持给账单加了 30%——哪种语言成本最高?』→ 粘贴真实生产文本,查看每个 tokenizer 的精确分解。",
4233
 
4234
+ // v0.8.8 — anti-bullshit pack #14:LongScore (RULER + HELMET 查询)
4235
+ "modes.longscore": "🎯 LongScore",
4236
+ "mode_desc.longscore": "查询你的模型在短上下文之外的相对降级。RULER + HELMET KB(n=93 模型)。LongScore 指标来自 100-LongBench (ACL 2025)。",
4237
+ "longscore.title": "🎯 LongScore",
4238
+ "longscore.tip": "每个模型都声称 128K 上下文窗口,但准确率早就开始降级。LongScore(来自 100-LongBench, ACL 2025 的 peer-reviewed 指标)测量相对于短上下文的降级。将基础能力与真正的长上下文能力解耦——你比较的是降级,而不是原始分数。在 RULER + HELMET KB 中查询(n=93 模型)。",
4239
+ "longscore.desc": "<strong>你的模型在短上下文之外降级多少?</strong> 粘贴 HF 模型 id → 查看 LongScore(相对降级)+ 每长度分解 + HELMET 7-task 分数(如有)。无 GPU。无推理。纯查询已发布的 benchmark。",
4240
+ "longscore.input_label": "模型 id:",
4241
+ "longscore.input.placeholder": "例如 Qwen2.5-72B-Instruct 或 meta-llama/Llama-3.1-70B-Instruct",
4242
+ "longscore.lookup_btn": "🔎 查询",
4243
+ "longscore.example_good_btn": "↳ 示例:Jamba-1.5-Large(无降级)",
4244
+ "longscore.example_mid_btn": "↳ 示例:Llama-3.1-70B(中等)",
4245
+ "longscore.example_bad_btn": "↳ 示例:dbrx(严重)",
4246
+ "longscore.formula_note": "💡 <strong>LongScore</strong> = 在 l ∈ {16K, 32K, 64K, 128K} 上的 (S_l − Base) / Base 平均值,其中 Base = mean(S_4K, S_8K)。来源:<a href=\"https://arxiv.org/abs/2505.19293\" target=\"_blank\">100-LongBench, ACL 2025</a>。数据:<a href=\"https://github.com/NVIDIA/RULER\" target=\"_blank\">NVIDIA RULER</a>(每长度,n=33)+ <a href=\"https://github.com/princeton-nlp/HELMET\" target=\"_blank\">HELMET</a>(128K 聚合,n=60)。0 = 无降级;-0.30 = 严重。",
4247
+ "longscore.miss.title": "KB 中未找到模型",
4248
+ "longscore.miss.body": "查询了 <code>{id}</code>。KB 包含 {n} 个模型。请尝试规范 HF id(例如 <code>Qwen2.5-72B-Instruct</code>、<code>Llama-3.1-70B-Instruct</code>、<code>Jamba-1.5-Mini</code>)。",
4249
+ "longscore.miss.suggest": "在以下位置检查覆盖范围",
4250
+ "longscore.no_ruler": "⚠ 无每长度数据 — LongScore 无法计算。改为显示 128K 处的 HELMET 聚合。",
4251
+ "longscore.score_label": "LongScore",
4252
+ "longscore.helmet_label": "HELMET 7-task 分解",
4253
+ "longscore.col.ctx": "上下文",
4254
+ "longscore.col.score": "分数",
4255
+ "longscore.col.lc": "LC",
4256
+ "longscore.col.task": "任务",
4257
+ "longscore.source_note": "数据源",
4258
+ "longscore.hint.empty": "⚠ 请先粘贴模型 id。",
4259
+ "longscore.status.lookup": "⏳ 查询中…",
4260
+ "longscore.status.miss": "ℹ 模型不在 KB 中",
4261
+ "longscore.status.ruler_hit": "✅ 找到 RULER 每长度数据",
4262
+ "longscore.status.helmet_only":"ℹ 仅 HELMET 聚合(无每长度数据)",
4263
+ "longscore.verdict.no_degradation": "✅ 短上下文之外无降级",
4264
+ "longscore.verdict.mild": "🟢 轻度降级(<10%)",
4265
+ "longscore.verdict.moderate": "🟠 中度降级(10-20%)",
4266
+ "longscore.verdict.severe": "🔴 严重降级(20-30%)",
4267
+ "longscore.verdict.extreme": "🚨 极端降级(>30%)",
4268
+ "inv.v088.longscore": "<strong>🎯 LongScore</strong> — peer-reviewed 降级指标(100-LongBench, ACL 2025)。在 RULER + HELMET KB(n=93)中查询任意模型。看你的模型在短上下文之外实际下降多少。",
4269
+ "help.v088.longscore.title": "🎯 LongScore",
4270
+ "help.v088.longscore.body": "每个长上下文 LLM 都声称 128K,但早就开始降级。100-LongBench 论文 (ACL 2025, arXiv:2505.19293) 注意到原始长上下文分数被基础能力主导——一个更聪明但长上下文配方更差的模型,仍然得分高于一个不那么聪明但配方更好的模型,掩盖了真正的长上下文降级。他们提出 <strong>LongScore</strong>:<code>LC_l = (S_l − Base) / Base</code>,其中 <code>Base = mean(S_short)</code>,然后对长长度取平均。结果:每个模型一个相对降级数字,可以同等比较。这个 tafagent 模式嵌入了 LongScore-ready 数据:RULER 每上下文聚合(n=33 模型,4K-128K)+ HELMET 128K 聚合(n=60 模型,7 类别)。查询是按 HF 模型 id 精确匹配(小写、连字符、点号已规范化)。对于有 RULER 数据的模型,你得到完整的 LongScore + 每长度分解 + 判定(无/轻/中/严重/极端降级)。对于仅 HELMET 模型,你得到 128K 处的 7-类别聚合。<em>用例</em>:『我想用 Llama-3.1-70B-Instruct 做 100K-token 文档摘要——实际上我损失多少准确率?』→ 粘贴 id,看到 -10% LongScore(中度降级,主要是 128K 处的 cliff)。决定是否使用、改用 long-ctx engineered 的模型,或者分块输入。",
4271
+
4272
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
4273
  "help.v081.hub.title": "🧭 Solutions Hub",
4274
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
js/longscore.js ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // longscore.js — pure logic for the 🎯 LongScore mode.
2
+ //
3
+ // Looks up an HF-style model id in data/longscore_kb.json and returns:
4
+ // - exact match: ruler_per_ctx (if available) + ruler_long_score (computed) + helmet aggregate
5
+ // - HELMET-only: aggregate scores at 128K, no LongScore (no per-length data)
6
+ // - miss: fallback for unknown models
7
+ //
8
+ // No UI strings — emits codes + params; main.js translates via i18n.
9
+ //
10
+ // LongScore formula (100-LongBench, ACL 2025, arXiv:2505.19293, §3.2):
11
+ // Base = mean(S_4K, S_8K)
12
+ // LC_l = (S_l - Base) / Base
13
+ // LongScore = mean(LC_l for l in {16K, 32K, 64K, 128K})
14
+ //
15
+ // More negative = worse long-ctx retention.
16
+
17
+ let KB = null;
18
+
19
+ export async function loadKB() {
20
+ if (KB) return KB;
21
+ const res = await fetch("data/longscore_kb.json");
22
+ if (!res.ok) throw new Error("longscore_kb fetch failed: " + res.status);
23
+ KB = await res.json();
24
+ return KB;
25
+ }
26
+
27
+ export function normalize(name) {
28
+ if (!name) return "";
29
+ let s = String(name).toLowerCase().trim();
30
+ s = s.replace(/^(meta-llama\/|01-ai\/|ai21labs\/|nvidia\/|princeton-nlp\/|unsloth\/)/, "");
31
+ s = s.replace(/_/g, "-").replace(/\./g, "-");
32
+ s = s.replace(/([a-z])(\d)/g, "$1-$2");
33
+ s = s.replace(/(\d)([a-z])/g, "$1-$2");
34
+ s = s.replace(/-+/g, "-");
35
+ // -inst → -instruct (both at end and in middle, before next -segment)
36
+ s = s.replace(/-inst(?=-|$)/g, "-instruct");
37
+ return s;
38
+ }
39
+
40
+ /** Classify LongScore avg into verdict code. */
41
+ export function classify(longscore_avg, thresholds) {
42
+ if (longscore_avg === null || longscore_avg === undefined) return "no_data";
43
+ if (longscore_avg >= thresholds.no_degradation) return "no_degradation";
44
+ if (longscore_avg >= thresholds.mild) return "mild";
45
+ if (longscore_avg >= thresholds.moderate) return "moderate";
46
+ if (longscore_avg >= thresholds.severe) return "severe";
47
+ return "extreme";
48
+ }
49
+
50
+ /** Look up a model and return a structured result. */
51
+ export async function lookup(rawId) {
52
+ const kb = await loadKB();
53
+ const id = normalize(rawId);
54
+ const entry = kb.models[id];
55
+ if (!entry) {
56
+ return {
57
+ code: "miss",
58
+ normalized_id: id,
59
+ n_kb_total: kb.stats.n_total,
60
+ };
61
+ }
62
+
63
+ const longscore = entry.ruler_long_score;
64
+ const verdict = longscore
65
+ ? classify(longscore.avg_lc, kb.thresholds)
66
+ : null;
67
+
68
+ return {
69
+ code: longscore ? "ruler_hit" : (entry.helmet ? "helmet_only" : "partial"),
70
+ display_name: entry.display_name,
71
+ normalized_id: id,
72
+ ruler_per_ctx: entry.ruler_per_ctx,
73
+ ruler_long_score: longscore,
74
+ helmet: entry.helmet,
75
+ recipe_class: entry.recipe_class,
76
+ params_b: entry.params_b,
77
+ native_context_k: entry.native_context_k,
78
+ source: entry.source,
79
+ verdict,
80
+ thresholds: kb.thresholds,
81
+ };
82
+ }
83
+
84
+ /** Get sorted list of all model ids — for autocomplete. */
85
+ export async function listAllIds() {
86
+ const kb = await loadKB();
87
+ return Object.keys(kb.models).sort();
88
+ }
89
+
90
+ /** Top-N best/worst by LongScore (for sanity inspection). Optional helper. */
91
+ export async function rank(direction) {
92
+ const kb = await loadKB();
93
+ const items = Object.entries(kb.models)
94
+ .filter(([, m]) => m.ruler_long_score)
95
+ .map(([id, m]) => ({
96
+ id,
97
+ display_name: m.display_name,
98
+ recipe_class: m.recipe_class,
99
+ avg_lc: m.ruler_long_score.avg_lc,
100
+ }));
101
+ items.sort((a, b) =>
102
+ direction === "best" ? b.avg_lc - a.avg_lc : a.avg_lc - b.avg_lc
103
+ );
104
+ return items;
105
+ }
js/main.js CHANGED
@@ -35,6 +35,9 @@ import {
35
  tokenizeAll, detectLanguageBlocks,
36
  PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
37
  } from "./tokenizer_tax.js";
 
 
 
38
 
39
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
40
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -229,6 +232,7 @@ document.addEventListener("click", (e) => {
229
  cache: "cache-section",
230
  speculative: "speculative-section",
231
  tax: "tax-section",
 
232
  hub: "hub-section",
233
  }[targetMode];
234
  if (sectionId) {
@@ -254,7 +258,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
254
  "diagnose-section", "phase-section", "unmask-section",
255
  "template-section", "arena-section", "contam-section",
256
  "quant-section", "drift-section", "niah-section",
257
- "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "hub-section"].forEach(id => {
258
  const el = $(id);
259
  if (el) el.style.display = "none";
260
  });
@@ -271,6 +275,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
271
  cache: "cache-section",
272
  speculative: "speculative-section",
273
  tax: "tax-section",
 
274
  hub: "hub-section",
275
  };
276
  const sectionId = sectionMap[mode];
@@ -283,6 +288,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
283
  if (mode === "cache") initCacheDiff();
284
  if (mode === "speculative") initSpeculative();
285
  if (mode === "tax") initTax();
 
286
  if (mode === "hub") initHub();
287
  });
288
  });
@@ -3469,10 +3475,10 @@ async function initHub() {
3469
 
3470
  function renderEntry(e) {
3471
  const modeBadge = e.tafagent_mode
3472
- ? `<span class="badge" style="background:#3fb950;">${e.tafagent_mode}</span>`
3473
  : (e.tafagent_planned_mode
3474
- ? `<span class="badge" style="background:#d29922;">${t("hub.planned") || "planned:"} ${e.tafagent_planned_mode}</span>`
3475
- : `<span class="badge" style="background:#6e7781;">${t("hub.no_mode") || "external"}</span>`);
3476
  const tools = (e.external_tools || [])
3477
  .map(tl => {
3478
  const icon = HUB_TYPE_BADGE[tl.type] || "🔗";
@@ -4410,6 +4416,167 @@ $("tax-sample-code-btn")?.addEventListener("click", () => {
4410
  runTaxTokenize();
4411
  });
4412
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4413
  // ════════════════════════════════════════════════════════════════════
4414
  // Bootstrap
4415
  // ════════════════════════════════════════════════════════════════════
 
35
  tokenizeAll, detectLanguageBlocks,
36
  PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
37
  } from "./tokenizer_tax.js";
38
+ import {
39
+ loadKB as loadLongscoreKB, lookup as longscoreLookup, rank as longscoreRank,
40
+ } from "./longscore.js";
41
 
42
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
43
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
 
232
  cache: "cache-section",
233
  speculative: "speculative-section",
234
  tax: "tax-section",
235
+ longscore: "longscore-section",
236
  hub: "hub-section",
237
  }[targetMode];
238
  if (sectionId) {
 
258
  "diagnose-section", "phase-section", "unmask-section",
259
  "template-section", "arena-section", "contam-section",
260
  "quant-section", "drift-section", "niah-section",
261
+ "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "longscore-section", "hub-section"].forEach(id => {
262
  const el = $(id);
263
  if (el) el.style.display = "none";
264
  });
 
275
  cache: "cache-section",
276
  speculative: "speculative-section",
277
  tax: "tax-section",
278
+ longscore: "longscore-section",
279
  hub: "hub-section",
280
  };
281
  const sectionId = sectionMap[mode];
 
288
  if (mode === "cache") initCacheDiff();
289
  if (mode === "speculative") initSpeculative();
290
  if (mode === "tax") initTax();
291
+ if (mode === "longscore") initLongscore();
292
  if (mode === "hub") initHub();
293
  });
294
  });
 
3475
 
3476
  function renderEntry(e) {
3477
  const modeBadge = e.tafagent_mode
3478
+ ? `<span class="badge" style="background:#3fb950;color:#fff;border-color:#3fb950;">${e.tafagent_mode}</span>`
3479
  : (e.tafagent_planned_mode
3480
+ ? `<span class="badge" style="background:#d29922;color:#1a1a1a;border-color:#d29922;">${t("hub.planned") || "planned:"} ${e.tafagent_planned_mode}</span>`
3481
+ : `<span class="badge" style="background:#6e7781;color:#fff;border-color:#6e7781;">${t("hub.no_mode") || "external"}</span>`);
3482
  const tools = (e.external_tools || [])
3483
  .map(tl => {
3484
  const icon = HUB_TYPE_BADGE[tl.type] || "🔗";
 
4416
  runTaxTokenize();
4417
  });
4418
 
4419
+ // ════════════════════════════════════════════════════════════════════
4420
+ // LongScore mode (v0.8.8 anti-bullshit pack #14)
4421
+ // ════════════════════════════════════════════════════════════════════
4422
+ let __longscoreInited = false;
4423
+
4424
+ function initLongscore() {
4425
+ if (__longscoreInited) return;
4426
+ __longscoreInited = true;
4427
+ // Eager-load KB so the first lookup is instant (KB is ~70KB, no real cost)
4428
+ loadLongscoreKB().catch(e => {
4429
+ console.warn("longscore_kb preload failed", e);
4430
+ });
4431
+ }
4432
+
4433
+ function fmtPct(x, sign) {
4434
+ if (x == null) return "—";
4435
+ const v = (x * 100);
4436
+ return `${sign && v >= 0 ? "+" : ""}${v.toFixed(1)}%`;
4437
+ }
4438
+
4439
+ function lcColor(avg) {
4440
+ if (avg == null) return "#8b949e";
4441
+ if (avg >= -0.02) return "#3fb950"; // green: no degradation
4442
+ if (avg >= -0.10) return "#a5d36a"; // light green
4443
+ if (avg >= -0.20) return "#f0883e"; // orange
4444
+ if (avg >= -0.30) return "#f85149"; // red
4445
+ return "#a01b1b"; // dark red: extreme
4446
+ }
4447
+
4448
+ function renderLongscoreResult(res) {
4449
+ if (res.code === "miss") {
4450
+ return `<div class="arena-result">
4451
+ <p style="color:#f0883e;"><strong>${t("longscore.miss.title") || "Model not found in KB"}</strong></p>
4452
+ <p>${tFmt("longscore.miss.body", { id: res.normalized_id, n: res.n_kb_total }) || `Looked up <code>${res.normalized_id}</code>. KB has ${res.n_kb_total} models. Try a canonical HF id (e.g. <code>Qwen2.5-72B-Instruct</code>, <code>Llama-3.1-70B-Instruct</code>, <code>Jamba-1.5-Mini</code>).`}</p>
4453
+ <p class="subtle" style="font-size:0.85em;">${t("longscore.miss.suggest") || "Check coverage at"} <a href="https://github.com/NVIDIA/RULER" target="_blank">RULER</a> · <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a>.</p>
4454
+ </div>`;
4455
+ }
4456
+
4457
+ const verdictMap = {
4458
+ no_degradation: { color: "#3fb950", label: t("longscore.verdict.no_degradation") || "✅ No degradation past short context" },
4459
+ mild: { color: "#a5d36a", label: t("longscore.verdict.mild") || "🟢 Mild degradation (<10%)" },
4460
+ moderate: { color: "#f0883e", label: t("longscore.verdict.moderate") || "🟠 Moderate degradation (10-20%)" },
4461
+ severe: { color: "#f85149", label: t("longscore.verdict.severe") || "🔴 Severe degradation (20-30%)" },
4462
+ extreme: { color: "#a01b1b", label: t("longscore.verdict.extreme") || "🚨 Extreme degradation (>30%)" },
4463
+ };
4464
+
4465
+ let html = `<div class="arena-result">`;
4466
+ html += `<p><strong>${escapeHtml(res.display_name)}</strong>`;
4467
+ if (res.params_b) html += ` <span class="subtle">· ${res.params_b}B params</span>`;
4468
+ if (res.recipe_class) html += ` <span class="subtle">· ${escapeHtml(res.recipe_class)}</span>`;
4469
+ if (res.native_context_k) html += ` <span class="subtle">· native ctx ${res.native_context_k}K</span>`;
4470
+ html += `</p>`;
4471
+
4472
+ // RULER per-length + LongScore
4473
+ if (res.ruler_long_score) {
4474
+ const ls = res.ruler_long_score;
4475
+ const v = verdictMap[res.verdict] || { color: "#8b949e", label: res.verdict };
4476
+ html += `<p style="margin-top:0.8em;font-size:1.1em;">
4477
+ <strong>${t("longscore.score_label") || "LongScore"}:</strong>
4478
+ <span style="color:${lcColor(ls.avg_lc)};font-family:monospace;font-size:1.2em;font-weight:bold;">${fmtPct(ls.avg_lc, true)}</span>
4479
+ <span class="subtle">· Base = ${ls.base.toFixed(1)}% (mean of 4K, 8K)</span>
4480
+ </p>`;
4481
+ html += `<p style="color:${v.color};font-weight:bold;">${v.label}</p>`;
4482
+
4483
+ // Per-length bars
4484
+ html += `<table class="lean-table" style="margin-top:0.8em;width:100%;">
4485
+ <thead><tr>
4486
+ <th style="text-align:left;">${t("longscore.col.ctx") || "Context"}</th>
4487
+ <th style="text-align:right;">${t("longscore.col.score") || "Score"}</th>
4488
+ <th style="text-align:right;">${t("longscore.col.lc") || "LC"}</th>
4489
+ </tr></thead><tbody>`;
4490
+ const ctxKeys = ["4k", "8k", "16k", "32k", "64k", "128k"];
4491
+ for (const k of ctxKeys) {
4492
+ const score = res.ruler_per_ctx?.[k];
4493
+ if (score == null) continue;
4494
+ const isShort = k === "4k" || k === "8k";
4495
+ const lc = ls.per_length_lc?.[k];
4496
+ html += `<tr ${isShort ? 'style="opacity:0.7;"' : ""}>
4497
+ <td><strong>${k.toUpperCase()}</strong>${isShort ? ` <span class="subtle" style="font-size:0.8em;">(base)</span>` : ""}</td>
4498
+ <td style="text-align:right;font-family:monospace;">${score.toFixed(1)}%</td>
4499
+ <td style="text-align:right;font-family:monospace;color:${lcColor(lc)};">${lc != null ? fmtPct(lc, true) : "—"}</td>
4500
+ </tr>`;
4501
+ }
4502
+ html += `</tbody></table>`;
4503
+ } else {
4504
+ // Helmet-only or partial
4505
+ html += `<p style="margin-top:0.8em;color:#f0883e;">${t("longscore.no_ruler") || "⚠ No per-length data — LongScore not computable. Showing HELMET aggregate at 128K instead."}</p>`;
4506
+ }
4507
+
4508
+ // HELMET breakdown if available
4509
+ if (res.helmet) {
4510
+ html += `<details style="margin-top:1em;" open>
4511
+ <summary><strong>${t("longscore.helmet_label") || "HELMET 7-task breakdown"} (at 128K)</strong></summary>
4512
+ <table class="lean-table" style="margin-top:0.5em;width:100%;">
4513
+ <thead><tr>
4514
+ <th style="text-align:left;">${t("longscore.col.task") || "Task"}</th>
4515
+ <th style="text-align:right;">${t("longscore.col.score") || "Score"}</th>
4516
+ </tr></thead><tbody>`;
4517
+ if (res.helmet.overall != null) {
4518
+ html += `<tr style="background:#1f2933;"><td><strong>Overall</strong></td><td style="text-align:right;font-family:monospace;"><strong>${res.helmet.overall.toFixed(1)}</strong></td></tr>`;
4519
+ }
4520
+ if (res.helmet.categories) {
4521
+ for (const [task, score] of Object.entries(res.helmet.categories)) {
4522
+ html += `<tr><td>${escapeHtml(task)}</td><td style="text-align:right;font-family:monospace;">${score != null ? score.toFixed(1) : "—"}</td></tr>`;
4523
+ }
4524
+ }
4525
+ html += `</tbody></table></details>`;
4526
+ }
4527
+
4528
+ html += `<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;">
4529
+ ${t("longscore.source_note") || "Data source"}: ${escapeHtml(res.source)} ·
4530
+ <a href="https://arxiv.org/abs/2505.19293" target="_blank">LongScore metric</a>
4531
+ </p>`;
4532
+ html += `</div>`;
4533
+ return html;
4534
+ }
4535
+
4536
+ async function runLongscoreLookup() {
4537
+ const id = $("longscore-input")?.value?.trim();
4538
+ if (!id) {
4539
+ $("longscore-status").textContent = t("longscore.hint.empty") || "⚠ Paste a model id first.";
4540
+ return;
4541
+ }
4542
+ $("longscore-status").textContent = t("longscore.status.lookup") || "⏳ Looking up…";
4543
+ $("longscore-output").innerHTML = "";
4544
+ try {
4545
+ const res = await longscoreLookup(id);
4546
+ $("longscore-output").innerHTML = renderLongscoreResult(res);
4547
+ if (res.code === "miss") {
4548
+ $("longscore-status").textContent = t("longscore.status.miss") || "ℹ Model not in KB";
4549
+ } else if (res.code === "ruler_hit") {
4550
+ $("longscore-status").textContent = t("longscore.status.ruler_hit") || "✅ RULER per-length data found";
4551
+ } else {
4552
+ $("longscore-status").textContent = t("longscore.status.helmet_only") || "ℹ HELMET aggregate only (no per-length data)";
4553
+ }
4554
+ } catch (e) {
4555
+ $("longscore-status").textContent = `❌ ${e.message || e}`;
4556
+ console.error(e);
4557
+ }
4558
+ }
4559
+
4560
+ $("longscore-lookup-btn")?.addEventListener("click", runLongscoreLookup);
4561
+ $("longscore-input")?.addEventListener("keydown", e => {
4562
+ if (e.key === "Enter") {
4563
+ e.preventDefault();
4564
+ runLongscoreLookup();
4565
+ }
4566
+ });
4567
+ $("longscore-example-good-btn")?.addEventListener("click", () => {
4568
+ $("longscore-input").value = "Jamba-1.5-Large";
4569
+ runLongscoreLookup();
4570
+ });
4571
+ $("longscore-example-mid-btn")?.addEventListener("click", () => {
4572
+ $("longscore-input").value = "Llama-3.1-70B-Instruct";
4573
+ runLongscoreLookup();
4574
+ });
4575
+ $("longscore-example-bad-btn")?.addEventListener("click", () => {
4576
+ $("longscore-input").value = "dbrx";
4577
+ runLongscoreLookup();
4578
+ });
4579
+
4580
  // ════════════════════════════════════════════════════════════════════
4581
  // Bootstrap
4582
  // ════════════════════════════════════════════════════════════════════
scripts/test_longscore.mjs ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Smoke test for js/longscore.js — verifies normalize, lookup, classify codes.
2
+ // Run: node scripts/test_longscore.mjs
3
+ import { readFileSync } from "fs";
4
+
5
+ // Mock fetch for Node ESM
6
+ globalThis.fetch = async (url) => {
7
+ const path = url.startsWith("data/") ? `./${url}` : url;
8
+ const txt = readFileSync(path, "utf-8");
9
+ return {
10
+ ok: true,
11
+ json: async () => JSON.parse(txt),
12
+ };
13
+ };
14
+
15
+ const { normalize, lookup, classify, rank } = await import("../js/longscore.js");
16
+
17
+ let pass = 0, fail = 0;
18
+ function check(name, cond, detail) {
19
+ if (cond) { pass++; console.log(` ✓ ${name}`); }
20
+ else { fail++; console.log(` ✗ ${name}${detail ? ": " + detail : ""}`); }
21
+ }
22
+
23
+ console.log("--- normalize ---");
24
+ check("trims + lowercases", normalize(" Qwen2.5 ") === "qwen-2-5");
25
+ check("strips meta-llama/", normalize("meta-llama/Llama-3.1-70B-Instruct") === "llama-3-1-70-b-instruct");
26
+ check("strips 01-ai/", normalize("01-ai/Yi-34B-200K") === "yi-34-b-200-k");
27
+ check("inst → instruct", normalize("Mistral-7B-Inst-v0.2") === "mistral-7-b-instruct-v-0-2");
28
+ check("dot → dash", normalize("Phi-3.5-mini-instruct") === "phi-3-5-mini-instruct");
29
+ check("empty", normalize("") === "");
30
+
31
+ console.log("\n--- classify ---");
32
+ const t = { no_degradation: -0.02, mild: -0.10, moderate: -0.20, severe: -0.30 };
33
+ check("no_data", classify(null, t) === "no_data");
34
+ check("no_degradation", classify(0.0, t) === "no_degradation");
35
+ check("mild", classify(-0.05, t) === "mild");
36
+ check("moderate", classify(-0.15, t) === "moderate");
37
+ check("severe", classify(-0.25, t) === "severe");
38
+ check("extreme", classify(-0.50, t) === "extreme");
39
+
40
+ console.log("\n--- lookup (RULER hit) ---");
41
+ const r1 = await lookup("Llama-3.1-70B-Instruct");
42
+ check("ruler_hit code", r1.code === "ruler_hit");
43
+ check("longscore present", typeof r1.ruler_long_score?.avg_lc === "number");
44
+ check("verdict assigned", r1.verdict !== null);
45
+ check("base ~96", r1.ruler_long_score?.base > 95 && r1.ruler_long_score?.base < 97,
46
+ `got base=${r1.ruler_long_score?.base}`);
47
+ check("Llama-3.1-70B avg_lc ~-0.10", Math.abs(r1.ruler_long_score?.avg_lc - (-0.1024)) < 0.001,
48
+ `got ${r1.ruler_long_score?.avg_lc}`);
49
+
50
+ console.log("\n--- lookup (Jamba — best LongScore) ---");
51
+ const r2 = await lookup("Jamba-1.5-Large");
52
+ check("ruler_hit", r2.code === "ruler_hit");
53
+ check("Jamba near-zero degradation", r2.ruler_long_score?.avg_lc > -0.02);
54
+
55
+ console.log("\n--- lookup (dbrx — severe) ---");
56
+ const r3 = await lookup("dbrx");
57
+ check("ruler_hit", r3.code === "ruler_hit");
58
+ check("dbrx severe verdict", r3.verdict === "severe" || r3.verdict === "extreme",
59
+ `got verdict=${r3.verdict} for avg_lc=${r3.ruler_long_score?.avg_lc}`);
60
+
61
+ console.log("\n--- lookup (miss) ---");
62
+ const r4 = await lookup("nonexistent-model-123");
63
+ check("miss code", r4.code === "miss");
64
+ check("normalized id present", r4.normalized_id === "nonexistent-model-123");
65
+
66
+ console.log("\n--- rank ---");
67
+ const ranking = await rank("worst");
68
+ check("ranking returned", Array.isArray(ranking) && ranking.length > 0);
69
+ check("worst is most negative", ranking[0].avg_lc < ranking[ranking.length - 1].avg_lc);
70
+
71
+ console.log(`\n${pass} passed, ${fail} failed`);
72
+ process.exit(fail > 0 ? 1 : 0);
scripts/test_longscore_e2e.mjs ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // E2E lookup smoke for the 3 example buttons (Jamba/Llama/dbrx) + a HELMET-only model.
2
+ import { readFileSync } from "fs";
3
+ globalThis.fetch = async (url) => {
4
+ const path = url.startsWith("data/") ? `./${url}` : url;
5
+ return { ok: true, json: async () => JSON.parse(readFileSync(path, "utf-8")) };
6
+ };
7
+
8
+ const { lookup } = await import("../js/longscore.js");
9
+
10
+ const cases = [
11
+ { input: "Jamba-1.5-Large", expect: { code: "ruler_hit", verdict: "no_degradation" } },
12
+ { input: "Llama-3.1-70B-Instruct", expect: { code: "ruler_hit", verdict: "moderate" } },
13
+ { input: "dbrx", expect: { code: "ruler_hit", verdict: "extreme" } },
14
+ { input: "GPT-4", expect: { code: "helmet_only" } }, // HELMET-only
15
+ { input: "totally-fake-model-xyz", expect: { code: "miss" } },
16
+ ];
17
+
18
+ let pass = 0, fail = 0;
19
+ for (const c of cases) {
20
+ const r = await lookup(c.input);
21
+ const ok = r.code === c.expect.code &&
22
+ (!c.expect.verdict || r.verdict === c.expect.verdict);
23
+ if (ok) {
24
+ pass++;
25
+ const score = r.ruler_long_score ? `LongScore=${(r.ruler_long_score.avg_lc*100).toFixed(1)}%` :
26
+ r.helmet ? `HELMET overall=${r.helmet.overall}` : "";
27
+ console.log(` ✓ ${c.input.padEnd(30)} → ${r.code.padEnd(12)} ${r.verdict || "n/a".padEnd(15)} ${score}`);
28
+ } else {
29
+ fail++;
30
+ console.log(` ✗ ${c.input.padEnd(30)} → got code=${r.code} verdict=${r.verdict}, expected=${JSON.stringify(c.expect)}`);
31
+ }
32
+ }
33
+ console.log(`\n${pass}/${pass+fail} cases pass`);
34
+ process.exit(fail > 0 ? 1 : 0);