taf-agent / index.html
karlexmarin's picture
v0.8.8: add LongScore to "What do you want to do?" entry tile
7d88f9f
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>TAF Agent — Test ANY Transformer LLM in Your Browser</title>
<meta name="description" content="Free, auditable diagnostic for transformer LLMs. Predict viability (long-context, KV compression, training budget, hardware) from config alone. Runs entirely in your browser. No server, no auth, no cost." />
<meta name="keywords" content="transformer, LLM, diagnostic, RoPE, NIAH, KV cache, viability, free, browser, GPU, NeurIPS, TAF" />
<meta name="author" content="Carles Marin" />
<!-- OpenGraph for social sharing (Twitter, LinkedIn, WhatsApp, Discord, etc.) -->
<meta property="og:type" content="website" />
<meta property="og:url" content="https://karlesmarin.github.io/tafagent/" />
<meta property="og:title" content="TAF Agent — Test ANY Transformer LLM in Your Browser" />
<meta property="og:description" content="Free, auditable transformer LLM diagnostic. 8 recipes, 5 modes, 4 languages. Runs in your browser. No server, no auth, $0/month forever." />
<meta property="og:site_name" content="TAF Agent" />
<!-- Twitter Card -->
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="TAF Agent — Test ANY Transformer LLM in Your Browser" />
<meta name="twitter:description" content="Free, auditable transformer LLM diagnostic. 8 recipes, 5 modes, 4 languages. Runs in your browser. $0 forever." />
<!-- Theme color for browser UI -->
<meta name="theme-color" content="#0a0e14" />
<link rel="stylesheet" href="style.css" />
<script src="https://cdn.jsdelivr.net/pyodide/v0.26.4/full/pyodide.js"></script>
</head>
<body>
<a href="#mode-section" class="skip-link" data-i18n="a11y.skip">Skip to main content</a>
<header>
<!-- Language switcher (top-right, round flags) -->
<div class="lang-switcher">
<button class="lang-btn" data-lang="en" data-label="English" title="English" aria-label="Switch language to English">🇬🇧</button>
<button class="lang-btn" data-lang="es" data-label="Español" title="Español" aria-label="Cambiar idioma a Español">🇪🇸</button>
<button class="lang-btn" data-lang="fr" data-label="Français" title="Français" aria-label="Changer la langue en Français">🇫🇷</button>
<button class="lang-btn" data-lang="zh" data-label="中文" title="中文" aria-label="切换语言至中文">🇨🇳</button>
</div>
<h1 data-i18n="hero.title">🔬 TAF Agent</h1>
<p class="tagline" data-i18n="hero.tagline">
Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.
</p>
<p class="subtle" data-i18n="hero.subtitle">
Predicts whether a model will work for your use case <em>before</em> you spend money or time. Everything runs in your browser &mdash; your inputs never leave this tab.
</p>
<p class="subtle" style="margin-top:0.25rem; font-size:0.85rem;" data-i18n="hero.about">
Built by an independent researcher. Open source. Not affiliated with any model vendor.
</p>
<p class="hero-buttons">
<button id="quickstart-btn" type="button" data-i18n="hero.quickstart_btn">⚡ Quick start</button>
<button id="inventory-btn" type="button" data-i18n="hero.inventory_btn">🧰 What it gives you</button>
<button id="help-btn" type="button" data-i18n="hero.help">📘 Manual &amp; examples</button>
</p>
</header>
<!-- Help modal -->
<div id="help-modal" role="dialog" aria-modal="true" aria-labelledby="help-modal-title" aria-hidden="true">
<div class="help-content">
<button class="help-close" id="help-close" aria-label="Close help">×</button>
<h2 id="help-modal-title" data-i18n="help.title">📘 TAF Agent — User Manual</h2>
<h3 data-i18n="help.what.title">What does it do?</h3>
<p data-i18n="help.what.body">Predicts <strong>practical viability</strong> of any transformer LLM
<em>before you spend GPU/$</em>. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF — Thermodynamic Attention Framework).</p>
<h3 data-i18n="help.modes.title">How to use — 7 modes</h3>
<p data-i18n="help.modes.profile"><strong>📇 Profile</strong>: paste model id → all recipes at once = TAF Card. <strong>Best starting point</strong>.</p>
<p data-i18n="help.modes.compare"><strong>🆚 Compare</strong>: 2-3 models side-by-side on same recipe. Best when choosing between candidates.</p>
<p data-i18n="help.modes.inspector"><strong>🔍 Inspect config</strong>: paste raw <code>config.json</code> → tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.</p>
<p data-i18n="help.modes.ask"><strong>💬 Ask plain English</strong>: free-form question, in-browser LLM picks the recipe. Best for casual exploration.</p>
<p data-i18n="help.modes.recipe"><strong>📋 Recipe + form</strong>: manual selection, full parameter control. Best when you want exact control.</p>
<p data-i18n="help.modes.diagnose"><strong>🩺 Diagnose CLI</strong>: generate Python command to measure γ on your local machine (transformers + numpy). Fast ≈5 min CPU; full ≈20–60 min GPU. Output JSON re-uploadable via Inspect.</p>
<p data-i18n="help.modes.phase"><strong>📊 Phase diagram</strong>: scatter plot of 23 panel models on (log θ, γ) plane. Hagedorn line γ=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.</p>
<h3 data-i18n="help.recipes.title">The 8 recipes available</h3>
<p data-i18n="help.recipe.x1.title"><strong>X-1 Custom training vs API</strong> — compares cost of training your own model vs paying for API access.</p>
<div class="help-example" data-i18n="help.recipe.x1.example">
Try: <em>"Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"</em><br>
Answer types: YES (custom) / NO (API) with break-even months.
</div>
<p data-i18n="help.recipe.x2.title"><strong>X-2 Long Context Viability</strong> — predicts if a model serves a target context length reliably.</p>
<div class="help-example" data-i18n="help.recipe.x2.example">
Try: <em>"Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"</em><br>
Chains: γ_Padé → decomposition → d_horizon → NIAH ceiling → hallucination → KV memory.<br>
Verdict: YES / DEGRADED / NO with mitigation if needed.
</div>
<p data-i18n="help.recipe.x3.title"><strong>X-3 Budget pre-flight</strong> — given $ budget, what model is feasible to train?</p>
<div class="help-example" data-i18n="help.recipe.x3.example">
Try: <em>"I have $5000, what model can I train?"</em><br>
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
</div>
<p data-i18n="help.recipe.x5.title"><strong>X-5 Hardware selection</strong> — which GPU should I use to serve at target throughput?</p>
<div class="help-example" data-i18n="help.recipe.x5.example">
Try: <em>"Cheapest hardware to serve Llama-3-8B at 10M tokens/day"</em><br>
Answer: best GPU + $/Mtok + capacity vs target.
</div>
<p data-i18n="help.recipe.x19.title"><strong>X-19 KV Compression decision</strong> — should I use soft decay, hard cutoff, or literature methods?</p>
<div class="help-example" data-i18n="help.recipe.x19.example">
Try: <em>"How to compress KV cache for Qwen2.5-7B at 32K?"</em><br>
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
</div>
<h3 style="margin-top: 1.5em;" data-i18n="help.divider.v04_s29">— v0.4 (sesión 29 findings) —</h3>
<p data-i18n="help.section.v04"><strong>What's new in v0.4</strong> (sesión 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).</p>
<p data-i18n="help.recipe.x21.title"><strong>X-21 Imprint Purity Diagnostic</strong> — predicts γ on RANDOM tokens via ν=−1/(2π); how clean is the model's RoPE prediction?</p>
<div class="help-example" data-i18n="help.recipe.x21.example">
Try: <em>"How clean is the RoPE prediction on Llama-3-8B?"</em><br>
Answer: predicted γ_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
</div>
<p data-i18n="help.v04.imprint" style="font-size: 0.9em; opacity: 0.85;"><strong>Learned-imprint slope ν = −1/(2π)</strong>: RoPE rotation period 2π drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. ν is DERIVED — not fitted (empirical err 0.3%).</p>
<p data-i18n="help.recipe.x22.title"><strong>X-22 Compute-Context Invariant</strong> — does γ × log(N²·D) lie in panel band 51.2 ± 16.8? Detects scaling/training anomalies.</p>
<div class="help-example" data-i18n="help.recipe.x22.example">
Try: <em>"Does Mistral-7B fit the compute-context invariant?"</em><br>
Answer: K = γ·log(N²·D), z-score, IN-BAND or OUTLIER.
</div>
<p data-i18n="help.v04.invariant" style="font-size: 0.9em; opacity: 0.85;"><strong>Chinchilla-attention invariant K</strong>: γ × log(N²·D) ≈ 51.2 ± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.</p>
<p data-i18n="help.recipe.x23.title"><strong>X-23 IH-Phase Detector</strong> — pre- or post-induction-head? Cheap probe via sign(γ_text − γ_random).</p>
<div class="help-example" data-i18n="help.recipe.x23.example">
Try: <em>"Is Qwen2.5-7B post-induction-head?"</em><br>
Answer: CONFIRMED PRE-IH / CONFIRMED POST-IH / ANOMALY (with size-vs-Δγ consistency check).
</div>
<p data-i18n="help.v04.ih_probe" style="font-size: 0.9em; opacity: 0.85;"><strong>Δγ as IH probe</strong>: sign(γ_text − γ_random) > 0 ⟺ post-induction-head. Cheaper than running an in-context-learning benchmark.</p>
<p data-i18n="help.v04.constants" style="font-size: 0.9em; opacity: 0.85;"><strong>γ-cluster on famous constants</strong> (intriguing, n=4): CodeLlama-13b γ=0.382 ≈ 1−1/φ (golden conjugate, err 0.0003); pythia-1.4b γ=0.705 ≈ 1/√2; Llama-2-7b γ=0.287 ≈ 1−1/√2; Mistral-Nemo γ=0.428 ≈ log_10(e). Caveat: could be coincidence.</p>
<h3 style="margin-top: 1.5em;" data-i18n="v04.title">🆕 v0.4 — New diagnostics (sesion 31)</h3>
<p style="opacity: 0.85;"><em data-i18n="v04.section.intro">Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + Sócratic interrogation. Available in <code>taf_browser.py</code> §33.</em></p>
<p><strong data-i18n="v04.arch.label">Architectural Concentration</strong><span data-i18n="v04.arch.desc">γ_text ≈ γ_Padé − 0.012·n_kv. Cross-panel correlational law (R²=0.30). Caveat: not per-model predictor.</span></p>
<p><strong data-i18n="v04.pdi.label">PDI — Padé Deviation Index</strong><span data-i18n="v04.pdi.desc">PDI = d_horizon_obs/T_eval. Traffic light: green (≈1), orange (>>1), yellow (<<1), red (Phase B negative).</span></p>
<p><strong data-i18n="v04.4bit.label">4-bit Shift Predictor</strong><span data-i18n="v04.4bit.desc">MHA: R²(bf16)<0.9 → γ rises; R²>0.99 → γ drops. GQA: precision-robust regardless.</span></p>
<p><strong data-i18n="v04.crit.label">Critical Exponents Bundle</strong><span data-i18n="v04.crit.desc">ν_c, β_c, η_c (=γ−1, CORRECTED), α_C, γ_susc with AM-GM minimum at γ=1−1/√2≈0.293.</span></p>
<h3 style="margin-top: 1.5em;" data-i18n="v05.title">🔬 v0.5 — Machine-verified consistency (sesion 32)</h3>
<p style="opacity: 0.85;"><em data-i18n="v05.section.intro">Sage Groebner basis + Lean Mathlib4 dual-tool verification of <strong>15 algebraic identities</strong> of TAF critical exponents. First transformer-attention framework with formal machine-proof backing.</em></p>
<p><strong data-i18n="v05.verify.label">Algebraic Consistency Check</strong><span data-i18n="v05.verify.desc">Given measured γ, verifies 12 D-SAGE identities (D-SAGE-1: 2η²+η·γ_χ+1=0, β·χ=−1, α+χ=2, etc.). All passing = framework intact. Failures indicate bf16 outliers / quantization artifacts.</span></p>
<p><strong data-i18n="v05.dsage1.label">D-SAGE-1 (★★ core)</strong><span data-i18n="v05.dsage1.desc">Quadratic identity 2η² + η·γ_χ + 1 = 0 (Sage Groebner-discovered, Lean-verified). Replaces incorrect 'triple closure' claim. Refutes paper 1's η=2γ algebraically.</span></p>
<p><strong data-i18n="v05.erratum.label">Paper 1 erratum — η correction</strong><span data-i18n="v05.erratum.desc">Paper 1 originally claimed η = 2γ. Sage Groebner + Lean Mathlib4 proved this fails (residual (-4γ³+5γ+1)/(1-γ) > 0 ∀γ ∈ Phase A). Correct value: η = γ−1, satisfying D-SAGE-1.</span></p>
<p><strong data-i18n="v05.repro.label">Reproducibility</strong><span data-i18n="v05.repro.desc">All 15 theorems machine-proof in Lean Mathlib4 (1973 jobs build success). Sage script: <code>analysis/sage_recursive_sweep_2026-04-30.sage</code>. Lean code: <code>lean_taf/taf/Taf/Identities.lean</code>.</span></p>
<h3 style="margin-top: 1.5em;" data-i18n="help.v06.title">🆕 v0.6 — γ predicted-vs-observed + Cardy ΔH + Lean badges</h3>
<p style="opacity: 0.85;" data-i18n="help.v06.intro"><em>v0.6 (2026-05-06): three new diagnostics live in the TAF Card under <strong>🔬 Diagnostics</strong>. All run in your browser; γ_observed comes from the Diagnose CLI on real weights.</em></p>
<p><strong data-i18n="help.v06.layout.title">TAF Card layout (new in v0.6)</strong></p>
<p data-i18n="help.v06.layout.body">After clicking <strong>🚀 Generate full profile</strong> the card shows: a <strong>hero strip</strong> on top (architecture class + meta + 3 pills: aggregate verdict ✅/⚠/❌, γ headline, 🧲 Anti-Ising if Phase A) and four <strong>expandable sections</strong>: <strong>📋 Recipes</strong> (open by default — verdict per dimension), <strong>🔬 Diagnostics</strong> (key numbers, γ predicted vs observed, what-if explorer), <strong>✓ Verification</strong> (Sage+Lean algebraic consistency, falsification F1-F23), <strong>📂 Provenance &amp; share</strong> (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline <strong></strong> tooltip.</p>
<p><strong data-i18n="help.v06.gamma_check.title">γ predicted vs observed</strong></p>
<p data-i18n="help.v06.gamma_check.body">Enter the empirically-measured γ from your model and the tool computes <strong>η = θ_eff_obs / θ_eff_Padé</strong> and classifies into one of 5 regimes:</p>
<ul style="font-size: 0.92em;">
<li data-i18n="help.v06.case.normal"><strong>Normal</strong> (η ∈ [0.85, 1.15]) — model uses its full nominal context. <em>Use case</em>: validate a new release before adopting it.</li>
<li data-i18n="help.v06.case.fraud"><strong>Fraud</strong>&lt; 0.01) — nominal θ inflated; model behaves as if θ ≪ advertised. <em>Use case</em>: detect YaRN/marketing inflation (CodeLlama / Mistral-Nemo pattern).</li>
<li data-i18n="help.v06.case.compressed"><strong>Compressed</strong>&lt; 0.5) — context compressed; model attends shorter than nominal θ. <em>Use case</em>: spot RLHF/instruction-tuning compression (LLaMA-2 pattern).</li>
<li data-i18n="help.v06.case.overpade"><strong>Over-Padé</strong>&gt; 1.5) — model attends farther than Padé predicts. <em>Use case</em>: identify Lerch-corrected regime or undertrained early checkpoints (pythia-1b pattern).</li>
<li data-i18n="help.v06.case.swa"><strong>SWA random-corpus</strong> (γ_obs &gt; 1.05 with random_corpus=Yes) — sliding-window attention signature. <em>Use case</em>: confirm Mistral / Gemma SWA on random tokens.</li>
</ul>
<p><strong data-i18n="help.v06.cardy.title">Cardy ΔH diagnostic</strong></p>
<p data-i18n="help.v06.cardy.body"><strong>ΔH_Cardy = log(θ_eff_obs / θ_nominal)</strong>. Entropy shift between observed effective θ and nominal θ. Strong negative = compression entropy; near zero = nominal match. Complements η for borderline cases.</p>
<p><strong data-i18n="help.v06.lean.title">Lean + Mathlib verification badges</strong></p>
<p data-i18n="help.v06.lean.body">TAF identities (Anti-Ising, D-SAGE-1 quadratic, Padé z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: <a href="https://github.com/karlesmarin/lean-taf" target="_blank">github.com/karlesmarin/lean-taf</a>. Anyone can clone + <code>lake build</code> to re-verify. The 🧲 Anti-Ising pill in the hero strip is one such badge.</p>
<p><strong data-i18n="help.v06.glossary.title">Variable glossary (also embedded in TAF Card)</strong></p>
<p data-i18n="help.v06.glossary.body">Every variable in the TAF Card has an inline ⓘ tooltip. The complete list: γ, γ_Padé, γ_decomposed, γ_observed, θ, θ_eff_obs, θ_eff_Padé, η, ΔH_Cardy, χ, d_horizon, L_NIAH, KV memory, regime. Hover any ⓘ for the definition + paper section.</p>
<h3 data-i18n="help.add_models.title">Adding new models (3 ways)</h3>
<ul>
<li data-i18n="help.add_models.preset"><strong>Preset list</strong>: 11 popular models curated. Just select from dropdown.</li>
<li data-i18n="help.add_models.hf"><strong>HF Hub fetch</strong>: paste any model id (e.g. <code>Qwen/Qwen2.5-32B-Instruct</code>),
click 📥 Fetch. Browser downloads <code>config.json</code> directly from HuggingFace, fills the form. Works for any public model.</li>
<li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li>
</ul>
<h3 style="margin-top: 1.5em;" data-i18n="help.v07.title">🆕 v0.7 — Anti-bullshit pack (4 new modes)</h3>
<p style="opacity: 0.85;" data-i18n="help.v07.intro"><em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em></p>
<p><strong data-i18n="help.v07.unmask.title">🪟 Context Unmasker</strong></p>
<p data-i18n="help.v07.unmask.body">Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.</p>
<p><strong data-i18n="help.v07.template.title">📜 Chat-template Sniffer</strong></p>
<p data-i18n="help.v07.template.body">Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.</p>
<p><strong data-i18n="help.v07.arena.title">🎯 Arena-Elo CI Reconstructor</strong></p>
<p data-i18n="help.v07.arena.body">Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring "model A beats model B", verify their CIs don't overlap.</p>
<p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p>
<p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p>
<p><strong data-i18n="help.v07.quant.title">⚖️ Quant-regime Classifier</strong></p>
<p data-i18n="help.v07.quant.body">Predicts γ-shift and ΔPPL for any (model × quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, …). Architecture-aware: small d_head + aggressive GQA → more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. <em>Use case</em>: before quantizing, predict whether your specific architecture × scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.</p>
<p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p>
<p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the &quot;Load sample&quot; button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p>
<p><strong data-i18n="help.v07.niah.title">🔍 NIAH → Reasoning Gap</strong></p>
<p data-i18n="help.v07.niah.body">RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.</p>
<p><strong data-i18n="help.v08.saturation.title">📈 Benchmark Saturation Detector</strong></p>
<p data-i18n="help.v08.saturation.body">MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.</p>
<p><strong data-i18n="help.v082.cot.title">📋 JSON CoT-aware Linter</strong></p>
<p data-i18n="help.v082.cot.body">Constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in the order your schema declares them. If you write <code>{ answer, reasoning }</code> the model commits to <code>answer</code> first and CoT collapses into post-hoc justification. Paste any schema (or example response) — the linter classifies each field as <em>reasoning</em>, <em>answer</em>, or <em>other</em>, flags the ordering, and emits a reordered fix you can copy back. <em>Use case</em>: 'My CoT prompt works in plaintext but degrades under JSON mode' → run linter, find the inverted order, fix.</p>
<p><strong data-i18n="help.v083.peft.title">🔧 PEFT Anti-Pattern Checker</strong></p>
<p data-i18n="help.v083.peft.body">PEFT's <code>get_peft_model(base, config)</code> creates a FRESH adapter — it does not load saved weights from a path. Users who paste tutorial code and try to resume from a checkpoint silently throw away their training. peft #2115 has the canonical bug report. This linter scans your training script for the pattern + 3 related issues (QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio) and reports findings with line numbers and suggested fixes. <em>Use case</em>: before you launch a 10-hour LoRA fine-tune, paste your script — catch the silent bugs in 200ms.</p>
<p><strong data-i18n="help.v084.cache.title">🔁 Prompt-Cache Diff Predictor</strong></p>
<p data-i18n="help.v084.cache.body">Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.</p>
<p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
<p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
<p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
<p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
<p><strong data-i18n="help.v088.longscore.title">🎯 LongScore</strong></p>
<p data-i18n="help.v088.longscore.body">Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).</p>
<p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
<p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
<h3 data-i18n="help.audit.title">The audit chain</h3>
<p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer
to the underlying paper for derivation.</p>
<h3 data-i18n="help.synthesis.title">The plain-English answer</h3>
<p data-i18n="help.synthesis.body">After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are <em>always correct</em> (deterministic Python);
the synthesis is LLM-generated — verify against the chain if in doubt.</p>
<h3 data-i18n="help.params.title">Common parameters explained</h3>
<ul>
<li data-i18n="help.param.theta"><strong>θ (rope_theta)</strong>: RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).</li>
<li data-i18n="help.param.T_train"><strong>T_train</strong>: max context the model was trained on. From <code>max_position_embeddings</code>.</li>
<li data-i18n="help.param.T_eval"><strong>T_eval</strong>: <em>your target</em> inference context length. The key knob.</li>
<li data-i18n="help.param.gqa"><strong>n_kv_heads &lt; n_attention_heads</strong>: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.</li>
<li data-i18n="help.param.swa"><strong>has_SWA</strong>: model uses Sliding Window Attention (Mistral, gemma-2).</li>
<li data-i18n="help.param.nparams"><strong>n_params</strong>: total parameter count. Threshold ~400M for induction-head emergence.</li>
</ul>
<h3 data-i18n="help.verdicts.title">What to look for in verdicts</h3>
<ul>
<li data-i18n="help.verdict.yes"><strong style="color:#3fb950;">YES / GO</strong> — proceed with confidence; numbers support the choice.</li>
<li data-i18n="help.verdict.deg"><strong style="color:#d29922;">DEGRADED / TINY-MODEL</strong> — works but with caveats; read the action.</li>
<li data-i18n="help.verdict.no"><strong style="color:#f85149;">NO / MEMORY-LIMITED</strong> — don't proceed as-is; mitigation provided.</li>
</ul>
<h3 data-i18n="help.privacy.title">Privacy</h3>
<p data-i18n="help.privacy.body">Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.</p>
<h3 data-i18n="help.source.title">Source &amp; paper</h3>
<p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br>
Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/19826343" target="_blank">Zenodo</a>; arXiv forthcoming)<br>
Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p>
</div>
</div>
<!-- Quick start modal (3 steps) -->
<div id="quickstart-modal" role="dialog" aria-modal="true" aria-labelledby="qs-modal-title" aria-hidden="true">
<div class="help-content">
<button class="help-close" id="quickstart-close" aria-label="Close quick start">×</button>
<h2 id="qs-modal-title" data-i18n="qs.title">⚡ Quick start</h2>
<ol class="qs-steps">
<li data-i18n="qs.step1">Paste a HuggingFace model ID (e.g. <code>meta-llama/Meta-Llama-3-8B</code>)</li>
<li data-i18n="qs.step2">Click <strong>📇 Profile a model</strong></li>
<li data-i18n="qs.step3">Read your TAF Card &mdash; verdict per use case + key numbers + math verified by Lean+Mathlib</li>
</ol>
<p class="qs-cta">
<a href="#mode-section" class="btn-primary" id="qs-start-link" data-i18n="qs.cta">↓ Start now</a>
</p>
</div>
</div>
<!-- Inventory modal: what you get -->
<div id="inventory-modal" role="dialog" aria-modal="true" aria-labelledby="inv-modal-title" aria-hidden="true">
<div class="help-content">
<button class="help-close" id="inventory-close" aria-label="Close inventory">×</button>
<h2 id="inv-modal-title" data-i18n="inv.title">🧰 What this tool gives you</h2>
<div class="inventory-grid">
<details class="inv-card" open>
<summary class="inv-card-title" data-i18n="inv.recipes.title">🎯 8 recipes &mdash; does this model fit your use case?</summary>
<ul>
<li><strong data-i18n="inv.recipes.x1.title">Custom train vs API</strong>: <span data-i18n="inv.recipes.x1.body">which is cheaper for your traffic?</span></li>
<li><strong data-i18n="inv.recipes.x2.title">Long context</strong>: <span data-i18n="inv.recipes.x2.body">will it handle 32k / 128k tokens reliably?</span></li>
<li><strong data-i18n="inv.recipes.x3.title">Budget</strong>: <span data-i18n="inv.recipes.x3.body">with $X, what model can you train from scratch?</span></li>
<li><strong data-i18n="inv.recipes.x5.title">Hardware</strong>: <span data-i18n="inv.recipes.x5.body">which GPU to serve N tokens/day?</span></li>
<li><strong data-i18n="inv.recipes.x19.title">KV cache</strong>: <span data-i18n="inv.recipes.x19.body">how to compress without breaking quality?</span></li>
<li><strong data-i18n="inv.recipes.x21.title">Imprint purity</strong>: <span data-i18n="inv.recipes.x21.body">how clean is the model's positional encoding?</span></li>
<li><strong data-i18n="inv.recipes.x22.title">Compute-context</strong>: <span data-i18n="inv.recipes.x22.body">does the model fit the empirical band?</span></li>
<li><strong data-i18n="inv.recipes.x23.title">IH-phase</strong>: <span data-i18n="inv.recipes.x23.body">pre- or post-induction-head?</span></li>
</ul>
</details>
<details class="inv-card" open>
<summary class="inv-card-title" data-i18n="inv.diag.title">🔬 Diagnostics</summary>
<ul>
<li data-i18n="inv.diag.gamma"><strong>γ predicted vs observed</strong> &mdash; auto-classifies the model into 5 regimes (normal · fraud / inflated context · compressed · over-Padé · sliding-window)</li>
<li data-i18n="inv.diag.cardy"><strong>Cardy ΔH</strong> &mdash; entropy shift between observed and nominal context</li>
<li data-i18n="inv.diag.fals"><strong>Falsification dashboard</strong> &mdash; checks 23 specific predictions (F1–F23)</li>
<li data-i18n="inv.diag.alg"><strong>Algebraic consistency</strong> &mdash; 8 mathematical identities the model must satisfy</li>
</ul>
</details>
<details class="inv-card" open>
<summary class="inv-card-title" data-i18n="inv.verify.title">✓ Formally verified math</summary>
<ul>
<li data-i18n="inv.verify.count"><strong>37 theorems</strong> machine-proven in Lean 4 + Mathlib4</li>
<li data-i18n="inv.verify.click">Click any badge → opens the source line on GitHub</li>
<li data-i18n="inv.verify.reverify">Verify yourself: <code>lake build</code> (≈5 s after cache fetch)</li>
</ul>
</details>
<details class="inv-card" open>
<summary class="inv-card-title" data-i18n="inv.export.title">📤 Export &amp; share</summary>
<ul>
<li data-i18n="inv.export.formats"><strong>JSON · Markdown · LaTeX</strong> (paper-ready)</li>
<li data-i18n="inv.export.share">Reproducible share link (state encoded in URL)</li>
<li data-i18n="inv.export.registry">Submit to community registry on GitHub</li>
</ul>
</details>
<details class="inv-card" open>
<summary class="inv-card-title" data-i18n="inv.v07.title">🆕 v0.7 anti-bullshit pack</summary>
<ul>
<li data-i18n="inv.v07.unmask"><strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far</li>
<li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li>
<li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li>
<li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li>
<li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li>
<li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li>
<li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li>
<li data-i18n="inv.v08.saturation"><strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?</li>
<li data-i18n="inv.v082.cot"><strong>📋 JSON CoT</strong> — lints structured-output schemas for the answer-before-reasoning anti-pattern that silently breaks Chain-of-Thought.</li>
<li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
<li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
<li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
<li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
<li data-i18n="inv.v088.longscore"><strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.</li>
<li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
</ul>
</details>
</div>
<details class="arch-supported" open>
<summary data-i18n="arch.summary">Architectures supported (click to expand)</summary>
<div class="arch-badges">
<span class="badge">✓ RoPE-MHA <span class="info"><span class="tooltip" data-i18n="tooltip.mha">Multi-Head Attention: each token position attends through several parallel heads at once.</span></span></span>
<span class="badge">✓ RoPE-GQA <span class="info"><span class="tooltip" data-i18n="tooltip.gqa">Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes γ toward Hagedorn).</span></span></span>
<span class="badge">✓ ALiBi <span class="info"><span class="tooltip" data-i18n="tooltip.alibi">Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.</span></span></span>
<span class="badge">✓ AbsPE <span class="info"><span class="tooltip" data-i18n="tooltip.abspe">Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.</span></span></span>
<span class="badge">✓ SWA <span class="info"><span class="tooltip" data-i18n="tooltip.swa">Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).</span></span></span>
<span class="badge">✓ SSM (Mamba) <span class="info"><span class="tooltip" data-i18n="tooltip.ssm">State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).</span></span></span>
<span class="badge" data-i18n="arch.anyhf">✓ Any HuggingFace public model</span>
</div>
</details>
</div>
</div>
<main>
<!-- Status with loading bar -->
<section id="status-bar">
<div id="status" data-i18n="status.loading_pyodide">⏳ Loading Python runtime...</div>
<div id="loading-bar-wrap" style="display:none;">
<div id="loading-bar"></div>
</div>
</section>
<!-- v0.7.7 — Task tiles: friendlier entry point, groups 14 modes by user intent. -->
<section id="task-tiles" aria-labelledby="tiles-title">
<h2 id="tiles-title" data-i18n="tiles.title">🎯 What do you want to do?</h2>
<p class="recipe-desc" data-i18n="tiles.subtitle">Pick a task. Each one opens the right tool below. Or scroll down for the full list of 22 modes.</p>
<div class="tiles-grid">
<div class="task-tile">
<h3>
<span data-i18n="tile.diagnose.title">🔬 Diagnose a model</span>
<span class="info"><span class="tooltip" data-i18n="tile.diagnose.tip">Start here when you have a specific model id and want a full diagnostic: <strong>Profile</strong> runs all 5 recipes at once. <strong>Unmask</strong> checks if max_position_embeddings is honest. <strong>NIAH→Reason</strong> predicts retrieval-vs-reasoning gap. <strong>LongScore</strong> looks up published RULER + HELMET data for the model and shows real degradation past short context (peer-reviewed metric). <strong>Quant</strong> predicts whether quantizing will break it. <strong>Inspect</strong> lets you paste raw config.json for private/in-dev models.</span></span>
</h3>
<p class="tile-desc" data-i18n="tile.diagnose.desc">Will this specific model work for my use case?</p>
<div class="tile-modes">
<button data-mode-link="profile" data-i18n="modes.profile">📇 Profile a model</button>
<button data-mode-link="unmask" data-i18n="modes.unmask">🪟 Unmask</button>
<button data-mode-link="niah" data-i18n="modes.niah">🔍 NIAH→Reason</button>
<button data-mode-link="longscore" data-i18n="modes.longscore">🎯 LongScore</button>
<button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button>
<button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button>
</div>
</div>
<div class="task-tile">
<h3>
<span data-i18n="tile.trust.title">✓ Trust a benchmark score</span>
<span class="info"><span class="tooltip" data-i18n="tile.trust.tip">When you see a score and want to know if it's real. <strong>Contamination</strong> rates 20+ benchmarks for likelihood the model saw them during training. <strong>Drift</strong> tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). <strong>Arena CI</strong> reconstructs the confidence intervals Chatbot Arena hides — many top-Elo "wins" are statistically tied.</span></span>
</h3>
<p class="tile-desc" data-i18n="tile.trust.desc">Should I believe this number? Bug or noise?</p>
<div class="tile-modes">
<button data-mode-link="contam" data-i18n="modes.contam">🧪 Contamination</button>
<button data-mode-link="drift" data-i18n="modes.drift">🔀 Drift</button>
<button data-mode-link="arena" data-i18n="modes.arena">🎯 Arena CI</button>
<button data-mode-link="saturation" data-i18n="modes.saturation">📈 Saturation</button>
<button data-mode-link="hub" data-i18n="modes.hub">🧭 Solutions</button>
</div>
</div>
<div class="task-tile">
<h3>
<span data-i18n="tile.eval.title">⚙️ Set up an eval correctly</span>
<span class="info"><span class="tooltip" data-i18n="tile.eval.tip">Before you run lm-eval-harness or vLLM serve, get the right CLI flag. <strong>Chat-template Sniffer</strong> detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact <code>--apply_chat_template</code> / <code>--chat-template</code> invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). <strong>Diagnose CLI</strong> generates the Python command to measure γ_obs on your local GPU.</span></span>
</h3>
<p class="tile-desc" data-i18n="tile.eval.desc">Get the exact CLI flag for lm-eval / vLLM / transformers.</p>
<div class="tile-modes">
<button data-mode-link="template" data-i18n="modes.template">📜 Chat-template</button>
<button data-mode-link="diagnose" data-i18n="modes.diagnose">🩺 Diagnose CLI</button>
<button data-mode-link="cot" data-i18n="modes.cot">📋 JSON CoT</button>
<button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
<button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
<button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
<button data-mode-link="tax" data-i18n="modes.tax">🌍 Token Tax</button>
</div>
</div>
<div class="task-tile">
<h3>
<span data-i18n="tile.compare.title">🆚 Compare models</span>
<span class="info"><span class="tooltip" data-i18n="tile.compare.tip"><strong>Compare</strong>: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). <strong>Phase diagram</strong>: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.</span></span>
</h3>
<p class="tile-desc" data-i18n="tile.compare.desc">Side-by-side, or browse the empirical model landscape.</p>
<div class="tile-modes">
<button data-mode-link="compare" data-i18n="modes.compare">🆚 Compare models</button>
<button data-mode-link="phase" data-i18n="modes.phase">📊 Phase diagram</button>
</div>
</div>
<div class="task-tile">
<h3>
<span data-i18n="tile.manual.title">📋 Manual / free-form</span>
<span class="info"><span class="tooltip" data-i18n="tile.manual.tip"><strong>Recipe</strong>: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. <strong>Ask</strong>: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.</span></span>
</h3>
<p class="tile-desc" data-i18n="tile.manual.desc">Pick a specific recipe by hand, or ask in plain English.</p>
<div class="tile-modes">
<button data-mode-link="recipe" data-i18n="modes.recipe">📋 Pick recipe</button>
<button data-mode-link="ask" data-i18n="modes.ask">💬 Ask plain English</button>
</div>
</div>
</div>
</section>
<!-- Mode toggle -->
<section id="mode-section">
<h2><span data-i18n="modes.title">🎯 Mode</span>
<span class="info"><span class="tooltip" data-i18n="modes.tip"><strong>7 modes available.</strong> Most users want 📇 Profile (one-click full diagnosis).<br>
<strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br>
<strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br>
<strong>🔍 Inspect</strong>: paste raw config.json to debug parameters.<br>
<strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br>
<strong>📋 Recipe</strong>: manual selection with full form control.<br>
<strong>🩺 Diagnose CLI</strong>: generate Python command to measure γ on real weights.<br>
<strong>📊 Phase diagram</strong>: explore 23 panel models on (log θ, γ) plane.
</span></span>
</h2>
<div class="mode-tabs" role="tablist" aria-label="View modes">
<button class="mode-btn active" data-mode="profile" role="tab" aria-selected="true" data-i18n="modes.profile">📇 Profile a model</button>
<button class="mode-btn" data-mode="compare" role="tab" aria-selected="false" data-i18n="modes.compare">🆚 Compare models</button>
<button class="mode-btn" data-mode="inspector" role="tab" aria-selected="false" data-i18n="modes.inspector">🔍 Inspect config</button>
<button class="mode-btn" data-mode="ask" role="tab" aria-selected="false" data-i18n="modes.ask">💬 Ask plain English</button>
<button class="mode-btn" data-mode="recipe" role="tab" aria-selected="false" data-i18n="modes.recipe">📋 Pick recipe</button>
<button class="mode-btn" data-mode="diagnose" role="tab" aria-selected="false" data-i18n="modes.diagnose">🩺 Diagnose CLI</button>
<button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button>
<button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button>
<button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button>
<button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button>
<button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button>
<button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button>
<button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button>
<button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button>
<button class="mode-btn" data-mode="saturation" role="tab" aria-selected="false" data-i18n="modes.saturation">📈 Saturation</button>
<button class="mode-btn" data-mode="cot" role="tab" aria-selected="false" data-i18n="modes.cot">📋 JSON CoT</button>
<button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
<button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
<button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
<button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
<button class="mode-btn" data-mode="longscore" role="tab" aria-selected="false" data-i18n="modes.longscore">🎯 LongScore</button>
<button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
</div>
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
<strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>),
click Profile. See all 5 recipes scored in seconds.
</p>
</section>
<!-- PROFILE mode -->
<section id="profile-section">
<div class="quickstart-banner" data-i18n="profile.quickstart">
💡 Quick start: pick any preset → click Generate. Or paste a model id from <a href='https://huggingface.co/models?library=transformers&sort=trending' target='_blank'>HF Hub trending</a> → 📥 Fetch → Generate.
</div>
<h2><span data-i18n="profile.title">📇 Profile a model</span>
<span class="info"><span class="tooltip" data-i18n="profile.tip">
<strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset).
Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget,
hardware) and produces a single <strong>TAF Card</strong> showing verdict per
dimension + key numbers + architecture classification.<br><br>
<strong>Use case</strong>: "I'm evaluating Qwen2.5-32B for production —
what's its full viability profile?" → paste id → Profile → done.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="profile.desc">
<strong>For technicians</strong>: when you need a complete viability snapshot
of a candidate model. Outputs match paper §sec:gamma_decomposition format.
</p>
<div class="form-row">
<label for="profile-preset" data-i18n="profile.preset_label">Preset:</label>
<select id="profile-preset" disabled>
<option value="" data-i18n="profile.preset_default">— or pick from list —</option>
</select>
</div>
<div class="form-row">
<label for="profile-hf-id" data-i18n="profile.hf_label">HF model id:</label>
<input type="text" id="profile-hf-id"
data-i18n-placeholder="profile.hf_placeholder"
placeholder="e.g. meta-llama/Meta-Llama-3-8B or Qwen/Qwen2.5-7B" style="flex:1;" />
<button id="profile-fetch-btn" type="button" class="secondary" data-i18n="profile.fetch_btn">📥 Fetch</button>
</div>
<div id="profile-hf-status" class="subtle" style="margin: -0.5rem 0 1rem; min-height:1.2em;"></div>
<div class="form-grid" id="profile-form">
<div class="form-field">
<label><span data-i18n="param.theta">θ (rope_theta)</span> <span class="info"><span class="tooltip" data-i18n="param.theta.tip">RoPE base frequency from <code>config.rope_theta</code>.</span></span></label>
<input type="number" id="profile-theta" value="500000" />
</div>
<div class="form-field">
<label><span data-i18n="param.T_train">T_train</span> <span class="info"><span class="tooltip" data-i18n="param.T_train.tip">Max training context. From <code>max_position_embeddings</code>.</span></span></label>
<input type="number" id="profile-T_train" value="8192" />
</div>
<div class="form-field">
<label><span data-i18n="param.T_eval">T_eval (your target)</span> <span class="info"><span class="tooltip" data-i18n="param.T_eval.tip">Inference context length you'll actually serve. The key knob.</span></span></label>
<input type="number" id="profile-T_eval" value="32000" />
</div>
<div class="form-field">
<label><span data-i18n="param.n_attn">n_attention_heads</span> <span class="info"><span class="tooltip" data-i18n="param.n_attn.tip">Number of attention heads per layer. From <code>num_attention_heads</code>.</span></span></label>
<input type="number" id="profile-n_attn" value="32" />
</div>
<div class="form-field">
<label><span data-i18n="param.n_kv">n_kv_heads</span> <span class="info"><span class="tooltip" data-i18n="param.n_kv.tip">KV heads. If &lt; n_attention_heads → GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.</span></span></label>
<input type="number" id="profile-n_kv" value="8" />
</div>
<div class="form-field">
<label><span data-i18n="param.d_head">head_dim</span> <span class="info"><span class="tooltip" data-i18n="param.d_head.tip">Per-head dimension. Typical 64, 96, 128. From <code>head_dim</code> or <code>hidden_size / num_attention_heads</code>.</span></span></label>
<input type="number" id="profile-d_head" value="128" />
</div>
<div class="form-field">
<label><span data-i18n="param.n_layers">n_layers</span> <span class="info"><span class="tooltip" data-i18n="param.n_layers.tip">Number of transformer blocks. From <code>num_hidden_layers</code>.</span></span></label>
<input type="number" id="profile-n_layers" value="32" />
</div>
<div class="form-field">
<label><span data-i18n="param.n_params">n_params (e.g. 8e9)</span> <span class="info"><span class="tooltip" data-i18n="param.n_params.tip">Total parameter count. Threshold ~400M for induction-head emergence. Affects KV memory and budget recipes.</span></span></label>
<input type="text" id="profile-n_params" value="8e9" />
</div>
<div class="form-field">
<label><span data-i18n="param.has_swa">Has SWA?</span> <span class="info"><span class="tooltip" data-i18n="param.has_swa.tip">Sliding Window Attention. <code>true</code> for Mistral, gemma-2, phi-3. Calibration audit (v0.5.3) disabled the historical δ_SWA correction (n=1 fit).</span></span></label>
<select id="profile-has_swa">
<option value="false" selected data-i18n="common.no">No</option>
<option value="true" data-i18n="common.yes">Yes</option>
</select>
</div>
</div>
<button id="profile-btn" disabled data-i18n="profile.btn">🚀 Generate full profile</button>
</section>
<!-- INSPECTOR mode (paste config.json directly) -->
<section id="inspector-section" style="display:none;">
<div class="quickstart-banner" data-i18n="inspector.quickstart">
💡 Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile.
</div>
<h2><span data-i18n="inspector.title">🔍 Architecture Inspector</span>
<span class="info"><span class="tooltip" data-i18n="inspector.tip">
<strong>Paste any config.json directly</strong>. Tool parses it and runs the full Profile.
Useful for: private models, in-development configs, models not yet on HuggingFace,
or comparing what your custom architecture would do.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="inspector.desc">
Paste the raw <code>config.json</code> contents. The tool extracts the architectural
parameters and runs the full 5-recipe Profile.
</p>
<textarea id="inspector-json" rows="12"
data-i18n-placeholder="inspector.placeholder"
placeholder='{
"model_type": "llama",
"rope_theta": 500000,
"max_position_embeddings": 8192,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"hidden_size": 4096,
"num_hidden_layers": 32,
"vocab_size": 128256
}'></textarea>
<div class="form-row" style="margin-top:0.5rem;">
<label for="inspector-T_eval" data-i18n="inspector.T_eval">T_eval (your target context):</label>
<input type="number" id="inspector-T_eval" value="32000" />
</div>
<button id="inspector-btn" disabled data-i18n="inspector.btn">🚀 Inspect & profile</button>
<span id="inspector-status" class="subtle" style="margin-left:0.75rem;"></span>
</section>
<!-- COMPARE mode -->
<section id="compare-section" style="display:none;">
<div class="quickstart-banner" data-i18n="compare.example">
💡 Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context.
</div>
<h2><span data-i18n="compare.title">🆚 Compare models side-by-side</span>
<span class="info"><span class="tooltip" data-i18n="compare.tip">
<strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and
one recipe. See verdicts in a single comparison table.<br><br>
<strong>Use case</strong>: "I need long-context retrieval at 16K — which is
best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="compare.desc">
<strong>For technicians</strong>: when choosing between 2-3 candidate models for
a specific deployment scenario. Compare their verdicts on the same recipe.
</p>
<div class="form-row">
<label for="compare-recipe" data-i18n="compare.recipe_label">Recipe:</label>
<select id="compare-recipe" disabled>
<option value="" data-i18n="recipe.default">— pick a recipe —</option>
</select>
</div>
<div class="form-row">
<label for="compare-T_eval" data-i18n="compare.T_eval_label">T_eval (target context):</label>
<input type="number" id="compare-T_eval" value="16000" style="flex:1;" />
<span class="info" style="margin-top:0.5rem;"><span class="tooltip">
For X-2 / X-19 only. The context length all compared models will be
evaluated at. Other recipes use their own params.
</span></span>
</div>
<div id="compare-models">
<h3 style="margin-top:1rem;" data-i18n="compare.models_title">Models to compare (add up to 3)</h3>
<div class="compare-slot" data-slot="1">
<input type="text" class="compare-hf-id"
data-i18n-placeholder="compare.slot1_placeholder"
placeholder="HF model id (e.g. meta-llama/Meta-Llama-3-8B)" />
<select class="compare-preset">
<option value="" data-i18n="compare.preset_default">— or preset —</option>
</select>
</div>
<div class="compare-slot" data-slot="2">
<input type="text" class="compare-hf-id"
data-i18n-placeholder="compare.slot2_placeholder"
placeholder="HF model id #2" />
<select class="compare-preset">
<option value="" data-i18n="compare.preset_default">— or preset —</option>
</select>
</div>
<div class="compare-slot" data-slot="3">
<input type="text" class="compare-hf-id"
data-i18n-placeholder="compare.slot3_placeholder"
placeholder="HF model id #3 (optional)" />
<select class="compare-preset">
<option value="" data-i18n="compare.preset_default">— or preset —</option>
</select>
</div>
</div>
<button id="compare-btn" disabled style="margin-top:1rem;" data-i18n="compare.btn">🚀 Compare</button>
</section>
<!-- ASK mode (free-form question) -->
<section id="ask-section" style="display:none;">
<h2 data-i18n="ask.title">❓ Your question</h2>
<textarea id="question" rows="3"
data-i18n-placeholder="ask.placeholder"
placeholder="e.g. Will Mistral-7B handle 16K NIAH retrieval? Or: I have $5,000, what model can I train? Or: Cheapest GPU to serve Llama-70B at 100M tokens/day?"></textarea>
<div style="display:flex; gap:0.5rem; margin-top:0.5rem; flex-wrap:wrap;">
<button id="ask-btn" disabled data-i18n="ask.btn">🚀 Analyze</button>
<button id="example-btn" type="button" class="secondary" data-i18n="ask.example_btn">💡 Try an example</button>
</div>
</section>
<!-- Diagnose mode: build the CLI command for diagnose_model.py -->
<section id="diagnose-section" style="display:none;">
<h2><span data-i18n="diagnose.title">🩺 Diagnose CLI Command Builder</span>
<span class="info"><span class="tooltip" data-i18n="diagnose.tip">
<strong>Measure γ_obs (not predict)</strong>. The browser tool predicts γ from
config alone (Padé). To <em>measure</em> the actual decay on a real model
you need GPU + Python. This builder produces the exact CLI command you
run locally; the script is shipped in this repository at
<code>cli/diagnose_model.py</code>.<br><br>
<strong>Output</strong>: γ_obs, R², phase, KV cache budget D_90, KL anomaly,
full thermodynamic profile (Z, U, S, F, C_V, χ). Saved as JSON.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="diagnose.desc">
Pick options below and copy-paste the generated command on your local
machine (Python + transformers + numpy). Total wall time ≈ 5 min in
<code>--fast</code> mode on CPU; full mode 20–60 min on GPU.
</p>
<div class="form-row">
<label for="diag-model" data-i18n="diagnose.model_label">HF model id:</label>
<input type="text" id="diag-model" placeholder="EleutherAI/pythia-70m" value="EleutherAI/pythia-70m">
</div>
<div class="form-row">
<label for="diag-theta" data-i18n="diagnose.theta_label">θ (auto if blank):</label>
<input type="number" id="diag-theta" placeholder="auto-detect">
</div>
<div class="form-row">
<label for="diag-N" data-i18n="diagnose.n_label">Context N:</label>
<input type="number" id="diag-N" value="2000" min="100" max="32000">
</div>
<div class="form-row">
<label data-i18n="diagnose.options_label">Options:</label>
<span>
<label><input type="checkbox" id="diag-fast" checked>
<span data-i18n="diagnose.opt_fast">--fast (CPU, ~5 min)</span></label><br>
<label><input type="checkbox" id="diag-cpu">
<span data-i18n="diagnose.opt_cpu">--cpu (force CPU)</span></label><br>
<label><input type="checkbox" id="diag-4bit">
<span data-i18n="diagnose.opt_4bit">--load_in_4bit (≥7B models)</span></label>
</span>
</div>
<div class="form-row">
<label for="diag-local" data-i18n="diagnose.local_label">--local path (optional):</label>
<input type="text" id="diag-local" placeholder="/path/to/local/weights">
</div>
<button id="diag-build-btn" data-i18n="diagnose.build_btn">📋 Build command</button>
<div id="diag-output" style="display:none; margin-top:1em;">
<h3 data-i18n="diagnose.cmd_title">Generated command:</h3>
<pre id="diag-cmd" class="diag-cmd-box"></pre>
<button id="diag-copy-btn" data-i18n="diagnose.copy_btn">📋 Copy to clipboard</button>
<p class="recipe-desc" data-i18n="diagnose.next_steps">
<strong>Next steps</strong>:
(1) <code>git clone https://github.com/karlesmarin/tafagent</code>
(2) <code>cd tafagent &amp;&amp; pip install torch transformers numpy</code>
(3) Run the command above.
(4) Result JSON lands in <code>./diagnose_results/</code> — upload it
to the <strong>📋 Pick recipe</strong> mode (or paste in <strong>🔍 Inspect config</strong>) for full TAF analysis.
</p>
</div>
</section>
<!-- Phase diagram mode: live scatter of measured γ vs θ -->
<section id="phase-section" style="display:none;">
<h2><span data-i18n="phase.title">📊 Phase diagram (γ × θ)</span>
<span class="info"><span class="tooltip" data-i18n="phase.tip">
Each dot is one model from the paper's empirical panel
(data/master_gamma_results.json). The x-axis is RoPE base θ
on log scale; y-axis is measured γ.
The Hagedorn line γ=1 separates Phase A (γ&lt;1, global) from
Phase B (γ&gt;1, local-collapsed).
Hover dots for details; click to populate the recipe form.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="phase.desc">
23 models in the panel; the Padé curve (line) is
γ_pred(θ) = (2θ−T√2)/(2θ+T√2) at T=2000.
</p>
<canvas id="phase-canvas" width="900" height="500" style="max-width:100%; background: var(--card-bg); border-radius: 6px;"></canvas>
<div id="phase-info" class="recipe-desc" style="margin-top:0.6em;"></div>
</section>
<!-- Unmask mode: detect misleading max_position_embeddings via SWA / RoPE-scaling -->
<section id="unmask-section" style="display:none;">
<h2><span data-i18n="unmask.title">🪟 Context Unmasker</span>
<span class="info"><span class="tooltip" data-i18n="unmask.tip">
Paste a HuggingFace model id (or raw config.json). The tool checks for
sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and
GQA — anything that makes <code>max_position_embeddings</code> larger
than the practical effective context. Mistral-7B-v0.1 is the canonical
example: declared 32k, attends within ~4-8k.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="unmask.desc">
<strong>Are you about to spend money on a model that won't actually attend that far?</strong> Paste an id and find out in 1 second. No GPU, no inference — just config.json arithmetic.
</p>
<div class="form-row">
<label for="unmask-id" data-i18n="unmask.id_label">HF model id:</label>
<input type="text" id="unmask-id" placeholder="e.g. mistralai/Mistral-7B-v0.1" />
<button type="button" id="unmask-fetch-btn" data-i18n="unmask.fetch_btn">🔍 Unmask</button>
</div>
<p id="unmask-status" class="recipe-desc" style="font-size:0.92em;"></p>
<details style="margin: 0.6em 0;">
<summary style="cursor:pointer; font-size:0.92em;" data-i18n="unmask.paste_summary">Or paste raw config.json (private / in-dev models)</summary>
<textarea id="unmask-paste" rows="6" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder='{"max_position_embeddings": 32768, "sliding_window": 4096, ...}'></textarea>
<button type="button" id="unmask-paste-btn" data-i18n="unmask.paste_btn" style="margin-top:0.4em;">🔍 Unmask pasted config</button>
</details>
<div id="unmask-output" style="margin-top: 1em;"></div>
</section>
<!-- Chat-template sniffer mode (v0.7.1 anti-bullshit pack #2) -->
<section id="template-section" style="display:none;">
<h2><span data-i18n="template.title">📜 Chat-template Sniffer</span>
<span class="info"><span class="tooltip" data-i18n="template.tip">
Paste an HF model id (or raw tokenizer_config.json). Detects the
chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3,
Alpaca, DeepSeek, custom) and gives you the exact framework command
to use it correctly. lm-eval-harness silently halves accuracy if you
forget to apply it (issue #1841).
</span></span>
</h2>
<p class="recipe-desc" data-i18n="template.desc">
<strong>Did you forget <code>--apply_chat_template</code>?</strong> Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack.
</p>
<div class="form-row">
<label for="template-id" data-i18n="template.id_label">HF model id:</label>
<input type="text" id="template-id" placeholder="e.g. mistralai/Mistral-7B-Instruct-v0.3" />
<button type="button" id="template-fetch-btn" data-i18n="template.fetch_btn">📜 Sniff</button>
</div>
<p id="template-status" class="recipe-desc" style="font-size:0.92em;"></p>
<details style="margin: 0.6em 0;">
<summary style="cursor:pointer; font-size:0.92em;" data-i18n="template.paste_summary">Or paste raw tokenizer_config.json (private models)</summary>
<textarea id="template-paste" rows="6" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder='{"chat_template": "...", ...}'></textarea>
<button type="button" id="template-paste-btn" data-i18n="template.paste_btn" style="margin-top:0.4em;">📜 Sniff pasted config</button>
</details>
<div id="template-output" style="margin-top: 1em;"></div>
</section>
<!-- Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3) -->
<section id="arena-section" style="display:none;">
<h2><span data-i18n="arena.title">🎯 Arena-Elo CI Reconstructor</span>
<span class="info"><span class="tooltip" data-i18n="arena.tip">
Chatbot Arena strips confidence intervals from the public leaderboard.
A 5-Elo gap can be statistically meaningless. Paste raw vote data
(model_a, model_b, winner) — the tool computes Bradley-Terry MLE +
bootstrap CIs and lists statistical ties (CI overlap).
</span></span>
</h2>
<p class="recipe-desc" data-i18n="arena.desc">
<strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser.
</p>
<div class="form-row">
<button type="button" id="arena-sample-btn" data-i18n="arena.sample_btn">📊 Load sample data</button>
<button type="button" id="arena-run-btn" data-i18n="arena.run_btn">🎯 Compute CIs</button>
<button type="button" id="arena-clear-btn" class="secondary" data-i18n="arena.clear_btn">🗑️ Clear</button>
</div>
<p id="arena-status" class="recipe-desc" style="font-size:0.92em;"></p>
<details style="margin: 0.6em 0;" open>
<summary style="cursor:pointer; font-size:0.92em;" data-i18n="arena.csv_summary">Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)</summary>
<textarea id="arena-csv" rows="10" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder="model_a,model_b,winner&#10;GPT-4,Claude,a&#10;Llama-3,Mixtral,tie&#10;..."></textarea>
</details>
<div id="arena-output" style="margin-top: 1em;"></div>
</section>
<!-- Contamination prior (v0.7.3 anti-bullshit pack #4) -->
<section id="contam-section" style="display:none;">
<h2><span data-i18n="contam.title">🧪 Contamination Prior</span>
<span class="info"><span class="tooltip" data-i18n="contam.tip">
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="contam.desc">
<strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated.
</p>
<div class="form-row">
<label for="contam-cutoff" data-i18n="contam.cutoff_label">Training cutoff:</label>
<input type="text" id="contam-cutoff" placeholder="2023-12 or 2024-01" style="max-width:14em;" />
<button type="button" id="contam-run-btn" data-i18n="contam.run_btn">🧪 Rate all benchmarks</button>
</div>
<p id="contam-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="contam-output" style="margin-top: 1em;"></div>
</section>
<!-- Quant-regime classifier (v0.7.3 anti-bullshit pack #5) -->
<section id="quant-section" style="display:none;">
<h2><span data-i18n="quant.title">⚖️ Quant-regime Classifier</span>
<span class="info"><span class="tooltip" data-i18n="quant.tip">
Predicts γ-shift (and downstream ΔPPL) for a given (model × quant scheme).
Generic claims like "AWQ ~95% retention" are too vague — TAF uses
d_head, GQA ratio, SWA flag, and model size to give an architecture-specific
verdict. Solves: HF community widely reports unpredictable quant cliffs
(NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).
</span></span>
</h2>
<p class="recipe-desc" data-i18n="quant.desc">
<strong>Will quantizing your model break it?</strong> Paste an HF model id, pick a quant scheme — get predicted γ-shift, expected ΔPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.
</p>
<div class="form-row">
<label for="quant-id" data-i18n="quant.id_label">HF model id:</label>
<input type="text" id="quant-id" placeholder="e.g. meta-llama/Llama-3.2-1B" />
<button type="button" id="quant-fetch-btn" data-i18n="quant.fetch_btn">📥 Fetch config</button>
</div>
<div class="form-row">
<label for="quant-scheme" data-i18n="quant.scheme_label">Quant scheme:</label>
<select id="quant-scheme">
<option value="">— select scheme —</option>
</select>
<button type="button" id="quant-run-btn" data-i18n="quant.run_btn">⚖️ Predict</button>
<button type="button" id="quant-all-btn" class="secondary" data-i18n="quant.all_btn">📊 Compare all schemes</button>
</div>
<p id="quant-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="quant-output" style="margin-top: 1em;"></div>
</section>
<!-- Cross-framework drift bound (v0.7.5 anti-bullshit pack #6) -->
<section id="drift-section" style="display:none;">
<h2><span data-i18n="drift.title">🔀 Cross-framework Drift Bound</span>
<span class="info"><span class="tooltip" data-i18n="drift.tip">
Same model, different scores on different setups. Is the gap noise or
a real bug? Enter two scores with their (framework, dtype, batch,
chat-template) — tool predicts the maximum allowable drift from
numerical noise alone. If observed gap exceeds it → real bug, usually
chat-template mismatch (lm-eval issue #1841) or KV-cache layout.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="drift.desc">
<strong>Your model gives 67.2 on lm-eval-hf and 65.1 on vLLM-served. Bug or noise?</strong> Enter both scores with (framework, dtype, batch, chat-template applied?). Tool predicts the noise band and flags real bugs. arxiv 2506.09501 documents this as a major eval reproducibility problem.
</p>
<div class="drift-grid">
<fieldset class="drift-setup">
<legend data-i18n="drift.setup_a">Setup A</legend>
<div class="form-row">
<label for="drift-a-score" data-i18n="drift.score">Score:</label>
<input type="number" id="drift-a-score" step="0.01" placeholder="67.2" />
</div>
<div class="form-row">
<label for="drift-a-framework" data-i18n="drift.framework">Framework:</label>
<select id="drift-a-framework"></select>
</div>
<div class="form-row">
<label for="drift-a-dtype" data-i18n="drift.dtype">Dtype:</label>
<select id="drift-a-dtype"></select>
</div>
<div class="form-row">
<label for="drift-a-batch" data-i18n="drift.batch">Batch:</label>
<input type="number" id="drift-a-batch" min="1" value="1" />
</div>
<div class="form-row">
<label for="drift-a-template" data-i18n="drift.template">Chat-template:</label>
<select id="drift-a-template">
<option value="applied" data-i18n="drift.template.applied">applied</option>
<option value="not_applied" data-i18n="drift.template.not_applied">not applied</option>
<option value="unknown" data-i18n="drift.template.unknown">unknown</option>
</select>
</div>
</fieldset>
<fieldset class="drift-setup">
<legend data-i18n="drift.setup_b">Setup B</legend>
<div class="form-row">
<label for="drift-b-score" data-i18n="drift.score">Score:</label>
<input type="number" id="drift-b-score" step="0.01" placeholder="65.1" />
</div>
<div class="form-row">
<label for="drift-b-framework" data-i18n="drift.framework">Framework:</label>
<select id="drift-b-framework"></select>
</div>
<div class="form-row">
<label for="drift-b-dtype" data-i18n="drift.dtype">Dtype:</label>
<select id="drift-b-dtype"></select>
</div>
<div class="form-row">
<label for="drift-b-batch" data-i18n="drift.batch">Batch:</label>
<input type="number" id="drift-b-batch" min="1" value="1" />
</div>
<div class="form-row">
<label for="drift-b-template" data-i18n="drift.template">Chat-template:</label>
<select id="drift-b-template">
<option value="applied" data-i18n="drift.template.applied">applied</option>
<option value="not_applied" data-i18n="drift.template.not_applied">not applied</option>
<option value="unknown" data-i18n="drift.template.unknown">unknown</option>
</select>
</div>
</fieldset>
</div>
<div class="form-row">
<button type="button" id="drift-run-btn" data-i18n="drift.run_btn">🔀 Compute drift bound</button>
<button type="button" id="drift-sample-btn" class="secondary" data-i18n="drift.sample_btn">📊 Load sample (chat-template bug)</button>
</div>
<p id="drift-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="drift-output" style="margin-top: 1em;"></div>
</section>
<!-- NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7) -->
<section id="niah-section" style="display:none;">
<h2><span data-i18n="niah.title">🔍 NIAH → Reasoning Gap</span>
<span class="info"><span class="tooltip" data-i18n="niah.tip">
NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="niah.desc">
<strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%.
</p>
<div class="form-row">
<label for="niah-id" data-i18n="niah.id_label">HF model id:</label>
<input type="text" id="niah-id" placeholder="e.g. meta-llama/Llama-3.1-8B-Instruct" />
<button type="button" id="niah-fetch-btn" data-i18n="niah.fetch_btn">📥 Fetch config</button>
</div>
<div class="form-row">
<label for="niah-teval" data-i18n="niah.teval_label">Target context (T_eval):</label>
<input type="number" id="niah-teval" min="512" step="1024" value="32768" />
<button type="button" id="niah-run-btn" data-i18n="niah.run_btn">🔍 Predict</button>
<button type="button" id="niah-sweep-btn" class="secondary" data-i18n="niah.sweep_btn">📊 Sweep contexts</button>
</div>
<p id="niah-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="niah-output" style="margin-top: 1em;"></div>
</section>
<!-- Benchmark Saturation Detector (v0.8.0 anti-bullshit pack #6) -->
<section id="saturation-section" style="display:none;">
<h2><span data-i18n="saturation.title">📈 Benchmark Saturation Detector</span>
<span class="info"><span class="tooltip" data-i18n="saturation.tip">
MMLU is saturated (88-94% all frontier models). Reporting "92% on MMLU" is now meaningless. This tool tells you which benchmarks still discriminate frontier models, which are saturated, and what to use instead. Data: DemandSphere AI Frontier Tracker (CC BY-NC 4.0) refreshed 2026-05.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="saturation.desc">
<strong>Is your benchmark still useful?</strong> Pick a benchmark to see top-3 frontier scores, spread, and a verdict (saturated / near-saturated / discriminative) plus recommended replacements.
</p>
<div class="form-row">
<label for="saturation-select" data-i18n="saturation.select_label">Benchmark:</label>
<select id="saturation-select"></select>
<button type="button" id="saturation-run-btn" data-i18n="saturation.run_btn">📈 Classify</button>
<button type="button" id="saturation-all-btn" class="secondary" data-i18n="saturation.all_btn">📊 Show all</button>
</div>
<p id="saturation-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="saturation-output" style="margin-top: 1em;"></div>
<p class="subtle" style="font-size:0.82em; margin-top:1em;" data-i18n="saturation.attribution">
Data: DemandSphere AI Frontier Model Tracker (CC BY-NC 4.0) · HF Open LLM Leaderboard v3 (open-weight historical) · last fetch 2026-05-05.
</p>
</section>
<!-- Solutions Hub — integrator portal (v0.8.1) -->
<!-- JSON CoT-aware Linter (mode=cot, v0.8.2 anti-bullshit pack #8) -->
<section id="cot-section" style="display:none;">
<h2><span data-i18n="cot.title">📋 JSON CoT-aware Linter</span>
<span class="info"><span class="tooltip" data-i18n="cot.tip">
<strong>Why this matters</strong>: constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in schema order. If your schema places <code>answer</code> before <code>reasoning</code>, the model commits to a final answer first and only then writes the rationale to justify it — defeating Chain-of-Thought entirely. Paste a JSON Schema (or example object) and the linter flags the ordering.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="cot.desc">
<strong>Reasoning before answer, always.</strong> Paste a JSON Schema or example response object — the linter reports whether reasoning fields come before answer fields and suggests a fix.
</p>
<div class="form-row">
<textarea id="cot-input" rows="10" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="cot.input.placeholder" placeholder='{ "type": "object", "properties": { "answer": {"type": "string"}, "reasoning": {"type": "string"} } }'></textarea>
</div>
<div class="form-row">
<button type="button" id="cot-lint-btn" data-i18n="cot.lint_btn">🔍 Lint</button>
<button type="button" id="cot-example-good-btn" class="secondary" data-i18n="cot.example_good_btn">↳ Example: good order</button>
<button type="button" id="cot-example-bad-btn" class="secondary" data-i18n="cot.example_bad_btn">↳ Example: anti-pattern</button>
</div>
<p id="cot-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="cot-output" style="margin-top: 1em;"></div>
</section>
<!-- PEFT Anti-Pattern Checker (mode=peft, v0.8.3 anti-bullshit pack #9) -->
<section id="peft-section" style="display:none;">
<h2><span data-i18n="peft.title">🔧 PEFT Anti-Pattern Checker</span>
<span class="info"><span class="tooltip" data-i18n="peft.tip">
<strong>Why this matters</strong>: <code>get_peft_model(base, config)</code> creates a FRESH adapter — it does NOT load saved weights. Users who want to resume from a checkpoint must call <code>PeftModel.from_pretrained(base, path)</code>. peft #2115 documents the silent base-model bug. This linter scans your training script for that pattern (and 3 others: QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio).
</span></span>
</h2>
<p class="recipe-desc" data-i18n="peft.desc">
<strong>Don't burn 10 hours of training on a base model.</strong> Paste your PEFT setup code — the linter flags silent base-model loads, QLoRA ordering bugs, target_modules/arch mismatches, and lora_alpha conventions.
</p>
<div class="form-row">
<textarea id="peft-input" rows="14" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="peft.input.placeholder" placeholder="from peft import LoraConfig, get_peft_model&#10;base = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B')&#10;config = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','v_proj'])&#10;model = get_peft_model(base, config)&#10;# resume from saved adapter:&#10;model.load_state_dict('./outputs/checkpoint-1000/adapter_model.bin')"></textarea>
</div>
<div class="form-row">
<button type="button" id="peft-lint-btn" data-i18n="peft.lint_btn">🔍 Lint</button>
<button type="button" id="peft-example-bug-btn" class="secondary" data-i18n="peft.example_bug_btn">↳ Example: silent base-load</button>
<button type="button" id="peft-example-qlora-btn" class="secondary" data-i18n="peft.example_qlora_btn">↳ Example: QLoRA order bug</button>
<button type="button" id="peft-example-clean-btn" class="secondary" data-i18n="peft.example_clean_btn">↳ Example: clean</button>
</div>
<p id="peft-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="peft-output" style="margin-top: 1em;"></div>
</section>
<!-- Prompt-Cache Diff Predictor (mode=cache, v0.8.4 anti-bullshit pack #10) -->
<section id="cache-section" style="display:none;">
<h2><span data-i18n="cache.title">🔁 Prompt-Cache Diff Predictor</span>
<span class="info"><span class="tooltip" data-i18n="cache.tip">
<strong>Why this matters</strong>: Anthropic's `cache_control` cache breaks at the first token diff in the marked prefix. OpenAI auto-caches prefixes ≥1024 tokens but invalidates on any change. Gemini context cache requires ≥32K tokens. A misplaced edit silently 10x's your bill — and the API never warns you. Paste old + new prompt, see per-provider hit ratio + cost delta.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="cache.desc">
<strong>Don't 10x your bill on a one-character edit.</strong> Paste your previous and current prompt — the predictor finds the longest common prefix, estimates tokens, and shows per-provider cache hit ratio + $ delta vs no-cache.
</p>
<div class="form-row" style="display:flex; gap:1em; flex-wrap:wrap;">
<div style="flex:1; min-width:300px;">
<label for="cache-old" data-i18n="cache.old_label">Old prompt:</label>
<textarea id="cache-old" rows="10" style="width:100%;font-family:monospace;font-size:0.85em;" data-i18n-placeholder="cache.old.placeholder" placeholder="You are a helpful assistant. …"></textarea>
</div>
<div style="flex:1; min-width:300px;">
<label for="cache-new" data-i18n="cache.new_label">New prompt:</label>
<textarea id="cache-new" rows="10" style="width:100%;font-family:monospace;font-size:0.85em;" data-i18n-placeholder="cache.new.placeholder" placeholder="You are a helpful assistant. …"></textarea>
</div>
</div>
<div class="form-row">
<label for="cache-profile" data-i18n="cache.profile_label">Tokenizer profile:</label>
<select id="cache-profile">
<option value="english" data-i18n="cache.profile.english">English (chars/4)</option>
<option value="code" data-i18n="cache.profile.code">Code (chars/3.5)</option>
<option value="mixed" data-i18n="cache.profile.mixed">CJK / Cyrillic (chars/2)</option>
</select>
<label for="cache-output-tokens" data-i18n="cache.output_label">Estimated output tokens:</label>
<input type="number" id="cache-output-tokens" value="500" min="0" max="100000" style="width:8em;" />
</div>
<div class="form-row">
<button type="button" id="cache-diff-btn" data-i18n="cache.diff_btn">🔍 Predict</button>
<button type="button" id="cache-example-good-btn" class="secondary" data-i18n="cache.example_good_btn">↳ Example: 99% hit</button>
<button type="button" id="cache-example-broken-btn" class="secondary" data-i18n="cache.example_broken_btn">↳ Example: cache busted</button>
<button type="button" id="cache-example-belowmin-btn" class="secondary" data-i18n="cache.example_belowmin_btn">↳ Example: below OpenAI min</button>
</div>
<p id="cache-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="cache-output" style="margin-top: 1em;"></div>
</section>
<!-- Speculative-Decode Compatibility Checker (mode=speculative, v0.8.5 #11) -->
<section id="speculative-section" style="display:none;">
<h2><span data-i18n="speculative.title">🔬 Speculative-Decode Compatibility</span>
<span class="info"><span class="tooltip" data-i18n="speculative.tip">
<strong>Why this matters</strong>: speculative decoding (vLLM, SGLang, llama.cpp, transformers) requires the draft and target model to share an EXACT vocabulary. Any token-id disagreement means the target rejects every draft token — you pay BOTH compute costs and get WORSE throughput than baseline. The system reports nominal output (just slower), so the bug is invisible in unit tests. This tool fetches `tokenizer.json` from HF Hub for both ids and compares.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="speculative.desc">
<strong>Don't ship spec-dec with mismatched vocabs.</strong> Paste target + draft HF model ids → tool fetches tokenizers, compares vocab type, size, sampled token-ids, special tokens, added tokens → verdict + speedup estimate.
</p>
<div class="form-row">
<label for="spec-target-id" data-i18n="speculative.target_label">Target (large) model id:</label>
<input type="text" id="spec-target-id" placeholder="Qwen/Qwen2.5-72B-Instruct" style="flex:1;" />
</div>
<div class="form-row">
<label for="spec-draft-id" data-i18n="speculative.draft_label">Draft (small) model id:</label>
<input type="text" id="spec-draft-id" placeholder="Qwen/Qwen2.5-7B-Instruct" style="flex:1;" />
</div>
<div class="form-row">
<button type="button" id="spec-check-btn" data-i18n="speculative.check_btn">🔍 Check compatibility</button>
<button type="button" id="spec-example-good-btn" class="secondary" data-i18n="speculative.example_good_btn">↳ Example: Qwen2.5 7B/72B (good)</button>
<button type="button" id="spec-example-bad-btn" class="secondary" data-i18n="speculative.example_bad_btn">↳ Example: cross-family (bad)</button>
</div>
<p class="recipe-desc subtle" style="font-size:0.82em;" data-i18n="speculative.gated_note">
💡 <strong>Gated models</strong> (Llama, Mistral, Gemma) require HF login + license acceptance — this tool can't auth, so they return 401. Use open-weight pairs (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) for demos.
</p>
<p id="spec-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="spec-output" style="margin-top: 1em;"></div>
</section>
<!-- Multilingual Tokenizer Tax (mode=tax, v0.8.7 anti-bullshit pack #13) -->
<section id="tax-section" style="display:none;">
<h2><span data-i18n="tax.title">🌍 Multilingual Tokenizer Tax</span>
<span class="info"><span class="tooltip" data-i18n="tax.tip">
<strong>Why this matters</strong>: tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="tax.desc">
<strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline.
</p>
<div class="form-row">
<label for="tax-input" data-i18n="tax.input_label">Text to tokenize:</label>
<textarea id="tax-input" rows="8" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="tax.input.placeholder" placeholder="Paste any text — English, Chinese, Arabic, code, …"></textarea>
</div>
<div class="form-row">
<button type="button" id="tax-tokenize-btn" data-i18n="tax.tokenize_btn">🔬 Tokenize all</button>
<button type="button" id="tax-sample-en-btn" class="secondary" data-i18n="tax.sample_en_btn">↳ Sample: English</button>
<button type="button" id="tax-sample-zh-btn" class="secondary" data-i18n="tax.sample_zh_btn">↳ Sample: 中文</button>
<button type="button" id="tax-sample-ar-btn" class="secondary" data-i18n="tax.sample_ar_btn">↳ Sample: عربى</button>
<button type="button" id="tax-sample-mixed-btn" class="secondary" data-i18n="tax.sample_mixed_btn">↳ Sample: mixed</button>
<button type="button" id="tax-sample-code-btn" class="secondary" data-i18n="tax.sample_code_btn">↳ Sample: code</button>
</div>
<p id="tax-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="tax-output" style="margin-top: 1em;"></div>
<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="tax.firstload_note">
💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser.
</p>
</section>
<!-- LongScore (mode=longscore, v0.8.8 anti-bullshit pack #14) -->
<section id="longscore-section" style="display:none;">
<h2><span data-i18n="longscore.title">🎯 LongScore</span>
<span class="info"><span class="tooltip" data-i18n="longscore.tip">
<strong>Why this matters</strong>: every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).
</span></span>
</h2>
<p class="recipe-desc" data-i18n="longscore.desc">
<strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.
</p>
<div class="form-row">
<label for="longscore-input" data-i18n="longscore.input_label">Model id:</label>
<input type="text" id="longscore-input" data-i18n-placeholder="longscore.input.placeholder"
placeholder="e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct" style="flex:1;" />
<button type="button" id="longscore-lookup-btn" data-i18n="longscore.lookup_btn">🔎 Lookup</button>
</div>
<div class="form-row">
<button type="button" id="longscore-example-good-btn" class="secondary" data-i18n="longscore.example_good_btn">↳ Example: Jamba-1.5-Large (no degradation)</button>
<button type="button" id="longscore-example-mid-btn" class="secondary" data-i18n="longscore.example_mid_btn">↳ Example: Llama-3.1-70B (moderate)</button>
<button type="button" id="longscore-example-bad-btn" class="secondary" data-i18n="longscore.example_bad_btn">↳ Example: dbrx (severe)</button>
</div>
<p id="longscore-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="longscore-output" style="margin-top: 1em;"></div>
<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="longscore.formula_note">
💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href="https://arxiv.org/abs/2505.19293" target="_blank">100-LongBench, ACL 2025</a>. Data: <a href="https://github.com/NVIDIA/RULER" target="_blank">NVIDIA RULER</a> (per-length, n=33) + <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.
</p>
</section>
<section id="hub-section" style="display:none;">
<h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
<span class="info"><span class="tooltip" data-i18n="hub.tip">
Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.
</span></span>
</h2>
<p class="recipe-desc" data-i18n="hub.desc">
<strong>Don't reinvent — find.</strong> 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most.
</p>
<div class="form-row">
<input type="text" id="hub-search" placeholder="search: e.g. 'forgetting' or 'vendor' or 'RAG'…" style="flex:1;" />
<button type="button" id="hub-clear-btn" class="secondary" data-i18n="hub.clear_btn">✕ Clear</button>
</div>
<p id="hub-status" class="recipe-desc" style="font-size:0.92em;"></p>
<div id="hub-output" style="margin-top: 1em;"></div>
</section>
<!-- Recipe selector (mode=recipe) -->
<section id="recipe-section" style="display:none;">
<h2 data-i18n="recipe.title">📋 Recipe</h2>
<select id="recipe-select" disabled>
<option value="" data-i18n="recipe.default">— select a recipe —</option>
</select>
<p id="recipe-desc-display" class="recipe-desc"></p>
</section>
<!-- Form (mode=recipe) -->
<section id="form-section" style="display:none;">
<h2 data-i18n="recipe.input_title">🎯 Inputs</h2>
<div class="form-row">
<label for="preset" data-i18n="profile.preset_label">Preset model:</label>
<select id="preset" disabled>
<option value="" data-i18n="profile.preset_default">— select to autofill —</option>
</select>
</div>
<div class="form-row">
<label for="hf-id" data-i18n="profile.hf_label">Or any HF model:</label>
<input type="text" id="hf-id"
data-i18n-placeholder="profile.hf_placeholder"
placeholder="e.g. Qwen/Qwen2.5-32B-Instruct" style="flex:1;" />
<button id="hf-fetch-btn" type="button" class="secondary" data-i18n="profile.fetch_btn">📥 Fetch</button>
</div>
<div id="hf-status" class="subtle" style="margin: -0.5rem 0 1rem; min-height:1.2em;"></div>
<div id="dynamic-form" class="form-grid"></div>
<button id="run-btn" disabled data-i18n="ask.btn">🚀 Analyze</button>
</section>
<!-- Output (single-recipe verdict + chain) -->
<section id="output-section" style="display:none;">
<h2 data-i18n="verdict.title">📊 Verdict</h2>
<div id="verdict-box"></div>
<div class="share-bar">
<button id="share-btn" class="secondary" type="button" data-i18n="share.btn">🔗 Copy share link</button>
<button id="recipe-download-btn" class="secondary" type="button" data-i18n="share.download">💾 Download JSON</button>
<button id="recipe-download-md-btn" class="secondary" type="button" data-i18n="share.download_md">📝 Markdown</button>
<button id="recipe-download-tex-btn" class="secondary" type="button" data-i18n="share.download_tex">📜 LaTeX</button>
<button id="recipe-submit-btn" class="secondary" type="button" data-i18n="share.submit">📤 Submit to registry</button>
<span id="share-status" class="subtle"></span>
</div>
<h2 data-i18n="chain.title">🔍 Computation Chain</h2>
<p class="subtle" data-i18n="chain.desc">Every number below is deterministic Python. Click a step to expand.</p>
<div id="chain-box"></div>
<h2 id="answer-header" style="display:none;" data-i18n="answer.title">💬 Plain-English Answer</h2>
<div id="answer-box" style="display:none;"></div>
</section>
<!-- Profile output -->
<section id="profile-output" style="display:none;">
<h2 data-i18n="tafcard.title">📇 TAF Card — full model profile</h2>
<div id="profile-box"></div>
</section>
<!-- Compare output -->
<section id="compare-output" style="display:none;">
<h2 data-i18n="compare.title_out">🆚 Comparison Table</h2>
<div id="compare-box"></div>
</section>
<!-- Hidden file input for JSON upload (shared by all import buttons) -->
<input type="file" id="import-file" accept=".json,application/json" style="display:none;" />
<!-- Floating import bar (always visible) -->
<section id="import-section">
<h2 data-i18n="share.import_title">📂 Import a shared TAF result</h2>
<p class="recipe-desc" data-i18n="share.import_desc">
Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally.
Same view as if you'd run it yourself.
</p>
<button id="import-btn" class="secondary" type="button" data-i18n="share.import_btn">📂 Load shared JSON</button>
<span id="import-status" class="subtle" style="margin-left:0.75rem;"></span>
</section>
<!-- Browse community submissions (live from GitHub Issues) -->
<section id="community-section">
<h2 data-i18n="community.title">🌐 Recent community submissions</h2>
<p class="recipe-desc" data-i18n="community.desc">
Live feed from the public registry. Click any submission to view full analysis.
<a href="https://github.com/karlesmarin/tafagent-registry/issues" target="_blank" data-i18n="community.browse_all">Browse all →</a>
</p>
<div id="community-feed" class="subtle"><span data-i18n="community.loading">Loading...</span></div>
</section>
<!-- FALSIFICATION dashboard (paper predictions status) -->
<section id="falsification-section">
<h2 data-i18n="falsification.title">🔬 Paper predictions — falsification status</h2>
<p class="recipe-desc" data-i18n="falsification.desc">
The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested.
Here's the live status of every prediction in the paper.
</p>
<div id="falsification-table"></div>
</section>
</main>
<footer>
<p data-i18n="footer.text">
© 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper.
</p>
<p>
<a href="https://github.com/karlesmarin/tafagent" target="_blank">Source on GitHub</a>
·
<a href="https://github.com/karlesmarin/NeurIPS" target="_blank">Paper repo</a>
</p>
<p class="subtle" data-i18n="footer.tech_stack">
Computation: Pyodide · Synthesis: WebLLM (Qwen2.5-0.5B local) · Hosting: GitHub Pages · Cost: $0
</p>
</footer>
<script type="module" src="js/main.js"></script>
</body>
</html>