Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /> | |
| <title>TAF Agent — Test ANY Transformer LLM in Your Browser</title> | |
| <meta name="description" content="Free, auditable diagnostic for transformer LLMs. Predict viability (long-context, KV compression, training budget, hardware) from config alone. Runs entirely in your browser. No server, no auth, no cost." /> | |
| <meta name="keywords" content="transformer, LLM, diagnostic, RoPE, NIAH, KV cache, viability, free, browser, GPU, NeurIPS, TAF" /> | |
| <meta name="author" content="Carles Marin" /> | |
| <!-- OpenGraph for social sharing (Twitter, LinkedIn, WhatsApp, Discord, etc.) --> | |
| <meta property="og:type" content="website" /> | |
| <meta property="og:url" content="https://karlesmarin.github.io/tafagent/" /> | |
| <meta property="og:title" content="TAF Agent — Test ANY Transformer LLM in Your Browser" /> | |
| <meta property="og:description" content="Free, auditable transformer LLM diagnostic. 8 recipes, 5 modes, 4 languages. Runs in your browser. No server, no auth, $0/month forever." /> | |
| <meta property="og:site_name" content="TAF Agent" /> | |
| <!-- Twitter Card --> | |
| <meta name="twitter:card" content="summary_large_image" /> | |
| <meta name="twitter:title" content="TAF Agent — Test ANY Transformer LLM in Your Browser" /> | |
| <meta name="twitter:description" content="Free, auditable transformer LLM diagnostic. 8 recipes, 5 modes, 4 languages. Runs in your browser. $0 forever." /> | |
| <!-- Theme color for browser UI --> | |
| <meta name="theme-color" content="#0a0e14" /> | |
| <link rel="stylesheet" href="style.css" /> | |
| <script src="https://cdn.jsdelivr.net/pyodide/v0.26.4/full/pyodide.js"></script> | |
| </head> | |
| <body> | |
| <a href="#mode-section" class="skip-link" data-i18n="a11y.skip">Skip to main content</a> | |
| <header> | |
| <!-- Language switcher (top-right, round flags) --> | |
| <div class="lang-switcher"> | |
| <button class="lang-btn" data-lang="en" data-label="English" title="English" aria-label="Switch language to English">🇬🇧</button> | |
| <button class="lang-btn" data-lang="es" data-label="Español" title="Español" aria-label="Cambiar idioma a Español">🇪🇸</button> | |
| <button class="lang-btn" data-lang="fr" data-label="Français" title="Français" aria-label="Changer la langue en Français">🇫🇷</button> | |
| <button class="lang-btn" data-lang="zh" data-label="中文" title="中文" aria-label="切换语言至中文">🇨🇳</button> | |
| </div> | |
| <h1 data-i18n="hero.title">🔬 TAF Agent</h1> | |
| <p class="tagline" data-i18n="hero.tagline"> | |
| Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup. | |
| </p> | |
| <p class="subtle" data-i18n="hero.subtitle"> | |
| Predicts whether a model will work for your use case <em>before</em> you spend money or time. Everything runs in your browser — your inputs never leave this tab. | |
| </p> | |
| <p class="subtle" style="margin-top:0.25rem; font-size:0.85rem;" data-i18n="hero.about"> | |
| Built by an independent researcher. Open source. Not affiliated with any model vendor. | |
| </p> | |
| <p class="hero-buttons"> | |
| <button id="quickstart-btn" type="button" data-i18n="hero.quickstart_btn">⚡ Quick start</button> | |
| <button id="inventory-btn" type="button" data-i18n="hero.inventory_btn">🧰 What it gives you</button> | |
| <button id="help-btn" type="button" data-i18n="hero.help">📘 Manual & examples</button> | |
| </p> | |
| </header> | |
| <!-- Help modal --> | |
| <div id="help-modal" role="dialog" aria-modal="true" aria-labelledby="help-modal-title" aria-hidden="true"> | |
| <div class="help-content"> | |
| <button class="help-close" id="help-close" aria-label="Close help">×</button> | |
| <h2 id="help-modal-title" data-i18n="help.title">📘 TAF Agent — User Manual</h2> | |
| <h3 data-i18n="help.what.title">What does it do?</h3> | |
| <p data-i18n="help.what.body">Predicts <strong>practical viability</strong> of any transformer LLM | |
| <em>before you spend GPU/$</em>. Answers questions like "will this model work at L=32K?" or | |
| "should I train custom or use API?" using deterministic Python formulas (TAF — Thermodynamic Attention Framework).</p> | |
| <h3 data-i18n="help.modes.title">How to use — 7 modes</h3> | |
| <p data-i18n="help.modes.profile"><strong>📇 Profile</strong>: paste model id → all recipes at once = TAF Card. <strong>Best starting point</strong>.</p> | |
| <p data-i18n="help.modes.compare"><strong>🆚 Compare</strong>: 2-3 models side-by-side on same recipe. Best when choosing between candidates.</p> | |
| <p data-i18n="help.modes.inspector"><strong>🔍 Inspect config</strong>: paste raw <code>config.json</code> → tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.</p> | |
| <p data-i18n="help.modes.ask"><strong>💬 Ask plain English</strong>: free-form question, in-browser LLM picks the recipe. Best for casual exploration.</p> | |
| <p data-i18n="help.modes.recipe"><strong>📋 Recipe + form</strong>: manual selection, full parameter control. Best when you want exact control.</p> | |
| <p data-i18n="help.modes.diagnose"><strong>🩺 Diagnose CLI</strong>: generate Python command to measure γ on your local machine (transformers + numpy). Fast ≈5 min CPU; full ≈20–60 min GPU. Output JSON re-uploadable via Inspect.</p> | |
| <p data-i18n="help.modes.phase"><strong>📊 Phase diagram</strong>: scatter plot of 23 panel models on (log θ, γ) plane. Hagedorn line γ=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.</p> | |
| <h3 data-i18n="help.recipes.title">The 8 recipes available</h3> | |
| <p data-i18n="help.recipe.x1.title"><strong>X-1 Custom training vs API</strong> — compares cost of training your own model vs paying for API access.</p> | |
| <div class="help-example" data-i18n="help.recipe.x1.example"> | |
| Try: <em>"Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"</em><br> | |
| Answer types: YES (custom) / NO (API) with break-even months. | |
| </div> | |
| <p data-i18n="help.recipe.x2.title"><strong>X-2 Long Context Viability</strong> — predicts if a model serves a target context length reliably.</p> | |
| <div class="help-example" data-i18n="help.recipe.x2.example"> | |
| Try: <em>"Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"</em><br> | |
| Chains: γ_Padé → decomposition → d_horizon → NIAH ceiling → hallucination → KV memory.<br> | |
| Verdict: YES / DEGRADED / NO with mitigation if needed. | |
| </div> | |
| <p data-i18n="help.recipe.x3.title"><strong>X-3 Budget pre-flight</strong> — given $ budget, what model is feasible to train?</p> | |
| <div class="help-example" data-i18n="help.recipe.x3.example"> | |
| Try: <em>"I have $5000, what model can I train?"</em><br> | |
| Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens). | |
| </div> | |
| <p data-i18n="help.recipe.x5.title"><strong>X-5 Hardware selection</strong> — which GPU should I use to serve at target throughput?</p> | |
| <div class="help-example" data-i18n="help.recipe.x5.example"> | |
| Try: <em>"Cheapest hardware to serve Llama-3-8B at 10M tokens/day"</em><br> | |
| Answer: best GPU + $/Mtok + capacity vs target. | |
| </div> | |
| <p data-i18n="help.recipe.x19.title"><strong>X-19 KV Compression decision</strong> — should I use soft decay, hard cutoff, or literature methods?</p> | |
| <div class="help-example" data-i18n="help.recipe.x19.example"> | |
| Try: <em>"How to compress KV cache for Qwen2.5-7B at 32K?"</em><br> | |
| Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train. | |
| </div> | |
| <h3 style="margin-top: 1.5em;" data-i18n="help.divider.v04_s29">— v0.4 (sesión 29 findings) —</h3> | |
| <p data-i18n="help.section.v04"><strong>What's new in v0.4</strong> (sesión 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).</p> | |
| <p data-i18n="help.recipe.x21.title"><strong>X-21 Imprint Purity Diagnostic</strong> — predicts γ on RANDOM tokens via ν=−1/(2π); how clean is the model's RoPE prediction?</p> | |
| <div class="help-example" data-i18n="help.recipe.x21.example"> | |
| Try: <em>"How clean is the RoPE prediction on Llama-3-8B?"</em><br> | |
| Answer: predicted γ_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED). | |
| </div> | |
| <p data-i18n="help.v04.imprint" style="font-size: 0.9em; opacity: 0.85;"><strong>Learned-imprint slope ν = −1/(2π)</strong>: RoPE rotation period 2π drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. ν is DERIVED — not fitted (empirical err 0.3%).</p> | |
| <p data-i18n="help.recipe.x22.title"><strong>X-22 Compute-Context Invariant</strong> — does γ × log(N²·D) lie in panel band 51.2 ± 16.8? Detects scaling/training anomalies.</p> | |
| <div class="help-example" data-i18n="help.recipe.x22.example"> | |
| Try: <em>"Does Mistral-7B fit the compute-context invariant?"</em><br> | |
| Answer: K = γ·log(N²·D), z-score, IN-BAND or OUTLIER. | |
| </div> | |
| <p data-i18n="help.v04.invariant" style="font-size: 0.9em; opacity: 0.85;"><strong>Chinchilla-attention invariant K</strong>: γ × log(N²·D) ≈ 51.2 ± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.</p> | |
| <p data-i18n="help.recipe.x23.title"><strong>X-23 IH-Phase Detector</strong> — pre- or post-induction-head? Cheap probe via sign(γ_text − γ_random).</p> | |
| <div class="help-example" data-i18n="help.recipe.x23.example"> | |
| Try: <em>"Is Qwen2.5-7B post-induction-head?"</em><br> | |
| Answer: CONFIRMED PRE-IH / CONFIRMED POST-IH / ANOMALY (with size-vs-Δγ consistency check). | |
| </div> | |
| <p data-i18n="help.v04.ih_probe" style="font-size: 0.9em; opacity: 0.85;"><strong>Δγ as IH probe</strong>: sign(γ_text − γ_random) > 0 ⟺ post-induction-head. Cheaper than running an in-context-learning benchmark.</p> | |
| <p data-i18n="help.v04.constants" style="font-size: 0.9em; opacity: 0.85;"><strong>γ-cluster on famous constants</strong> (intriguing, n=4): CodeLlama-13b γ=0.382 ≈ 1−1/φ (golden conjugate, err 0.0003); pythia-1.4b γ=0.705 ≈ 1/√2; Llama-2-7b γ=0.287 ≈ 1−1/√2; Mistral-Nemo γ=0.428 ≈ log_10(e). Caveat: could be coincidence.</p> | |
| <h3 style="margin-top: 1.5em;" data-i18n="v04.title">🆕 v0.4 — New diagnostics (sesion 31)</h3> | |
| <p style="opacity: 0.85;"><em data-i18n="v04.section.intro">Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + Sócratic interrogation. Available in <code>taf_browser.py</code> §33.</em></p> | |
| <p><strong data-i18n="v04.arch.label">Architectural Concentration</strong> — <span data-i18n="v04.arch.desc">γ_text ≈ γ_Padé − 0.012·n_kv. Cross-panel correlational law (R²=0.30). Caveat: not per-model predictor.</span></p> | |
| <p><strong data-i18n="v04.pdi.label">PDI — Padé Deviation Index</strong> — <span data-i18n="v04.pdi.desc">PDI = d_horizon_obs/T_eval. Traffic light: green (≈1), orange (>>1), yellow (<<1), red (Phase B negative).</span></p> | |
| <p><strong data-i18n="v04.4bit.label">4-bit Shift Predictor</strong> — <span data-i18n="v04.4bit.desc">MHA: R²(bf16)<0.9 → γ rises; R²>0.99 → γ drops. GQA: precision-robust regardless.</span></p> | |
| <p><strong data-i18n="v04.crit.label">Critical Exponents Bundle</strong> — <span data-i18n="v04.crit.desc">ν_c, β_c, η_c (=γ−1, CORRECTED), α_C, γ_susc with AM-GM minimum at γ=1−1/√2≈0.293.</span></p> | |
| <h3 style="margin-top: 1.5em;" data-i18n="v05.title">🔬 v0.5 — Machine-verified consistency (sesion 32)</h3> | |
| <p style="opacity: 0.85;"><em data-i18n="v05.section.intro">Sage Groebner basis + Lean Mathlib4 dual-tool verification of <strong>15 algebraic identities</strong> of TAF critical exponents. First transformer-attention framework with formal machine-proof backing.</em></p> | |
| <p><strong data-i18n="v05.verify.label">Algebraic Consistency Check</strong> — <span data-i18n="v05.verify.desc">Given measured γ, verifies 12 D-SAGE identities (D-SAGE-1: 2η²+η·γ_χ+1=0, β·χ=−1, α+χ=2, etc.). All passing = framework intact. Failures indicate bf16 outliers / quantization artifacts.</span></p> | |
| <p><strong data-i18n="v05.dsage1.label">D-SAGE-1 (★★ core)</strong> — <span data-i18n="v05.dsage1.desc">Quadratic identity 2η² + η·γ_χ + 1 = 0 (Sage Groebner-discovered, Lean-verified). Replaces incorrect 'triple closure' claim. Refutes paper 1's η=2γ algebraically.</span></p> | |
| <p><strong data-i18n="v05.erratum.label">Paper 1 erratum — η correction</strong> — <span data-i18n="v05.erratum.desc">Paper 1 originally claimed η = 2γ. Sage Groebner + Lean Mathlib4 proved this fails (residual (-4γ³+5γ+1)/(1-γ) > 0 ∀γ ∈ Phase A). Correct value: η = γ−1, satisfying D-SAGE-1.</span></p> | |
| <p><strong data-i18n="v05.repro.label">Reproducibility</strong> — <span data-i18n="v05.repro.desc">All 15 theorems machine-proof in Lean Mathlib4 (1973 jobs build success). Sage script: <code>analysis/sage_recursive_sweep_2026-04-30.sage</code>. Lean code: <code>lean_taf/taf/Taf/Identities.lean</code>.</span></p> | |
| <h3 style="margin-top: 1.5em;" data-i18n="help.v06.title">🆕 v0.6 — γ predicted-vs-observed + Cardy ΔH + Lean badges</h3> | |
| <p style="opacity: 0.85;" data-i18n="help.v06.intro"><em>v0.6 (2026-05-06): three new diagnostics live in the TAF Card under <strong>🔬 Diagnostics</strong>. All run in your browser; γ_observed comes from the Diagnose CLI on real weights.</em></p> | |
| <p><strong data-i18n="help.v06.layout.title">TAF Card layout (new in v0.6)</strong></p> | |
| <p data-i18n="help.v06.layout.body">After clicking <strong>🚀 Generate full profile</strong> the card shows: a <strong>hero strip</strong> on top (architecture class + meta + 3 pills: aggregate verdict ✅/⚠/❌, γ headline, 🧲 Anti-Ising if Phase A) and four <strong>expandable sections</strong>: <strong>📋 Recipes</strong> (open by default — verdict per dimension), <strong>🔬 Diagnostics</strong> (key numbers, γ predicted vs observed, what-if explorer), <strong>✓ Verification</strong> (Sage+Lean algebraic consistency, falsification F1-F23), <strong>📂 Provenance & share</strong> (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline <strong>ⓘ</strong> tooltip.</p> | |
| <p><strong data-i18n="help.v06.gamma_check.title">γ predicted vs observed</strong></p> | |
| <p data-i18n="help.v06.gamma_check.body">Enter the empirically-measured γ from your model and the tool computes <strong>η = θ_eff_obs / θ_eff_Padé</strong> and classifies into one of 5 regimes:</p> | |
| <ul style="font-size: 0.92em;"> | |
| <li data-i18n="help.v06.case.normal"><strong>Normal</strong> (η ∈ [0.85, 1.15]) — model uses its full nominal context. <em>Use case</em>: validate a new release before adopting it.</li> | |
| <li data-i18n="help.v06.case.fraud"><strong>Fraud</strong> (η < 0.01) — nominal θ inflated; model behaves as if θ ≪ advertised. <em>Use case</em>: detect YaRN/marketing inflation (CodeLlama / Mistral-Nemo pattern).</li> | |
| <li data-i18n="help.v06.case.compressed"><strong>Compressed</strong> (η < 0.5) — context compressed; model attends shorter than nominal θ. <em>Use case</em>: spot RLHF/instruction-tuning compression (LLaMA-2 pattern).</li> | |
| <li data-i18n="help.v06.case.overpade"><strong>Over-Padé</strong> (η > 1.5) — model attends farther than Padé predicts. <em>Use case</em>: identify Lerch-corrected regime or undertrained early checkpoints (pythia-1b pattern).</li> | |
| <li data-i18n="help.v06.case.swa"><strong>SWA random-corpus</strong> (γ_obs > 1.05 with random_corpus=Yes) — sliding-window attention signature. <em>Use case</em>: confirm Mistral / Gemma SWA on random tokens.</li> | |
| </ul> | |
| <p><strong data-i18n="help.v06.cardy.title">Cardy ΔH diagnostic</strong></p> | |
| <p data-i18n="help.v06.cardy.body"><strong>ΔH_Cardy = log(θ_eff_obs / θ_nominal)</strong>. Entropy shift between observed effective θ and nominal θ. Strong negative = compression entropy; near zero = nominal match. Complements η for borderline cases.</p> | |
| <p><strong data-i18n="help.v06.lean.title">Lean + Mathlib verification badges</strong></p> | |
| <p data-i18n="help.v06.lean.body">TAF identities (Anti-Ising, D-SAGE-1 quadratic, Padé z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: <a href="https://github.com/karlesmarin/lean-taf" target="_blank">github.com/karlesmarin/lean-taf</a>. Anyone can clone + <code>lake build</code> to re-verify. The 🧲 Anti-Ising pill in the hero strip is one such badge.</p> | |
| <p><strong data-i18n="help.v06.glossary.title">Variable glossary (also embedded in TAF Card)</strong></p> | |
| <p data-i18n="help.v06.glossary.body">Every variable in the TAF Card has an inline ⓘ tooltip. The complete list: γ, γ_Padé, γ_decomposed, γ_observed, θ, θ_eff_obs, θ_eff_Padé, η, ΔH_Cardy, χ, d_horizon, L_NIAH, KV memory, regime. Hover any ⓘ for the definition + paper section.</p> | |
| <h3 data-i18n="help.add_models.title">Adding new models (3 ways)</h3> | |
| <ul> | |
| <li data-i18n="help.add_models.preset"><strong>Preset list</strong>: 11 popular models curated. Just select from dropdown.</li> | |
| <li data-i18n="help.add_models.hf"><strong>HF Hub fetch</strong>: paste any model id (e.g. <code>Qwen/Qwen2.5-32B-Instruct</code>), | |
| click 📥 Fetch. Browser downloads <code>config.json</code> directly from HuggingFace, fills the form. Works for any public model.</li> | |
| <li data-i18n="help.add_models.manual"><strong>Manual</strong>: fill the form fields directly with values from the model card.</li> | |
| </ul> | |
| <h3 style="margin-top: 1.5em;" data-i18n="help.v07.title">🆕 v0.7 — Anti-bullshit pack (4 new modes)</h3> | |
| <p style="opacity: 0.85;" data-i18n="help.v07.intro"><em>v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference — pure metadata + math.</em></p> | |
| <p><strong data-i18n="help.v07.unmask.title">🪟 Context Unmasker</strong></p> | |
| <p data-i18n="help.v07.unmask.body">Detects when <code>max_position_embeddings</code> is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id → 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. <em>Use case</em>: before paying GPU for 32k context, verify the model actually attends that far.</p> | |
| <p><strong data-i18n="help.v07.template.title">📜 Chat-template Sniffer</strong></p> | |
| <p data-i18n="help.v07.template.body">Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting <code>--apply_chat_template</code> silently halves multi-turn accuracy. <em>Use case</em>: before reporting a benchmark score, confirm you applied the template correctly.</p> | |
| <p><strong data-i18n="help.v07.arena.title">🎯 Arena-Elo CI Reconstructor</strong></p> | |
| <p data-i18n="help.v07.arena.body">Chatbot Arena strips confidence intervals from its public leaderboard — a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) → Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. <em>Use case</em>: before declaring "model A beats model B", verify their CIs don't overlap.</p> | |
| <p><strong data-i18n="help.v07.contam.title">🧪 Contamination Prior</strong></p> | |
| <p data-i18n="help.v07.contam.body">Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date → tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSR…) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. <em>Use case</em>: decide which scores to trust when comparing two models.</p> | |
| <p><strong data-i18n="help.v07.quant.title">⚖️ Quant-regime Classifier</strong></p> | |
| <p data-i18n="help.v07.quant.body">Predicts γ-shift and ΔPPL for any (model × quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, …). Architecture-aware: small d_head + aggressive GQA → more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. <em>Use case</em>: before quantizing, predict whether your specific architecture × scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.</p> | |
| <p><strong data-i18n="help.v07.drift.title">🔀 Cross-framework Drift Bound</strong></p> | |
| <p data-i18n="help.v07.drift.body">Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it → real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. <em>Use case</em>: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.</p> | |
| <p><strong data-i18n="help.v07.niah.title">🔍 NIAH → Reasoning Gap</strong></p> | |
| <p data-i18n="help.v07.niah.body">RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (γ_Padé + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays ≥65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. <em>Use case</em>: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.</p> | |
| <p><strong data-i18n="help.v08.saturation.title">📈 Benchmark Saturation Detector</strong></p> | |
| <p data-i18n="help.v08.saturation.body">MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict — saturated / near-saturated / discriminative — plus a recommended replacement (e.g. MMLU → MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. <em>Use case</em>: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.</p> | |
| <p><strong data-i18n="help.v082.cot.title">📋 JSON CoT-aware Linter</strong></p> | |
| <p data-i18n="help.v082.cot.body">Constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in the order your schema declares them. If you write <code>{ answer, reasoning }</code> the model commits to <code>answer</code> first and CoT collapses into post-hoc justification. Paste any schema (or example response) — the linter classifies each field as <em>reasoning</em>, <em>answer</em>, or <em>other</em>, flags the ordering, and emits a reordered fix you can copy back. <em>Use case</em>: 'My CoT prompt works in plaintext but degrades under JSON mode' → run linter, find the inverted order, fix.</p> | |
| <p><strong data-i18n="help.v083.peft.title">🔧 PEFT Anti-Pattern Checker</strong></p> | |
| <p data-i18n="help.v083.peft.body">PEFT's <code>get_peft_model(base, config)</code> creates a FRESH adapter — it does not load saved weights from a path. Users who paste tutorial code and try to resume from a checkpoint silently throw away their training. peft #2115 has the canonical bug report. This linter scans your training script for the pattern + 3 related issues (QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio) and reports findings with line numbers and suggested fixes. <em>Use case</em>: before you launch a 10-hour LoRA fine-tune, paste your script — catch the silent bugs in 200ms.</p> | |
| <p><strong data-i18n="help.v084.cache.title">🔁 Prompt-Cache Diff Predictor</strong></p> | |
| <p data-i18n="help.v084.cache.body">Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.</p> | |
| <p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p> | |
| <p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p> | |
| <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p> | |
| <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p> | |
| <p><strong data-i18n="help.v088.longscore.title">🎯 LongScore</strong></p> | |
| <p data-i18n="help.v088.longscore.body">Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose <strong>LongScore</strong>: <code>LC_l = (S_l − Base) / Base</code> with <code>Base = mean(S_short)</code>, then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. <em>Use case</em>: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization — how much accuracy do I actually lose?' → paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).</p> | |
| <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p> | |
| <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p> | |
| <h3 data-i18n="help.audit.title">The audit chain</h3> | |
| <p data-i18n="help.audit.body">Every result shows the full <strong>Computation Chain</strong> — each formula step with its inputs, | |
| output, and interpretation. Click any step to expand. Cite section numbers (§26.1, §19.1, etc.) refer | |
| to the underlying paper for derivation.</p> | |
| <h3 data-i18n="help.synthesis.title">The plain-English answer</h3> | |
| <p data-i18n="help.synthesis.body">After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load) | |
| synthesizes a plain-English summary. The numbers above are <em>always correct</em> (deterministic Python); | |
| the synthesis is LLM-generated — verify against the chain if in doubt.</p> | |
| <h3 data-i18n="help.params.title">Common parameters explained</h3> | |
| <ul> | |
| <li data-i18n="help.param.theta"><strong>θ (rope_theta)</strong>: RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).</li> | |
| <li data-i18n="help.param.T_train"><strong>T_train</strong>: max context the model was trained on. From <code>max_position_embeddings</code>.</li> | |
| <li data-i18n="help.param.T_eval"><strong>T_eval</strong>: <em>your target</em> inference context length. The key knob.</li> | |
| <li data-i18n="help.param.gqa"><strong>n_kv_heads < n_attention_heads</strong>: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.</li> | |
| <li data-i18n="help.param.swa"><strong>has_SWA</strong>: model uses Sliding Window Attention (Mistral, gemma-2).</li> | |
| <li data-i18n="help.param.nparams"><strong>n_params</strong>: total parameter count. Threshold ~400M for induction-head emergence.</li> | |
| </ul> | |
| <h3 data-i18n="help.verdicts.title">What to look for in verdicts</h3> | |
| <ul> | |
| <li data-i18n="help.verdict.yes"><strong style="color:#3fb950;">YES / GO</strong> — proceed with confidence; numbers support the choice.</li> | |
| <li data-i18n="help.verdict.deg"><strong style="color:#d29922;">DEGRADED / TINY-MODEL</strong> — works but with caveats; read the action.</li> | |
| <li data-i18n="help.verdict.no"><strong style="color:#f85149;">NO / MEMORY-LIMITED</strong> — don't proceed as-is; mitigation provided.</li> | |
| </ul> | |
| <h3 data-i18n="help.privacy.title">Privacy</h3> | |
| <p data-i18n="help.privacy.body">Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model | |
| runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.</p> | |
| <h3 data-i18n="help.source.title">Source & paper</h3> | |
| <p data-i18n="help.source.body">Source code: <a href="https://github.com/karlesmarin/tafagent" target="_blank">github.com/karlesmarin/tafagent</a><br> | |
| Paper: <em>Marin 2026 — Predicting How Transformers Attend</em> (<a href="https://zenodo.org/records/19826343" target="_blank">Zenodo</a>; arXiv forthcoming)<br> | |
| Dataset: <a href="https://huggingface.co/datasets/karlexmarin/taf-attention-decay" target="_blank">taf-attention-decay</a> — 58 γ-measurements across 32 models (CC-BY-4.0)</p> | |
| </div> | |
| </div> | |
| <!-- Quick start modal (3 steps) --> | |
| <div id="quickstart-modal" role="dialog" aria-modal="true" aria-labelledby="qs-modal-title" aria-hidden="true"> | |
| <div class="help-content"> | |
| <button class="help-close" id="quickstart-close" aria-label="Close quick start">×</button> | |
| <h2 id="qs-modal-title" data-i18n="qs.title">⚡ Quick start</h2> | |
| <ol class="qs-steps"> | |
| <li data-i18n="qs.step1">Paste a HuggingFace model ID (e.g. <code>meta-llama/Meta-Llama-3-8B</code>)</li> | |
| <li data-i18n="qs.step2">Click <strong>📇 Profile a model</strong></li> | |
| <li data-i18n="qs.step3">Read your TAF Card — verdict per use case + key numbers + math verified by Lean+Mathlib</li> | |
| </ol> | |
| <p class="qs-cta"> | |
| <a href="#mode-section" class="btn-primary" id="qs-start-link" data-i18n="qs.cta">↓ Start now</a> | |
| </p> | |
| </div> | |
| </div> | |
| <!-- Inventory modal: what you get --> | |
| <div id="inventory-modal" role="dialog" aria-modal="true" aria-labelledby="inv-modal-title" aria-hidden="true"> | |
| <div class="help-content"> | |
| <button class="help-close" id="inventory-close" aria-label="Close inventory">×</button> | |
| <h2 id="inv-modal-title" data-i18n="inv.title">🧰 What this tool gives you</h2> | |
| <div class="inventory-grid"> | |
| <details class="inv-card" open> | |
| <summary class="inv-card-title" data-i18n="inv.recipes.title">🎯 8 recipes — does this model fit your use case?</summary> | |
| <ul> | |
| <li><strong data-i18n="inv.recipes.x1.title">Custom train vs API</strong>: <span data-i18n="inv.recipes.x1.body">which is cheaper for your traffic?</span></li> | |
| <li><strong data-i18n="inv.recipes.x2.title">Long context</strong>: <span data-i18n="inv.recipes.x2.body">will it handle 32k / 128k tokens reliably?</span></li> | |
| <li><strong data-i18n="inv.recipes.x3.title">Budget</strong>: <span data-i18n="inv.recipes.x3.body">with $X, what model can you train from scratch?</span></li> | |
| <li><strong data-i18n="inv.recipes.x5.title">Hardware</strong>: <span data-i18n="inv.recipes.x5.body">which GPU to serve N tokens/day?</span></li> | |
| <li><strong data-i18n="inv.recipes.x19.title">KV cache</strong>: <span data-i18n="inv.recipes.x19.body">how to compress without breaking quality?</span></li> | |
| <li><strong data-i18n="inv.recipes.x21.title">Imprint purity</strong>: <span data-i18n="inv.recipes.x21.body">how clean is the model's positional encoding?</span></li> | |
| <li><strong data-i18n="inv.recipes.x22.title">Compute-context</strong>: <span data-i18n="inv.recipes.x22.body">does the model fit the empirical band?</span></li> | |
| <li><strong data-i18n="inv.recipes.x23.title">IH-phase</strong>: <span data-i18n="inv.recipes.x23.body">pre- or post-induction-head?</span></li> | |
| </ul> | |
| </details> | |
| <details class="inv-card" open> | |
| <summary class="inv-card-title" data-i18n="inv.diag.title">🔬 Diagnostics</summary> | |
| <ul> | |
| <li data-i18n="inv.diag.gamma"><strong>γ predicted vs observed</strong> — auto-classifies the model into 5 regimes (normal · fraud / inflated context · compressed · over-Padé · sliding-window)</li> | |
| <li data-i18n="inv.diag.cardy"><strong>Cardy ΔH</strong> — entropy shift between observed and nominal context</li> | |
| <li data-i18n="inv.diag.fals"><strong>Falsification dashboard</strong> — checks 23 specific predictions (F1–F23)</li> | |
| <li data-i18n="inv.diag.alg"><strong>Algebraic consistency</strong> — 8 mathematical identities the model must satisfy</li> | |
| </ul> | |
| </details> | |
| <details class="inv-card" open> | |
| <summary class="inv-card-title" data-i18n="inv.verify.title">✓ Formally verified math</summary> | |
| <ul> | |
| <li data-i18n="inv.verify.count"><strong>37 theorems</strong> machine-proven in Lean 4 + Mathlib4</li> | |
| <li data-i18n="inv.verify.click">Click any badge → opens the source line on GitHub</li> | |
| <li data-i18n="inv.verify.reverify">Verify yourself: <code>lake build</code> (≈5 s after cache fetch)</li> | |
| </ul> | |
| </details> | |
| <details class="inv-card" open> | |
| <summary class="inv-card-title" data-i18n="inv.export.title">📤 Export & share</summary> | |
| <ul> | |
| <li data-i18n="inv.export.formats"><strong>JSON · Markdown · LaTeX</strong> (paper-ready)</li> | |
| <li data-i18n="inv.export.share">Reproducible share link (state encoded in URL)</li> | |
| <li data-i18n="inv.export.registry">Submit to community registry on GitHub</li> | |
| </ul> | |
| </details> | |
| <details class="inv-card" open> | |
| <summary class="inv-card-title" data-i18n="inv.v07.title">🆕 v0.7 anti-bullshit pack</summary> | |
| <ul> | |
| <li data-i18n="inv.v07.unmask"><strong>🪟 Unmask</strong> — config.json claims 32k? See if it actually attends that far</li> | |
| <li data-i18n="inv.v07.template"><strong>📜 Chat-template</strong> — exact CLI flag so lm-eval doesn't silently halve your accuracy</li> | |
| <li data-i18n="inv.v07.arena"><strong>🎯 Arena CI</strong> — recover the confidence intervals Chatbot Arena hides</li> | |
| <li data-i18n="inv.v07.contam"><strong>🧪 Contamination</strong> — rate 20+ benchmarks for contamination probability</li> | |
| <li data-i18n="inv.v07.quant"><strong>⚖️ Quant</strong> — predict γ shift + ΔPPL for any (model × quant scheme) combo</li> | |
| <li data-i18n="inv.v07.drift"><strong>🔀 Drift</strong> — bug or noise? Predict max admissible gap between two evals</li> | |
| <li data-i18n="inv.v07.niah"><strong>🔍 NIAH→Reason</strong> — does your "128k context" actually reason there, or just retrieve?</li> | |
| <li data-i18n="inv.v08.saturation"><strong>📈 Saturation</strong> — is your benchmark still useful, or are all frontier models tied at the top?</li> | |
| <li data-i18n="inv.v082.cot"><strong>📋 JSON CoT</strong> — lints structured-output schemas for the answer-before-reasoning anti-pattern that silently breaks Chain-of-Thought.</li> | |
| <li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li> | |
| <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li> | |
| <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li> | |
| <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li> | |
| <li data-i18n="inv.v088.longscore"><strong>🎯 LongScore</strong> — peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.</li> | |
| <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li> | |
| </ul> | |
| </details> | |
| </div> | |
| <details class="arch-supported" open> | |
| <summary data-i18n="arch.summary">Architectures supported (click to expand)</summary> | |
| <div class="arch-badges"> | |
| <span class="badge">✓ RoPE-MHA <span class="info"><span class="tooltip" data-i18n="tooltip.mha">Multi-Head Attention: each token position attends through several parallel heads at once.</span></span></span> | |
| <span class="badge">✓ RoPE-GQA <span class="info"><span class="tooltip" data-i18n="tooltip.gqa">Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes γ toward Hagedorn).</span></span></span> | |
| <span class="badge">✓ ALiBi <span class="info"><span class="tooltip" data-i18n="tooltip.alibi">Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.</span></span></span> | |
| <span class="badge">✓ AbsPE <span class="info"><span class="tooltip" data-i18n="tooltip.abspe">Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.</span></span></span> | |
| <span class="badge">✓ SWA <span class="info"><span class="tooltip" data-i18n="tooltip.swa">Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).</span></span></span> | |
| <span class="badge">✓ SSM (Mamba) <span class="info"><span class="tooltip" data-i18n="tooltip.ssm">State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).</span></span></span> | |
| <span class="badge" data-i18n="arch.anyhf">✓ Any HuggingFace public model</span> | |
| </div> | |
| </details> | |
| </div> | |
| </div> | |
| <main> | |
| <!-- Status with loading bar --> | |
| <section id="status-bar"> | |
| <div id="status" data-i18n="status.loading_pyodide">⏳ Loading Python runtime...</div> | |
| <div id="loading-bar-wrap" style="display:none;"> | |
| <div id="loading-bar"></div> | |
| </div> | |
| </section> | |
| <!-- v0.7.7 — Task tiles: friendlier entry point, groups 14 modes by user intent. --> | |
| <section id="task-tiles" aria-labelledby="tiles-title"> | |
| <h2 id="tiles-title" data-i18n="tiles.title">🎯 What do you want to do?</h2> | |
| <p class="recipe-desc" data-i18n="tiles.subtitle">Pick a task. Each one opens the right tool below. Or scroll down for the full list of 22 modes.</p> | |
| <div class="tiles-grid"> | |
| <div class="task-tile"> | |
| <h3> | |
| <span data-i18n="tile.diagnose.title">🔬 Diagnose a model</span> | |
| <span class="info"><span class="tooltip" data-i18n="tile.diagnose.tip">Start here when you have a specific model id and want a full diagnostic: <strong>Profile</strong> runs all 5 recipes at once. <strong>Unmask</strong> checks if max_position_embeddings is honest. <strong>NIAH→Reason</strong> predicts retrieval-vs-reasoning gap. <strong>LongScore</strong> looks up published RULER + HELMET data for the model and shows real degradation past short context (peer-reviewed metric). <strong>Quant</strong> predicts whether quantizing will break it. <strong>Inspect</strong> lets you paste raw config.json for private/in-dev models.</span></span> | |
| </h3> | |
| <p class="tile-desc" data-i18n="tile.diagnose.desc">Will this specific model work for my use case?</p> | |
| <div class="tile-modes"> | |
| <button data-mode-link="profile" data-i18n="modes.profile">📇 Profile a model</button> | |
| <button data-mode-link="unmask" data-i18n="modes.unmask">🪟 Unmask</button> | |
| <button data-mode-link="niah" data-i18n="modes.niah">🔍 NIAH→Reason</button> | |
| <button data-mode-link="longscore" data-i18n="modes.longscore">🎯 LongScore</button> | |
| <button data-mode-link="quant" data-i18n="modes.quant">⚖️ Quant</button> | |
| <button data-mode-link="inspector" data-i18n="modes.inspector">🔍 Inspect config</button> | |
| </div> | |
| </div> | |
| <div class="task-tile"> | |
| <h3> | |
| <span data-i18n="tile.trust.title">✓ Trust a benchmark score</span> | |
| <span class="info"><span class="tooltip" data-i18n="tile.trust.tip">When you see a score and want to know if it's real. <strong>Contamination</strong> rates 20+ benchmarks for likelihood the model saw them during training. <strong>Drift</strong> tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). <strong>Arena CI</strong> reconstructs the confidence intervals Chatbot Arena hides — many top-Elo "wins" are statistically tied.</span></span> | |
| </h3> | |
| <p class="tile-desc" data-i18n="tile.trust.desc">Should I believe this number? Bug or noise?</p> | |
| <div class="tile-modes"> | |
| <button data-mode-link="contam" data-i18n="modes.contam">🧪 Contamination</button> | |
| <button data-mode-link="drift" data-i18n="modes.drift">🔀 Drift</button> | |
| <button data-mode-link="arena" data-i18n="modes.arena">🎯 Arena CI</button> | |
| <button data-mode-link="saturation" data-i18n="modes.saturation">📈 Saturation</button> | |
| <button data-mode-link="hub" data-i18n="modes.hub">🧭 Solutions</button> | |
| </div> | |
| </div> | |
| <div class="task-tile"> | |
| <h3> | |
| <span data-i18n="tile.eval.title">⚙️ Set up an eval correctly</span> | |
| <span class="info"><span class="tooltip" data-i18n="tile.eval.tip">Before you run lm-eval-harness or vLLM serve, get the right CLI flag. <strong>Chat-template Sniffer</strong> detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact <code>--apply_chat_template</code> / <code>--chat-template</code> invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). <strong>Diagnose CLI</strong> generates the Python command to measure γ_obs on your local GPU.</span></span> | |
| </h3> | |
| <p class="tile-desc" data-i18n="tile.eval.desc">Get the exact CLI flag for lm-eval / vLLM / transformers.</p> | |
| <div class="tile-modes"> | |
| <button data-mode-link="template" data-i18n="modes.template">📜 Chat-template</button> | |
| <button data-mode-link="diagnose" data-i18n="modes.diagnose">🩺 Diagnose CLI</button> | |
| <button data-mode-link="cot" data-i18n="modes.cot">📋 JSON CoT</button> | |
| <button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button> | |
| <button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button> | |
| <button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button> | |
| <button data-mode-link="tax" data-i18n="modes.tax">🌍 Token Tax</button> | |
| </div> | |
| </div> | |
| <div class="task-tile"> | |
| <h3> | |
| <span data-i18n="tile.compare.title">🆚 Compare models</span> | |
| <span class="info"><span class="tooltip" data-i18n="tile.compare.tip"><strong>Compare</strong>: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). <strong>Phase diagram</strong>: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.</span></span> | |
| </h3> | |
| <p class="tile-desc" data-i18n="tile.compare.desc">Side-by-side, or browse the empirical model landscape.</p> | |
| <div class="tile-modes"> | |
| <button data-mode-link="compare" data-i18n="modes.compare">🆚 Compare models</button> | |
| <button data-mode-link="phase" data-i18n="modes.phase">📊 Phase diagram</button> | |
| </div> | |
| </div> | |
| <div class="task-tile"> | |
| <h3> | |
| <span data-i18n="tile.manual.title">📋 Manual / free-form</span> | |
| <span class="info"><span class="tooltip" data-i18n="tile.manual.tip"><strong>Recipe</strong>: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. <strong>Ask</strong>: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.</span></span> | |
| </h3> | |
| <p class="tile-desc" data-i18n="tile.manual.desc">Pick a specific recipe by hand, or ask in plain English.</p> | |
| <div class="tile-modes"> | |
| <button data-mode-link="recipe" data-i18n="modes.recipe">📋 Pick recipe</button> | |
| <button data-mode-link="ask" data-i18n="modes.ask">💬 Ask plain English</button> | |
| </div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- Mode toggle --> | |
| <section id="mode-section"> | |
| <h2><span data-i18n="modes.title">🎯 Mode</span> | |
| <span class="info"><span class="tooltip" data-i18n="modes.tip"><strong>7 modes available.</strong> Most users want 📇 Profile (one-click full diagnosis).<br> | |
| <strong>📇 Profile</strong>: paste a model id → 5-recipe TAF Card.<br> | |
| <strong>🆚 Compare</strong>: 2-3 models side-by-side on one recipe.<br> | |
| <strong>🔍 Inspect</strong>: paste raw config.json to debug parameters.<br> | |
| <strong>💬 Ask</strong>: free-form question, browser LLM picks the recipe.<br> | |
| <strong>📋 Recipe</strong>: manual selection with full form control.<br> | |
| <strong>🩺 Diagnose CLI</strong>: generate Python command to measure γ on real weights.<br> | |
| <strong>📊 Phase diagram</strong>: explore 23 panel models on (log θ, γ) plane. | |
| </span></span> | |
| </h2> | |
| <div class="mode-tabs" role="tablist" aria-label="View modes"> | |
| <button class="mode-btn active" data-mode="profile" role="tab" aria-selected="true" data-i18n="modes.profile">📇 Profile a model</button> | |
| <button class="mode-btn" data-mode="compare" role="tab" aria-selected="false" data-i18n="modes.compare">🆚 Compare models</button> | |
| <button class="mode-btn" data-mode="inspector" role="tab" aria-selected="false" data-i18n="modes.inspector">🔍 Inspect config</button> | |
| <button class="mode-btn" data-mode="ask" role="tab" aria-selected="false" data-i18n="modes.ask">💬 Ask plain English</button> | |
| <button class="mode-btn" data-mode="recipe" role="tab" aria-selected="false" data-i18n="modes.recipe">📋 Pick recipe</button> | |
| <button class="mode-btn" data-mode="diagnose" role="tab" aria-selected="false" data-i18n="modes.diagnose">🩺 Diagnose CLI</button> | |
| <button class="mode-btn" data-mode="phase" role="tab" aria-selected="false" data-i18n="modes.phase">📊 Phase diagram</button> | |
| <button class="mode-btn" data-mode="unmask" role="tab" aria-selected="false" data-i18n="modes.unmask">🪟 Unmask</button> | |
| <button class="mode-btn" data-mode="template" role="tab" aria-selected="false" data-i18n="modes.template">📜 Chat-template</button> | |
| <button class="mode-btn" data-mode="arena" role="tab" aria-selected="false" data-i18n="modes.arena">🎯 Arena CI</button> | |
| <button class="mode-btn" data-mode="contam" role="tab" aria-selected="false" data-i18n="modes.contam">🧪 Contamination</button> | |
| <button class="mode-btn" data-mode="quant" role="tab" aria-selected="false" data-i18n="modes.quant">⚖️ Quant</button> | |
| <button class="mode-btn" data-mode="drift" role="tab" aria-selected="false" data-i18n="modes.drift">🔀 Drift</button> | |
| <button class="mode-btn" data-mode="niah" role="tab" aria-selected="false" data-i18n="modes.niah">🔍 NIAH→Reason</button> | |
| <button class="mode-btn" data-mode="saturation" role="tab" aria-selected="false" data-i18n="modes.saturation">📈 Saturation</button> | |
| <button class="mode-btn" data-mode="cot" role="tab" aria-selected="false" data-i18n="modes.cot">📋 JSON CoT</button> | |
| <button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button> | |
| <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button> | |
| <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button> | |
| <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button> | |
| <button class="mode-btn" data-mode="longscore" role="tab" aria-selected="false" data-i18n="modes.longscore">🎯 LongScore</button> | |
| <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button> | |
| </div> | |
| <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc"> | |
| <strong>Quickest start</strong>: paste any HuggingFace model id (e.g. <code>meta-llama/Meta-Llama-3-8B</code>), | |
| click Profile. See all 5 recipes scored in seconds. | |
| </p> | |
| </section> | |
| <!-- PROFILE mode --> | |
| <section id="profile-section"> | |
| <div class="quickstart-banner" data-i18n="profile.quickstart"> | |
| 💡 Quick start: pick any preset → click Generate. Or paste a model id from <a href='https://huggingface.co/models?library=transformers&sort=trending' target='_blank'>HF Hub trending</a> → 📥 Fetch → Generate. | |
| </div> | |
| <h2><span data-i18n="profile.title">📇 Profile a model</span> | |
| <span class="info"><span class="tooltip" data-i18n="profile.tip"> | |
| <strong>One-click full diagnosis</strong>. Paste any HF model id (or pick preset). | |
| Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, | |
| hardware) and produces a single <strong>TAF Card</strong> showing verdict per | |
| dimension + key numbers + architecture classification.<br><br> | |
| <strong>Use case</strong>: "I'm evaluating Qwen2.5-32B for production — | |
| what's its full viability profile?" → paste id → Profile → done. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="profile.desc"> | |
| <strong>For technicians</strong>: when you need a complete viability snapshot | |
| of a candidate model. Outputs match paper §sec:gamma_decomposition format. | |
| </p> | |
| <div class="form-row"> | |
| <label for="profile-preset" data-i18n="profile.preset_label">Preset:</label> | |
| <select id="profile-preset" disabled> | |
| <option value="" data-i18n="profile.preset_default">— or pick from list —</option> | |
| </select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="profile-hf-id" data-i18n="profile.hf_label">HF model id:</label> | |
| <input type="text" id="profile-hf-id" | |
| data-i18n-placeholder="profile.hf_placeholder" | |
| placeholder="e.g. meta-llama/Meta-Llama-3-8B or Qwen/Qwen2.5-7B" style="flex:1;" /> | |
| <button id="profile-fetch-btn" type="button" class="secondary" data-i18n="profile.fetch_btn">📥 Fetch</button> | |
| </div> | |
| <div id="profile-hf-status" class="subtle" style="margin: -0.5rem 0 1rem; min-height:1.2em;"></div> | |
| <div class="form-grid" id="profile-form"> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.theta">θ (rope_theta)</span> <span class="info"><span class="tooltip" data-i18n="param.theta.tip">RoPE base frequency from <code>config.rope_theta</code>.</span></span></label> | |
| <input type="number" id="profile-theta" value="500000" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.T_train">T_train</span> <span class="info"><span class="tooltip" data-i18n="param.T_train.tip">Max training context. From <code>max_position_embeddings</code>.</span></span></label> | |
| <input type="number" id="profile-T_train" value="8192" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.T_eval">T_eval (your target)</span> <span class="info"><span class="tooltip" data-i18n="param.T_eval.tip">Inference context length you'll actually serve. The key knob.</span></span></label> | |
| <input type="number" id="profile-T_eval" value="32000" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.n_attn">n_attention_heads</span> <span class="info"><span class="tooltip" data-i18n="param.n_attn.tip">Number of attention heads per layer. From <code>num_attention_heads</code>.</span></span></label> | |
| <input type="number" id="profile-n_attn" value="32" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.n_kv">n_kv_heads</span> <span class="info"><span class="tooltip" data-i18n="param.n_kv.tip">KV heads. If < n_attention_heads → GQA (Grouped Query Attention). Reduces KV memory but pushes γ toward Hagedorn.</span></span></label> | |
| <input type="number" id="profile-n_kv" value="8" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.d_head">head_dim</span> <span class="info"><span class="tooltip" data-i18n="param.d_head.tip">Per-head dimension. Typical 64, 96, 128. From <code>head_dim</code> or <code>hidden_size / num_attention_heads</code>.</span></span></label> | |
| <input type="number" id="profile-d_head" value="128" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.n_layers">n_layers</span> <span class="info"><span class="tooltip" data-i18n="param.n_layers.tip">Number of transformer blocks. From <code>num_hidden_layers</code>.</span></span></label> | |
| <input type="number" id="profile-n_layers" value="32" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.n_params">n_params (e.g. 8e9)</span> <span class="info"><span class="tooltip" data-i18n="param.n_params.tip">Total parameter count. Threshold ~400M for induction-head emergence. Affects KV memory and budget recipes.</span></span></label> | |
| <input type="text" id="profile-n_params" value="8e9" /> | |
| </div> | |
| <div class="form-field"> | |
| <label><span data-i18n="param.has_swa">Has SWA?</span> <span class="info"><span class="tooltip" data-i18n="param.has_swa.tip">Sliding Window Attention. <code>true</code> for Mistral, gemma-2, phi-3. Calibration audit (v0.5.3) disabled the historical δ_SWA correction (n=1 fit).</span></span></label> | |
| <select id="profile-has_swa"> | |
| <option value="false" selected data-i18n="common.no">No</option> | |
| <option value="true" data-i18n="common.yes">Yes</option> | |
| </select> | |
| </div> | |
| </div> | |
| <button id="profile-btn" disabled data-i18n="profile.btn">🚀 Generate full profile</button> | |
| </section> | |
| <!-- INSPECTOR mode (paste config.json directly) --> | |
| <section id="inspector-section" style="display:none;"> | |
| <div class="quickstart-banner" data-i18n="inspector.quickstart"> | |
| 💡 Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile. | |
| </div> | |
| <h2><span data-i18n="inspector.title">🔍 Architecture Inspector</span> | |
| <span class="info"><span class="tooltip" data-i18n="inspector.tip"> | |
| <strong>Paste any config.json directly</strong>. Tool parses it and runs the full Profile. | |
| Useful for: private models, in-development configs, models not yet on HuggingFace, | |
| or comparing what your custom architecture would do. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="inspector.desc"> | |
| Paste the raw <code>config.json</code> contents. The tool extracts the architectural | |
| parameters and runs the full 5-recipe Profile. | |
| </p> | |
| <textarea id="inspector-json" rows="12" | |
| data-i18n-placeholder="inspector.placeholder" | |
| placeholder='{ | |
| "model_type": "llama", | |
| "rope_theta": 500000, | |
| "max_position_embeddings": 8192, | |
| "num_attention_heads": 32, | |
| "num_key_value_heads": 8, | |
| "hidden_size": 4096, | |
| "num_hidden_layers": 32, | |
| "vocab_size": 128256 | |
| }'></textarea> | |
| <div class="form-row" style="margin-top:0.5rem;"> | |
| <label for="inspector-T_eval" data-i18n="inspector.T_eval">T_eval (your target context):</label> | |
| <input type="number" id="inspector-T_eval" value="32000" /> | |
| </div> | |
| <button id="inspector-btn" disabled data-i18n="inspector.btn">🚀 Inspect & profile</button> | |
| <span id="inspector-status" class="subtle" style="margin-left:0.75rem;"></span> | |
| </section> | |
| <!-- COMPARE mode --> | |
| <section id="compare-section" style="display:none;"> | |
| <div class="quickstart-banner" data-i18n="compare.example"> | |
| 💡 Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context. | |
| </div> | |
| <h2><span data-i18n="compare.title">🆚 Compare models side-by-side</span> | |
| <span class="info"><span class="tooltip" data-i18n="compare.tip"> | |
| <strong>Same recipe, multiple models</strong>. Pick 2-3 candidate models and | |
| one recipe. See verdicts in a single comparison table.<br><br> | |
| <strong>Use case</strong>: "I need long-context retrieval at 16K — which is | |
| best: Llama-3-8B, Mistral-7B, or Qwen-7B?" → pick 3 + X-2 + 16K → see winner. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="compare.desc"> | |
| <strong>For technicians</strong>: when choosing between 2-3 candidate models for | |
| a specific deployment scenario. Compare their verdicts on the same recipe. | |
| </p> | |
| <div class="form-row"> | |
| <label for="compare-recipe" data-i18n="compare.recipe_label">Recipe:</label> | |
| <select id="compare-recipe" disabled> | |
| <option value="" data-i18n="recipe.default">— pick a recipe —</option> | |
| </select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="compare-T_eval" data-i18n="compare.T_eval_label">T_eval (target context):</label> | |
| <input type="number" id="compare-T_eval" value="16000" style="flex:1;" /> | |
| <span class="info" style="margin-top:0.5rem;"><span class="tooltip"> | |
| For X-2 / X-19 only. The context length all compared models will be | |
| evaluated at. Other recipes use their own params. | |
| </span></span> | |
| </div> | |
| <div id="compare-models"> | |
| <h3 style="margin-top:1rem;" data-i18n="compare.models_title">Models to compare (add up to 3)</h3> | |
| <div class="compare-slot" data-slot="1"> | |
| <input type="text" class="compare-hf-id" | |
| data-i18n-placeholder="compare.slot1_placeholder" | |
| placeholder="HF model id (e.g. meta-llama/Meta-Llama-3-8B)" /> | |
| <select class="compare-preset"> | |
| <option value="" data-i18n="compare.preset_default">— or preset —</option> | |
| </select> | |
| </div> | |
| <div class="compare-slot" data-slot="2"> | |
| <input type="text" class="compare-hf-id" | |
| data-i18n-placeholder="compare.slot2_placeholder" | |
| placeholder="HF model id #2" /> | |
| <select class="compare-preset"> | |
| <option value="" data-i18n="compare.preset_default">— or preset —</option> | |
| </select> | |
| </div> | |
| <div class="compare-slot" data-slot="3"> | |
| <input type="text" class="compare-hf-id" | |
| data-i18n-placeholder="compare.slot3_placeholder" | |
| placeholder="HF model id #3 (optional)" /> | |
| <select class="compare-preset"> | |
| <option value="" data-i18n="compare.preset_default">— or preset —</option> | |
| </select> | |
| </div> | |
| </div> | |
| <button id="compare-btn" disabled style="margin-top:1rem;" data-i18n="compare.btn">🚀 Compare</button> | |
| </section> | |
| <!-- ASK mode (free-form question) --> | |
| <section id="ask-section" style="display:none;"> | |
| <h2 data-i18n="ask.title">❓ Your question</h2> | |
| <textarea id="question" rows="3" | |
| data-i18n-placeholder="ask.placeholder" | |
| placeholder="e.g. Will Mistral-7B handle 16K NIAH retrieval? Or: I have $5,000, what model can I train? Or: Cheapest GPU to serve Llama-70B at 100M tokens/day?"></textarea> | |
| <div style="display:flex; gap:0.5rem; margin-top:0.5rem; flex-wrap:wrap;"> | |
| <button id="ask-btn" disabled data-i18n="ask.btn">🚀 Analyze</button> | |
| <button id="example-btn" type="button" class="secondary" data-i18n="ask.example_btn">💡 Try an example</button> | |
| </div> | |
| </section> | |
| <!-- Diagnose mode: build the CLI command for diagnose_model.py --> | |
| <section id="diagnose-section" style="display:none;"> | |
| <h2><span data-i18n="diagnose.title">🩺 Diagnose CLI Command Builder</span> | |
| <span class="info"><span class="tooltip" data-i18n="diagnose.tip"> | |
| <strong>Measure γ_obs (not predict)</strong>. The browser tool predicts γ from | |
| config alone (Padé). To <em>measure</em> the actual decay on a real model | |
| you need GPU + Python. This builder produces the exact CLI command you | |
| run locally; the script is shipped in this repository at | |
| <code>cli/diagnose_model.py</code>.<br><br> | |
| <strong>Output</strong>: γ_obs, R², phase, KV cache budget D_90, KL anomaly, | |
| full thermodynamic profile (Z, U, S, F, C_V, χ). Saved as JSON. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="diagnose.desc"> | |
| Pick options below and copy-paste the generated command on your local | |
| machine (Python + transformers + numpy). Total wall time ≈ 5 min in | |
| <code>--fast</code> mode on CPU; full mode 20–60 min on GPU. | |
| </p> | |
| <div class="form-row"> | |
| <label for="diag-model" data-i18n="diagnose.model_label">HF model id:</label> | |
| <input type="text" id="diag-model" placeholder="EleutherAI/pythia-70m" value="EleutherAI/pythia-70m"> | |
| </div> | |
| <div class="form-row"> | |
| <label for="diag-theta" data-i18n="diagnose.theta_label">θ (auto if blank):</label> | |
| <input type="number" id="diag-theta" placeholder="auto-detect"> | |
| </div> | |
| <div class="form-row"> | |
| <label for="diag-N" data-i18n="diagnose.n_label">Context N:</label> | |
| <input type="number" id="diag-N" value="2000" min="100" max="32000"> | |
| </div> | |
| <div class="form-row"> | |
| <label data-i18n="diagnose.options_label">Options:</label> | |
| <span> | |
| <label><input type="checkbox" id="diag-fast" checked> | |
| <span data-i18n="diagnose.opt_fast">--fast (CPU, ~5 min)</span></label><br> | |
| <label><input type="checkbox" id="diag-cpu"> | |
| <span data-i18n="diagnose.opt_cpu">--cpu (force CPU)</span></label><br> | |
| <label><input type="checkbox" id="diag-4bit"> | |
| <span data-i18n="diagnose.opt_4bit">--load_in_4bit (≥7B models)</span></label> | |
| </span> | |
| </div> | |
| <div class="form-row"> | |
| <label for="diag-local" data-i18n="diagnose.local_label">--local path (optional):</label> | |
| <input type="text" id="diag-local" placeholder="/path/to/local/weights"> | |
| </div> | |
| <button id="diag-build-btn" data-i18n="diagnose.build_btn">📋 Build command</button> | |
| <div id="diag-output" style="display:none; margin-top:1em;"> | |
| <h3 data-i18n="diagnose.cmd_title">Generated command:</h3> | |
| <pre id="diag-cmd" class="diag-cmd-box"></pre> | |
| <button id="diag-copy-btn" data-i18n="diagnose.copy_btn">📋 Copy to clipboard</button> | |
| <p class="recipe-desc" data-i18n="diagnose.next_steps"> | |
| <strong>Next steps</strong>: | |
| (1) <code>git clone https://github.com/karlesmarin/tafagent</code> | |
| (2) <code>cd tafagent && pip install torch transformers numpy</code> | |
| (3) Run the command above. | |
| (4) Result JSON lands in <code>./diagnose_results/</code> — upload it | |
| to the <strong>📋 Pick recipe</strong> mode (or paste in <strong>🔍 Inspect config</strong>) for full TAF analysis. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- Phase diagram mode: live scatter of measured γ vs θ --> | |
| <section id="phase-section" style="display:none;"> | |
| <h2><span data-i18n="phase.title">📊 Phase diagram (γ × θ)</span> | |
| <span class="info"><span class="tooltip" data-i18n="phase.tip"> | |
| Each dot is one model from the paper's empirical panel | |
| (data/master_gamma_results.json). The x-axis is RoPE base θ | |
| on log scale; y-axis is measured γ. | |
| The Hagedorn line γ=1 separates Phase A (γ<1, global) from | |
| Phase B (γ>1, local-collapsed). | |
| Hover dots for details; click to populate the recipe form. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="phase.desc"> | |
| 23 models in the panel; the Padé curve (line) is | |
| γ_pred(θ) = (2θ−T√2)/(2θ+T√2) at T=2000. | |
| </p> | |
| <canvas id="phase-canvas" width="900" height="500" style="max-width:100%; background: var(--card-bg); border-radius: 6px;"></canvas> | |
| <div id="phase-info" class="recipe-desc" style="margin-top:0.6em;"></div> | |
| </section> | |
| <!-- Unmask mode: detect misleading max_position_embeddings via SWA / RoPE-scaling --> | |
| <section id="unmask-section" style="display:none;"> | |
| <h2><span data-i18n="unmask.title">🪟 Context Unmasker</span> | |
| <span class="info"><span class="tooltip" data-i18n="unmask.tip"> | |
| Paste a HuggingFace model id (or raw config.json). The tool checks for | |
| sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and | |
| GQA — anything that makes <code>max_position_embeddings</code> larger | |
| than the practical effective context. Mistral-7B-v0.1 is the canonical | |
| example: declared 32k, attends within ~4-8k. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="unmask.desc"> | |
| <strong>Are you about to spend money on a model that won't actually attend that far?</strong> Paste an id and find out in 1 second. No GPU, no inference — just config.json arithmetic. | |
| </p> | |
| <div class="form-row"> | |
| <label for="unmask-id" data-i18n="unmask.id_label">HF model id:</label> | |
| <input type="text" id="unmask-id" placeholder="e.g. mistralai/Mistral-7B-v0.1" /> | |
| <button type="button" id="unmask-fetch-btn" data-i18n="unmask.fetch_btn">🔍 Unmask</button> | |
| </div> | |
| <p id="unmask-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <details style="margin: 0.6em 0;"> | |
| <summary style="cursor:pointer; font-size:0.92em;" data-i18n="unmask.paste_summary">Or paste raw config.json (private / in-dev models)</summary> | |
| <textarea id="unmask-paste" rows="6" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder='{"max_position_embeddings": 32768, "sliding_window": 4096, ...}'></textarea> | |
| <button type="button" id="unmask-paste-btn" data-i18n="unmask.paste_btn" style="margin-top:0.4em;">🔍 Unmask pasted config</button> | |
| </details> | |
| <div id="unmask-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Chat-template sniffer mode (v0.7.1 anti-bullshit pack #2) --> | |
| <section id="template-section" style="display:none;"> | |
| <h2><span data-i18n="template.title">📜 Chat-template Sniffer</span> | |
| <span class="info"><span class="tooltip" data-i18n="template.tip"> | |
| Paste an HF model id (or raw tokenizer_config.json). Detects the | |
| chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3, | |
| Alpaca, DeepSeek, custom) and gives you the exact framework command | |
| to use it correctly. lm-eval-harness silently halves accuracy if you | |
| forget to apply it (issue #1841). | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="template.desc"> | |
| <strong>Did you forget <code>--apply_chat_template</code>?</strong> Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack. | |
| </p> | |
| <div class="form-row"> | |
| <label for="template-id" data-i18n="template.id_label">HF model id:</label> | |
| <input type="text" id="template-id" placeholder="e.g. mistralai/Mistral-7B-Instruct-v0.3" /> | |
| <button type="button" id="template-fetch-btn" data-i18n="template.fetch_btn">📜 Sniff</button> | |
| </div> | |
| <p id="template-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <details style="margin: 0.6em 0;"> | |
| <summary style="cursor:pointer; font-size:0.92em;" data-i18n="template.paste_summary">Or paste raw tokenizer_config.json (private models)</summary> | |
| <textarea id="template-paste" rows="6" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder='{"chat_template": "...", ...}'></textarea> | |
| <button type="button" id="template-paste-btn" data-i18n="template.paste_btn" style="margin-top:0.4em;">📜 Sniff pasted config</button> | |
| </details> | |
| <div id="template-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Arena-Elo CI reconstructor (v0.7.2 anti-bullshit pack #3) --> | |
| <section id="arena-section" style="display:none;"> | |
| <h2><span data-i18n="arena.title">🎯 Arena-Elo CI Reconstructor</span> | |
| <span class="info"><span class="tooltip" data-i18n="arena.tip"> | |
| Chatbot Arena strips confidence intervals from the public leaderboard. | |
| A 5-Elo gap can be statistically meaningless. Paste raw vote data | |
| (model_a, model_b, winner) — the tool computes Bradley-Terry MLE + | |
| bootstrap CIs and lists statistical ties (CI overlap). | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="arena.desc"> | |
| <strong>Is GPT-4 actually better than Claude — or are they tied?</strong> Paste pairwise vote CSV (or click <em>Load sample</em>). Bradley-Terry MLE + 200-iteration bootstrap → ranked Elos with 95% CIs and statistical-tie detection. All in browser. | |
| </p> | |
| <div class="form-row"> | |
| <button type="button" id="arena-sample-btn" data-i18n="arena.sample_btn">📊 Load sample data</button> | |
| <button type="button" id="arena-run-btn" data-i18n="arena.run_btn">🎯 Compute CIs</button> | |
| <button type="button" id="arena-clear-btn" class="secondary" data-i18n="arena.clear_btn">🗑️ Clear</button> | |
| </div> | |
| <p id="arena-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <details style="margin: 0.6em 0;" open> | |
| <summary style="cursor:pointer; font-size:0.92em;" data-i18n="arena.csv_summary">Vote CSV (header: <code>model_a,model_b,winner</code>; winner ∈ a/b/tie)</summary> | |
| <textarea id="arena-csv" rows="10" style="width:100%; font-family:monospace; font-size:0.85em; margin-top:0.4em;" placeholder="model_a,model_b,winner GPT-4,Claude,a Llama-3,Mixtral,tie ..."></textarea> | |
| </details> | |
| <div id="arena-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Contamination prior (v0.7.3 anti-bullshit pack #4) --> | |
| <section id="contam-section" style="display:none;"> | |
| <h2><span data-i18n="contam.title">🧪 Contamination Prior</span> | |
| <span class="info"><span class="tooltip" data-i18n="contam.tip"> | |
| Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) × (benchmark release date) × (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="contam.desc"> | |
| <strong>Should you trust your model's MMLU score?</strong> Enter the model's training cutoff date — the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA…) and tells you which scores are likely contaminated. | |
| </p> | |
| <div class="form-row"> | |
| <label for="contam-cutoff" data-i18n="contam.cutoff_label">Training cutoff:</label> | |
| <input type="text" id="contam-cutoff" placeholder="2023-12 or 2024-01" style="max-width:14em;" /> | |
| <button type="button" id="contam-run-btn" data-i18n="contam.run_btn">🧪 Rate all benchmarks</button> | |
| </div> | |
| <p id="contam-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="contam-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Quant-regime classifier (v0.7.3 anti-bullshit pack #5) --> | |
| <section id="quant-section" style="display:none;"> | |
| <h2><span data-i18n="quant.title">⚖️ Quant-regime Classifier</span> | |
| <span class="info"><span class="tooltip" data-i18n="quant.tip"> | |
| Predicts γ-shift (and downstream ΔPPL) for a given (model × quant scheme). | |
| Generic claims like "AWQ ~95% retention" are too vague — TAF uses | |
| d_head, GQA ratio, SWA flag, and model size to give an architecture-specific | |
| verdict. Solves: HF community widely reports unpredictable quant cliffs | |
| (NF4 -2 PPL on Phi-3 but fine on Llama-3-8B). | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="quant.desc"> | |
| <strong>Will quantizing your model break it?</strong> Paste an HF model id, pick a quant scheme — get predicted γ-shift, expected ΔPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required. | |
| </p> | |
| <div class="form-row"> | |
| <label for="quant-id" data-i18n="quant.id_label">HF model id:</label> | |
| <input type="text" id="quant-id" placeholder="e.g. meta-llama/Llama-3.2-1B" /> | |
| <button type="button" id="quant-fetch-btn" data-i18n="quant.fetch_btn">📥 Fetch config</button> | |
| </div> | |
| <div class="form-row"> | |
| <label for="quant-scheme" data-i18n="quant.scheme_label">Quant scheme:</label> | |
| <select id="quant-scheme"> | |
| <option value="">— select scheme —</option> | |
| </select> | |
| <button type="button" id="quant-run-btn" data-i18n="quant.run_btn">⚖️ Predict</button> | |
| <button type="button" id="quant-all-btn" class="secondary" data-i18n="quant.all_btn">📊 Compare all schemes</button> | |
| </div> | |
| <p id="quant-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="quant-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Cross-framework drift bound (v0.7.5 anti-bullshit pack #6) --> | |
| <section id="drift-section" style="display:none;"> | |
| <h2><span data-i18n="drift.title">🔀 Cross-framework Drift Bound</span> | |
| <span class="info"><span class="tooltip" data-i18n="drift.tip"> | |
| Same model, different scores on different setups. Is the gap noise or | |
| a real bug? Enter two scores with their (framework, dtype, batch, | |
| chat-template) — tool predicts the maximum allowable drift from | |
| numerical noise alone. If observed gap exceeds it → real bug, usually | |
| chat-template mismatch (lm-eval issue #1841) or KV-cache layout. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="drift.desc"> | |
| <strong>Your model gives 67.2 on lm-eval-hf and 65.1 on vLLM-served. Bug or noise?</strong> Enter both scores with (framework, dtype, batch, chat-template applied?). Tool predicts the noise band and flags real bugs. arxiv 2506.09501 documents this as a major eval reproducibility problem. | |
| </p> | |
| <div class="drift-grid"> | |
| <fieldset class="drift-setup"> | |
| <legend data-i18n="drift.setup_a">Setup A</legend> | |
| <div class="form-row"> | |
| <label for="drift-a-score" data-i18n="drift.score">Score:</label> | |
| <input type="number" id="drift-a-score" step="0.01" placeholder="67.2" /> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-a-framework" data-i18n="drift.framework">Framework:</label> | |
| <select id="drift-a-framework"></select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-a-dtype" data-i18n="drift.dtype">Dtype:</label> | |
| <select id="drift-a-dtype"></select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-a-batch" data-i18n="drift.batch">Batch:</label> | |
| <input type="number" id="drift-a-batch" min="1" value="1" /> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-a-template" data-i18n="drift.template">Chat-template:</label> | |
| <select id="drift-a-template"> | |
| <option value="applied" data-i18n="drift.template.applied">applied</option> | |
| <option value="not_applied" data-i18n="drift.template.not_applied">not applied</option> | |
| <option value="unknown" data-i18n="drift.template.unknown">unknown</option> | |
| </select> | |
| </div> | |
| </fieldset> | |
| <fieldset class="drift-setup"> | |
| <legend data-i18n="drift.setup_b">Setup B</legend> | |
| <div class="form-row"> | |
| <label for="drift-b-score" data-i18n="drift.score">Score:</label> | |
| <input type="number" id="drift-b-score" step="0.01" placeholder="65.1" /> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-b-framework" data-i18n="drift.framework">Framework:</label> | |
| <select id="drift-b-framework"></select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-b-dtype" data-i18n="drift.dtype">Dtype:</label> | |
| <select id="drift-b-dtype"></select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-b-batch" data-i18n="drift.batch">Batch:</label> | |
| <input type="number" id="drift-b-batch" min="1" value="1" /> | |
| </div> | |
| <div class="form-row"> | |
| <label for="drift-b-template" data-i18n="drift.template">Chat-template:</label> | |
| <select id="drift-b-template"> | |
| <option value="applied" data-i18n="drift.template.applied">applied</option> | |
| <option value="not_applied" data-i18n="drift.template.not_applied">not applied</option> | |
| <option value="unknown" data-i18n="drift.template.unknown">unknown</option> | |
| </select> | |
| </div> | |
| </fieldset> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="drift-run-btn" data-i18n="drift.run_btn">🔀 Compute drift bound</button> | |
| <button type="button" id="drift-sample-btn" class="secondary" data-i18n="drift.sample_btn">📊 Load sample (chat-template bug)</button> | |
| </div> | |
| <p id="drift-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="drift-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- NIAH → reasoning gap predictor (v0.7.6 anti-bullshit pack #7) --> | |
| <section id="niah-section" style="display:none;"> | |
| <h2><span data-i18n="niah.title">🔍 NIAH → Reasoning Gap</span> | |
| <span class="info"><span class="tooltip" data-i18n="niah.tip"> | |
| NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="niah.desc"> | |
| <strong>Your model claims 128k context. Will it actually reason at 64k, or just retrieve?</strong> Paste an HF model id and a target eval context — tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays ≥65%. | |
| </p> | |
| <div class="form-row"> | |
| <label for="niah-id" data-i18n="niah.id_label">HF model id:</label> | |
| <input type="text" id="niah-id" placeholder="e.g. meta-llama/Llama-3.1-8B-Instruct" /> | |
| <button type="button" id="niah-fetch-btn" data-i18n="niah.fetch_btn">📥 Fetch config</button> | |
| </div> | |
| <div class="form-row"> | |
| <label for="niah-teval" data-i18n="niah.teval_label">Target context (T_eval):</label> | |
| <input type="number" id="niah-teval" min="512" step="1024" value="32768" /> | |
| <button type="button" id="niah-run-btn" data-i18n="niah.run_btn">🔍 Predict</button> | |
| <button type="button" id="niah-sweep-btn" class="secondary" data-i18n="niah.sweep_btn">📊 Sweep contexts</button> | |
| </div> | |
| <p id="niah-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="niah-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Benchmark Saturation Detector (v0.8.0 anti-bullshit pack #6) --> | |
| <section id="saturation-section" style="display:none;"> | |
| <h2><span data-i18n="saturation.title">📈 Benchmark Saturation Detector</span> | |
| <span class="info"><span class="tooltip" data-i18n="saturation.tip"> | |
| MMLU is saturated (88-94% all frontier models). Reporting "92% on MMLU" is now meaningless. This tool tells you which benchmarks still discriminate frontier models, which are saturated, and what to use instead. Data: DemandSphere AI Frontier Tracker (CC BY-NC 4.0) refreshed 2026-05. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="saturation.desc"> | |
| <strong>Is your benchmark still useful?</strong> Pick a benchmark to see top-3 frontier scores, spread, and a verdict (saturated / near-saturated / discriminative) plus recommended replacements. | |
| </p> | |
| <div class="form-row"> | |
| <label for="saturation-select" data-i18n="saturation.select_label">Benchmark:</label> | |
| <select id="saturation-select"></select> | |
| <button type="button" id="saturation-run-btn" data-i18n="saturation.run_btn">📈 Classify</button> | |
| <button type="button" id="saturation-all-btn" class="secondary" data-i18n="saturation.all_btn">📊 Show all</button> | |
| </div> | |
| <p id="saturation-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="saturation-output" style="margin-top: 1em;"></div> | |
| <p class="subtle" style="font-size:0.82em; margin-top:1em;" data-i18n="saturation.attribution"> | |
| Data: DemandSphere AI Frontier Model Tracker (CC BY-NC 4.0) · HF Open LLM Leaderboard v3 (open-weight historical) · last fetch 2026-05-05. | |
| </p> | |
| </section> | |
| <!-- Solutions Hub — integrator portal (v0.8.1) --> | |
| <!-- JSON CoT-aware Linter (mode=cot, v0.8.2 anti-bullshit pack #8) --> | |
| <section id="cot-section" style="display:none;"> | |
| <h2><span data-i18n="cot.title">📋 JSON CoT-aware Linter</span> | |
| <span class="info"><span class="tooltip" data-i18n="cot.tip"> | |
| <strong>Why this matters</strong>: constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in schema order. If your schema places <code>answer</code> before <code>reasoning</code>, the model commits to a final answer first and only then writes the rationale to justify it — defeating Chain-of-Thought entirely. Paste a JSON Schema (or example object) and the linter flags the ordering. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="cot.desc"> | |
| <strong>Reasoning before answer, always.</strong> Paste a JSON Schema or example response object — the linter reports whether reasoning fields come before answer fields and suggests a fix. | |
| </p> | |
| <div class="form-row"> | |
| <textarea id="cot-input" rows="10" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="cot.input.placeholder" placeholder='{ "type": "object", "properties": { "answer": {"type": "string"}, "reasoning": {"type": "string"} } }'></textarea> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="cot-lint-btn" data-i18n="cot.lint_btn">🔍 Lint</button> | |
| <button type="button" id="cot-example-good-btn" class="secondary" data-i18n="cot.example_good_btn">↳ Example: good order</button> | |
| <button type="button" id="cot-example-bad-btn" class="secondary" data-i18n="cot.example_bad_btn">↳ Example: anti-pattern</button> | |
| </div> | |
| <p id="cot-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="cot-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- PEFT Anti-Pattern Checker (mode=peft, v0.8.3 anti-bullshit pack #9) --> | |
| <section id="peft-section" style="display:none;"> | |
| <h2><span data-i18n="peft.title">🔧 PEFT Anti-Pattern Checker</span> | |
| <span class="info"><span class="tooltip" data-i18n="peft.tip"> | |
| <strong>Why this matters</strong>: <code>get_peft_model(base, config)</code> creates a FRESH adapter — it does NOT load saved weights. Users who want to resume from a checkpoint must call <code>PeftModel.from_pretrained(base, path)</code>. peft #2115 documents the silent base-model bug. This linter scans your training script for that pattern (and 3 others: QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio). | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="peft.desc"> | |
| <strong>Don't burn 10 hours of training on a base model.</strong> Paste your PEFT setup code — the linter flags silent base-model loads, QLoRA ordering bugs, target_modules/arch mismatches, and lora_alpha conventions. | |
| </p> | |
| <div class="form-row"> | |
| <textarea id="peft-input" rows="14" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="peft.input.placeholder" placeholder="from peft import LoraConfig, get_peft_model base = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B') config = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','v_proj']) model = get_peft_model(base, config) # resume from saved adapter: model.load_state_dict('./outputs/checkpoint-1000/adapter_model.bin')"></textarea> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="peft-lint-btn" data-i18n="peft.lint_btn">🔍 Lint</button> | |
| <button type="button" id="peft-example-bug-btn" class="secondary" data-i18n="peft.example_bug_btn">↳ Example: silent base-load</button> | |
| <button type="button" id="peft-example-qlora-btn" class="secondary" data-i18n="peft.example_qlora_btn">↳ Example: QLoRA order bug</button> | |
| <button type="button" id="peft-example-clean-btn" class="secondary" data-i18n="peft.example_clean_btn">↳ Example: clean</button> | |
| </div> | |
| <p id="peft-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="peft-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Prompt-Cache Diff Predictor (mode=cache, v0.8.4 anti-bullshit pack #10) --> | |
| <section id="cache-section" style="display:none;"> | |
| <h2><span data-i18n="cache.title">🔁 Prompt-Cache Diff Predictor</span> | |
| <span class="info"><span class="tooltip" data-i18n="cache.tip"> | |
| <strong>Why this matters</strong>: Anthropic's `cache_control` cache breaks at the first token diff in the marked prefix. OpenAI auto-caches prefixes ≥1024 tokens but invalidates on any change. Gemini context cache requires ≥32K tokens. A misplaced edit silently 10x's your bill — and the API never warns you. Paste old + new prompt, see per-provider hit ratio + cost delta. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="cache.desc"> | |
| <strong>Don't 10x your bill on a one-character edit.</strong> Paste your previous and current prompt — the predictor finds the longest common prefix, estimates tokens, and shows per-provider cache hit ratio + $ delta vs no-cache. | |
| </p> | |
| <div class="form-row" style="display:flex; gap:1em; flex-wrap:wrap;"> | |
| <div style="flex:1; min-width:300px;"> | |
| <label for="cache-old" data-i18n="cache.old_label">Old prompt:</label> | |
| <textarea id="cache-old" rows="10" style="width:100%;font-family:monospace;font-size:0.85em;" data-i18n-placeholder="cache.old.placeholder" placeholder="You are a helpful assistant. …"></textarea> | |
| </div> | |
| <div style="flex:1; min-width:300px;"> | |
| <label for="cache-new" data-i18n="cache.new_label">New prompt:</label> | |
| <textarea id="cache-new" rows="10" style="width:100%;font-family:monospace;font-size:0.85em;" data-i18n-placeholder="cache.new.placeholder" placeholder="You are a helpful assistant. …"></textarea> | |
| </div> | |
| </div> | |
| <div class="form-row"> | |
| <label for="cache-profile" data-i18n="cache.profile_label">Tokenizer profile:</label> | |
| <select id="cache-profile"> | |
| <option value="english" data-i18n="cache.profile.english">English (chars/4)</option> | |
| <option value="code" data-i18n="cache.profile.code">Code (chars/3.5)</option> | |
| <option value="mixed" data-i18n="cache.profile.mixed">CJK / Cyrillic (chars/2)</option> | |
| </select> | |
| <label for="cache-output-tokens" data-i18n="cache.output_label">Estimated output tokens:</label> | |
| <input type="number" id="cache-output-tokens" value="500" min="0" max="100000" style="width:8em;" /> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="cache-diff-btn" data-i18n="cache.diff_btn">🔍 Predict</button> | |
| <button type="button" id="cache-example-good-btn" class="secondary" data-i18n="cache.example_good_btn">↳ Example: 99% hit</button> | |
| <button type="button" id="cache-example-broken-btn" class="secondary" data-i18n="cache.example_broken_btn">↳ Example: cache busted</button> | |
| <button type="button" id="cache-example-belowmin-btn" class="secondary" data-i18n="cache.example_belowmin_btn">↳ Example: below OpenAI min</button> | |
| </div> | |
| <p id="cache-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="cache-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Speculative-Decode Compatibility Checker (mode=speculative, v0.8.5 #11) --> | |
| <section id="speculative-section" style="display:none;"> | |
| <h2><span data-i18n="speculative.title">🔬 Speculative-Decode Compatibility</span> | |
| <span class="info"><span class="tooltip" data-i18n="speculative.tip"> | |
| <strong>Why this matters</strong>: speculative decoding (vLLM, SGLang, llama.cpp, transformers) requires the draft and target model to share an EXACT vocabulary. Any token-id disagreement means the target rejects every draft token — you pay BOTH compute costs and get WORSE throughput than baseline. The system reports nominal output (just slower), so the bug is invisible in unit tests. This tool fetches `tokenizer.json` from HF Hub for both ids and compares. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="speculative.desc"> | |
| <strong>Don't ship spec-dec with mismatched vocabs.</strong> Paste target + draft HF model ids → tool fetches tokenizers, compares vocab type, size, sampled token-ids, special tokens, added tokens → verdict + speedup estimate. | |
| </p> | |
| <div class="form-row"> | |
| <label for="spec-target-id" data-i18n="speculative.target_label">Target (large) model id:</label> | |
| <input type="text" id="spec-target-id" placeholder="Qwen/Qwen2.5-72B-Instruct" style="flex:1;" /> | |
| </div> | |
| <div class="form-row"> | |
| <label for="spec-draft-id" data-i18n="speculative.draft_label">Draft (small) model id:</label> | |
| <input type="text" id="spec-draft-id" placeholder="Qwen/Qwen2.5-7B-Instruct" style="flex:1;" /> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="spec-check-btn" data-i18n="speculative.check_btn">🔍 Check compatibility</button> | |
| <button type="button" id="spec-example-good-btn" class="secondary" data-i18n="speculative.example_good_btn">↳ Example: Qwen2.5 7B/72B (good)</button> | |
| <button type="button" id="spec-example-bad-btn" class="secondary" data-i18n="speculative.example_bad_btn">↳ Example: cross-family (bad)</button> | |
| </div> | |
| <p class="recipe-desc subtle" style="font-size:0.82em;" data-i18n="speculative.gated_note"> | |
| 💡 <strong>Gated models</strong> (Llama, Mistral, Gemma) require HF login + license acceptance — this tool can't auth, so they return 401. Use open-weight pairs (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) for demos. | |
| </p> | |
| <p id="spec-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="spec-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Multilingual Tokenizer Tax (mode=tax, v0.8.7 anti-bullshit pack #13) --> | |
| <section id="tax-section" style="display:none;"> | |
| <h2><span data-i18n="tax.title">🌍 Multilingual Tokenizer Tax</span> | |
| <span class="info"><span class="tooltip" data-i18n="tax.tip"> | |
| <strong>Why this matters</strong>: tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="tax.desc"> | |
| <strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline. | |
| </p> | |
| <div class="form-row"> | |
| <label for="tax-input" data-i18n="tax.input_label">Text to tokenize:</label> | |
| <textarea id="tax-input" rows="8" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="tax.input.placeholder" placeholder="Paste any text — English, Chinese, Arabic, code, …"></textarea> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="tax-tokenize-btn" data-i18n="tax.tokenize_btn">🔬 Tokenize all</button> | |
| <button type="button" id="tax-sample-en-btn" class="secondary" data-i18n="tax.sample_en_btn">↳ Sample: English</button> | |
| <button type="button" id="tax-sample-zh-btn" class="secondary" data-i18n="tax.sample_zh_btn">↳ Sample: 中文</button> | |
| <button type="button" id="tax-sample-ar-btn" class="secondary" data-i18n="tax.sample_ar_btn">↳ Sample: عربى</button> | |
| <button type="button" id="tax-sample-mixed-btn" class="secondary" data-i18n="tax.sample_mixed_btn">↳ Sample: mixed</button> | |
| <button type="button" id="tax-sample-code-btn" class="secondary" data-i18n="tax.sample_code_btn">↳ Sample: code</button> | |
| </div> | |
| <p id="tax-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="tax-output" style="margin-top: 1em;"></div> | |
| <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="tax.firstload_note"> | |
| 💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser. | |
| </p> | |
| </section> | |
| <!-- LongScore (mode=longscore, v0.8.8 anti-bullshit pack #14) --> | |
| <section id="longscore-section" style="display:none;"> | |
| <h2><span data-i18n="longscore.title">🎯 LongScore</span> | |
| <span class="info"><span class="tooltip" data-i18n="longscore.tip"> | |
| <strong>Why this matters</strong>: every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability — so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models). | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="longscore.desc"> | |
| <strong>How much does your model degrade past short context?</strong> Paste an HF model id → see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks. | |
| </p> | |
| <div class="form-row"> | |
| <label for="longscore-input" data-i18n="longscore.input_label">Model id:</label> | |
| <input type="text" id="longscore-input" data-i18n-placeholder="longscore.input.placeholder" | |
| placeholder="e.g. Qwen2.5-72B-Instruct or meta-llama/Llama-3.1-70B-Instruct" style="flex:1;" /> | |
| <button type="button" id="longscore-lookup-btn" data-i18n="longscore.lookup_btn">🔎 Lookup</button> | |
| </div> | |
| <div class="form-row"> | |
| <button type="button" id="longscore-example-good-btn" class="secondary" data-i18n="longscore.example_good_btn">↳ Example: Jamba-1.5-Large (no degradation)</button> | |
| <button type="button" id="longscore-example-mid-btn" class="secondary" data-i18n="longscore.example_mid_btn">↳ Example: Llama-3.1-70B (moderate)</button> | |
| <button type="button" id="longscore-example-bad-btn" class="secondary" data-i18n="longscore.example_bad_btn">↳ Example: dbrx (severe)</button> | |
| </div> | |
| <p id="longscore-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="longscore-output" style="margin-top: 1em;"></div> | |
| <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="longscore.formula_note"> | |
| 💡 <strong>LongScore</strong> = mean over l ∈ {16K, 32K, 64K, 128K} of (S_l − Base) / Base, where Base = mean(S_4K, S_8K). Source: <a href="https://arxiv.org/abs/2505.19293" target="_blank">100-LongBench, ACL 2025</a>. Data: <a href="https://github.com/NVIDIA/RULER" target="_blank">NVIDIA RULER</a> (per-length, n=33) + <a href="https://github.com/princeton-nlp/HELMET" target="_blank">HELMET</a> (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe. | |
| </p> | |
| </section> | |
| <section id="hub-section" style="display:none;"> | |
| <h2><span data-i18n="hub.title">🧭 Solutions Hub</span> | |
| <span class="info"><span class="tooltip" data-i18n="hub.tip"> | |
| Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild. | |
| </span></span> | |
| </h2> | |
| <p class="recipe-desc" data-i18n="hub.desc"> | |
| <strong>Don't reinvent — find.</strong> 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most. | |
| </p> | |
| <div class="form-row"> | |
| <input type="text" id="hub-search" placeholder="search: e.g. 'forgetting' or 'vendor' or 'RAG'…" style="flex:1;" /> | |
| <button type="button" id="hub-clear-btn" class="secondary" data-i18n="hub.clear_btn">✕ Clear</button> | |
| </div> | |
| <p id="hub-status" class="recipe-desc" style="font-size:0.92em;"></p> | |
| <div id="hub-output" style="margin-top: 1em;"></div> | |
| </section> | |
| <!-- Recipe selector (mode=recipe) --> | |
| <section id="recipe-section" style="display:none;"> | |
| <h2 data-i18n="recipe.title">📋 Recipe</h2> | |
| <select id="recipe-select" disabled> | |
| <option value="" data-i18n="recipe.default">— select a recipe —</option> | |
| </select> | |
| <p id="recipe-desc-display" class="recipe-desc"></p> | |
| </section> | |
| <!-- Form (mode=recipe) --> | |
| <section id="form-section" style="display:none;"> | |
| <h2 data-i18n="recipe.input_title">🎯 Inputs</h2> | |
| <div class="form-row"> | |
| <label for="preset" data-i18n="profile.preset_label">Preset model:</label> | |
| <select id="preset" disabled> | |
| <option value="" data-i18n="profile.preset_default">— select to autofill —</option> | |
| </select> | |
| </div> | |
| <div class="form-row"> | |
| <label for="hf-id" data-i18n="profile.hf_label">Or any HF model:</label> | |
| <input type="text" id="hf-id" | |
| data-i18n-placeholder="profile.hf_placeholder" | |
| placeholder="e.g. Qwen/Qwen2.5-32B-Instruct" style="flex:1;" /> | |
| <button id="hf-fetch-btn" type="button" class="secondary" data-i18n="profile.fetch_btn">📥 Fetch</button> | |
| </div> | |
| <div id="hf-status" class="subtle" style="margin: -0.5rem 0 1rem; min-height:1.2em;"></div> | |
| <div id="dynamic-form" class="form-grid"></div> | |
| <button id="run-btn" disabled data-i18n="ask.btn">🚀 Analyze</button> | |
| </section> | |
| <!-- Output (single-recipe verdict + chain) --> | |
| <section id="output-section" style="display:none;"> | |
| <h2 data-i18n="verdict.title">📊 Verdict</h2> | |
| <div id="verdict-box"></div> | |
| <div class="share-bar"> | |
| <button id="share-btn" class="secondary" type="button" data-i18n="share.btn">🔗 Copy share link</button> | |
| <button id="recipe-download-btn" class="secondary" type="button" data-i18n="share.download">💾 Download JSON</button> | |
| <button id="recipe-download-md-btn" class="secondary" type="button" data-i18n="share.download_md">📝 Markdown</button> | |
| <button id="recipe-download-tex-btn" class="secondary" type="button" data-i18n="share.download_tex">📜 LaTeX</button> | |
| <button id="recipe-submit-btn" class="secondary" type="button" data-i18n="share.submit">📤 Submit to registry</button> | |
| <span id="share-status" class="subtle"></span> | |
| </div> | |
| <h2 data-i18n="chain.title">🔍 Computation Chain</h2> | |
| <p class="subtle" data-i18n="chain.desc">Every number below is deterministic Python. Click a step to expand.</p> | |
| <div id="chain-box"></div> | |
| <h2 id="answer-header" style="display:none;" data-i18n="answer.title">💬 Plain-English Answer</h2> | |
| <div id="answer-box" style="display:none;"></div> | |
| </section> | |
| <!-- Profile output --> | |
| <section id="profile-output" style="display:none;"> | |
| <h2 data-i18n="tafcard.title">📇 TAF Card — full model profile</h2> | |
| <div id="profile-box"></div> | |
| </section> | |
| <!-- Compare output --> | |
| <section id="compare-output" style="display:none;"> | |
| <h2 data-i18n="compare.title_out">🆚 Comparison Table</h2> | |
| <div id="compare-box"></div> | |
| </section> | |
| <!-- Hidden file input for JSON upload (shared by all import buttons) --> | |
| <input type="file" id="import-file" accept=".json,application/json" style="display:none;" /> | |
| <!-- Floating import bar (always visible) --> | |
| <section id="import-section"> | |
| <h2 data-i18n="share.import_title">📂 Import a shared TAF result</h2> | |
| <p class="recipe-desc" data-i18n="share.import_desc"> | |
| Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. | |
| Same view as if you'd run it yourself. | |
| </p> | |
| <button id="import-btn" class="secondary" type="button" data-i18n="share.import_btn">📂 Load shared JSON</button> | |
| <span id="import-status" class="subtle" style="margin-left:0.75rem;"></span> | |
| </section> | |
| <!-- Browse community submissions (live from GitHub Issues) --> | |
| <section id="community-section"> | |
| <h2 data-i18n="community.title">🌐 Recent community submissions</h2> | |
| <p class="recipe-desc" data-i18n="community.desc"> | |
| Live feed from the public registry. Click any submission to view full analysis. | |
| <a href="https://github.com/karlesmarin/tafagent-registry/issues" target="_blank" data-i18n="community.browse_all">Browse all →</a> | |
| </p> | |
| <div id="community-feed" class="subtle"><span data-i18n="community.loading">Loading...</span></div> | |
| </section> | |
| <!-- FALSIFICATION dashboard (paper predictions status) --> | |
| <section id="falsification-section"> | |
| <h2 data-i18n="falsification.title">🔬 Paper predictions — falsification status</h2> | |
| <p class="recipe-desc" data-i18n="falsification.desc"> | |
| The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested. | |
| Here's the live status of every prediction in the paper. | |
| </p> | |
| <div id="falsification-table"></div> | |
| </section> | |
| </main> | |
| <footer> | |
| <p data-i18n="footer.text"> | |
| © 2026 Carles Marin · Apache-2.0 · independent research · the tool that closes the loop of the paper. | |
| </p> | |
| <p> | |
| <a href="https://github.com/karlesmarin/tafagent" target="_blank">Source on GitHub</a> | |
| · | |
| <a href="https://github.com/karlesmarin/NeurIPS" target="_blank">Paper repo</a> | |
| </p> | |
| <p class="subtle" data-i18n="footer.tech_stack"> | |
| Computation: Pyodide · Synthesis: WebLLM (Qwen2.5-0.5B local) · Hosting: GitHub Pages · Cost: $0 | |
| </p> | |
| </footer> | |
| <script type="module" src="js/main.js"></script> | |
| </body> | |
| </html> | |