Spaces:

karlexmarin
/

taf-agent

Running

App Files Files Community

taf-agent / README.md

karlexmarin

v0.7.7: UX restructure — task tiles + 'Why static?' README FAQ

b4eccb8 6 days ago

preview code

raw

history blame contribute delete

18.2 kB

metadata

title: TAF Agent
emoji: 🔬
colorFrom: blue
colorTo: green
sdk: static
pinned: true
license: apache-2.0
short_description: 14 anti-bullshit LLM diagnostics in browser. 4 langs.
tags:
  - transformer
  - llm
  - evaluation
  - diagnostic
  - anti-bullshit
  - long-context
  - sliding-window
  - yarn
  - chat-template
  - arena-elo
  - contamination
  - quantization
  - drift
  - ruler
  - niah
  - lean-mathlib
  - formal-verification
  - rope
  - kv-cache
  - viability
  - thermodynamics
  - free
  - browser
  - webgpu
language:
  - en
  - es
  - fr
  - zh

🔬 TAF Agent

Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup. 14 browser-only modes · 4 languages · 37 Lean+Mathlib-verified theorems · 0 telemetry.

🌐 Live: https://karlesmarin.github.io/tafagent · HF Space: https://huggingface.co/spaces/karlexmarin/taf-agent 📦 Source: https://github.com/karlesmarin/tafagent · Lean repo: https://github.com/karlesmarin/lean-taf 📄 Paper: Predicting How Transformers Attend — Marin 2026 🗂️ Dataset: taf-attention-decay (58 measurements, 32 models)

🆕 v0.7 — Anti-bullshit pack (7 new modes)

Each mode solves a documented HuggingFace community pain. All run in your browser with zero inference — pure metadata + math. 3 of these have no comparable tool publicly available (validated against existing literature):

Mode	What it answers	Pain it kills
🪟 Unmask	"Is `max_position_embeddings` honest?"	Mistral-7B-v0.1 says 32k, attends ~8k via SWA
📜 Chat-template	"Which CLI flag for lm-eval / vLLM / transformers?"	lm-eval-harness #1841 silently halves multi-turn accuracy
🎯 Arena CI	"Is GPT-4 actually better than Claude — or tied?"	Chatbot Arena strips CIs from public leaderboard
🧪 Contamination	"Should I trust this MMLU score?"	Open LLM Leaderboard v1 killed by MMLU/HellaSwag contamination
⚖️ Quant	"Will NF4 break my model? AWQ better?"	Generic "AWQ ~95%" claims hide model-specific cliffs
🔀 Drift	"lm-eval gives 67.2, vLLM gives 65.1. Bug or noise?"	arxiv 2506.09501 — eval irreproducibility
🔍 NIAH→Reason	"Does my 128k-context model actually reason there?"	RULER paper finding — needle pass ≠ multi-hop reasoning

Plus a search-as-you-type HF Hub autocomplete on every model-id input (5 minute cache, privacy-noted, gated-license link).

▶ Try it now · v0.7 release notes

A note before you read on

This tool was built by one independent researcher, with no funding, no team, no GPUs beyond a single consumer card, and the full collaborative help of large language models as research instruments. It exists because the paper it complements (Predicting How Transformers Attend — Marin 2026) needed a way for any reader to check the framework's predictions on their own model in seconds, without installing anything, without paying anyone, and without trusting a server they don't control.

If it is useful to you — even once — that is enough. If it is wrong about your model, please tell us so we can fix the framework. The point is the common ground, not the artefact.

What it does

Drop in a model id (or paste any HuggingFace public model), get a falsifiable answer to "will this work?" — backed by the Thermodynamic Attention Framework (TAF) formulas:

Decision recipes

Will Llama-3-8B serve 32K context with NIAH retrieval? → X-2
Should I train a custom 7B model or pay for API access? → X-1
I have $5,000 — what model can I afford to train? → X-3
Cheapest GPU to serve Llama-70B at 100M tokens/day? → X-5
Soft KV decay or hard cutoff for compression? → X-19

Diagnostic recipes (NEW v0.4 — sesión 29 findings 2026-04-28)

How much positional bias did training imprint on this model? → X-21
Does this model fit the empirical compute-context invariant band? → X-22
Is this checkpoint pre- or post-induction-head? → X-23

Each as a chain of TAF formulas (paper §17, §19, §20, §24, §26, §28-§30) rendered with full audit trail. Every number is deterministic Python; nothing is hallucinated.

Four ways to use it

📇 Profile a model — paste id, get all 5 recipes scored as a unified TAF Card (best starting point)
🆚 Compare models — 2-3 candidates side-by-side on the same recipe
💬 Ask plain English — free-form question, in-browser LLM picks the right recipe
📋 Pick recipe — manual selection with full form control

How it stays free + unlimited

Static HTML/JS hosted on GitHub Pages (truly unlimited bandwidth)
Python TAF computation runs in your browser via Pyodide (no server-side compute)
Plain-English synthesis runs Qwen2.5-0.5B-Instruct in your browser via WebLLM (your GPU/CPU, your electricity, ~350MB cached after first load)
Model config.json files fetched directly from HuggingFace Hub (free, public, no auth for non-gated models)
Your data never leaves your browser

If 1 user or 1 million users hit it, our cost stays the same: $0.

Why static HTML+JS+Pyodide instead of Gradio/Streamlit?

A reasonable question. Three TAF Agent USPs are only possible with browser-only architecture:

Your inputs never leave the tab. No server = no privacy compromise. The "anti-bullshit" framing depends on this.
$0 forever, even at infinite scale. Static Spaces have unlimited HF bandwidth; there is no cold-start, no queue, no rate limit. Going viral can't bankrupt the project.
Lean+Mathlib formal verification ships as a static manifest. The 37 theorem badges link to source lines that anyone can lake build themselves — no hidden server logic.

Bonus: in-browser LLM (WebLLM running Qwen2.5-0.5B in your GPU/CPU) for the 💬 Ask mode is only viable in static. Pyodide running deterministic Python in your browser means you can audit every number — no opaque server.

The cost: HuggingFace's "Trending Spaces" algorithm favours Gradio/Streamlit Spaces. We compensate with detailed tags + forum presence + this README. If you'd prefer a Python-API client, that's a planned gradio_client companion (v0.9).

Architecture coverage

Supports any model whose config.json is parseable:

Family	Examples	Status
RoPE-MHA	pythia, gpt-j, original LLaMA	✓ supported
RoPE-GQA	Llama-3, Mistral, Qwen2.5, gemma-2	✓ supported
ALiBi	BLOOM, Falcon	✓ supported
AbsPE	gpt2 family	✓ supported
SWA (sliding window)	Mistral, gemma-2, phi-3	✓ supported
SSM	Mamba, Mamba-2	✓ partial (γ doesn't apply, KV does)
Any HF Hub public model	(any)	✓ via 📥 Fetch button

Languages

Interface available in:

🇬🇧 English
🇪🇸 Español
🇫🇷 Français
🇨🇳 中文

Click flags top-right to switch.

Local development

Browser application

git clone https://github.com/karlesmarin/tafagent
cd tafagent
python -m http.server 8000
# open http://localhost:8000

CLI diagnostic (for the paper)

The directory cli/diagnose_model.py is the command-line companion described in the paper Predicting How Transformers Attend (Marin 2026). It characterises any causal language model from HuggingFace in minutes on CPU and produces the raw gamma_obs, R², and thermodynamic profile used in the manuscript.

pip install torch transformers numpy
python cli/diagnose_model.py --model EleutherAI/pythia-2.8b --fast --cpu

Reproducibility data

The directory data/ ships every measurement referenced in the paper (343 JSON files, ~5.5 MB). See data/README.md for the layout.

Browser requirements

Chrome / Edge / Firefox 113+ for WebGPU acceleration (recommended)
Older browsers fall back to CPU inference (slower but works)
~2 GB free RAM for the synthesis LLM
~350 MB disk for model cache (one-time)

What's new in v0.4 (2026-04-28)

Three new diagnostic recipes derived from cross-model panel analysis (n=22 LLMs):

X-21 — Imprint Purity Diagnostic

Predicts γ on RANDOM-token input via the learned-imprint formula:

γ_random = γ_pade(θ, T) + ν · log_10(P / 14M)
   ν = −1/(2π) ≈ −0.1592   (DERIVED from RoPE rotation period)

Even on random tokens, weights apply a learned positional bias proportional to log(N_params). The slope ν is fixed (not fitted) — derivable from RoPE's 2π rotation period. Empirical validation: n=22 LLMs, p=0.022, |err|=0.3%.

Use case: detect anomalous training, format conversion (e.g. OLMo native vs HF Δγ=0.30), or fine-tuning drift by comparing predicted vs measured γ_random.

X-22 — Compute-Context Invariant

Computes the empirical Chinchilla×attention invariant:

K = γ × log(N² · D)   where D = 20·N (Chinchilla compute-optimal)
Empirical band: K ∈ [34, 68]   (51.2 ± 16.8, CV=0.329, n=22)

K-outliers indicate scaling/training anomalies. Llama-3-8B with γ=1.045 gives K=74.6 (z=1.39, high-K OUTLIER) — flags supra-Padé attention.

X-23 — IH-Phase Detector

Uses the Δγ probe (cheaper than ICL benchmark):

sign(γ_text − γ_random) > 0   ⟺   post-induction-head formation

Pre-IH (P<400M, n=7): ⟨Δγ⟩=−0.19±0.26 Post-IH (P≥400M, n=15): ⟨Δγ⟩=+0.03±0.26

Use case: monitor training trajectories without running ICL benchmarks; detect anomalous checkpoints.

Other v0.4 additions

gamma_decompose_v2(...) — 6-axis decomposition with the new imprint axis
famous_constant_proximity(...) — detects γ-cluster on famous constants (e.g. CodeLlama-13b γ=0.382 ≈ 1−1/φ golden conjugate)

What's new in v0.5.3 (2026-05-02) — 🔧 Audit-driven bug fixes

The TAF Agent was applied to its own author's paper (recursive Sócrates audit) and to the agent's own formula implementations. Several real bugs were detected and corrected. All v0.5.0–v0.5.2 users running diagnostics on Phase B models (γ > 1: LLaMA-2/3, Mistral, Gemma, Qwen2.5-7B near-Hagedorn) received incorrect KV-compression recommendations. This release fixes all known issues.

Critical fixes

D_f_closed (KV compression window): replaced asymptotic / Hagedorn-buffer branches with discrete cumulative sum. Old code clamped Phase B (γ>1) to N when truth was ~3 % of N (LLaMA-3-8B at γ=1.046 with N=2000 should compress to ~750 tokens; old code returned 2000). Boundary γ ∈ [0.99, 1.01] was off by factor ~2×. Now exact for any γ.
partition_Z(γ=1, N): was log(N + 0.5), missing Euler-Mascheroni constant γ_E ≈ 0.577 (~7 % underestimate of H_N). Now log(N) + γ_E.
free_energy_F: returned −log(Z) (β·F convention). Now −log(Z)/γ, consistent with the Helmholtz definition F = −T·log(Z) and the thermodynamic identity S = γ·(U − F).
γ_pred: replaced obsolete C/lnθ heuristic with γ_Padé(θ, T_eval) (paper §3.3).

Calibration audit (cross-panel re-check, n=22)

Re-running the empirical δ corrections of gamma_decompose against the panel revealed:

Constant	Hardcoded	Panel re-audit	Verdict
δ_GQA	+0.11	+0.115	✓ replicates
δ_SWA	−0.21	originally fit on n=1 model	✗ disabled (insufficient data)
δ_post_IH	−0.15	group-mean ≈ 0 (n=16 yes / 6 no)	⚠ flagged exploratory
δ_instruct (v2)	−0.10	n=3, p=0.06 (already noted)	⚠ flagged exploratory

gamma_decompose and gamma_decompose_v2 now return per-axis status fields (delta_SWA_status, delta_post_IH_status, etc.) and a top-level calibration_warning so consumers can detect which corrections are reliable.

The TAF Card UI now displays a collapsible "v0.5.3 — Calibration audit" banner in all four supported languages (EN/ES/FR/ZH) explaining this.

Paper §5.2 erratum

The framework's own self-audit found that paper §5.2 Theorem 5.2 claims C_V(γ=1, N) = (log N)²/4. Sócrates triangulation (numerical Python + Sage exact rational + SymPy symbolic integral) confirms the correct asymptotic is (log N)²/12 — a factor-3 error in the paper's truncated Z-expansion proof. The agent's heat_capacity_Cv already computes the correct value via numerical derivative of U; only the paper's analytic formula is wrong, not the tool. A formal erratum will be published as a separate document.

Tests

22/22 unit tests pass (tests/test_taf_formulas.py), including regression tests for D_f Phase B, partition_Z γ_E, free_energy_F convention, and δ_SWA disabled.

Why this happened

These bugs survived prior reviews because the affected code paths were exercised mainly on Phase A models (γ < 0.95) where the asymptotic approximation is close enough. Phase B (γ > 1) and the boundary near Hagedorn (|γ−1| < 0.05) were under-tested. The agent now uses direct discrete computation, so accuracy is uniform across all γ.

What's new in v0.5 (2026-05-01) — 🔬 Machine-verified consistency

First transformer-attention framework with formal machine-proof backing.

Sage Groebner basis (algebraic decision procedure) + Lean Mathlib4 (dependent type theory) dual-tool verification of 15 algebraic identities of TAF critical exponents.

`verify_algebraic_consistency(γ)` — new function

Given measured γ ∈ Phase A (0,1), checks 12 D-SAGE identities derived from TAF exponents (β=γ−1, ν=1/(1−γ), η=γ−1, etc.):

D-SAGE-1 (★★ core): 2η² + η·γ_χ + 1 = 0 (quadratic identity)
D-SAGE-2: β·χ = −1 (Phase A)
D-SAGE-4: α + χ = 2
D-SAGE-5: α + γ_χ = 2(2 − γ)
D-SAGE-6: β·γ_χ = −2γ² + 4γ − 3 (factored)
Rushbrooke + Josephson tautologies (d=1)
Fisher residual = γ(2γ−3)/(1−γ) (NOT zero generally; corrects "triple closure")
η=2γ refutation (Phase A residual > 0; paper 1's claim was wrong)
D-SAGE-7: c · |ν_imprint| · 2π = 3 (dimensional closure)

Pass = framework intact. Fail = bf16 outlier, quantization artifact, or γ measurement noise.

Paper 1 erratum

Paper 1 originally claimed η = 2γ. Sage Groebner + Lean Mathlib4 detected this is algebraically wrong (residual (−4γ³+5γ+1)/(1−γ) > 0 ∀γ ∈ Phase A). Correct value: η = γ − 1, satisfying D-SAGE-1.

Reproducibility

# Sage verification
docker run --rm -v "$(pwd)/analysis:/work" sagemath/sagemath:latest \
    sage /work/sage_recursive_sweep_2026-04-30.sage

# Lean verification
docker run --rm -v "$(pwd)/lean_taf:/work" \
    leanprovercommunity/lean:latest \
    -c "cd /work/taf && lake build"

Build success: 1973/1973 jobs (Mathlib4 + 15 TAF theorems), DONE_EXIT=0.

Lean code: lean_taf/taf/Taf/Identities.lean Sage script: analysis/sage_recursive_sweep_2026-04-30.sage

How you can help

This tool is at v0.5. There's a long way to go.

🐛 Report bugs: https://github.com/karlesmarin/tafagent/issues
🌐 Translate: add a language to js/i18n.js, send a PR
🧪 Falsify a prediction: run the tool on a model where you have ground-truth measurements; if our verdict disagrees with reality, open an issue. We take refutations as seriously as confirmations.
➕ New recipe: implement an X-N recipe in python/taf_browser.py following the pattern of X-1...X-19
➕ New preset: add a popular model to the PRESETS dict
📝 Improve docs / examples: anything that helps the next person

Citation

If this tool helps you — paper or code:

@article{marin2026Predicting How Transformers Atten,
  author  = {Marin, Carles},
  title   = {Predicting How Transformers Attend
Analytic Power-Law Theory, Phase Transitions, and Practical Compression
Tools},
  year    = {2026},
  url     = {https://zenodo.org/records/19826343},
}

@misc{marin2026tafagent,
  author = {Marin, Carles},
  title  = {{TAF Agent}: Browser-Based Transformer Diagnostic Tool},
  year   = {2026},
  url    = {https://karlesmarin.github.io/tafagent},
}

License

Apache-2.0 (this code).

Synthesis model: Qwen2.5-0.5B-Instruct distributed under Apache-2.0.

Acknowledgements

This tool would not exist without:

The model commons: EleutherAI, Meta AI, Alibaba Qwen team, Mistral AI, Google DeepMind, Microsoft Research, AI2, BigScience, TII, DeepSeek-AI, HuggingFace SmolLM team, the Mamba authors, the RWKV community, and OpenAI for releasing weights and configs publicly.
The infrastructure commons: Pyodide, WebLLM, HuggingFace Hub, GitHub Pages, jsdelivr CDN.
The maintainers of transformers, numpy, scipy, sympy, tokenizers, accelerate, and the dozens of small libraries that make modern ML possible.
The wider ML community — bloggers, reproducibility checkers, Discord moderators, Stack Overflow answerers, blog post writers (Lilian Weng, Andrej Karpathy, Sebastian Raschka, Jay Alammar, Sasha Rush, Phil Wang, the EleutherAI team, and many more) whose explanations carried the author through every concept this tool uses.
Large language models as research instruments — Claude (Anthropic), GPT (OpenAI), Gemini (Google DeepMind), Mistral, Llama, DeepSeek, Grok, Qwen-Chat, and Microsoft phi — for the symbolic derivations, sage cross-checks, prose revision, audit work, and long-form co-writing that underlie both this tool and the underlying paper.

The author was the hand that typed; the work itself belongs to the commons that made it possible.