karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
cb542c8
·
1 Parent(s): 77f164d

docs(v0.5.3): README changelog + HF Post communicating audit-driven fixes

Browse files

- README.md: new "What's new in v0.5.3" section listing all fixes,
calibration audit table (panel n=22), and paper §5.2 erratum reference.
- docs/hf-post-v053-fix.md: public-facing post for HF Space community,
with model-by-model impact table, what was fixed / not affected,
and verification instructions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +72 -0
  2. docs/hf-post-v053-fix.md +160 -0
README.md CHANGED
@@ -211,6 +211,78 @@ detect anomalous checkpoints.
211
 
212
  ---
213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ## What's new in v0.5 (2026-05-01) — 🔬 Machine-verified consistency
215
 
216
  **First transformer-attention framework with formal machine-proof backing.**
 
211
 
212
  ---
213
 
214
+ ## What's new in v0.5.3 (2026-05-02) — 🔧 Audit-driven bug fixes
215
+
216
+ The TAF Agent was **applied to its own author's paper** (recursive Sócrates audit)
217
+ and to the agent's own formula implementations. Several real bugs were detected
218
+ and corrected. **All v0.5.0–v0.5.2 users running diagnostics on Phase B models
219
+ (γ > 1: LLaMA-2/3, Mistral, Gemma, Qwen2.5-7B near-Hagedorn) received
220
+ incorrect KV-compression recommendations.** This release fixes all known issues.
221
+
222
+ ### Critical fixes
223
+
224
+ - **`D_f_closed` (KV compression window)**: replaced asymptotic / Hagedorn-buffer
225
+ branches with **discrete cumulative sum**. Old code clamped Phase B (γ>1) to
226
+ N when truth was ~3 % of N (LLaMA-3-8B at γ=1.046 with N=2000 should compress
227
+ to ~750 tokens; old code returned 2000). Boundary γ ∈ [0.99, 1.01] was off
228
+ by factor ~2×. Now exact for any γ.
229
+
230
+ - **`partition_Z(γ=1, N)`**: was `log(N + 0.5)`, missing Euler-Mascheroni
231
+ constant γ_E ≈ 0.577 (~7 % underestimate of H_N). Now `log(N) + γ_E`.
232
+
233
+ - **`free_energy_F`**: returned `−log(Z)` (β·F convention). Now `−log(Z)/γ`,
234
+ consistent with the Helmholtz definition F = −T·log(Z) and the
235
+ thermodynamic identity S = γ·(U − F).
236
+
237
+ - **`γ_pred`**: replaced obsolete `C/lnθ` heuristic with `γ_Padé(θ, T_eval)`
238
+ (paper §3.3).
239
+
240
+ ### Calibration audit (cross-panel re-check, n=22)
241
+
242
+ Re-running the empirical δ corrections of `gamma_decompose` against the
243
+ panel revealed:
244
+
245
+ | Constant | Hardcoded | Panel re-audit | Verdict |
246
+ |---|---|---|---|
247
+ | δ_GQA | +0.11 | +0.115 | ✓ replicates |
248
+ | δ_SWA | −0.21 | originally fit on **n=1 model** | ✗ disabled (insufficient data) |
249
+ | δ_post_IH | −0.15 | group-mean ≈ 0 (n=16 yes / 6 no) | ⚠ flagged exploratory |
250
+ | δ_instruct (v2) | −0.10 | n=3, p=0.06 (already noted) | ⚠ flagged exploratory |
251
+
252
+ `gamma_decompose` and `gamma_decompose_v2` now return per-axis status fields
253
+ (`delta_SWA_status`, `delta_post_IH_status`, etc.) and a top-level
254
+ `calibration_warning` so consumers can detect which corrections are reliable.
255
+
256
+ The TAF Card UI now displays a collapsible **"v0.5.3 — Calibration audit"
257
+ banner** in all four supported languages (EN/ES/FR/ZH) explaining this.
258
+
259
+ ### Paper §5.2 erratum
260
+
261
+ The framework's **own self-audit** found that paper §5.2 Theorem 5.2 claims
262
+ `C_V(γ=1, N) = (log N)²/4`. Sócrates triangulation (numerical Python +
263
+ Sage exact rational + SymPy symbolic integral) confirms the correct
264
+ asymptotic is `(log N)²/12` — a factor-3 error in the paper's truncated
265
+ Z-expansion proof. The agent's `heat_capacity_Cv` already computes the
266
+ correct value via numerical derivative of U; **only the paper's analytic
267
+ formula is wrong, not the tool**. A formal erratum will be published as a
268
+ separate document.
269
+
270
+ ### Tests
271
+
272
+ 22/22 unit tests pass (`tests/test_taf_formulas.py`), including regression
273
+ tests for D_f Phase B, partition_Z γ_E, free_energy_F convention, and
274
+ δ_SWA disabled.
275
+
276
+ ### Why this happened
277
+
278
+ These bugs survived prior reviews because the affected code paths were
279
+ exercised mainly on Phase A models (γ < 0.95) where the asymptotic
280
+ approximation is close enough. Phase B (γ > 1) and the boundary near
281
+ Hagedorn (|γ−1| < 0.05) were under-tested. The agent now uses direct
282
+ discrete computation, so accuracy is uniform across all γ.
283
+
284
+ ---
285
+
286
  ## What's new in v0.5 (2026-05-01) — 🔬 Machine-verified consistency
287
 
288
  **First transformer-attention framework with formal machine-proof backing.**
docs/hf-post-v053-fix.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔧 TAF Agent v0.5.3 — Audit-driven bug fixes
2
+
3
+ **TL;DR** — If you ran the TAF Agent on a model with γ > 1 (LLaMA-2/3,
4
+ Mistral, Gemma, near-Hagedorn Qwen) before today, the KV-compression
5
+ recommendation (`D_f`) was probably wrong. The agent has been corrected
6
+ end-to-end. Re-run your diagnostics.
7
+
8
+ ---
9
+
10
+ ## What was wrong
11
+
12
+ The agent applies a self-audit to its own paper. Yesterday I turned the
13
+ audit on the agent itself. It found six issues. Three are critical
14
+ enough that I'm posting publicly.
15
+
16
+ ### 1. `D_f_closed` Phase B (γ>1) — wrong by 30–95 %
17
+
18
+ For models with γ > 1 (Phase B, where attention is locally concentrated)
19
+ the old asymptotic formula clamped to the full context length N, when
20
+ the true compression target is ~3–10 % of N.
21
+
22
+ | Model | γ | Old `D_f` (N=2000, f=0.9) | Correct `D_f` |
23
+ |---|---|---|---|
24
+ | LLaMA-2-7B | 1.026 | 2000 (clamped) | ~830 |
25
+ | LLaMA-3-8B | 1.046 | 2000 (clamped) | ~750 |
26
+ | Gemma-2-9B random | 1.135 | 2000 (clamped) | ~610 |
27
+ | (γ = 1.5 stress test) | 1.500 | 2000 (clamped) | ~44 |
28
+
29
+ If you used the agent's compression suggestion for any of these, you
30
+ were leaving real memory savings on the table.
31
+
32
+ ### 2. Hagedorn buffer (|γ − 1| < 0.01) — factor 2× off
33
+
34
+ Models living right at the phase boundary (Qwen2.5-7B at γ=0.997, etc.)
35
+ hit a hardcoded special case `N · f^(1/log N)` instead of the correct
36
+ `N^f`. Off by ~2×.
37
+
38
+ ### 3. `δ_SWA = −0.21` calibration — fit on n = 1 model
39
+
40
+ The architectural decomposition `gamma_decompose` carried a SWA
41
+ correction of −0.21 derived from a single Sliding-Window-Attention
42
+ model in the panel. With n = 1 you cannot estimate a coefficient; the
43
+ constant was effectively arbitrary. **Now disabled** with explicit
44
+ status flag `delta_SWA_status: 'exploratory_n1_disabled'`.
45
+
46
+ `δ_post_IH = −0.15` and `δ_instruct = −0.10` did not replicate cleanly
47
+ on the panel re-audit either; both now carry `exploratory` flags.
48
+ Only `δ_GQA = +0.11` (panel-mean +0.115) replicates. **The most
49
+ reliable axes are now `δ_GQA` and the `ν_imprint` slope.**
50
+
51
+ ---
52
+
53
+ ## What was fixed
54
+
55
+ - `D_f_closed` rewritten to use **direct discrete cumulative sum** —
56
+ exact for any γ, no asymptotics, no buffers. ~10 ms per call for
57
+ N ≤ 10⁶.
58
+
59
+ - `partition_Z(γ=1, N)` now adds the Euler-Mascheroni constant
60
+ (~7 % accuracy fix on H_N).
61
+
62
+ - `free_energy_F` switched to physics convention `F = −log(Z)/γ`,
63
+ consistent with `S = γ·(U − F)`.
64
+
65
+ - `γ_pred` now uses `γ_Padé(θ, T_eval)` instead of the obsolete
66
+ `C/lnθ` heuristic.
67
+
68
+ - `gamma_decompose` and `gamma_decompose_v2` return per-axis
69
+ reliability flags + a top-level `calibration_warning`.
70
+
71
+ - TAF Card UI shows a **collapsible "v0.5.3 — Calibration audit"
72
+ banner in all four supported languages** (EN / ES / FR / ZH).
73
+
74
+ - 22 unit tests added (`tests/test_taf_formulas.py`), all passing.
75
+
76
+ ---
77
+
78
+ ## What was *not* affected
79
+
80
+ These formulas were verified independently and remain correct:
81
+
82
+ - `gamma_pade`, `theta_design`, `alpha_opt`, `theta_eff_pade`
83
+ - `mean_log_d`, `entropy_S` (the new `F` convention adjusts but the
84
+ identity `S = γ·(U − F)` is preserved)
85
+ - `heat_capacity_Cv` — numerical derivative of `mean_log_d`,
86
+ computes the correct value automatically (the **paper §5.2 analytic
87
+ formula `(log N)²/4` is wrong** but the agent never used it; agent
88
+ computes via finite difference and gets the correct asymptotic
89
+ `(log N)²/12`)
90
+ - `d_horizon`, `L_NIAH^c`, `χ`, `T_attn`
91
+ - `gamma_random_predict`, `compute_invariant_K`, `ih_phase_check`
92
+ - All the verified algebraic identities (D-SAGE-1 through 7)
93
+
94
+ ---
95
+
96
+ ## Paper §5.2 erratum (separate)
97
+
98
+ While auditing the agent, the framework also caught an algebraic error
99
+ in the companion paper. Paper §5.2 Theorem 5.2 claims:
100
+
101
+ ```
102
+ C_V(γ = 1, N) = (log N)² / 4
103
+ ```
104
+
105
+ Triple triangulation (Sócrates numerical + Sage exact rational + SymPy
106
+ symbolic integration) shows the correct asymptotic is:
107
+
108
+ ```
109
+ C_V(γ = 1, N) → (log N)² / 12 (large N)
110
+ ```
111
+
112
+ The proof in the paper truncated Z(γ, N) at first order in (1−γ),
113
+ missing a (1−γ)²·(log N)²/6 term. A formal erratum is in preparation
114
+ and will be published as a separate document.
115
+
116
+ This does not affect any of the agent's numerical outputs — the agent
117
+ computes `C_V` via numerical derivative, not the buggy analytic form.
118
+ It only affects the analytic claim in the paper.
119
+
120
+ ---
121
+
122
+ ## How to verify
123
+
124
+ ```bash
125
+ git clone https://github.com/karlesmarin/tafagent
126
+ cd tafagent
127
+ pytest tests/test_taf_formulas.py # 22/22 should pass
128
+ ```
129
+
130
+ Or just open the live Space — the calibration banner will show up
131
+ immediately at the top of any TAF Card output.
132
+
133
+ ---
134
+
135
+ ## Why I'm telling you this
136
+
137
+ If you used a tool's recommendation to change a real production setup
138
+ (KV cache size, RoPE scaling, model selection) and the tool was wrong,
139
+ you deserve to know. That's the point of "auditable, deterministic,
140
+ in-browser" — not just that it's transparent in the abstract, but that
141
+ when a bug is found it gets reported. Today there's a bug to report.
142
+
143
+ The audit framework that found these is itself in early development
144
+ (Sócrates v0.1, internal use). The fact that it caught real issues in
145
+ its own author's published paper and shipped tool is, honestly, the
146
+ strongest validation it has so far.
147
+
148
+ If you spot anything else wrong — please open an issue.
149
+
150
+ — Carles Marín
151
+ *Independent researcher*
152
+ *2026-05-02*
153
+
154
+ ---
155
+
156
+ **Links**:
157
+ - Live: https://huggingface.co/spaces/karlexmarin/taf-agent
158
+ - Source: https://github.com/karlesmarin/tafagent
159
+ - Paper: https://zenodo.org/records/19826343
160
+ - Dataset: https://huggingface.co/datasets/karlexmarin/taf-attention-decay