galimova commited on
Commit
72fa8f0
·
verified ·
1 Parent(s): 7a73d66

Add diagnostic report and analysis

Browse files
Files changed (1) hide show
  1. README.md +186 -0
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dynamic mHC µP Diagnostic Report
2
+ ## Empirical Verification of Theorem Conditions for V4-style Dynamic Residual Routing
3
+
4
+ **Date:** 2026-01-05
5
+ **Config:** DeepSeek V4 (n_s=4, K=20, hidden_size=7168, 61 layers)
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ We ran all six diagnostics prescribed by the updated plan against a pure-PyTorch reimplementation of DeepSeek V4's dynamic mHC mechanism, sweeping width d ∈ {64, 128, 256, 512, 1024} with fixed n_s = 4 and K = 20.
12
+
13
+ **The headline finding:** Under standard initialization (σ_W = 0.02, s_C = 0.1), the dynamic generator sensitivity ||DC(x)|| · ||x|| scales as **d^0.98 ≈ d^1**, violating the O(1) condition required by Theorem D (condition 4). This is the central technical obstacle for µP transfer in dynamic mHC.
14
+
15
+ **The fix:** Multiple corrected parameterizations restore O(1) scaling. The cleanest are:
16
+ - **Correction A** (s_C = 1/(n_s·d)): yields d^{-0.035} ≈ O(1) for comb
17
+ - **Correction C** (σ_W = 1/(n_s·d)): yields d^{-0.037} ≈ O(1) for **all three** (comb, pre, post)
18
+
19
+ Correction C is the µP-natural choice: it corresponds to standard µP fan-in initialization for the generator weight W_C.
20
+
21
+ ---
22
+
23
+ ## Diagnostic Results
24
+
25
+ ### Diagnostic 1: Finite-K Sinkhorn Error ε_K
26
+
27
+ | Width | ε_K (mean) | ε_K (max) |
28
+ |-------|-----------|-----------|
29
+ | 64 | 1.0e-6 | 1.0e-6 |
30
+ | 128 | 1.0e-6 | 1.0e-6 |
31
+ | 256 | 1.0e-6 | 1.0e-6 |
32
+ | 512 | 1.0e-6 | 1.0e-6 |
33
+ | 1024 | 1.0e-6 | 1.0e-6 |
34
+
35
+ **Conclusion:** ε_K ≈ eps = 1e-6 (the additive epsilon in the Sinkhorn loop dominates). With K=20 and n_s=4, Sinkhorn convergence is essentially exact. The finite-K error is **not** a practical concern for V4's configuration.
36
+
37
+ **Impact on Theorem A:** The frozen realized comb stability theorem holds essentially exactly: ||C||_2 = 1 + O(ε_K) with ε_K ≈ 10^{-6}.
38
+
39
+ ### Diagnostic 2: Frozen Mixer Spectral Norm ||C||_2
40
+
41
+ | Width | ||C||_2 (mean) | κ(C) (mean) |
42
+ |-------|----------------|-------------|
43
+ | 64 | 0.999999 | 1167 |
44
+ | 128 | 0.999999 | 652 |
45
+ | 256 | 0.999999 | 409 |
46
+ | 512 | 0.999999 | 462 |
47
+ | 1024 | 0.999999 | 264 |
48
+
49
+ **Conclusion:** ||C||_2 ≈ 1 to 6 decimal places. The Birkhoff constraint works exactly as Theorem A predicts. The realized comb is nonexpansive.
50
+
51
+ **Note:** The high condition number κ(C) means that while the maximum singular value is 1, the minimum singular value is small (~0.001). This means C is close to a rank-deficient matrix — consistent with Sinkhorn producing near-permutation matrices when logits have moderate spread.
52
+
53
+ ### Diagnostic 3: Sinkhorn Quotient-Jacobian Spectrum
54
+
55
+ This diagnostic is **width-independent** (operates on fixed n_s=4 matrices).
56
+
57
+ | K | σ_max | σ_min | κ | Gauge Leakage |
58
+ |----|-------|-------|------|---------------|
59
+ | 1 | 0.358 | 0.065 | 6.86 | 0.278 |
60
+ | 2 | 0.353 | 0.064 | 6.93 | 0.091 |
61
+ | 5 | 0.354 | 0.063 | 8.13 | 0.006 |
62
+ | 10 | 0.355 | 0.058 | 9.92 | 0.0001 |
63
+ | 20 | 0.354 | 0.067 | 7.40 | 0.000002 |
64
+ | 50 | 0.355 | 0.063 | 7.91 | 0.000000 |
65
+
66
+ **Key Findings:**
67
+
68
+ 1. **The quotient Jacobian is well-conditioned.** κ ≈ 7-10 across all K values. This validates Theorem C's assumption.
69
+
70
+ 2. **σ_max ≈ 0.35, σ_min ≈ 0.06.** The Sinkhorn projection is a **contraction** on G^⊥ (σ_max < 1). Perturbations are damped, not amplified.
71
+
72
+ 3. **Gauge leakage drops exponentially with K.** At K=20, leakage is 2e-6. The Sinkhorn Jacobian maps G^⊥ almost perfectly into G^⊥.
73
+
74
+ 4. **The spectrum is K-independent for K ≥ 5.** Convergence is fast.
75
+
76
+ 5. **The (n_s-1)² = 9 singular values** have a smooth distribution — all gauge-perpendicular directions are treated comparably.
77
+
78
+ ### Diagnostic 4: Dynamic Sensitivity (THE KEY RESULT)
79
+
80
+ | Width | ||DC(x)||·||x|| | ||Dp(x)||·||x|| | ||Dq(x)||·||x|| |
81
+ |-------|-----------------|-----------------|-----------------|
82
+ | 64 | 0.147 | 0.139 | 0.274 |
83
+ | 128 | 0.284 | 0.271 | 0.543 |
84
+ | 256 | 0.559 | 0.532 | 1.066 |
85
+ | 512 | 1.112 | 1.052 | 2.110 |
86
+ | 1024 | 2.235 | 2.086 | 4.158 |
87
+
88
+ **Scaling exponents:**
89
+ - ||DC(x)||·||x|| ~ **d^{0.98}**
90
+ - ||Dp(x)||·||x|| ~ **d^{0.98}**
91
+ - ||Dq(x)||·||x|| ~ **d^{0.98}**
92
+
93
+ **This is the smoking gun.** All three dynamic sensitivities scale linearly with width.
94
+
95
+ #### Jacobian Chain Decomposition
96
+
97
+ | Component | Scaling | Value at d=1024 |
98
+ |-----------|---------|-----------------|
99
+ | ||D(RMSNorm)||₂ | d^0 ≈ O(1) | 1.000 |
100
+ | ||W_comb||₂ | d^{0.45} ≈ √d | 1.349 |
101
+ | |s_C| | O(1) | 0.100 |
102
+ | ||DS_K||₂ | d^0 ≈ O(1) | 0.262 |
103
+ | ||x|| | d^{0.50} = √d | 64.01 |
104
+
105
+ **The two √d factors:**
106
+ 1. **||W_comb||₂ ~ √d**: Generator weight spectral norm (fan-in = n_s·d)
107
+ 2. **||x|| ~ √d**: Multi-stream state norm (n_s·d entries)
108
+
109
+ Product: O(1) · O(1) · √d · O(1) · √d = **Θ(d)**
110
+
111
+ ### Diagnostic 5: Generated-Logit Update Scale
112
+
113
+ | Width | ||Π_{G^⊥} ΔZ||₂ | Perp/Total Ratio |
114
+ |-------|------------------|------------------|
115
+ | 64 | 0.00317 | 0.998 |
116
+ | 128 | 0.00553 | 0.997 |
117
+ | 256 | 0.01006 | 0.997 |
118
+ | 512 | 0.01950 | 0.994 |
119
+ | 1024 | 0.03826 | 0.996 |
120
+
121
+ **Conclusions:**
122
+ 1. Almost all of ΔZ is in G^⊥ (>99.7%). Gradient updates naturally avoid gauge directions.
123
+ 2. ||Π_{G^⊥} ΔZ||₂ grows with width (~√d), requiring LR compensation.
124
+
125
+ ### Diagnostic 6: Pre/Post Gate Statistics
126
+
127
+ | Width | p̄ (mean) | ||p||₁ | q̄ (mean) | ||q||_∞ |
128
+ |-------|----------|--------|----------|---------|
129
+ | 64 | 0.498 | 1.993 | 0.998 | 1.014 |
130
+ | 128 | 0.501 | 2.003 | 1.000 | 1.024 |
131
+ | 256 | 0.499 | 1.998 | 1.002 | 1.033 |
132
+ | 512 | 0.497 | 1.989 | 1.005 | 1.053 |
133
+ | 1024 | 0.502 | 2.009 | 1.008 | 1.072 |
134
+
135
+ Gates are stable across widths. Pre-weights center at 0.5 (sigmoid midpoint), post-weights at 1.0 (2·sigmoid(0)).
136
+
137
+ ---
138
+
139
+ ## Corrected Parameterizations
140
+
141
+ | Correction | Description | ||DC||·||x|| scaling |
142
+ |------------|-------------|---------------------|
143
+ | Baseline | σ_W=0.02, s_C=0.1 | d^{0.98} ❌ |
144
+ | A | s_C = 1/(n_s·d) | d^{-0.04} ✅ (comb only) |
145
+ | B | σ_W = s_C = 1/√(n_s·d) | d^{-0.04} ✅ |
146
+ | **C** | **σ_W = 1/(n_s·d), s_C=0.1** | **d^{-0.04} ✅ (all gates)** |
147
+ | D | s_C = 1/√d, σ_W=0.02 | d^{0.46} ❌ |
148
+ | E | σ_W = 1/√(n_s·d), s_C=0.1 | d^{0.47} ❌ |
149
+
150
+ ### The µP Rule for Dynamic mHC Generator Weights
151
+
152
+ | Parameter | Init | LR scaling |
153
+ |-----------|------|------------|
154
+ | W_ℓ^a ∈ ℝ^{(2+n_s)n_s × n_s·d} | σ² = 1/(n_s·d)² | η/d |
155
+ | s_ℓ^a ∈ ℝ³ | O(1) | η |
156
+ | b_ℓ^a ∈ ℝ^{(2+n_s)n_s} | 0 | η |
157
+
158
+ **Key insight:** The generator weight's effective fan-in is **n_s·d** (the total multi-stream dimension), not d.
159
+
160
+ ---
161
+
162
+ ## Implications for Theorem D
163
+
164
+ The conditions are:
165
+ 1. ✅ Branch f_ℓ^a satisfies standard spectral µP (assumed)
166
+ 2. ✅ ε_K = O(1), actually ε_K ≈ 10^{-6} (Diagnostic 1)
167
+ 3. ✅ Quotient Jacobian well-conditioned, κ ≈ 7-10 (Diagnostic 3)
168
+ 4. ✅→ **Requires Correction C:** σ_W = 1/(n_s·d) gives O(1) (Diagnostic 4)
169
+ 5. ✅→ **Requires LR scaling:** η_W = Θ(1/d) (Diagnostic 5)
170
+
171
+ With both corrections applied, all five conditions of Theorem D are satisfied.
172
+
173
+ ---
174
+
175
+ ## V4 Sinkhorn Implementation Notes
176
+
177
+ From `kernel.py`:
178
+ 1. **Init:** Row-softmax + eps, then col-normalize
179
+ 2. **Iterations:** K-1 repetitions of (row-normalize, col-normalize)
180
+ 3. **Convention:** comb[j,k] with j=output stream, k=input stream
181
+ 4. **hc_post:** y_o = q_o · f(y) + Σ_i C[i,o] · residual_i
182
+
183
+ ## Files
184
+
185
+ - `mhc_diagnostics.py` — Complete diagnostic implementation (all 6 diagnostics)
186
+ - `mhc_analysis.py` — Chain decomposition, corrections, figures