File size: 7,970 Bytes
72fa8f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# Dynamic mHC µP Diagnostic Report
## Empirical Verification of Theorem Conditions for V4-style Dynamic Residual Routing

**Date:** 2026-01-05  
**Config:** DeepSeek V4 (n_s=4, K=20, hidden_size=7168, 61 layers)

---

## Executive Summary

We ran all six diagnostics prescribed by the updated plan against a pure-PyTorch reimplementation of DeepSeek V4's dynamic mHC mechanism, sweeping width d ∈ {64, 128, 256, 512, 1024} with fixed n_s = 4 and K = 20.

**The headline finding:** Under standard initialization (σ_W = 0.02, s_C = 0.1), the dynamic generator sensitivity ||DC(x)|| · ||x|| scales as **d^0.98 ≈ d^1**, violating the O(1) condition required by Theorem D (condition 4). This is the central technical obstacle for µP transfer in dynamic mHC.

**The fix:** Multiple corrected parameterizations restore O(1) scaling. The cleanest are:
- **Correction A** (s_C = 1/(n_s·d)): yields d^{-0.035} ≈ O(1) for comb
- **Correction C** (σ_W = 1/(n_s·d)): yields d^{-0.037} ≈ O(1) for **all three** (comb, pre, post)

Correction C is the µP-natural choice: it corresponds to standard µP fan-in initialization for the generator weight W_C.

---

## Diagnostic Results

### Diagnostic 1: Finite-K Sinkhorn Error ε_K

| Width | ε_K (mean) | ε_K (max) |
|-------|-----------|-----------|
| 64    | 1.0e-6    | 1.0e-6    |
| 128   | 1.0e-6    | 1.0e-6    |
| 256   | 1.0e-6    | 1.0e-6    |
| 512   | 1.0e-6    | 1.0e-6    |
| 1024  | 1.0e-6    | 1.0e-6    |

**Conclusion:** ε_K ≈ eps = 1e-6 (the additive epsilon in the Sinkhorn loop dominates). With K=20 and n_s=4, Sinkhorn convergence is essentially exact. The finite-K error is **not** a practical concern for V4's configuration.

**Impact on Theorem A:** The frozen realized comb stability theorem holds essentially exactly: ||C||_2 = 1 + O(ε_K) with ε_K ≈ 10^{-6}.

### Diagnostic 2: Frozen Mixer Spectral Norm ||C||_2

| Width | ||C||_2 (mean) | κ(C) (mean) |
|-------|----------------|-------------|
| 64    | 0.999999       | 1167        |
| 128   | 0.999999       | 652         |
| 256   | 0.999999       | 409         |
| 512   | 0.999999       | 462         |
| 1024  | 0.999999       | 264         |

**Conclusion:** ||C||_2 ≈ 1 to 6 decimal places. The Birkhoff constraint works exactly as Theorem A predicts. The realized comb is nonexpansive.

**Note:** The high condition number κ(C) means that while the maximum singular value is 1, the minimum singular value is small (~0.001). This means C is close to a rank-deficient matrix — consistent with Sinkhorn producing near-permutation matrices when logits have moderate spread.

### Diagnostic 3: Sinkhorn Quotient-Jacobian Spectrum

This diagnostic is **width-independent** (operates on fixed n_s=4 matrices).

| K  | σ_max | σ_min | κ    | Gauge Leakage |
|----|-------|-------|------|---------------|
| 1  | 0.358 | 0.065 | 6.86 | 0.278         |
| 2  | 0.353 | 0.064 | 6.93 | 0.091         |
| 5  | 0.354 | 0.063 | 8.13 | 0.006         |
| 10 | 0.355 | 0.058 | 9.92 | 0.0001        |
| 20 | 0.354 | 0.067 | 7.40 | 0.000002      |
| 50 | 0.355 | 0.063 | 7.91 | 0.000000      |

**Key Findings:**

1. **The quotient Jacobian is well-conditioned.** κ ≈ 7-10 across all K values. This validates Theorem C's assumption.

2. **σ_max ≈ 0.35, σ_min ≈ 0.06.** The Sinkhorn projection is a **contraction** on G^⊥ (σ_max < 1). Perturbations are damped, not amplified.

3. **Gauge leakage drops exponentially with K.** At K=20, leakage is 2e-6. The Sinkhorn Jacobian maps G^⊥ almost perfectly into G^⊥.

4. **The spectrum is K-independent for K ≥ 5.** Convergence is fast.

5. **The (n_s-1)² = 9 singular values** have a smooth distribution — all gauge-perpendicular directions are treated comparably.

### Diagnostic 4: Dynamic Sensitivity (THE KEY RESULT)

| Width | ||DC(x)||·||x|| | ||Dp(x)||·||x|| | ||Dq(x)||·||x|| |
|-------|-----------------|-----------------|-----------------|
| 64    | 0.147           | 0.139           | 0.274           |
| 128   | 0.284           | 0.271           | 0.543           |
| 256   | 0.559           | 0.532           | 1.066           |
| 512   | 1.112           | 1.052           | 2.110           |
| 1024  | 2.235           | 2.086           | 4.158           |

**Scaling exponents:**
- ||DC(x)||·||x|| ~ **d^{0.98}**
- ||Dp(x)||·||x|| ~ **d^{0.98}**
- ||Dq(x)||·||x|| ~ **d^{0.98}**

**This is the smoking gun.** All three dynamic sensitivities scale linearly with width.

#### Jacobian Chain Decomposition

| Component | Scaling | Value at d=1024 |
|-----------|---------|-----------------|
| ||D(RMSNorm)||₂ | d^0 ≈ O(1) | 1.000 |
| ||W_comb||₂ | d^{0.45} ≈ √d | 1.349 |
| |s_C| | O(1) | 0.100 |
| ||DS_K||₂ | d^0 ≈ O(1) | 0.262 |
| ||x|| | d^{0.50} = √d | 64.01 |

**The two √d factors:**
1. **||W_comb||₂ ~ √d**: Generator weight spectral norm (fan-in = n_s·d)
2. **||x|| ~ √d**: Multi-stream state norm (n_s·d entries)

Product: O(1) · O(1) · √d · O(1) · √d = **Θ(d)**

### Diagnostic 5: Generated-Logit Update Scale

| Width | ||Π_{G^⊥} ΔZ||₂ | Perp/Total Ratio |
|-------|------------------|------------------|
| 64    | 0.00317          | 0.998            |
| 128   | 0.00553          | 0.997            |
| 256   | 0.01006          | 0.997            |
| 512   | 0.01950          | 0.994            |
| 1024  | 0.03826          | 0.996            |

**Conclusions:**
1. Almost all of ΔZ is in G^⊥ (>99.7%). Gradient updates naturally avoid gauge directions.
2. ||Π_{G^⊥} ΔZ||₂ grows with width (~√d), requiring LR compensation.

### Diagnostic 6: Pre/Post Gate Statistics

| Width | p̄ (mean) | ||p||₁ | q̄ (mean) | ||q||_∞ |
|-------|----------|--------|----------|---------|
| 64    | 0.498    | 1.993  | 0.998    | 1.014   |
| 128   | 0.501    | 2.003  | 1.000    | 1.024   |
| 256   | 0.499    | 1.998  | 1.002    | 1.033   |
| 512   | 0.497    | 1.989  | 1.005    | 1.053   |
| 1024  | 0.502    | 2.009  | 1.008    | 1.072   |

Gates are stable across widths. Pre-weights center at 0.5 (sigmoid midpoint), post-weights at 1.0 (2·sigmoid(0)).

---

## Corrected Parameterizations

| Correction | Description | ||DC||·||x|| scaling |
|------------|-------------|---------------------|
| Baseline   | σ_W=0.02, s_C=0.1 | d^{0.98} ❌ |
| A          | s_C = 1/(n_s·d) | d^{-0.04} ✅ (comb only) |
| B          | σ_W = s_C = 1/√(n_s·d) | d^{-0.04} ✅ |
| **C**      | **σ_W = 1/(n_s·d), s_C=0.1** | **d^{-0.04} ✅ (all gates)** |
| D          | s_C = 1/√d, σ_W=0.02 | d^{0.46} ❌ |
| E          | σ_W = 1/√(n_s·d), s_C=0.1 | d^{0.47} ❌ |

### The µP Rule for Dynamic mHC Generator Weights

| Parameter | Init | LR scaling |
|-----------|------|------------|
| W_ℓ^a ∈ ℝ^{(2+n_s)n_s × n_s·d} | σ² = 1/(n_s·d)² | η/d |
| s_ℓ^a ∈ ℝ³ | O(1) | η |
| b_ℓ^a ∈ ℝ^{(2+n_s)n_s} | 0 | η |

**Key insight:** The generator weight's effective fan-in is **n_s·d** (the total multi-stream dimension), not d.

---

## Implications for Theorem D

The conditions are:
1. ✅ Branch f_ℓ^a satisfies standard spectral µP (assumed)
2. ✅ ε_K = O(1), actually ε_K ≈ 10^{-6} (Diagnostic 1)
3. ✅ Quotient Jacobian well-conditioned, κ ≈ 7-10 (Diagnostic 3)
4. ✅→ **Requires Correction C:** σ_W = 1/(n_s·d) gives O(1) (Diagnostic 4)
5. ✅→ **Requires LR scaling:** η_W = Θ(1/d) (Diagnostic 5)

With both corrections applied, all five conditions of Theorem D are satisfied.

---

## V4 Sinkhorn Implementation Notes

From `kernel.py`:
1. **Init:** Row-softmax + eps, then col-normalize
2. **Iterations:** K-1 repetitions of (row-normalize, col-normalize)
3. **Convention:** comb[j,k] with j=output stream, k=input stream
4. **hc_post:** y_o = q_o · f(y) + Σ_i C[i,o] · residual_i

## Files

- `mhc_diagnostics.py` — Complete diagnostic implementation (all 6 diagnostics)
- `mhc_analysis.py` — Chain decomposition, corrections, figures