File size: 7,970 Bytes
72fa8f0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | # Dynamic mHC µP Diagnostic Report
## Empirical Verification of Theorem Conditions for V4-style Dynamic Residual Routing
**Date:** 2026-01-05
**Config:** DeepSeek V4 (n_s=4, K=20, hidden_size=7168, 61 layers)
---
## Executive Summary
We ran all six diagnostics prescribed by the updated plan against a pure-PyTorch reimplementation of DeepSeek V4's dynamic mHC mechanism, sweeping width d ∈ {64, 128, 256, 512, 1024} with fixed n_s = 4 and K = 20.
**The headline finding:** Under standard initialization (σ_W = 0.02, s_C = 0.1), the dynamic generator sensitivity ||DC(x)|| · ||x|| scales as **d^0.98 ≈ d^1**, violating the O(1) condition required by Theorem D (condition 4). This is the central technical obstacle for µP transfer in dynamic mHC.
**The fix:** Multiple corrected parameterizations restore O(1) scaling. The cleanest are:
- **Correction A** (s_C = 1/(n_s·d)): yields d^{-0.035} ≈ O(1) for comb
- **Correction C** (σ_W = 1/(n_s·d)): yields d^{-0.037} ≈ O(1) for **all three** (comb, pre, post)
Correction C is the µP-natural choice: it corresponds to standard µP fan-in initialization for the generator weight W_C.
---
## Diagnostic Results
### Diagnostic 1: Finite-K Sinkhorn Error ε_K
| Width | ε_K (mean) | ε_K (max) |
|-------|-----------|-----------|
| 64 | 1.0e-6 | 1.0e-6 |
| 128 | 1.0e-6 | 1.0e-6 |
| 256 | 1.0e-6 | 1.0e-6 |
| 512 | 1.0e-6 | 1.0e-6 |
| 1024 | 1.0e-6 | 1.0e-6 |
**Conclusion:** ε_K ≈ eps = 1e-6 (the additive epsilon in the Sinkhorn loop dominates). With K=20 and n_s=4, Sinkhorn convergence is essentially exact. The finite-K error is **not** a practical concern for V4's configuration.
**Impact on Theorem A:** The frozen realized comb stability theorem holds essentially exactly: ||C||_2 = 1 + O(ε_K) with ε_K ≈ 10^{-6}.
### Diagnostic 2: Frozen Mixer Spectral Norm ||C||_2
| Width | ||C||_2 (mean) | κ(C) (mean) |
|-------|----------------|-------------|
| 64 | 0.999999 | 1167 |
| 128 | 0.999999 | 652 |
| 256 | 0.999999 | 409 |
| 512 | 0.999999 | 462 |
| 1024 | 0.999999 | 264 |
**Conclusion:** ||C||_2 ≈ 1 to 6 decimal places. The Birkhoff constraint works exactly as Theorem A predicts. The realized comb is nonexpansive.
**Note:** The high condition number κ(C) means that while the maximum singular value is 1, the minimum singular value is small (~0.001). This means C is close to a rank-deficient matrix — consistent with Sinkhorn producing near-permutation matrices when logits have moderate spread.
### Diagnostic 3: Sinkhorn Quotient-Jacobian Spectrum
This diagnostic is **width-independent** (operates on fixed n_s=4 matrices).
| K | σ_max | σ_min | κ | Gauge Leakage |
|----|-------|-------|------|---------------|
| 1 | 0.358 | 0.065 | 6.86 | 0.278 |
| 2 | 0.353 | 0.064 | 6.93 | 0.091 |
| 5 | 0.354 | 0.063 | 8.13 | 0.006 |
| 10 | 0.355 | 0.058 | 9.92 | 0.0001 |
| 20 | 0.354 | 0.067 | 7.40 | 0.000002 |
| 50 | 0.355 | 0.063 | 7.91 | 0.000000 |
**Key Findings:**
1. **The quotient Jacobian is well-conditioned.** κ ≈ 7-10 across all K values. This validates Theorem C's assumption.
2. **σ_max ≈ 0.35, σ_min ≈ 0.06.** The Sinkhorn projection is a **contraction** on G^⊥ (σ_max < 1). Perturbations are damped, not amplified.
3. **Gauge leakage drops exponentially with K.** At K=20, leakage is 2e-6. The Sinkhorn Jacobian maps G^⊥ almost perfectly into G^⊥.
4. **The spectrum is K-independent for K ≥ 5.** Convergence is fast.
5. **The (n_s-1)² = 9 singular values** have a smooth distribution — all gauge-perpendicular directions are treated comparably.
### Diagnostic 4: Dynamic Sensitivity (THE KEY RESULT)
| Width | ||DC(x)||·||x|| | ||Dp(x)||·||x|| | ||Dq(x)||·||x|| |
|-------|-----------------|-----------------|-----------------|
| 64 | 0.147 | 0.139 | 0.274 |
| 128 | 0.284 | 0.271 | 0.543 |
| 256 | 0.559 | 0.532 | 1.066 |
| 512 | 1.112 | 1.052 | 2.110 |
| 1024 | 2.235 | 2.086 | 4.158 |
**Scaling exponents:**
- ||DC(x)||·||x|| ~ **d^{0.98}**
- ||Dp(x)||·||x|| ~ **d^{0.98}**
- ||Dq(x)||·||x|| ~ **d^{0.98}**
**This is the smoking gun.** All three dynamic sensitivities scale linearly with width.
#### Jacobian Chain Decomposition
| Component | Scaling | Value at d=1024 |
|-----------|---------|-----------------|
| ||D(RMSNorm)||₂ | d^0 ≈ O(1) | 1.000 |
| ||W_comb||₂ | d^{0.45} ≈ √d | 1.349 |
| |s_C| | O(1) | 0.100 |
| ||DS_K||₂ | d^0 ≈ O(1) | 0.262 |
| ||x|| | d^{0.50} = √d | 64.01 |
**The two √d factors:**
1. **||W_comb||₂ ~ √d**: Generator weight spectral norm (fan-in = n_s·d)
2. **||x|| ~ √d**: Multi-stream state norm (n_s·d entries)
Product: O(1) · O(1) · √d · O(1) · √d = **Θ(d)**
### Diagnostic 5: Generated-Logit Update Scale
| Width | ||Π_{G^⊥} ΔZ||₂ | Perp/Total Ratio |
|-------|------------------|------------------|
| 64 | 0.00317 | 0.998 |
| 128 | 0.00553 | 0.997 |
| 256 | 0.01006 | 0.997 |
| 512 | 0.01950 | 0.994 |
| 1024 | 0.03826 | 0.996 |
**Conclusions:**
1. Almost all of ΔZ is in G^⊥ (>99.7%). Gradient updates naturally avoid gauge directions.
2. ||Π_{G^⊥} ΔZ||₂ grows with width (~√d), requiring LR compensation.
### Diagnostic 6: Pre/Post Gate Statistics
| Width | p̄ (mean) | ||p||₁ | q̄ (mean) | ||q||_∞ |
|-------|----------|--------|----------|---------|
| 64 | 0.498 | 1.993 | 0.998 | 1.014 |
| 128 | 0.501 | 2.003 | 1.000 | 1.024 |
| 256 | 0.499 | 1.998 | 1.002 | 1.033 |
| 512 | 0.497 | 1.989 | 1.005 | 1.053 |
| 1024 | 0.502 | 2.009 | 1.008 | 1.072 |
Gates are stable across widths. Pre-weights center at 0.5 (sigmoid midpoint), post-weights at 1.0 (2·sigmoid(0)).
---
## Corrected Parameterizations
| Correction | Description | ||DC||·||x|| scaling |
|------------|-------------|---------------------|
| Baseline | σ_W=0.02, s_C=0.1 | d^{0.98} ❌ |
| A | s_C = 1/(n_s·d) | d^{-0.04} ✅ (comb only) |
| B | σ_W = s_C = 1/√(n_s·d) | d^{-0.04} ✅ |
| **C** | **σ_W = 1/(n_s·d), s_C=0.1** | **d^{-0.04} ✅ (all gates)** |
| D | s_C = 1/√d, σ_W=0.02 | d^{0.46} ❌ |
| E | σ_W = 1/√(n_s·d), s_C=0.1 | d^{0.47} ❌ |
### The µP Rule for Dynamic mHC Generator Weights
| Parameter | Init | LR scaling |
|-----------|------|------------|
| W_ℓ^a ∈ ℝ^{(2+n_s)n_s × n_s·d} | σ² = 1/(n_s·d)² | η/d |
| s_ℓ^a ∈ ℝ³ | O(1) | η |
| b_ℓ^a ∈ ℝ^{(2+n_s)n_s} | 0 | η |
**Key insight:** The generator weight's effective fan-in is **n_s·d** (the total multi-stream dimension), not d.
---
## Implications for Theorem D
The conditions are:
1. ✅ Branch f_ℓ^a satisfies standard spectral µP (assumed)
2. ✅ ε_K = O(1), actually ε_K ≈ 10^{-6} (Diagnostic 1)
3. ✅ Quotient Jacobian well-conditioned, κ ≈ 7-10 (Diagnostic 3)
4. ✅→ **Requires Correction C:** σ_W = 1/(n_s·d) gives O(1) (Diagnostic 4)
5. ✅→ **Requires LR scaling:** η_W = Θ(1/d) (Diagnostic 5)
With both corrections applied, all five conditions of Theorem D are satisfied.
---
## V4 Sinkhorn Implementation Notes
From `kernel.py`:
1. **Init:** Row-softmax + eps, then col-normalize
2. **Iterations:** K-1 repetitions of (row-normalize, col-normalize)
3. **Convention:** comb[j,k] with j=output stream, k=input stream
4. **hc_post:** y_o = q_o · f(y) + Σ_i C[i,o] · residual_i
## Files
- `mhc_diagnostics.py` — Complete diagnostic implementation (all 6 diagnostics)
- `mhc_analysis.py` — Chain decomposition, corrections, figures
|