Add diagnostic report and analysis
Browse files
README.md
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dynamic mHC µP Diagnostic Report
|
| 2 |
+
## Empirical Verification of Theorem Conditions for V4-style Dynamic Residual Routing
|
| 3 |
+
|
| 4 |
+
**Date:** 2026-01-05
|
| 5 |
+
**Config:** DeepSeek V4 (n_s=4, K=20, hidden_size=7168, 61 layers)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Executive Summary
|
| 10 |
+
|
| 11 |
+
We ran all six diagnostics prescribed by the updated plan against a pure-PyTorch reimplementation of DeepSeek V4's dynamic mHC mechanism, sweeping width d ∈ {64, 128, 256, 512, 1024} with fixed n_s = 4 and K = 20.
|
| 12 |
+
|
| 13 |
+
**The headline finding:** Under standard initialization (σ_W = 0.02, s_C = 0.1), the dynamic generator sensitivity ||DC(x)|| · ||x|| scales as **d^0.98 ≈ d^1**, violating the O(1) condition required by Theorem D (condition 4). This is the central technical obstacle for µP transfer in dynamic mHC.
|
| 14 |
+
|
| 15 |
+
**The fix:** Multiple corrected parameterizations restore O(1) scaling. The cleanest are:
|
| 16 |
+
- **Correction A** (s_C = 1/(n_s·d)): yields d^{-0.035} ≈ O(1) for comb
|
| 17 |
+
- **Correction C** (σ_W = 1/(n_s·d)): yields d^{-0.037} ≈ O(1) for **all three** (comb, pre, post)
|
| 18 |
+
|
| 19 |
+
Correction C is the µP-natural choice: it corresponds to standard µP fan-in initialization for the generator weight W_C.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Diagnostic Results
|
| 24 |
+
|
| 25 |
+
### Diagnostic 1: Finite-K Sinkhorn Error ε_K
|
| 26 |
+
|
| 27 |
+
| Width | ε_K (mean) | ε_K (max) |
|
| 28 |
+
|-------|-----------|-----------|
|
| 29 |
+
| 64 | 1.0e-6 | 1.0e-6 |
|
| 30 |
+
| 128 | 1.0e-6 | 1.0e-6 |
|
| 31 |
+
| 256 | 1.0e-6 | 1.0e-6 |
|
| 32 |
+
| 512 | 1.0e-6 | 1.0e-6 |
|
| 33 |
+
| 1024 | 1.0e-6 | 1.0e-6 |
|
| 34 |
+
|
| 35 |
+
**Conclusion:** ε_K ≈ eps = 1e-6 (the additive epsilon in the Sinkhorn loop dominates). With K=20 and n_s=4, Sinkhorn convergence is essentially exact. The finite-K error is **not** a practical concern for V4's configuration.
|
| 36 |
+
|
| 37 |
+
**Impact on Theorem A:** The frozen realized comb stability theorem holds essentially exactly: ||C||_2 = 1 + O(ε_K) with ε_K ≈ 10^{-6}.
|
| 38 |
+
|
| 39 |
+
### Diagnostic 2: Frozen Mixer Spectral Norm ||C||_2
|
| 40 |
+
|
| 41 |
+
| Width | ||C||_2 (mean) | κ(C) (mean) |
|
| 42 |
+
|-------|----------------|-------------|
|
| 43 |
+
| 64 | 0.999999 | 1167 |
|
| 44 |
+
| 128 | 0.999999 | 652 |
|
| 45 |
+
| 256 | 0.999999 | 409 |
|
| 46 |
+
| 512 | 0.999999 | 462 |
|
| 47 |
+
| 1024 | 0.999999 | 264 |
|
| 48 |
+
|
| 49 |
+
**Conclusion:** ||C||_2 ≈ 1 to 6 decimal places. The Birkhoff constraint works exactly as Theorem A predicts. The realized comb is nonexpansive.
|
| 50 |
+
|
| 51 |
+
**Note:** The high condition number κ(C) means that while the maximum singular value is 1, the minimum singular value is small (~0.001). This means C is close to a rank-deficient matrix — consistent with Sinkhorn producing near-permutation matrices when logits have moderate spread.
|
| 52 |
+
|
| 53 |
+
### Diagnostic 3: Sinkhorn Quotient-Jacobian Spectrum
|
| 54 |
+
|
| 55 |
+
This diagnostic is **width-independent** (operates on fixed n_s=4 matrices).
|
| 56 |
+
|
| 57 |
+
| K | σ_max | σ_min | κ | Gauge Leakage |
|
| 58 |
+
|----|-------|-------|------|---------------|
|
| 59 |
+
| 1 | 0.358 | 0.065 | 6.86 | 0.278 |
|
| 60 |
+
| 2 | 0.353 | 0.064 | 6.93 | 0.091 |
|
| 61 |
+
| 5 | 0.354 | 0.063 | 8.13 | 0.006 |
|
| 62 |
+
| 10 | 0.355 | 0.058 | 9.92 | 0.0001 |
|
| 63 |
+
| 20 | 0.354 | 0.067 | 7.40 | 0.000002 |
|
| 64 |
+
| 50 | 0.355 | 0.063 | 7.91 | 0.000000 |
|
| 65 |
+
|
| 66 |
+
**Key Findings:**
|
| 67 |
+
|
| 68 |
+
1. **The quotient Jacobian is well-conditioned.** κ ≈ 7-10 across all K values. This validates Theorem C's assumption.
|
| 69 |
+
|
| 70 |
+
2. **σ_max ≈ 0.35, σ_min ≈ 0.06.** The Sinkhorn projection is a **contraction** on G^⊥ (σ_max < 1). Perturbations are damped, not amplified.
|
| 71 |
+
|
| 72 |
+
3. **Gauge leakage drops exponentially with K.** At K=20, leakage is 2e-6. The Sinkhorn Jacobian maps G^⊥ almost perfectly into G^⊥.
|
| 73 |
+
|
| 74 |
+
4. **The spectrum is K-independent for K ≥ 5.** Convergence is fast.
|
| 75 |
+
|
| 76 |
+
5. **The (n_s-1)² = 9 singular values** have a smooth distribution — all gauge-perpendicular directions are treated comparably.
|
| 77 |
+
|
| 78 |
+
### Diagnostic 4: Dynamic Sensitivity (THE KEY RESULT)
|
| 79 |
+
|
| 80 |
+
| Width | ||DC(x)||·||x|| | ||Dp(x)||·||x|| | ||Dq(x)||·||x|| |
|
| 81 |
+
|-------|-----------------|-----------------|-----------------|
|
| 82 |
+
| 64 | 0.147 | 0.139 | 0.274 |
|
| 83 |
+
| 128 | 0.284 | 0.271 | 0.543 |
|
| 84 |
+
| 256 | 0.559 | 0.532 | 1.066 |
|
| 85 |
+
| 512 | 1.112 | 1.052 | 2.110 |
|
| 86 |
+
| 1024 | 2.235 | 2.086 | 4.158 |
|
| 87 |
+
|
| 88 |
+
**Scaling exponents:**
|
| 89 |
+
- ||DC(x)||·||x|| ~ **d^{0.98}**
|
| 90 |
+
- ||Dp(x)||·||x|| ~ **d^{0.98}**
|
| 91 |
+
- ||Dq(x)||·||x|| ~ **d^{0.98}**
|
| 92 |
+
|
| 93 |
+
**This is the smoking gun.** All three dynamic sensitivities scale linearly with width.
|
| 94 |
+
|
| 95 |
+
#### Jacobian Chain Decomposition
|
| 96 |
+
|
| 97 |
+
| Component | Scaling | Value at d=1024 |
|
| 98 |
+
|-----------|---------|-----------------|
|
| 99 |
+
| ||D(RMSNorm)||₂ | d^0 ≈ O(1) | 1.000 |
|
| 100 |
+
| ||W_comb||₂ | d^{0.45} ≈ √d | 1.349 |
|
| 101 |
+
| |s_C| | O(1) | 0.100 |
|
| 102 |
+
| ||DS_K||₂ | d^0 ≈ O(1) | 0.262 |
|
| 103 |
+
| ||x|| | d^{0.50} = √d | 64.01 |
|
| 104 |
+
|
| 105 |
+
**The two √d factors:**
|
| 106 |
+
1. **||W_comb||₂ ~ √d**: Generator weight spectral norm (fan-in = n_s·d)
|
| 107 |
+
2. **||x|| ~ √d**: Multi-stream state norm (n_s·d entries)
|
| 108 |
+
|
| 109 |
+
Product: O(1) · O(1) · √d · O(1) · √d = **Θ(d)**
|
| 110 |
+
|
| 111 |
+
### Diagnostic 5: Generated-Logit Update Scale
|
| 112 |
+
|
| 113 |
+
| Width | ||Π_{G^⊥} ΔZ||₂ | Perp/Total Ratio |
|
| 114 |
+
|-------|------------------|------------------|
|
| 115 |
+
| 64 | 0.00317 | 0.998 |
|
| 116 |
+
| 128 | 0.00553 | 0.997 |
|
| 117 |
+
| 256 | 0.01006 | 0.997 |
|
| 118 |
+
| 512 | 0.01950 | 0.994 |
|
| 119 |
+
| 1024 | 0.03826 | 0.996 |
|
| 120 |
+
|
| 121 |
+
**Conclusions:**
|
| 122 |
+
1. Almost all of ΔZ is in G^⊥ (>99.7%). Gradient updates naturally avoid gauge directions.
|
| 123 |
+
2. ||Π_{G^⊥} ΔZ||₂ grows with width (~√d), requiring LR compensation.
|
| 124 |
+
|
| 125 |
+
### Diagnostic 6: Pre/Post Gate Statistics
|
| 126 |
+
|
| 127 |
+
| Width | p̄ (mean) | ||p||₁ | q̄ (mean) | ||q||_∞ |
|
| 128 |
+
|-------|----------|--------|----------|---------|
|
| 129 |
+
| 64 | 0.498 | 1.993 | 0.998 | 1.014 |
|
| 130 |
+
| 128 | 0.501 | 2.003 | 1.000 | 1.024 |
|
| 131 |
+
| 256 | 0.499 | 1.998 | 1.002 | 1.033 |
|
| 132 |
+
| 512 | 0.497 | 1.989 | 1.005 | 1.053 |
|
| 133 |
+
| 1024 | 0.502 | 2.009 | 1.008 | 1.072 |
|
| 134 |
+
|
| 135 |
+
Gates are stable across widths. Pre-weights center at 0.5 (sigmoid midpoint), post-weights at 1.0 (2·sigmoid(0)).
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## Corrected Parameterizations
|
| 140 |
+
|
| 141 |
+
| Correction | Description | ||DC||·||x|| scaling |
|
| 142 |
+
|------------|-------------|---------------------|
|
| 143 |
+
| Baseline | σ_W=0.02, s_C=0.1 | d^{0.98} ❌ |
|
| 144 |
+
| A | s_C = 1/(n_s·d) | d^{-0.04} ✅ (comb only) |
|
| 145 |
+
| B | σ_W = s_C = 1/√(n_s·d) | d^{-0.04} ✅ |
|
| 146 |
+
| **C** | **σ_W = 1/(n_s·d), s_C=0.1** | **d^{-0.04} ✅ (all gates)** |
|
| 147 |
+
| D | s_C = 1/√d, σ_W=0.02 | d^{0.46} ❌ |
|
| 148 |
+
| E | σ_W = 1/√(n_s·d), s_C=0.1 | d^{0.47} ❌ |
|
| 149 |
+
|
| 150 |
+
### The µP Rule for Dynamic mHC Generator Weights
|
| 151 |
+
|
| 152 |
+
| Parameter | Init | LR scaling |
|
| 153 |
+
|-----------|------|------------|
|
| 154 |
+
| W_ℓ^a ∈ ℝ^{(2+n_s)n_s × n_s·d} | σ² = 1/(n_s·d)² | η/d |
|
| 155 |
+
| s_ℓ^a ∈ ℝ³ | O(1) | η |
|
| 156 |
+
| b_ℓ^a ∈ ℝ^{(2+n_s)n_s} | 0 | η |
|
| 157 |
+
|
| 158 |
+
**Key insight:** The generator weight's effective fan-in is **n_s·d** (the total multi-stream dimension), not d.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Implications for Theorem D
|
| 163 |
+
|
| 164 |
+
The conditions are:
|
| 165 |
+
1. ✅ Branch f_ℓ^a satisfies standard spectral µP (assumed)
|
| 166 |
+
2. ✅ ε_K = O(1), actually ε_K ≈ 10^{-6} (Diagnostic 1)
|
| 167 |
+
3. ✅ Quotient Jacobian well-conditioned, κ ≈ 7-10 (Diagnostic 3)
|
| 168 |
+
4. ✅→ **Requires Correction C:** σ_W = 1/(n_s·d) gives O(1) (Diagnostic 4)
|
| 169 |
+
5. ✅→ **Requires LR scaling:** η_W = Θ(1/d) (Diagnostic 5)
|
| 170 |
+
|
| 171 |
+
With both corrections applied, all five conditions of Theorem D are satisfied.
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## V4 Sinkhorn Implementation Notes
|
| 176 |
+
|
| 177 |
+
From `kernel.py`:
|
| 178 |
+
1. **Init:** Row-softmax + eps, then col-normalize
|
| 179 |
+
2. **Iterations:** K-1 repetitions of (row-normalize, col-normalize)
|
| 180 |
+
3. **Convention:** comb[j,k] with j=output stream, k=input stream
|
| 181 |
+
4. **hc_post:** y_o = q_o · f(y) + Σ_i C[i,o] · residual_i
|
| 182 |
+
|
| 183 |
+
## Files
|
| 184 |
+
|
| 185 |
+
- `mhc_diagnostics.py` — Complete diagnostic implementation (all 6 diagnostics)
|
| 186 |
+
- `mhc_analysis.py` — Chain decomposition, corrections, figures
|