galimova
/

mhc-mup-diagnostics

Model card Files Files and versions

xet

Community

galimova commited on 6 days ago

Commit

72fa8f0

verified ·

1 Parent(s): 7a73d66

Add diagnostic report and analysis

Browse files

Files changed (1) hide show

README.md +186 -0

README.md ADDED Viewed

	@@ -0,0 +1,186 @@

+# Dynamic mHC µP Diagnostic Report
+## Empirical Verification of Theorem Conditions for V4-style Dynamic Residual Routing
+**Date:** 2026-01-05
+**Config:** DeepSeek V4 (n_s=4, K=20, hidden_size=7168, 61 layers)
+---
+## Executive Summary
+We ran all six diagnostics prescribed by the updated plan against a pure-PyTorch reimplementation of DeepSeek V4's dynamic mHC mechanism, sweeping width d ∈ {64, 128, 256, 512, 1024} with fixed n_s = 4 and K = 20.
+**The headline finding:** Under standard initialization (σ_W = 0.02, s_C = 0.1), the dynamic generator sensitivity ||DC(x)|| · ||x|| scales as **d^0.98 ≈ d^1**, violating the O(1) condition required by Theorem D (condition 4). This is the central technical obstacle for µP transfer in dynamic mHC.
+**The fix:** Multiple corrected parameterizations restore O(1) scaling. The cleanest are:
+- **Correction A** (s_C = 1/(n_s·d)): yields d^{-0.035} ≈ O(1) for comb
+- **Correction C** (σ_W = 1/(n_s·d)): yields d^{-0.037} ≈ O(1) for **all three** (comb, pre, post)
+Correction C is the µP-natural choice: it corresponds to standard µP fan-in initialization for the generator weight W_C.
+---
+## Diagnostic Results
+### Diagnostic 1: Finite-K Sinkhorn Error ε_K
+| Width | ε_K (mean) | ε_K (max) |
+|-------|-----------|-----------|
+| 64    | 1.0e-6    | 1.0e-6    |
+| 128   | 1.0e-6    | 1.0e-6    |
+| 256   | 1.0e-6    | 1.0e-6    |
+| 512   | 1.0e-6    | 1.0e-6    |
+| 1024  | 1.0e-6    | 1.0e-6    |
+**Conclusion:** ε_K ≈ eps = 1e-6 (the additive epsilon in the Sinkhorn loop dominates). With K=20 and n_s=4, Sinkhorn convergence is essentially exact. The finite-K error is **not** a practical concern for V4's configuration.
+**Impact on Theorem A:** The frozen realized comb stability theorem holds essentially exactly: ||C||_2 = 1 + O(ε_K) with ε_K ≈ 10^{-6}.
+### Diagnostic 2: Frozen Mixer Spectral Norm ||C||_2
+| Width | ||C||_2 (mean) | κ(C) (mean) |
+|-------|----------------|-------------|
+| 64    | 0.999999       | 1167        |
+| 128   | 0.999999       | 652         |
+| 256   | 0.999999       | 409         |
+| 512   | 0.999999       | 462         |
+| 1024  | 0.999999       | 264         |
+**Conclusion:** ||C||_2 ≈ 1 to 6 decimal places. The Birkhoff constraint works exactly as Theorem A predicts. The realized comb is nonexpansive.
+**Note:** The high condition number κ(C) means that while the maximum singular value is 1, the minimum singular value is small (~0.001). This means C is close to a rank-deficient matrix — consistent with Sinkhorn producing near-permutation matrices when logits have moderate spread.
+### Diagnostic 3: Sinkhorn Quotient-Jacobian Spectrum
+This diagnostic is **width-independent** (operates on fixed n_s=4 matrices).
+| K  | σ_max | σ_min | κ    | Gauge Leakage |
+|----|-------|-------|------|---------------|
+| 1  | 0.358 | 0.065 | 6.86 | 0.278         |
+| 2  | 0.353 | 0.064 | 6.93 | 0.091         |
+| 5  | 0.354 | 0.063 | 8.13 | 0.006         |
+| 10 | 0.355 | 0.058 | 9.92 | 0.0001        |
+| 20 | 0.354 | 0.067 | 7.40 | 0.000002      |
+| 50 | 0.355 | 0.063 | 7.91 | 0.000000      |
+**Key Findings:**
+1. **The quotient Jacobian is well-conditioned.** κ ≈ 7-10 across all K values. This validates Theorem C's assumption.
+2. **σ_max ≈ 0.35, σ_min ≈ 0.06.** The Sinkhorn projection is a **contraction** on G^⊥ (σ_max < 1). Perturbations are damped, not amplified.
+3. **Gauge leakage drops exponentially with K.** At K=20, leakage is 2e-6. The Sinkhorn Jacobian maps G^⊥ almost perfectly into G^⊥.
+4. **The spectrum is K-independent for K ≥ 5.** Convergence is fast.
+5. **The (n_s-1)² = 9 singular values** have a smooth distribution — all gauge-perpendicular directions are treated comparably.
+### Diagnostic 4: Dynamic Sensitivity (THE KEY RESULT)
+| Width | ||DC(x)||·||x|| | ||Dp(x)||·||x|| | ||Dq(x)||·||x|| |
+|-------|-----------------|-----------------|-----------------|
+| 64    | 0.147           | 0.139           | 0.274           |
+| 128   | 0.284           | 0.271           | 0.543           |
+| 256   | 0.559           | 0.532           | 1.066           |
+| 512   | 1.112           | 1.052           | 2.110           |
+| 1024  | 2.235           | 2.086           | 4.158           |
+**Scaling exponents:**
+- ||DC(x)||·||x|| ~ **d^{0.98}**
+- ||Dp(x)||·||x|| ~ **d^{0.98}**
+- ||Dq(x)||·||x|| ~ **d^{0.98}**
+**This is the smoking gun.** All three dynamic sensitivities scale linearly with width.
+#### Jacobian Chain Decomposition
+| Component | Scaling | Value at d=1024 |
+|-----------|---------|-----------------|
+| ||D(RMSNorm)||₂ | d^0 ≈ O(1) | 1.000 |
+| ||W_comb||₂ | d^{0.45} ≈ √d | 1.349 |
+| |s_C| | O(1) | 0.100 |
+| ||DS_K||₂ | d^0 ≈ O(1) | 0.262 |
+| ||x|| | d^{0.50} = √d | 64.01 |
+**The two √d factors:**
+1. **||W_comb||₂ ~ √d**: Generator weight spectral norm (fan-in = n_s·d)
+2. **||x|| ~ √d**: Multi-stream state norm (n_s·d entries)
+Product: O(1) · O(1) · √d · O(1) · √d = **Θ(d)**
+### Diagnostic 5: Generated-Logit Update Scale
+| Width | ||Π_{G^⊥} ΔZ||₂ | Perp/Total Ratio |
+|-------|------------------|------------------|
+| 64    | 0.00317          | 0.998            |
+| 128   | 0.00553          | 0.997            |
+| 256   | 0.01006          | 0.997            |
+| 512   | 0.01950          | 0.994            |
+| 1024  | 0.03826          | 0.996            |
+**Conclusions:**
+1. Almost all of ΔZ is in G^⊥ (>99.7%). Gradient updates naturally avoid gauge directions.
+2. ||Π_{G^⊥} ΔZ||₂ grows with width (~√d), requiring LR compensation.
+### Diagnostic 6: Pre/Post Gate Statistics
+| Width | p̄ (mean) | ||p||₁ | q̄ (mean) | ||q||_∞ |
+|-------|----------|--------|----------|---------|
+| 64    | 0.498    | 1.993  | 0.998    | 1.014   |
+| 128   | 0.501    | 2.003  | 1.000    | 1.024   |
+| 256   | 0.499    | 1.998  | 1.002    | 1.033   |
+| 512   | 0.497    | 1.989  | 1.005    | 1.053   |
+| 1024  | 0.502    | 2.009  | 1.008    | 1.072   |
+Gates are stable across widths. Pre-weights center at 0.5 (sigmoid midpoint), post-weights at 1.0 (2·sigmoid(0)).
+---
+## Corrected Parameterizations
+| Correction | Description | ||DC||·||x|| scaling |
+|------------|-------------|---------------------|
+| Baseline   | σ_W=0.02, s_C=0.1 | d^{0.98} ❌ |
+| A          | s_C = 1/(n_s·d) | d^{-0.04} ✅ (comb only) |
+| B          | σ_W = s_C = 1/√(n_s·d) | d^{-0.04} ✅ |
+| **C**      | **σ_W = 1/(n_s·d), s_C=0.1** | **d^{-0.04} ✅ (all gates)** |
+| D          | s_C = 1/√d, σ_W=0.02 | d^{0.46} ❌ |
+| E          | σ_W = 1/√(n_s·d), s_C=0.1 | d^{0.47} ❌ |
+### The µP Rule for Dynamic mHC Generator Weights
+| Parameter | Init | LR scaling |
+|-----------|------|------------|
+| W_ℓ^a ∈ ℝ^{(2+n_s)n_s × n_s·d} | σ² = 1/(n_s·d)² | η/d |
+| s_ℓ^a ∈ ℝ³ | O(1) | η |
+| b_ℓ^a ∈ ℝ^{(2+n_s)n_s} | 0 | η |
+**Key insight:** The generator weight's effective fan-in is **n_s·d** (the total multi-stream dimension), not d.
+---
+## Implications for Theorem D
+The conditions are:
+1. ✅ Branch f_ℓ^a satisfies standard spectral µP (assumed)
+2. ✅ ε_K = O(1), actually ε_K ≈ 10^{-6} (Diagnostic 1)
+3. ✅ Quotient Jacobian well-conditioned, κ ≈ 7-10 (Diagnostic 3)
+4. ✅→ **Requires Correction C:** σ_W = 1/(n_s·d) gives O(1) (Diagnostic 4)
+5. ✅→ **Requires LR scaling:** η_W = Θ(1/d) (Diagnostic 5)
+With both corrections applied, all five conditions of Theorem D are satisfied.
+---
+## V4 Sinkhorn Implementation Notes
+From `kernel.py`:
+1. **Init:** Row-softmax + eps, then col-normalize
+2. **Iterations:** K-1 repetitions of (row-normalize, col-normalize)
+3. **Convention:** comb[j,k] with j=output stream, k=input stream
+4. **hc_post:** y_o = q_o · f(y) + Σ_i C[i,o] · residual_i
+## Files
+- `mhc_diagnostics.py` — Complete diagnostic implementation (all 6 diagnostics)
+- `mhc_analysis.py` — Chain decomposition, corrections, figures