mikeumus-divincian commited on
Commit
30e5dd2
Β·
verified Β·
1 Parent(s): 1332648

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +61 -8
README.md CHANGED
@@ -18,6 +18,10 @@ A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct
18
 
19
  Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
20
 
 
 
 
 
21
  ## What this is
22
 
23
  This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β€” specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
@@ -28,6 +32,32 @@ The vindex enables:
28
  - **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
29
  - **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ## What this is not
32
 
33
  - This is **not** an inference endpoint. You cannot run generation with this artifact.
@@ -61,21 +91,36 @@ The vindex enables:
61
  | `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
62
  | `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
63
 
64
- ## Universal constants (C1–C5) β€” preliminary results
65
 
66
- *Phase 1+2 running on Modal H100. Results will be posted here and in the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) once complete.*
67
 
68
  | Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
69
  |----------|-------------|----------------------|----------------|
70
- | **C1** | FFN activation sparsity | 0.06–0.39 | pending |
71
  | **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%† |
72
- | **C3** | Gate coherence | 0.53–0.81 | pending |
73
- | **C4** | Layer temperature | 0.012–0.042 | pending |
74
- | **C5** | Circuit stage count | 4 (fp16) | pending |
75
 
76
- †*`num_experts_per_tok: 8` confirmed from `config.json`. Four organizations (Google, OpenAI, Alibaba, Moonshot AI) have independently converged on top-8 routing for frontier MoE.*
77
 
78
- **Prediction**: var@64 β‰ˆ 0.80–0.90 (power-law, same as all other fp16/bf16 models) and C5 = 4 (circuit intact). The MoE architecture shouldn't change this β€” per-expert weight matrices are bf16-equivalent.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  ## MoE-specific methodology
81
 
@@ -146,4 +191,12 @@ CC-BY-NC 4.0 β€” free for non-commercial research use. For commercial use, conta
146
 
147
  ---
148
 
 
 
 
 
 
 
 
 
149
  *Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*
 
18
 
19
  Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
20
 
21
+ > **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8Γ—H100) pending. The core finding β€” flat SVD spectrum consistent with 1-bit models β€” is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
22
+ >
23
+ > *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.*
24
+
25
  ## What this is
26
 
27
  This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β€” specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
 
32
  - **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
33
  - **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
34
 
35
+ ## Key finding: fp8-native training causes spectral dissolution
36
+
37
+ **Training precision, not storage precision, determines spectral structure.**
38
+
39
+ | Model | Training precision | var@64 (median) | Spectral class |
40
+ |-------|-------------------|-----------------|----------------|
41
+ | Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
42
+ | Ministral-3B | fp16 β†’ *post-quant* fp8 | ~0.85 | non-dissolved |
43
+ | **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** |
44
+ | Bonsai 8B | fp16 β†’ *post-quant* 1-bit | 0.093 | dissolved |
45
+ | BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved |
46
+
47
+ *Dissolved: var@64 < 0.15 (bimodal gap β€” no model yet observed between 0.15 and 0.50).*
48
+
49
+ The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum).
50
+
51
+ **Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β€” so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.
52
+
53
+ **What this does not claim:**
54
+ - fp8-native training is not *equivalent* to 1-bit in capability or behavior
55
+ - Post-quantized fp8 is not "worse" than native fp8 β€” the spectra differ, behavioral quality is a separate question
56
+ - We have no data on fp4, bf16, or other low-precision training regimes
57
+ - n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof
58
+
59
+ **Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.
60
+
61
  ## What this is not
62
 
63
  - This is **not** an inference endpoint. You cannot run generation with this artifact.
 
91
  | `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
92
  | `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
93
 
94
+ ## Universal constants (C1–C5) β€” spot-check results (6 layers)
95
 
96
+ *Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.*
97
 
98
  | Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
99
  |----------|-------------|----------------------|----------------|
100
+ | **C1** | FFN activation sparsity | 0.06–0.39 | pending (Phase 3) |
101
  | **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%† |
102
+ | **C3** | Gate coherence | 0.53–0.81 | pending (Phase 1 full) |
103
+ | **C4** | Layer temperature (var@64) | 0.80–0.90 (fp16) | **0.037–0.10** ‑ |
104
+ | **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) |
105
 
106
+ †*`num_experts_per_tok: 8` confirmed from `config.json`.*
107
 
108
+ ‑**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results:
109
+
110
+ | Layer | Type | var@64 | IQR / note |
111
+ |-------|------|--------|------------|
112
+ | L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β€” flat spectrum |
113
+ | L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
114
+ | L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
115
+ | L30 | MoE | pending (full run) | β€” |
116
+ | L45 | MoE | pending | β€” |
117
+ | L60 | MoE | pending | β€” |
118
+
119
+ **This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β€” realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.
120
+
121
+ For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β‰ˆ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] β‰ˆ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β‰ˆ 1.24 β€” nearly uniform.
122
+
123
+ **Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).
124
 
125
  ## MoE-specific methodology
126
 
 
191
 
192
  ---
193
 
194
+ ## Changelog
195
+
196
+ | Date | Update |
197
+ |------|--------|
198
+ | 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |
199
+
200
+ ---
201
+
202
  *Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*