Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -18,6 +18,10 @@ A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct
|
|
| 18 |
|
| 19 |
Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## What this is
|
| 22 |
|
| 23 |
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
|
|
@@ -28,6 +32,32 @@ The vindex enables:
|
|
| 28 |
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
|
| 29 |
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
## What this is not
|
| 32 |
|
| 33 |
- This is **not** an inference endpoint. You cannot run generation with this artifact.
|
|
@@ -61,21 +91,36 @@ The vindex enables:
|
|
| 61 |
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
|
| 62 |
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
|
| 63 |
|
| 64 |
-
## Universal constants (C1βC5) β
|
| 65 |
|
| 66 |
-
*Phase 1
|
| 67 |
|
| 68 |
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|
| 69 |
|----------|-------------|----------------------|----------------|
|
| 70 |
-
| **C1** | FFN activation sparsity | 0.06β0.39 | pending |
|
| 71 |
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β |
|
| 72 |
-
| **C3** | Gate coherence | 0.53β0.81 | pending |
|
| 73 |
-
| **C4** | Layer temperature | 0.
|
| 74 |
-
| **C5** | Circuit stage count | 4 (fp16) | pending |
|
| 75 |
|
| 76 |
-
β *`num_experts_per_tok: 8` confirmed from `config.json`.
|
| 77 |
|
| 78 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
## MoE-specific methodology
|
| 81 |
|
|
@@ -146,4 +191,12 @@ CC-BY-NC 4.0 β free for non-commercial research use. For commercial use, conta
|
|
| 146 |
|
| 147 |
---
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*
|
|
|
|
| 18 |
|
| 19 |
Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
|
| 20 |
|
| 21 |
+
> **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8ΓH100) pending. The core finding β flat SVD spectrum consistent with 1-bit models β is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
|
| 22 |
+
>
|
| 23 |
+
> *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.*
|
| 24 |
+
|
| 25 |
## What this is
|
| 26 |
|
| 27 |
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
|
|
|
|
| 32 |
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
|
| 33 |
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
|
| 34 |
|
| 35 |
+
## Key finding: fp8-native training causes spectral dissolution
|
| 36 |
+
|
| 37 |
+
**Training precision, not storage precision, determines spectral structure.**
|
| 38 |
+
|
| 39 |
+
| Model | Training precision | var@64 (median) | Spectral class |
|
| 40 |
+
|-------|-------------------|-----------------|----------------|
|
| 41 |
+
| Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
|
| 42 |
+
| Ministral-3B | fp16 β *post-quant* fp8 | ~0.85 | non-dissolved |
|
| 43 |
+
| **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** |
|
| 44 |
+
| Bonsai 8B | fp16 β *post-quant* 1-bit | 0.093 | dissolved |
|
| 45 |
+
| BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved |
|
| 46 |
+
|
| 47 |
+
*Dissolved: var@64 < 0.15 (bimodal gap β no model yet observed between 0.15 and 0.50).*
|
| 48 |
+
|
| 49 |
+
The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum).
|
| 50 |
+
|
| 51 |
+
**Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.
|
| 52 |
+
|
| 53 |
+
**What this does not claim:**
|
| 54 |
+
- fp8-native training is not *equivalent* to 1-bit in capability or behavior
|
| 55 |
+
- Post-quantized fp8 is not "worse" than native fp8 β the spectra differ, behavioral quality is a separate question
|
| 56 |
+
- We have no data on fp4, bf16, or other low-precision training regimes
|
| 57 |
+
- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof
|
| 58 |
+
|
| 59 |
+
**Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.
|
| 60 |
+
|
| 61 |
## What this is not
|
| 62 |
|
| 63 |
- This is **not** an inference endpoint. You cannot run generation with this artifact.
|
|
|
|
| 91 |
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
|
| 92 |
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
|
| 93 |
|
| 94 |
+
## Universal constants (C1βC5) β spot-check results (6 layers)
|
| 95 |
|
| 96 |
+
*Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.*
|
| 97 |
|
| 98 |
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|
| 99 |
|----------|-------------|----------------------|----------------|
|
| 100 |
+
| **C1** | FFN activation sparsity | 0.06β0.39 | pending (Phase 3) |
|
| 101 |
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β |
|
| 102 |
+
| **C3** | Gate coherence | 0.53β0.81 | pending (Phase 1 full) |
|
| 103 |
+
| **C4** | Layer temperature (var@64) | 0.80β0.90 (fp16) | **0.037β0.10** β‘ |
|
| 104 |
+
| **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) |
|
| 105 |
|
| 106 |
+
β *`num_experts_per_tok: 8` confirmed from `config.json`.*
|
| 107 |
|
| 108 |
+
β‘**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results:
|
| 109 |
+
|
| 110 |
+
| Layer | Type | var@64 | IQR / note |
|
| 111 |
+
|-------|------|--------|------------|
|
| 112 |
+
| L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β flat spectrum |
|
| 113 |
+
| L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
|
| 114 |
+
| L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
|
| 115 |
+
| L30 | MoE | pending (full run) | β |
|
| 116 |
+
| L45 | MoE | pending | β |
|
| 117 |
+
| L60 | MoE | pending | β |
|
| 118 |
+
|
| 119 |
+
**This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.
|
| 120 |
+
|
| 121 |
+
For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β 0.80β0.90 with a clear power-law singular value spectrum (S[0]/S[1] β 10β100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β 1.24 β nearly uniform.
|
| 122 |
+
|
| 123 |
+
**Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).
|
| 124 |
|
| 125 |
## MoE-specific methodology
|
| 126 |
|
|
|
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
+
## Changelog
|
| 195 |
+
|
| 196 |
+
| Date | Update |
|
| 197 |
+
|------|--------|
|
| 198 |
+
| 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |
|
| 199 |
+
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*
|