| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - interpretability |
| - mechanistic-interpretability |
| - moe |
| - kimi-k2 |
| - larql |
| - vindex |
| - svd |
| - weight-analysis |
| base_model: moonshotai/Kimi-K2-Instruct |
| --- |
| |
| # Kimi-K2-Instruct Vindex |
|
|
| A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) β Moonshot AI's 1T-parameter Mixture-of-Experts transformer. |
|
|
| Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py). |
|
|
| > **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8ΓH100) pending. The core finding β flat SVD spectrum consistent with 1-bit models β is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5. |
| > |
| > *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.* |
|
|
| ## What this is |
|
|
| This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database. |
|
|
| The vindex enables: |
| - **C1βC5 universal constant measurement** across the model |
| - **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth) |
| - **Feature-level entity association lookup** (which expert/feature direction activates for a given input) |
| - **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact) |
|
|
| ## Key finding: fp8-native training causes spectral dissolution |
|
|
| **Training precision, not storage precision, determines spectral structure.** |
|
|
| | Model | Training precision | var@64 (median) | Spectral class | |
| |-------|-------------------|-----------------|----------------| |
| | Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved | |
| | Ministral-3B | fp16 β *post-quant* fp8 | ~0.85 | non-dissolved | |
| | **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** | |
| | Bonsai 8B | fp16 β *post-quant* 1-bit | 0.093 | dissolved | |
| | BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved | |
|
|
| *Dissolved: var@64 < 0.15 (bimodal gap β no model yet observed between 0.15 and 0.50).* |
|
|
| The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum). |
|
|
| **Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested. |
|
|
| **What this does not claim:** |
| - fp8-native training is not *equivalent* to 1-bit in capability or behavior |
| - Post-quantized fp8 is not "worse" than native fp8 β the spectra differ, behavioral quality is a separate question |
| - We have no data on fp4, bf16, or other low-precision training regimes |
| - n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof |
|
|
| **Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch. |
|
|
| ## What this is not |
|
|
| - This is **not** an inference endpoint. You cannot run generation with this artifact. |
| - This is **not** the full model weights β only SVD statistics and feature metadata. |
| - Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately. |
|
|
| ## Model architecture (Kimi-K2 specifics) |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Architecture | DeepSeek-V3 style MoE | |
| | Total parameters | ~1T | |
| | Active parameters per forward pass | ~32B | |
| | Layers | 61 | |
| | Hidden size | 7168 | |
| | Routed experts per layer | 384 | |
| | Top-K routing | 8 (`num_experts_per_tok=8`) | |
| | Shared experts per layer | 1 | |
| | MoE intermediate size | 2048 | |
| | First K dense layers | 1 (layer 0 is dense MLP) | |
| | Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) | |
| | Scoring function | sigmoid | |
|
|
| ## Vindex files |
|
|
| | File | Description | |
| |------|-------------| |
| | `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) | |
| | `phase1_moe_svd_agg.json` | Aggregated C1βC4 constants across all layers | |
| | `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) | |
| | `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) | |
| | `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) | |
|
|
| ## Universal constants (C1βC5) β spot-check results (6 layers) |
|
|
| *Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.* |
|
|
| | Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) | |
| |----------|-------------|----------------------|----------------| |
| | **C1** | FFN activation sparsity | 0.06β0.39 | pending (Phase 3) | |
| | **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β | |
| | **C3** | Gate coherence | 0.53β0.81 | pending (Phase 1 full) | |
| | **C4** | Layer temperature (var@64) | 0.80β0.90 (fp16) | **0.037β0.10** β‘ | |
| | **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) | |
|
|
| β *`num_experts_per_tok: 8` confirmed from `config.json`.* |
|
|
| β‘**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results: |
|
|
| | Layer | Type | var@64 | IQR / note | |
| |-------|------|--------|------------| |
| | L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β flat spectrum | |
| | L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] | |
| | L15 | MoE | 0.100 | IQR=[0.090, 0.110] | |
| | L30 | MoE | pending (full run) | β | |
| | L45 | MoE | pending | β | |
| | L60 | MoE | pending | β | |
|
|
| **This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact. |
|
|
| For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β 0.80β0.90 with a clear power-law singular value spectrum (S[0]/S[1] β 10β100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β 1.24 β nearly uniform. |
|
|
| **Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params). |
|
|
| ## MoE-specific methodology |
|
|
| Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE: |
|
|
| 1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]` |
| 2. Stacked into a batch tensor: `[384, 7168, 2048]` |
| 3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM) |
| 4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic |
|
|
| **fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant. |
|
|
| ## Cross-architecture CKA prediction |
|
|
| Given that Gemma4-E2B β Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict: |
|
|
| - Kimi-K2 β Gemma4: ~0.97β0.99 at entity layer (if the four-stage circuit is intact) |
| - Kimi-K2 β Qwen3: similar range |
|
|
| If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series. |
|
|
| ## Reproduce it |
|
|
| ```bash |
| # Clone the builder |
| git clone https://github.com/Divinci-AI/server.git |
| cd server |
| |
| # Install Modal |
| pip install modal |
| |
| # Spot-check 6 layers first (validates expert layout detection) |
| modal run notebooks/moe_vindex_builder.py \ |
| --model moonshotai/Kimi-K2-Instruct \ |
| --layers 0,1,15,30,45,60 |
| |
| # Full Phase 1 β all 61 layers, batch SVD of 384 experts |
| modal run notebooks/moe_vindex_builder.py \ |
| --model moonshotai/Kimi-K2-Instruct |
| |
| # Phase 2 β router gate SVD (no inference needed) |
| modal run notebooks/moe_vindex_builder.py \ |
| --model moonshotai/Kimi-K2-Instruct --phase 2 |
| |
| # Phase 3 β routing statistics (requires 8ΓH100, live inference) |
| modal run notebooks/moe_vindex_builder.py \ |
| --model moonshotai/Kimi-K2-Instruct --phase 3 |
| |
| # Pull results from Modal volume |
| modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/ |
| ``` |
|
|
| ## License |
|
|
| CC-BY-NC 4.0 β free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{mooring2026kimi-k2-vindex, |
| author = {Mooring, Mike}, |
| title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability}, |
| year = {2026}, |
| url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex}, |
| note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Changelog |
|
|
| | Date | Update | |
| |------|--------| |
| | 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. | |
|
|
| --- |
|
|
| *Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).* |
|
|