File size: 10,545 Bytes
1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 30e5dd2 1332648 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | ---
license: cc-by-nc-4.0
tags:
- interpretability
- mechanistic-interpretability
- moe
- kimi-k2
- larql
- vindex
- svd
- weight-analysis
base_model: moonshotai/Kimi-K2-Instruct
---
# Kimi-K2-Instruct Vindex
A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) β Moonshot AI's 1T-parameter Mixture-of-Experts transformer.
Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
> **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8ΓH100) pending. The core finding β flat SVD spectrum consistent with 1-bit models β is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
>
> *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.*
## What this is
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
The vindex enables:
- **C1βC5 universal constant measurement** across the model
- **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
## Key finding: fp8-native training causes spectral dissolution
**Training precision, not storage precision, determines spectral structure.**
| Model | Training precision | var@64 (median) | Spectral class |
|-------|-------------------|-----------------|----------------|
| Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
| Ministral-3B | fp16 β *post-quant* fp8 | ~0.85 | non-dissolved |
| **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** |
| Bonsai 8B | fp16 β *post-quant* 1-bit | 0.093 | dissolved |
| BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved |
*Dissolved: var@64 < 0.15 (bimodal gap β no model yet observed between 0.15 and 0.50).*
The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum).
**Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.
**What this does not claim:**
- fp8-native training is not *equivalent* to 1-bit in capability or behavior
- Post-quantized fp8 is not "worse" than native fp8 β the spectra differ, behavioral quality is a separate question
- We have no data on fp4, bf16, or other low-precision training regimes
- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof
**Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.
## What this is not
- This is **not** an inference endpoint. You cannot run generation with this artifact.
- This is **not** the full model weights β only SVD statistics and feature metadata.
- Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately.
## Model architecture (Kimi-K2 specifics)
| Parameter | Value |
|-----------|-------|
| Architecture | DeepSeek-V3 style MoE |
| Total parameters | ~1T |
| Active parameters per forward pass | ~32B |
| Layers | 61 |
| Hidden size | 7168 |
| Routed experts per layer | 384 |
| Top-K routing | 8 (`num_experts_per_tok=8`) |
| Shared experts per layer | 1 |
| MoE intermediate size | 2048 |
| First K dense layers | 1 (layer 0 is dense MLP) |
| Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) |
| Scoring function | sigmoid |
## Vindex files
| File | Description |
|------|-------------|
| `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) |
| `phase1_moe_svd_agg.json` | Aggregated C1βC4 constants across all layers |
| `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) |
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
## Universal constants (C1βC5) β spot-check results (6 layers)
*Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.*
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|----------|-------------|----------------------|----------------|
| **C1** | FFN activation sparsity | 0.06β0.39 | pending (Phase 3) |
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β |
| **C3** | Gate coherence | 0.53β0.81 | pending (Phase 1 full) |
| **C4** | Layer temperature (var@64) | 0.80β0.90 (fp16) | **0.037β0.10** β‘ |
| **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) |
β *`num_experts_per_tok: 8` confirmed from `config.json`.*
β‘**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results:
| Layer | Type | var@64 | IQR / note |
|-------|------|--------|------------|
| L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β flat spectrum |
| L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
| L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
| L30 | MoE | pending (full run) | β |
| L45 | MoE | pending | β |
| L60 | MoE | pending | β |
**This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.
For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β 0.80β0.90 with a clear power-law singular value spectrum (S[0]/S[1] β 10β100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β 1.24 β nearly uniform.
**Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).
## MoE-specific methodology
Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE:
1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]`
2. Stacked into a batch tensor: `[384, 7168, 2048]`
3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic
**fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant.
## Cross-architecture CKA prediction
Given that Gemma4-E2B β Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:
- Kimi-K2 β Gemma4: ~0.97β0.99 at entity layer (if the four-stage circuit is intact)
- Kimi-K2 β Qwen3: similar range
If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.
## Reproduce it
```bash
# Clone the builder
git clone https://github.com/Divinci-AI/server.git
cd server
# Install Modal
pip install modal
# Spot-check 6 layers first (validates expert layout detection)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct \
--layers 0,1,15,30,45,60
# Full Phase 1 β all 61 layers, batch SVD of 384 experts
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct
# Phase 2 β router gate SVD (no inference needed)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 2
# Phase 3 β routing statistics (requires 8ΓH100, live inference)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 3
# Pull results from Modal volume
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
```
## License
CC-BY-NC 4.0 β free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai).
## Citation
```bibtex
@misc{mooring2026kimi-k2-vindex,
author = {Mooring, Mike},
title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
year = {2026},
url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
}
```
---
## Changelog
| Date | Update |
|------|--------|
| 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |
---
*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*
|