File size: 10,545 Bytes
1332648
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30e5dd2
 
 
 
1332648
 
 
 
 
 
 
 
 
 
30e5dd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1332648
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30e5dd2
1332648
30e5dd2
1332648
 
 
30e5dd2
1332648
30e5dd2
 
 
1332648
30e5dd2
1332648
30e5dd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1332648
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30e5dd2
 
 
 
 
 
 
 
1332648
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
license: cc-by-nc-4.0
tags:
  - interpretability
  - mechanistic-interpretability
  - moe
  - kimi-k2
  - larql
  - vindex
  - svd
  - weight-analysis
base_model: moonshotai/Kimi-K2-Instruct
---

# Kimi-K2-Instruct Vindex

A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) β€” Moonshot AI's 1T-parameter Mixture-of-Experts transformer.

Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).

> **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8Γ—H100) pending. The core finding β€” flat SVD spectrum consistent with 1-bit models β€” is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
>
> *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.*

## What this is

This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β€” specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.

The vindex enables:
- **C1–C5 universal constant measurement** across the model
- **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)

## Key finding: fp8-native training causes spectral dissolution

**Training precision, not storage precision, determines spectral structure.**

| Model | Training precision | var@64 (median) | Spectral class |
|-------|-------------------|-----------------|----------------|
| Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
| Ministral-3B | fp16 β†’ *post-quant* fp8 | ~0.85 | non-dissolved |
| **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** |
| Bonsai 8B | fp16 β†’ *post-quant* 1-bit | 0.093 | dissolved |
| BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved |

*Dissolved: var@64 < 0.15 (bimodal gap β€” no model yet observed between 0.15 and 0.50).*

The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum).

**Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β€” so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.

**What this does not claim:**
- fp8-native training is not *equivalent* to 1-bit in capability or behavior
- Post-quantized fp8 is not "worse" than native fp8 β€” the spectra differ, behavioral quality is a separate question
- We have no data on fp4, bf16, or other low-precision training regimes
- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof

**Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.

## What this is not

- This is **not** an inference endpoint. You cannot run generation with this artifact.
- This is **not** the full model weights β€” only SVD statistics and feature metadata.
- Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately.

## Model architecture (Kimi-K2 specifics)

| Parameter | Value |
|-----------|-------|
| Architecture | DeepSeek-V3 style MoE |
| Total parameters | ~1T |
| Active parameters per forward pass | ~32B |
| Layers | 61 |
| Hidden size | 7168 |
| Routed experts per layer | 384 |
| Top-K routing | 8 (`num_experts_per_tok=8`) |
| Shared experts per layer | 1 |
| MoE intermediate size | 2048 |
| First K dense layers | 1 (layer 0 is dense MLP) |
| Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) |
| Scoring function | sigmoid |

## Vindex files

| File | Description |
|------|-------------|
| `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) |
| `phase1_moe_svd_agg.json` | Aggregated C1–C4 constants across all layers |
| `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) |
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |

## Universal constants (C1–C5) β€” spot-check results (6 layers)

*Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.*

| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|----------|-------------|----------------------|----------------|
| **C1** | FFN activation sparsity | 0.06–0.39 | pending (Phase 3) |
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%† |
| **C3** | Gate coherence | 0.53–0.81 | pending (Phase 1 full) |
| **C4** | Layer temperature (var@64) | 0.80–0.90 (fp16) | **0.037–0.10** ‑ |
| **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) |

†*`num_experts_per_tok: 8` confirmed from `config.json`.*

‑**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results:

| Layer | Type | var@64 | IQR / note |
|-------|------|--------|------------|
| L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β€” flat spectrum |
| L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
| L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
| L30 | MoE | pending (full run) | β€” |
| L45 | MoE | pending | β€” |
| L60 | MoE | pending | β€” |

**This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β€” realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.

For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β‰ˆ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] β‰ˆ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β‰ˆ 1.24 β€” nearly uniform.

**Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).

## MoE-specific methodology

Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE:

1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]`
2. Stacked into a batch tensor: `[384, 7168, 2048]`
3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic

**fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant.

## Cross-architecture CKA prediction

Given that Gemma4-E2B ↔ Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:

- Kimi-K2 ↔ Gemma4: ~0.97–0.99 at entity layer (if the four-stage circuit is intact)
- Kimi-K2 ↔ Qwen3: similar range

If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.

## Reproduce it

```bash
# Clone the builder
git clone https://github.com/Divinci-AI/server.git
cd server

# Install Modal
pip install modal

# Spot-check 6 layers first (validates expert layout detection)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct \
  --layers 0,1,15,30,45,60

# Full Phase 1 β€” all 61 layers, batch SVD of 384 experts
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct

# Phase 2 β€” router gate SVD (no inference needed)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct --phase 2

# Phase 3 β€” routing statistics (requires 8Γ—H100, live inference)
modal run notebooks/moe_vindex_builder.py \
  --model moonshotai/Kimi-K2-Instruct --phase 3

# Pull results from Modal volume
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
```

## License

CC-BY-NC 4.0 β€” free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai).

## Citation

```bibtex
@misc{mooring2026kimi-k2-vindex,
  author = {Mooring, Mike},
  title  = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
  year   = {2026},
  url    = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
  note   = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
}
```

---

## Changelog

| Date | Update |
|------|--------|
| 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |

---

*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*