--- license: cc-by-nc-4.0 tags: - interpretability - mechanistic-interpretability - moe - kimi-k2 - larql - vindex - svd - weight-analysis base_model: moonshotai/Kimi-K2-Instruct --- # Kimi-K2-Instruct Vindex A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) — Moonshot AI's 1T-parameter Mixture-of-Experts transformer. Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py). > **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8×H100) pending. The core finding — flat SVD spectrum consistent with 1-bit models — is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5. > > *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.* ## What this is This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices — specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database. The vindex enables: - **C1–C5 universal constant measurement** across the model - **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth) - **Feature-level entity association lookup** (which expert/feature direction activates for a given input) - **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact) ## Key finding: fp8-native training causes spectral dissolution **Training precision, not storage precision, determines spectral structure.** | Model | Training precision | var@64 (median) | Spectral class | |-------|-------------------|-----------------|----------------| | Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved | | Ministral-3B | fp16 → *post-quant* fp8 | ~0.85 | non-dissolved | | **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** | | Bonsai 8B | fp16 → *post-quant* 1-bit | 0.093 | dissolved | | BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved | *Dissolved: var@64 < 0.15 (bimodal gap — no model yet observed between 0.15 and 0.50).* The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum). **Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented — so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested. **What this does not claim:** - fp8-native training is not *equivalent* to 1-bit in capability or behavior - Post-quantized fp8 is not "worse" than native fp8 — the spectra differ, behavioral quality is a separate question - We have no data on fp4, bf16, or other low-precision training regimes - n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof **Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch. ## What this is not - This is **not** an inference endpoint. You cannot run generation with this artifact. - This is **not** the full model weights — only SVD statistics and feature metadata. - Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately. ## Model architecture (Kimi-K2 specifics) | Parameter | Value | |-----------|-------| | Architecture | DeepSeek-V3 style MoE | | Total parameters | ~1T | | Active parameters per forward pass | ~32B | | Layers | 61 | | Hidden size | 7168 | | Routed experts per layer | 384 | | Top-K routing | 8 (`num_experts_per_tok=8`) | | Shared experts per layer | 1 | | MoE intermediate size | 2048 | | First K dense layers | 1 (layer 0 is dense MLP) | | Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) | | Scoring function | sigmoid | ## Vindex files | File | Description | |------|-------------| | `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) | | `phase1_moe_svd_agg.json` | Aggregated C1–C4 constants across all layers | | `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) | | `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) | | `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) | ## Universal constants (C1–C5) — spot-check results (6 layers) *Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.* | Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) | |----------|-------------|----------------------|----------------| | **C1** | FFN activation sparsity | 0.06–0.39 | pending (Phase 3) | | **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%† | | **C3** | Gate coherence | 0.53–0.81 | pending (Phase 1 full) | | **C4** | Layer temperature (var@64) | 0.80–0.90 (fp16) | **0.037–0.10** ‡ | | **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) | †*`num_experts_per_tok: 8` confirmed from `config.json`.* ‡**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results: | Layer | Type | var@64 | IQR / note | |-------|------|--------|------------| | L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] — flat spectrum | | L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] | | L15 | MoE | 0.100 | IQR=[0.090, 0.110] | | L30 | MoE | pending (full run) | — | | L45 | MoE | pending | — | | L60 | MoE | pending | — | **This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 — realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact. For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 ≈ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] ≈ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 ≈ 1.24 — nearly uniform. **Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params). ## MoE-specific methodology Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE: 1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]` 2. Stacked into a batch tensor: `[384, 7168, 2048]` 3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM) 4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic **fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant. ## Cross-architecture CKA prediction Given that Gemma4-E2B ↔ Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict: - Kimi-K2 ↔ Gemma4: ~0.97–0.99 at entity layer (if the four-stage circuit is intact) - Kimi-K2 ↔ Qwen3: similar range If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series. ## Reproduce it ```bash # Clone the builder git clone https://github.com/Divinci-AI/server.git cd server # Install Modal pip install modal # Spot-check 6 layers first (validates expert layout detection) modal run notebooks/moe_vindex_builder.py \ --model moonshotai/Kimi-K2-Instruct \ --layers 0,1,15,30,45,60 # Full Phase 1 — all 61 layers, batch SVD of 384 experts modal run notebooks/moe_vindex_builder.py \ --model moonshotai/Kimi-K2-Instruct # Phase 2 — router gate SVD (no inference needed) modal run notebooks/moe_vindex_builder.py \ --model moonshotai/Kimi-K2-Instruct --phase 2 # Phase 3 — routing statistics (requires 8×H100, live inference) modal run notebooks/moe_vindex_builder.py \ --model moonshotai/Kimi-K2-Instruct --phase 3 # Pull results from Modal volume modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/ ``` ## License CC-BY-NC 4.0 — free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai). ## Citation ```bibtex @misc{mooring2026kimi-k2-vindex, author = {Mooring, Mike}, title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability}, year = {2026}, url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex}, note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/} } ``` --- ## Changelog | Date | Update | |------|--------| | 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. | --- *Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*