kimi-k2-vindex / README.md
mikeumus-divincian's picture
Upload README.md with huggingface_hub
30e5dd2 verified
---
license: cc-by-nc-4.0
tags:
- interpretability
- mechanistic-interpretability
- moe
- kimi-k2
- larql
- vindex
- svd
- weight-analysis
base_model: moonshotai/Kimi-K2-Instruct
---
# Kimi-K2-Instruct Vindex
A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) β€” Moonshot AI's 1T-parameter Mixture-of-Experts transformer.
Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
> **Status (2026-04-23):** Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8Γ—H100) pending. The core finding β€” flat SVD spectrum consistent with 1-bit models β€” is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
>
> *This card updates in-place as phases land. See [changelog](#changelog) at the bottom.*
## What this is
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β€” specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
The vindex enables:
- **C1–C5 universal constant measurement** across the model
- **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
## Key finding: fp8-native training causes spectral dissolution
**Training precision, not storage precision, determines spectral structure.**
| Model | Training precision | var@64 (median) | Spectral class |
|-------|-------------------|-----------------|----------------|
| Gemma 4 E2B-it | fp16/bf16 | 0.041 | non-dissolved |
| Ministral-3B | fp16 β†’ *post-quant* fp8 | ~0.85 | non-dissolved |
| **Kimi-K2-Instruct** | **fp8 native** | **0.088** (MoE, 6/61 layers) | **dissolved** |
| Bonsai 8B | fp16 β†’ *post-quant* 1-bit | 0.093 | dissolved |
| BitNet b1.58-2B-4T | **1-bit native** | 0.111 | dissolved |
*Dissolved: var@64 < 0.15 (bimodal gap β€” no model yet observed between 0.15 and 0.50).*
The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is *also* stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used *at training time* (gradient noise and limited numerical range) or applied *after* fp16 training (post-quantization, which preserves the fp16-shaped spectrum).
**Hypothesis:** Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented β€” so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.
**What this does not claim:**
- fp8-native training is not *equivalent* to 1-bit in capability or behavior
- Post-quantized fp8 is not "worse" than native fp8 β€” the spectra differ, behavioral quality is a separate question
- We have no data on fp4, bf16, or other low-precision training regimes
- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof
**Testable prediction:** Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.
## What this is not
- This is **not** an inference endpoint. You cannot run generation with this artifact.
- This is **not** the full model weights β€” only SVD statistics and feature metadata.
- Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately.
## Model architecture (Kimi-K2 specifics)
| Parameter | Value |
|-----------|-------|
| Architecture | DeepSeek-V3 style MoE |
| Total parameters | ~1T |
| Active parameters per forward pass | ~32B |
| Layers | 61 |
| Hidden size | 7168 |
| Routed experts per layer | 384 |
| Top-K routing | 8 (`num_experts_per_tok=8`) |
| Shared experts per layer | 1 |
| MoE intermediate size | 2048 |
| First K dense layers | 1 (layer 0 is dense MLP) |
| Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) |
| Scoring function | sigmoid |
## Vindex files
| File | Description |
|------|-------------|
| `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) |
| `phase1_moe_svd_agg.json` | Aggregated C1–C4 constants across all layers |
| `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) |
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
## Universal constants (C1–C5) β€” spot-check results (6 layers)
*Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.*
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|----------|-------------|----------------------|----------------|
| **C1** | FFN activation sparsity | 0.06–0.39 | pending (Phase 3) |
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%† |
| **C3** | Gate coherence | 0.53–0.81 | pending (Phase 1 full) |
| **C4** | Layer temperature (var@64) | 0.80–0.90 (fp16) | **0.037–0.10** ‑ |
| **C5** | Circuit stage count | 4 (fp16) | pending (Phase 3) |
†*`num_experts_per_tok: 8` confirmed from `config.json`.*
‑**Key finding: fp8-native training produces a flat weight spectrum.** Spot-check results:
| Layer | Type | var@64 | IQR / note |
|-------|------|--------|------------|
| L00 | Dense MLP | 0.037 | S[:3]=[10.7, 8.6, 8.1] β€” flat spectrum |
| L01 | MoE (384 experts) | 0.082 | IQR=[0.076, 0.087] |
| L15 | MoE | 0.100 | IQR=[0.090, 0.110] |
| L30 | MoE | pending (full run) | β€” |
| L45 | MoE | pending | β€” |
| L60 | MoE | pending | β€” |
**This result is surprising and genuine.** The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 β€” realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.
For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 β‰ˆ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] β‰ˆ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 β‰ˆ 1.24 β€” nearly uniform.
**Hypothesis**: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).
## MoE-specific methodology
Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE:
1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]`
2. Stacked into a batch tensor: `[384, 7168, 2048]`
3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic
**fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant.
## Cross-architecture CKA prediction
Given that Gemma4-E2B ↔ Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:
- Kimi-K2 ↔ Gemma4: ~0.97–0.99 at entity layer (if the four-stage circuit is intact)
- Kimi-K2 ↔ Qwen3: similar range
If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.
## Reproduce it
```bash
# Clone the builder
git clone https://github.com/Divinci-AI/server.git
cd server
# Install Modal
pip install modal
# Spot-check 6 layers first (validates expert layout detection)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct \
--layers 0,1,15,30,45,60
# Full Phase 1 β€” all 61 layers, batch SVD of 384 experts
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct
# Phase 2 β€” router gate SVD (no inference needed)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 2
# Phase 3 β€” routing statistics (requires 8Γ—H100, live inference)
modal run notebooks/moe_vindex_builder.py \
--model moonshotai/Kimi-K2-Instruct --phase 3
# Pull results from Modal volume
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
```
## License
CC-BY-NC 4.0 β€” free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai).
## Citation
```bibtex
@misc{mooring2026kimi-k2-vindex,
author = {Mooring, Mike},
title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
year = {2026},
url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
}
```
---
## Changelog
| Date | Update |
|------|--------|
| 2026-04-23 | Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. |
---
*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*