Add model card with MoE vindex methodology and C1-C5 predictions
Browse files
README.md
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
tags:
|
| 4 |
+
- interpretability
|
| 5 |
+
- mechanistic-interpretability
|
| 6 |
+
- moe
|
| 7 |
+
- kimi-k2
|
| 8 |
+
- larql
|
| 9 |
+
- vindex
|
| 10 |
+
- svd
|
| 11 |
+
- weight-analysis
|
| 12 |
+
base_model: moonshotai/Kimi-K2-Instruct
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Kimi-K2-Instruct Vindex
|
| 16 |
+
|
| 17 |
+
A **vindex** (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) β Moonshot AI's 1T-parameter Mixture-of-Experts transformer.
|
| 18 |
+
|
| 19 |
+
Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).
|
| 20 |
+
|
| 21 |
+
## What this is
|
| 22 |
+
|
| 23 |
+
This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices β specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.
|
| 24 |
+
|
| 25 |
+
The vindex enables:
|
| 26 |
+
- **C1βC5 universal constant measurement** across the model
|
| 27 |
+
- **Cross-architecture CKA** (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
|
| 28 |
+
- **Feature-level entity association lookup** (which expert/feature direction activates for a given input)
|
| 29 |
+
- **Knowledge editing** via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)
|
| 30 |
+
|
| 31 |
+
## What this is not
|
| 32 |
+
|
| 33 |
+
- This is **not** an inference endpoint. You cannot run generation with this artifact.
|
| 34 |
+
- This is **not** the full model weights β only SVD statistics and feature metadata.
|
| 35 |
+
- Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately.
|
| 36 |
+
|
| 37 |
+
## Model architecture (Kimi-K2 specifics)
|
| 38 |
+
|
| 39 |
+
| Parameter | Value |
|
| 40 |
+
|-----------|-------|
|
| 41 |
+
| Architecture | DeepSeek-V3 style MoE |
|
| 42 |
+
| Total parameters | ~1T |
|
| 43 |
+
| Active parameters per forward pass | ~32B |
|
| 44 |
+
| Layers | 61 |
|
| 45 |
+
| Hidden size | 7168 |
|
| 46 |
+
| Routed experts per layer | 384 |
|
| 47 |
+
| Top-K routing | 8 (`num_experts_per_tok=8`) |
|
| 48 |
+
| Shared experts per layer | 1 |
|
| 49 |
+
| MoE intermediate size | 2048 |
|
| 50 |
+
| First K dense layers | 1 (layer 0 is dense MLP) |
|
| 51 |
+
| Weight precision | fp8 block-quantized (`weight_block_size=[128,128]`) |
|
| 52 |
+
| Scoring function | sigmoid |
|
| 53 |
+
|
| 54 |
+
## Vindex files
|
| 55 |
+
|
| 56 |
+
| File | Description |
|
| 57 |
+
|------|-------------|
|
| 58 |
+
| `phase1_moe_svd.json` | Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) |
|
| 59 |
+
| `phase1_moe_svd_agg.json` | Aggregated C1βC4 constants across all layers |
|
| 60 |
+
| `phase2_router_svd.json` | Router gate weight SVD per layer (`router.weight` [384, 7168]) |
|
| 61 |
+
| `phase3_routing_stats.json` | Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) |
|
| 62 |
+
| `moe_config.json` | Detected MoE architecture config (expert layout, layer types, routing params) |
|
| 63 |
+
|
| 64 |
+
## Universal constants (C1βC5) β preliminary results
|
| 65 |
+
|
| 66 |
+
*Phase 1+2 running on Modal H100. Results will be posted here and in the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) once complete.*
|
| 67 |
+
|
| 68 |
+
| Constant | Description | Expected (fp16 dense) | Kimi-K2 (MoE) |
|
| 69 |
+
|----------|-------------|----------------------|----------------|
|
| 70 |
+
| **C1** | FFN activation sparsity | 0.06β0.39 | pending |
|
| 71 |
+
| **C2** | Top-8 output concentration | 99.7% (at MoE scale) | 99.7%β |
|
| 72 |
+
| **C3** | Gate coherence | 0.53β0.81 | pending |
|
| 73 |
+
| **C4** | Layer temperature | 0.012β0.042 | pending |
|
| 74 |
+
| **C5** | Circuit stage count | 4 (fp16) | pending |
|
| 75 |
+
|
| 76 |
+
β *`num_experts_per_tok: 8` confirmed from `config.json`. Four organizations (Google, OpenAI, Alibaba, Moonshot AI) have independently converged on top-8 routing for frontier MoE.*
|
| 77 |
+
|
| 78 |
+
**Prediction**: var@64 β 0.80β0.90 (power-law, same as all other fp16/bf16 models) and C5 = 4 (circuit intact). The MoE architecture shouldn't change this β per-expert weight matrices are bf16-equivalent.
|
| 79 |
+
|
| 80 |
+
## MoE-specific methodology
|
| 81 |
+
|
| 82 |
+
Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE:
|
| 83 |
+
|
| 84 |
+
1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]`
|
| 85 |
+
2. Stacked into a batch tensor: `[384, 7168, 2048]`
|
| 86 |
+
3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
|
| 87 |
+
4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic
|
| 88 |
+
|
| 89 |
+
**fp8 handling**: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value *ratios* and variance-fraction statistics are scale-invariant.
|
| 90 |
+
|
| 91 |
+
## Cross-architecture CKA prediction
|
| 92 |
+
|
| 93 |
+
Given that Gemma4-E2B β Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:
|
| 94 |
+
|
| 95 |
+
- Kimi-K2 β Gemma4: ~0.97β0.99 at entity layer (if the four-stage circuit is intact)
|
| 96 |
+
- Kimi-K2 β Qwen3: similar range
|
| 97 |
+
|
| 98 |
+
If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.
|
| 99 |
+
|
| 100 |
+
## Reproduce it
|
| 101 |
+
|
| 102 |
+
```bash
|
| 103 |
+
# Clone the builder
|
| 104 |
+
git clone https://github.com/Divinci-AI/server.git
|
| 105 |
+
cd server
|
| 106 |
+
|
| 107 |
+
# Install Modal
|
| 108 |
+
pip install modal
|
| 109 |
+
|
| 110 |
+
# Spot-check 6 layers first (validates expert layout detection)
|
| 111 |
+
modal run notebooks/moe_vindex_builder.py \
|
| 112 |
+
--model moonshotai/Kimi-K2-Instruct \
|
| 113 |
+
--layers 0,1,15,30,45,60
|
| 114 |
+
|
| 115 |
+
# Full Phase 1 β all 61 layers, batch SVD of 384 experts
|
| 116 |
+
modal run notebooks/moe_vindex_builder.py \
|
| 117 |
+
--model moonshotai/Kimi-K2-Instruct
|
| 118 |
+
|
| 119 |
+
# Phase 2 β router gate SVD (no inference needed)
|
| 120 |
+
modal run notebooks/moe_vindex_builder.py \
|
| 121 |
+
--model moonshotai/Kimi-K2-Instruct --phase 2
|
| 122 |
+
|
| 123 |
+
# Phase 3 β routing statistics (requires 8ΓH100, live inference)
|
| 124 |
+
modal run notebooks/moe_vindex_builder.py \
|
| 125 |
+
--model moonshotai/Kimi-K2-Instruct --phase 3
|
| 126 |
+
|
| 127 |
+
# Pull results from Modal volume
|
| 128 |
+
modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## License
|
| 132 |
+
|
| 133 |
+
CC-BY-NC 4.0 β free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai).
|
| 134 |
+
|
| 135 |
+
## Citation
|
| 136 |
+
|
| 137 |
+
```bibtex
|
| 138 |
+
@misc{mooring2026kimi-k2-vindex,
|
| 139 |
+
author = {Mooring, Mike},
|
| 140 |
+
title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
|
| 141 |
+
year = {2026},
|
| 142 |
+
url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
|
| 143 |
+
note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
|
| 144 |
+
}
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
*Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).*
|