Upload README.md with huggingface_hub

30e5dd2 verified 23 days ago

10.5 kB

	---
	license: cc-by-nc-4.0
	tags:
	- interpretability
	- mechanistic-interpretability
	- moe
	- kimi-k2
	- larql
	- vindex
	- svd
	- weight-analysis
	base_model: moonshotai/Kimi-K2-Instruct
	---

	# Kimi-K2-Instruct Vindex

	A vindex (vector index of learned features) for [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) — Moonshot AI's 1T-parameter Mixture-of-Experts transformer.

	Built with [LarQL](https://github.com/Divinci-AI/larql) using the MoE-aware vindex builder at [`notebooks/moe_vindex_builder.py`](https://github.com/Divinci-AI/server/blob/preview/notebooks/moe_vindex_builder.py).

	> Status (2026-04-23): Phase 1 spot-check complete (6/61 layers). Full 61-layer Phase 1 running, ETA ~4 hr from publish time. Phase 2 (router SVD) and Phase 3 (routing stats, 8×H100) pending. The core finding — flat SVD spectrum consistent with 1-bit models — is established by the spot-check; subsequent phases refine the full-layer distribution and confirm C1/C3/C5.
	>
	> This card updates in-place as phases land. See [changelog](#changelog) at the bottom.

	## What this is

	This vindex is the SVD decomposition of Kimi-K2's MLP weight matrices — specifically the `down_proj` weights of all 384 routed experts across all 61 transformer layers. Each layer's 384 experts are batch-SVD'd and the singular value statistics are aggregated into a queryable feature database.

	The vindex enables:
	- C1–C5 universal constant measurement across the model
	- Cross-architecture CKA (comparing Kimi-K2's representational geometry to Gemma4/Qwen3 at matched normalized depth)
	- Feature-level entity association lookup (which expert/feature direction activates for a given input)
	- Knowledge editing via rank-1 DELETE/INSERT patches (where the four-stage circuit is intact)

	## Key finding: fp8-native training causes spectral dissolution

	Training precision, not storage precision, determines spectral structure.

	\| Model \| Training precision \| var@64 (median) \| Spectral class \|
	\|-------\|-------------------\|-----------------\|----------------\|
	\| Gemma 4 E2B-it \| fp16/bf16 \| 0.041 \| non-dissolved \|
	\| Ministral-3B \| fp16 → post-quant fp8 \| ~0.85 \| non-dissolved \|
	\| Kimi-K2-Instruct \| fp8 native \| 0.088 (MoE, 6/61 layers) \| dissolved \|
	\| Bonsai 8B \| fp16 → post-quant 1-bit \| 0.093 \| dissolved \|
	\| BitNet b1.58-2B-4T \| 1-bit native \| 0.111 \| dissolved \|

	Dissolved: var@64 < 0.15 (bimodal gap — no model yet observed between 0.15 and 0.50).

	The finding is surprising because Ministral-3B is stored in fp8 and shows a non-dissolved spectrum, while Kimi-K2 is also stored in fp8 but shows a dissolved spectrum. Same storage precision; opposite internal geometry. The variable that distinguishes them is whether fp8 was used at training time (gradient noise and limited numerical range) or applied after fp16 training (post-quantization, which preserves the fp16-shaped spectrum).

	Hypothesis: Low-precision training (fp8 native, 1-bit native) prevents the accumulation of high-rank weight refinements that produce sharp singular-value decay in fp16 models. Gradient updates under limited numerical range can only express directions that are already well-represented — so the weights converge to a more uniform spectral distribution. This is consistent with all three dissolution datapoints (n=3); mechanism is untested.

	What this does not claim:
	- fp8-native training is not equivalent to 1-bit in capability or behavior
	- Post-quantized fp8 is not "worse" than native fp8 — the spectra differ, behavioral quality is a separate question
	- We have no data on fp4, bf16, or other low-precision training regimes
	- n=3 across heterogeneous models; this is a hypothesis-generating result, not a causal proof

	Testable prediction: Any future model trained natively in fp8 or lower precision should show var@64 < 0.15. Models to watch: any native-fp8 release from Moonshot AI, DeepSeek, or others using low-precision training from scratch.

	## What this is not

	- This is not an inference endpoint. You cannot run generation with this artifact.
	- This is not the full model weights — only SVD statistics and feature metadata.
	- Phase 3 routing stats (live inference with `output_router_logits=True`) are stored separately.

	## Model architecture (Kimi-K2 specifics)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| DeepSeek-V3 style MoE \|
	\| Total parameters \| ~1T \|
	\| Active parameters per forward pass \| ~32B \|
	\| Layers \| 61 \|
	\| Hidden size \| 7168 \|
	\| Routed experts per layer \| 384 \|
	\| Top-K routing \| 8 (`num_experts_per_tok=8`) \|
	\| Shared experts per layer \| 1 \|
	\| MoE intermediate size \| 2048 \|
	\| First K dense layers \| 1 (layer 0 is dense MLP) \|
	\| Weight precision \| fp8 block-quantized (`weight_block_size=[128,128]`) \|
	\| Scoring function \| sigmoid \|

	## Vindex files

	\| File \| Description \|
	\|------\|-------------\|
	\| `phase1_moe_svd.json` \| Per-layer SVD statistics for all 384 experts (median_var64, q25/q75, dominant SV ratios) \|
	\| `phase1_moe_svd_agg.json` \| Aggregated C1–C4 constants across all layers \|
	\| `phase2_router_svd.json` \| Router gate weight SVD per layer (`router.weight` [384, 7168]) \|
	\| `phase3_routing_stats.json` \| Live routing statistics from `output_router_logits=True` inference (256 diverse prompts) \|
	\| `moe_config.json` \| Detected MoE architecture config (expert layout, layer types, routing params) \|

	## Universal constants (C1–C5) — spot-check results (6 layers)

	Phase 1 confirmed on Modal L4 (all 61 layers running). Phase 2+3 pending.

	\| Constant \| Description \| Expected (fp16 dense) \| Kimi-K2 (MoE) \|
	\|----------\|-------------\|----------------------\|----------------\|
	\| C1 \| FFN activation sparsity \| 0.06–0.39 \| pending (Phase 3) \|
	\| C2 \| Top-8 output concentration \| 99.7% (at MoE scale) \| 99.7%† \|
	\| C3 \| Gate coherence \| 0.53–0.81 \| pending (Phase 1 full) \|
	\| C4 \| Layer temperature (var@64) \| 0.80–0.90 (fp16) \| 0.037–0.10 ‡ \|
	\| C5 \| Circuit stage count \| 4 (fp16) \| pending (Phase 3) \|

	†`num_experts_per_tok: 8` confirmed from `config.json`.

	‡Key finding: fp8-native training produces a flat weight spectrum. Spot-check results:

	\| Layer \| Type \| var@64 \| IQR / note \|
	\|-------\|------\|--------\|------------\|
	\| L00 \| Dense MLP \| 0.037 \| S[:3]=[10.7, 8.6, 8.1] — flat spectrum \|
	\| L01 \| MoE (384 experts) \| 0.082 \| IQR=[0.076, 0.087] \|
	\| L15 \| MoE \| 0.100 \| IQR=[0.090, 0.110] \|
	\| L30 \| MoE \| pending (full run) \| — \|
	\| L45 \| MoE \| pending \| — \|
	\| L60 \| MoE \| pending \| — \|

	This result is surprising and genuine. The fp8 dequantization is verified correct (scale_inv stats: mean=0.00028, cv=0.41, dequantized weight std=0.023, abs_max=0.93 — realistic fp16-equivalent range). The flat spectrum is a property of fp8-native training, not a measurement artifact.

	For fp16/bf16 models (Gemma-4, Qwen3, Llama-3), var@64 ≈ 0.80–0.90 with a clear power-law singular value spectrum (S[0]/S[1] ≈ 10–100). For Kimi-K2, S[0]/S[1] = 10.7/8.6 ≈ 1.24 — nearly uniform.

	Hypothesis: fp8 native training (rather than post-quantization) causes the learned representation to spread across more singular directions, flattening the spectrum. This may be due to fp8 gradient noise acting as a distributional regularizer during training, or due to the extreme training scale (14.7T tokens at 1T params).

	## MoE-specific methodology

	Standard vindex builders (for dense models) SVD a single `down_proj` matrix per layer. For Kimi-K2's MoE:

	1. All 384 expert `down_proj` matrices per layer are loaded: shape `[hidden_size=7168, moe_intermediate_size=2048]`
	2. Stacked into a batch tensor: `[384, 7168, 2048]`
	3. Batch SVD computed on GPU (chunked into groups of 64 to fit H100 VRAM)
	4. Per-expert `var@64` values are aggregated: median, Q25, Q75 reported as the layer statistic

	fp8 handling: block-wise fp8 weights are cast to bf16 before SVD. The block-wise scale factors affect absolute singular values but not the relative structure (which singular value directions dominate). The singular value ratios and variance-fraction statistics are scale-invariant.

	## Cross-architecture CKA prediction

	Given that Gemma4-E2B ↔ Qwen3-8B achieves 99.2% CKA at matched normalized depth (entity layer, ~55%), we predict:

	- Kimi-K2 ↔ Gemma4: ~0.97–0.99 at entity layer (if the four-stage circuit is intact)
	- Kimi-K2 ↔ Qwen3: similar range

	If Kimi-K2's MoE routing significantly changes the representational geometry at any depth tier, CKA will drop below 0.90 at that tier. That would be the most surprising result of the cross-architecture series.

	## Reproduce it

	```bash
	# Clone the builder
	git clone https://github.com/Divinci-AI/server.git
	cd server

	# Install Modal
	pip install modal

	# Spot-check 6 layers first (validates expert layout detection)
	modal run notebooks/moe_vindex_builder.py \
	--model moonshotai/Kimi-K2-Instruct \
	--layers 0,1,15,30,45,60

	# Full Phase 1 — all 61 layers, batch SVD of 384 experts
	modal run notebooks/moe_vindex_builder.py \
	--model moonshotai/Kimi-K2-Instruct

	# Phase 2 — router gate SVD (no inference needed)
	modal run notebooks/moe_vindex_builder.py \
	--model moonshotai/Kimi-K2-Instruct --phase 2

	# Phase 3 — routing statistics (requires 8×H100, live inference)
	modal run notebooks/moe_vindex_builder.py \
	--model moonshotai/Kimi-K2-Instruct --phase 3

	# Pull results from Modal volume
	modal volume get vindex-cache moonshotai-kimi-k2-instruct/ ./kimi-vindex/
	```

	## License

	CC-BY-NC 4.0 — free for non-commercial research use. For commercial use, contact [mike@divinci.ai](mailto:mike@divinci.ai).

	## Citation

	```bibtex
	@misc{mooring2026kimi-k2-vindex,
	author = {Mooring, Mike},
	title = {Kimi-K2-Instruct Vindex: SVD Feature Database for Mechanistic Interpretability},
	year = {2026},
	url = {https://huggingface.co/Divinci-AI/kimi-k2-vindex},
	note = {Built with LarQL (https://github.com/Divinci-AI/larql). Part of the Interpretability Diaries series at https://divinci.ai/blog/}
	}
	```

	---

	## Changelog

	\| Date \| Update \|
	\|------\|--------\|
	\| 2026-04-23 \| Initial publish. Phase 1 spot-check complete (6/61 layers). Key finding: fp8-native spectral dissolution confirmed. Full Phase 1 running. \|

	---

	Part of the [Interpretability Diaries](https://divinci.ai/blog/architecture-every-llm-converges-to/) research series. Working in public at [github.com/Divinci-AI](https://github.com/Divinci-AI).