docs: clarify inference path via larql compile into model

fa094ae verified 25 days ago

7.95 kB

	---
	license: cc-by-nc-4.0
	library_name: larql
	tags:
	- vindex
	- larql
	- gemma4
	- gguf
	- mechanistic-interpretability
	- knowledge-editing
	- constellation-edits
	base_model: google/gemma-4-e2b-it
	---

	# Gemma 4 e2b — LarQL Vindex v0.2

	First-ever published [LarQL](https://github.com/chrishayuk/larql) vindex for Google's Gemma 4.

	A vindex is a transformer's weights decompiled into a queryable feature database — entity associations, circuit structure, and knowledge-editing surfaces exposed as APIs. No GPU required for most operations.

	## What this is / What this is not

	\| ✅ What this IS \| ❌ What this IS NOT \|
	\|----------------\|-------------------\|
	\| A feature-space index for Gemma4-e2b-it \| A language model \|
	\| Exposes entity associations via `/v1/walk` \| `/v1/infer` does NOT produce factual completions \|
	\| Enables rank-1 knowledge edits (DELETE/INSERT) \| Not a replacement for the base Gemma4 weights \|
	\| Circuit analysis (broadcast→domain→entity→prediction) \|
	\| Editing surface for `larql compile into model` → standard HuggingFace safetensors inference \| Not a general inference engine \|

	Critical note on `/v1/infer`: This endpoint returns a feature-modulated projection of the host model's activations — not a coherent text-generation distribution. Output is incoherent subword tokens by design (the vindex is a feature graph, not a full transformer forward pass). For factual text generation from the base model, use `google/gemma-4-e2b-it` directly. To run inference on an edited model (after DELETE/INSERT patches), use `larql compile into model` — this exports MEMIT-edited weights to HuggingFace safetensors that load like any standard `transformers` model. Use `/v1/walk` and `/v1/patch` for the validated vindex operations.

	Validated surfaces: `/v1/walk` (entity-association retrieval), `/v1/describe` (feature neighborhood), `/v1/patch` DELETE/INSERT (rank-1 weight editing, Gate 3 confirmed).

	Compile edited vindex to a runnable model:
	```bash
	# After applying patches, export to safetensors for standard inference
	larql compile into model \
	--vindex Divinci-AI/gemma-4-e2b-vindex \
	--output ./edited-gemma4 \
	--format safetensors

	# Run with standard Transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained('./edited-gemma4')
	```

	## Quick start

	```bash
	# Install LarQL (requires our fork with Gemma 4 support until upstreamed)
	git clone https://github.com/Divinci-AI/larql.git
	cd larql && cargo build --release

	# Set environment variables
	export LARQL_SERVICE_URL=<your_larql_cloud_run_url>
	export INTERNAL_LARQL_S2S_TOKEN=<your_s2s_token>

	# Query entity associations
	curl "$LARQL_SERVICE_URL/v1/walk?prompt=Paris&layers=14-27&top=10" \
	-H "Authorization: Bearer $INTERNAL_LARQL_S2S_TOKEN"

	# Gate 3 repro: DELETE the Paris→capital feature then verify suppression
	curl -X POST "$LARQL_SERVICE_URL/v1/patches/apply" \
	-H "Authorization: Bearer $INTERNAL_LARQL_S2S_TOKEN" \
	-H "Content-Type: application/json" \
	-d '{"name":"delete-paris-capital","patch":{"version":1,"base_model":"gemma4-e2b","created_at":"2026-04-20T00:00:00Z","operations":[{"op":"delete","entity":"Paris","relation":"capital","target":"서울","weight":1.0,"layer":27,"feature":11179}]}}'

	# Before: feature 11179 (gate_score=18.1) present in walk
	# After: feature 11179 absent from walk (complete suppression confirmed)
	```

	## Contents

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `gate_vectors.bin` \| 1.0 GB \| FFN gate matrices, per-layer variable (f16) \|
	\| `down_features.bin` \| ~1.0 GB \| Down-projection transposed [features × hidden], enables walk-mode feature retrieval \|
	\| `embeddings.bin` \| 768 MB \| Token embeddings, 262,144 × 1,536 (f16) \|
	\| `down_meta.bin` \| 29 MB \| Feature labels via vocab projection \|
	\| `feature_clusters.jsonl` \| 4 MB \| K-means clusters over gate features \|
	\| `relation_clusters.json` \| 15 MB \| Wikidata relation matching \|
	\| `norms.bin` \| 423 KB \| Per-layer normalization weights \|
	\| `tokenizer.json` \| 11 MB \| Substitute tokenizer (Qwen 2.5 — real Gemma 4 tokenizer was gated during extraction) \|
	\| `index.json` \| 5 KB \| Metadata: 35 layers, hidden=1536, variable FFN (6144 → 12288) \|
	\| `manifest.json` \| 1.1 KB \| Vindex version manifest \|

	Total: ~2.8 GB (without full weight files)

	> Note on `down_features.bin`: Generated from `down_weights.bin` via a Python transposition step that handles Gemma 4's variable intermediate sizes per layer (L0-14: 6144, L15-34: 12288). The Rust `build_down_features` binary segfaults on variable intermediate sizes; our fix is the Python Cloud Build step in `build-larql-service.sh`. Required for walk-mode feature retrieval.

	## Gate 3 Validation (DELETE patch confirmed)

	Gate 3 test: DELETE patch on Paris → 서울 (Seoul/capital) feature at layer 27, feature 11179.

	\| Metric \| Before DELETE \| After DELETE \|
	\|--------\|--------------\|-------------\|
	\| Feature 11179 gate_score \| 18.10 \| ABSENT \|
	\| Paris capital rank \| #2 overall \| Absent from top-25 \|
	\| Walk hits \| Feature 11179 present (score 18.1) \| Feature 11179 completely absent \|

	Walk vs dense diverge after fix: confirms `down_features.bin` is loaded and active.

	```
	Before: feature=11179 score=18.10 target='서울' ← rank #1
	After: feature=7327 score=9.40 target='PMA' ← 서울 COMPLETELY ABSENT
	```

	Gate 3 result: PASS ✓

	## Architecture details

	- Architecture: Gemma 4 dense (e2b variant)
	- Layers: 35 (L0-14: FFN=6144, L15-34: FFN=12288 — per-layer variable)
	- Hidden size: 1536
	- Head dim: 256
	- Attention: 8 Q heads, 1 KV head (GQA 8:1)
	- Quantization source: Q4_K GGUF

	## Research findings

	This vindex enabled the following findings (see `notebooks/PAPER_universal_constants.md` in [Divinci-AI/server](https://github.com/Divinci-AI/server)):

	Five universal constants across transformer architectures:
	1. ~12% dominant FFN sparsity (scale-invariant)
	2. Top-8 output concentration (~99.7% at each position)
	3. ~0.97 gate coherence across all layers
	4. ~0.042 layer temperature (log-activation variance)
	5. Broadcast → Domain → Entity → Prediction circuit (4-stage)

	Predictive formula: `active_experts ≈ 1/dominant_sparsity` predicts Gemma 4's top-8 MoE routing within 4% error from structural analysis alone.

	Constellation Edits (knowledge editing): Rank-1 DELETE at the TRACE-identified crown layer (L25 for geography facts) achieves FQ=1.00 in 80ms with full reversibility. Gradient ascent fails due to softmax saturation (gradient=0 at P=1.0 float32). Cross-architecture validation: Mistral-7B FQ=1.00/MU=0.88 (structural rank-1), Qwen2.5-1.5B FQ=1.00 (ROME-style k*). See `notebooks/PAPER_CONSTELLATION_EDITS_DRAFT.md`.

	## Important notes

	1. Substitute tokenizer: Feature labels show Qwen 2.5 tokens (151,643-vocab), not Gemma 4 tokens. Gate vectors are correct Gemma 4 weights; only the label mapping is approximate.

	2. Built with patched LarQL: 7 bug fixes required for Gemma 4 (column-major loading, Q4_K block size, variable FFN size support, etc.). See https://github.com/Divinci-AI/larql and upstream PR https://github.com/chrishayuk/larql/pull/24.

	3. License: CC-BY-NC 4.0. Academic and research use. Contact [mike@divinci.ai](mailto:mike@divinci.ai) for commercial licensing.

	## Citation

	```bibtex
	@misc{mooring2026universalconstants,
	title={Universal Constants of Transformer Intelligence},
	author={Mooring, Mike},
	year={2026},
	note={Preprint. arXiv forthcoming.}
	}

	@misc{mooring2026constellation,
	title={Constellation Edits: Training-Free Knowledge Injection and Auditable Unlearning via Multi-Layer Feature Patches},
	author={Mooring, Mike},
	year={2026},
	note={Preprint. arXiv forthcoming.}
	}
	```

	## Acknowledgments

	Chris Hayuk for creating LarQL. Google DeepMind for Gemma 4. Cloudflare for frontier model hosting.