ARBS / docs /true-ternary /TRUE-TERNARY-REFACTOR14.md

Upload folder using huggingface_hub

d8bc908 verified 1 day ago

5.94 kB

	# TRUE TERNARY REFACTOR 14

	Date: 2026-05-20

	## Goal

	Bring the reworked ARB platform path back into a production-ready shape after the MLA KV Cache, MemGram, MoEGraph ACT loop, LTI injection, VideoHead, and compressed temporal additions. The main contract for this pass:

	- KV ledger must condition MoEGraph, Router, and Outputs.
	- MoEGraph must consume VQ motifs and build/update KG edges without full-codebook materialization.
	- Shared VQ scales to 10M entries.
	- KG/composite VQ scales to 5M entries.
	- The model remains ternary-first for trainable internal state.
	- Training scripts that still call `_ternary_update_memory(loss_signal=...)` must keep working.

	## Changes

	### 1. 10M Shared VQ and 5M KG VQ

	Updated `arbitor/config.py`:

	- `SHARED_VQ_SIZE = 10_000_000`
	- `KGVQ_CODEBOOK_SIZE = 5_000_000`

	The estimated logical ternary total is now about 3.012B weights:

	\| Area \| Logical ternary weights \|
	\| --- \| ---: \|
	\| Embedding + text sequencer \| 33.5M \|
	\| Shared VQ + VQ bridge \| 640.9M \|
	\| MoEGraph \| 429.0M \|
	\| KG VQ + composite proposal \| 329.3M \|
	\| Router \| 12.9M \|
	\| ByteHead \| 626.9M \|
	\| Video/Talker heads \| 112.2M \|
	\| MLA attention \| 826.8M \|

	### 2. Large Ternary VQ Lookup

	`TernaryVQCodebook` now avoids full `x @ codebook.T` lookup for million-entry codebooks.

	- Small codebooks still use exact lookup.
	- Large codebooks use deterministic candidate IDs from input signs and compare only candidate vectors.
	- Candidate vectors are fetched through sparse ternary row decode from packed trits and int8 log scales.
	- Cluster usage updates only touched indices instead of running a full `bincount(minlength=codebook_size)`.

	This keeps 10M/5M codebooks from allocating a dequantized full codebook or a giant similarity matrix during normal forward.

	### 3. Sparse Ternary Embedding Updates

	`TernaryEmbeddingTable` now has a sparse path for large tables:

	- decodes only requested packed rows
	- expands only selected int8 `E` rows
	- captures sparse gradient signs per selected row
	- updates `T_accum`, `E_accum`, `E`, and packed trits only for touched rows

	This is required for the 10M shared VQ and 5M KG VQ to train without dense float hidden weights or dense codebook gradients.

	### 4. VQ to MoEGraph Shape Contract

	`SharedVQ` returns `CODEBOOK_DIM=64`, but MoEGraph, router, attention, and output heads operate at `TRIGRAM_DIM=7168`.

	`ARBModel` now adds a ternary `vq_to_trigram` projection plus ternary RMS norm after VQ so the full path is:

	```text
	Sequencer -> SharedVQ(64d motifs) -> ternary VQ-to-trigram expansion -> MLA/KV + MoEGraph -> Router/Outputs
	```

	This fixes the previous shape mismatch where VQ output could not correctly feed MoEGraph/ByteHead.

	### 5. KV Ledger Reaches Outputs Through MoEGraph

	Generation now seeds the KV ledger with prompt tokens when empty, and forward appends predicted byte IDs during both training and text generation.

	KV is consumed by `ContextAttentionScheduler`, injected into MoEGraph, and MoEGraph output is what the router and output heads receive. That satisfies the intended path:

	```text
	KV Ledger -> ContextAttentionScheduler -> MoEGraph -> OutputRouter -> ByteHead/VideoHead/TalkerHead
	```

	### 6. MoEGraph Active-Codebook Path

	MoEGraph no longer requires full `bridge.vq.embed` materialization. It receives the shared VQ table and fetches active codebook rows by motif index.

	Large codebooks use active-node traversal by default, so graph traversal stays tied to VQ motifs without trying to aggregate all 10M nodes.

	### 7. Composite/KG VQ Is Now Ternary

	`CompositeProposalHead` now uses ternary projection, ternary halt gate, and `TernaryVQCodebook` for KG/composite motifs. The old float `nn.Linear` proposal path is no longer the composite default.

	### 8. Training Update Compatibility

	`ARBModel._ternary_update_memory()` accepts both:

	- `loss_signal=...` for existing training scripts
	- `loss_components=...` for component-routed ternary backward

	The update path now:

	- skips non-finite losses before mutating ternary state
	- preserves regular hook-based updates after `loss.backward()`
	- supports component-specific dense and sparse ternary hooks
	- clears stale hooks after update

	### 9. TileLang Training Safety

	TileLang training is disabled by default:

	```text
	ARB_TILELANG_TRAINING=0
	```

	The TileLang autograd forward also now saves the flattened 2D input correctly. This keeps TileLang available for inference/debug speed work while avoiding the fp16 training path that was causing NaN losses.

	## Validation

	Passed:

	```bash
	python -m compileall -q arbitor training tests testing
	python -m pytest -q testing/test_gradient_capture.py testing/test_tilelang_training.py tests/test_lti.py testing/kg/test_kv_integration.py testing/attention/test_lstm_removal_clean.py tests/test_cross_modal.py tests/test_vae2d.py tests/test_vae2d_sequencer.py
	```

	Result:

	```text
	24 passed, 23 skipped, 1 warning
	```

	Additional targeted checks passed:

	- large sparse VQ forward/backward/update with `codebook_size=1_000_001`
	- composite proposal head with sparse 1M KG VQ
	- text sequencer returns `[B, T-2, TRIGRAM_DIM]`
	- exact small VQ path remains finite

	The heavy cross-modal/VAE tests are now guarded by:

	```bash
	ARB_RUN_SLOW_TESTS=1
	```

	This prevents normal test runs from constructing the full 3B target model and sidecar VAE encoders by accident. The initial ungated run entered the heavy full-model / VAE sidecar path and did not produce useful signal quickly in the local run.

	## Remaining Work

	1. Add a first-class small-config ARB smoke model so CI can test full `ARBModel.forward()` without constructing the full 3B target.
	2. Add CUDA validation for sparse VQ row updates on the actual 10M/5M codebooks.
	3. Add a dedicated slow-test fixture that asserts local sidecar cache availability before running full cross-modal VAE tests.
	4. Profile the candidate VQ lookup quality and tune `candidate_count` for speed/accuracy.