| # TRUE TERNARY REFACTOR 14 |
|
|
| Date: 2026-05-20 |
|
|
| ## Goal |
|
|
| Bring the reworked ARB platform path back into a production-ready shape after the MLA KV Cache, MemGram, MoEGraph ACT loop, LTI injection, VideoHead, and compressed temporal additions. The main contract for this pass: |
|
|
| - KV ledger must condition MoEGraph, Router, and Outputs. |
| - MoEGraph must consume VQ motifs and build/update KG edges without full-codebook materialization. |
| - Shared VQ scales to 10M entries. |
| - KG/composite VQ scales to 5M entries. |
| - The model remains ternary-first for trainable internal state. |
| - Training scripts that still call `_ternary_update_memory(loss_signal=...)` must keep working. |
|
|
| ## Changes |
|
|
| ### 1. 10M Shared VQ and 5M KG VQ |
|
|
| Updated `arbitor/config.py`: |
|
|
| - `SHARED_VQ_SIZE = 10_000_000` |
| - `KGVQ_CODEBOOK_SIZE = 5_000_000` |
|
|
| The estimated logical ternary total is now about **3.012B** weights: |
|
|
| | Area | Logical ternary weights | |
| | --- | ---: | |
| | Embedding + text sequencer | 33.5M | |
| | Shared VQ + VQ bridge | 640.9M | |
| | MoEGraph | 429.0M | |
| | KG VQ + composite proposal | 329.3M | |
| | Router | 12.9M | |
| | ByteHead | 626.9M | |
| | Video/Talker heads | 112.2M | |
| | MLA attention | 826.8M | |
|
|
| ### 2. Large Ternary VQ Lookup |
|
|
| `TernaryVQCodebook` now avoids full `x @ codebook.T` lookup for million-entry codebooks. |
|
|
| - Small codebooks still use exact lookup. |
| - Large codebooks use deterministic candidate IDs from input signs and compare only candidate vectors. |
| - Candidate vectors are fetched through sparse ternary row decode from packed trits and int8 log scales. |
| - Cluster usage updates only touched indices instead of running a full `bincount(minlength=codebook_size)`. |
|
|
| This keeps 10M/5M codebooks from allocating a dequantized full codebook or a giant similarity matrix during normal forward. |
|
|
| ### 3. Sparse Ternary Embedding Updates |
|
|
| `TernaryEmbeddingTable` now has a sparse path for large tables: |
|
|
| - decodes only requested packed rows |
| - expands only selected int8 `E` rows |
| - captures sparse gradient signs per selected row |
| - updates `T_accum`, `E_accum`, `E`, and packed trits only for touched rows |
|
|
| This is required for the 10M shared VQ and 5M KG VQ to train without dense float hidden weights or dense codebook gradients. |
|
|
| ### 4. VQ to MoEGraph Shape Contract |
|
|
| `SharedVQ` returns `CODEBOOK_DIM=64`, but MoEGraph, router, attention, and output heads operate at `TRIGRAM_DIM=7168`. |
|
|
| `ARBModel` now adds a ternary `vq_to_trigram` projection plus ternary RMS norm after VQ so the full path is: |
|
|
| ```text |
| Sequencer -> SharedVQ(64d motifs) -> ternary VQ-to-trigram expansion -> MLA/KV + MoEGraph -> Router/Outputs |
| ``` |
|
|
| This fixes the previous shape mismatch where VQ output could not correctly feed MoEGraph/ByteHead. |
|
|
| ### 5. KV Ledger Reaches Outputs Through MoEGraph |
|
|
| Generation now seeds the KV ledger with prompt tokens when empty, and forward appends predicted byte IDs during both training and text generation. |
|
|
| KV is consumed by `ContextAttentionScheduler`, injected into MoEGraph, and MoEGraph output is what the router and output heads receive. That satisfies the intended path: |
|
|
| ```text |
| KV Ledger -> ContextAttentionScheduler -> MoEGraph -> OutputRouter -> ByteHead/VideoHead/TalkerHead |
| ``` |
|
|
| ### 6. MoEGraph Active-Codebook Path |
|
|
| MoEGraph no longer requires full `bridge.vq.embed` materialization. It receives the shared VQ table and fetches active codebook rows by motif index. |
|
|
| Large codebooks use active-node traversal by default, so graph traversal stays tied to VQ motifs without trying to aggregate all 10M nodes. |
|
|
| ### 7. Composite/KG VQ Is Now Ternary |
|
|
| `CompositeProposalHead` now uses ternary projection, ternary halt gate, and `TernaryVQCodebook` for KG/composite motifs. The old float `nn.Linear` proposal path is no longer the composite default. |
|
|
| ### 8. Training Update Compatibility |
|
|
| `ARBModel._ternary_update_memory()` accepts both: |
|
|
| - `loss_signal=...` for existing training scripts |
| - `loss_components=...` for component-routed ternary backward |
|
|
| The update path now: |
|
|
| - skips non-finite losses before mutating ternary state |
| - preserves regular hook-based updates after `loss.backward()` |
| - supports component-specific dense and sparse ternary hooks |
| - clears stale hooks after update |
|
|
| ### 9. TileLang Training Safety |
|
|
| TileLang training is disabled by default: |
|
|
| ```text |
| ARB_TILELANG_TRAINING=0 |
| ``` |
|
|
| The TileLang autograd forward also now saves the flattened 2D input correctly. This keeps TileLang available for inference/debug speed work while avoiding the fp16 training path that was causing NaN losses. |
|
|
| ## Validation |
|
|
| Passed: |
|
|
| ```bash |
| python -m compileall -q arbitor training tests testing |
| python -m pytest -q testing/test_gradient_capture.py testing/test_tilelang_training.py tests/test_lti.py testing/kg/test_kv_integration.py testing/attention/test_lstm_removal_clean.py tests/test_cross_modal.py tests/test_vae2d.py tests/test_vae2d_sequencer.py |
| ``` |
|
|
| Result: |
|
|
| ```text |
| 24 passed, 23 skipped, 1 warning |
| ``` |
|
|
| Additional targeted checks passed: |
|
|
| - large sparse VQ forward/backward/update with `codebook_size=1_000_001` |
| - composite proposal head with sparse 1M KG VQ |
| - text sequencer returns `[B, T-2, TRIGRAM_DIM]` |
| - exact small VQ path remains finite |
|
|
| The heavy cross-modal/VAE tests are now guarded by: |
|
|
| ```bash |
| ARB_RUN_SLOW_TESTS=1 |
| ``` |
|
|
| This prevents normal test runs from constructing the full 3B target model and sidecar VAE encoders by accident. The initial ungated run entered the heavy full-model / VAE sidecar path and did not produce useful signal quickly in the local run. |
|
|
| ## Remaining Work |
|
|
| 1. Add a first-class small-config ARB smoke model so CI can test full `ARBModel.forward()` without constructing the full 3B target. |
| 2. Add CUDA validation for sparse VQ row updates on the actual 10M/5M codebooks. |
| 3. Add a dedicated slow-test fixture that asserts local sidecar cache availability before running full cross-modal VAE tests. |
| 4. Profile the candidate VQ lookup quality and tune `candidate_count` for speed/accuracy. |
|
|