| # TRUE TERNARY REFACTOR 15 |
|
|
| Date: 2026-05-20 |
|
|
| ## Goal |
|
|
| Fix the two regressions reported after the platform restructure: |
|
|
| - model capacity had fallen back to about 1.9B instead of the 3B target |
| - training was hitting OOM from fp16/fp32 state leaking into ternary paths |
|
|
| ## Changes |
|
|
| ### 1. Restored the 3B target shape |
|
|
| `arbitor/config.py` now restores the large VQ targets while keeping the motif width small enough to stay near 3B: |
|
|
| - `CODEBOOK_DIM = 64` |
| - `SHARED_VQ_SIZE = 10_000_000` |
| - `KGVQ_CODEBOOK_SIZE = 5_000_000` |
| - `KGVQ_CODEBOOK_DIM = 64` |
|
|
| A no-allocation constructor trace of the assembled default model reports: |
|
|
| ```text |
| dummy logical ternary total: 3,011,944,672 |
| ``` |
|
|
| This keeps the requested 10M shared VQ and 5M KG VQ without the accidental 1024-wide VQ explosion. |
|
|
| ### 2. Removed MoEGraph fp16 edge EMA |
|
|
| `MoEGraph` no longer allocates dense `codebook_size * 10` graph edges for large VQ graphs and no longer registers `edge_ema` as `float16`. |
|
|
| Large graphs now use bounded active edge state: |
|
|
| - `active_edge_src`: int32 |
| - `active_edge_dst`: int32 |
| - `active_edge_attr`: int8 ternary edge sign |
| - `active_edge_score`: int8 residual score |
| - `edge_index`: empty compatibility buffer for large active mode |
|
|
| Small graph tests still use dense edges, but the score path is now int8 `edge_score`, not fp16 EMA. |
|
|
| ### 3. Removed float KG VQ buffers |
|
|
| The old `KGVQCodebook` kept float32 `embed` and `embed_avg` buffers. It is now a compatibility wrapper around `TernaryVQCodebook`, so the KG/composite VQ uses packed ternary rows, int8 scales, int8 accumulators, and int16 usage counts. |
|
|
| ### 4. Large VQ initialization is now packed-first |
|
|
| `TernaryEmbeddingTable` now detects million-entry tables and initializes directly into: |
|
|
| - packed `uint8` trits |
| - int8 `E` |
| - int8 `E_accum` |
| - int8 `T_accum` |
|
|
| This avoids building temporary multi-GB float tensors for the 10M shared VQ and 5M KG VQ. |
|
|
| ### 5. Removed persistent fp32 Triton training hooks |
|
|
| The Triton ternary backward path now stores `_hook_grad_T_sign` as int8 instead of keeping `_hook_grad_2d` and `_hook_x_2d` fp32 activation/gradient views on each ternary module after backward. |
|
|
| The direct fp32 hook fallback remains only for non-Triton compatibility paths, and the tests now assert that the CUDA Triton path does not retain fp32 grad/x hooks. |
|
|
| ### 6. Ternary MoE centroids |
|
|
| MoEGraph routing centroids are now a `TernaryEmbeddingTable` instead of a float `nn.Parameter`. |
|
|
| ## Validation |
|
|
| Passed: |
|
|
| ```bash |
| python -m compileall -q arbitor training testing |
| python -m pytest -q testing/kg/test_kg_edges.py testing/kg/test_composite_head.py testing/test_gradient_capture.py testing/test_tilelang_training.py |
| python -m pytest -q testing/test_tscale.py::test_cuda_triton_tscale_path |
| python -m pytest -q --import-mode=importlib testing/model/test_tscale.py::test_cuda_triton_tscale_path |
| ``` |
|
|
| Additional targeted checks passed: |
|
|
| - large active MoEGraph with `codebook_size=10_000_000` has `edge_index.shape == (2, 0)` and no float edge buffers |
| - 1M-entry `TernaryVQCodebook` has no float buffers and trains through sparse forward/backward/update |
| - small active MoEGraph forward/backward remains finite with ternary centroids |
|
|
| ## Remaining Risk |
|
|
| The kernels still use fp32 accumulators internally for numeric accumulation and losses still produce floating scalar loss values. This pass removes persistent fp16/fp32 ternary state and retained fp32 training hooks, which were the memory leak/OOM concern. A fully integer activation/loss path would be a separate kernel-level redesign. |
|
|