| # True Ternary Refactor 6 — Architecture Ternarization And Accumulator Hardening |
|
|
| ## Scope |
|
|
| This pass moves the non-imported MORPH architecture toward persistent ternary state everywhere. ViT and Whisper remain imported frozen encoders as requested. The internal trainable/storage components are now ternary buffers or integer buffers rather than FP parameters. |
|
|
| ## Architecture Ternarization |
|
|
| Converted internal float trainable components: |
|
|
| - `ImageSequencer.patch_proj`: `nn.Linear` -> `TernaryScaleTensor` |
| - `AudioSequencer.frame_proj`: `nn.Linear` -> `TernaryScaleTensor` |
| - `ModalityGate.weights`: float parameter -> int8 buffer |
| - `GNNLoRAAdapter.B`: float parameter -> `TernaryScaleTensor` up projection |
| - `GNNLoRAAdapter.scale`: `nn.Embedding` -> `TernaryEmbeddingTable` |
| - `MemGram.struct_emb` / `conv_emb`: float `ParameterList` -> `TernaryEmbeddingTable` modules |
| - `MemGram` strength/decay logits: float parameters -> int8 buffers |
| - `FocusGate.boundary_embed`: `nn.Embedding` -> `TernaryEmbeddingTable` |
| - `FocusGate.reset_fc` / `dampen_fc`: `nn.Linear` -> `TernaryScaleTensor` |
| - `ConversationLSTM.focus_cell` / `topic_cell`: `nn.LSTMCell` -> `TernaryLSTMCell` |
| - `ConversationLSTM.topic_gate_fc`: `nn.Linear` -> `TernaryScaleTensor` |
| - `GraphMoEGate.query`: float parameter -> `TernaryScaleTensor` query projection |
| - `TernaryGraph.edge_attr`: float parameter -> int8 ternary edge buffer |
| - `VQAdapter`: `FlashVQCodebook` float buffers -> `TernaryVQCodebook` with `TernaryEmbeddingTable` |
| - `ConvVQCodebook.embed`: float buffer -> `TernaryEmbeddingTable` |
| - `ConvVQCodebook` strength/decay logits: float parameters -> int8 buffers |
|
|
| New reusable modules: |
|
|
| - `TernaryEmbeddingTable`: packed ternary lookup table with int8 `E`, int8 `E_accum`, and int8 `T_accum`. |
| - `TernaryLSTMCell`: LSTM-style gate cell using one ternary projection over `[x, h]`. |
| - `TernaryVQCodebook`: VQ lookup against a ternary embedding table with integer cluster counters. |
|
|
| ## Audit Results |
|
|
| Text/internal full architecture without ViT/Whisper: |
|
|
| ```text |
| logical ternary weights: 23,887,936 |
| ternary training state: 31.22 MB |
| trainable float params: 0 tensors, 0.00 MB |
| frozen float params: 0 tensors, 0.00 MB |
| float buffers: 0 tensors, 0.00 MB |
| ``` |
|
|
| Full model with ViT and Whisper enabled: |
|
|
| ```text |
| logical ternary weights: 25,560,128 |
| ternary training state: 33.40 MB |
| trainable float params: 0 tensors, 0.00 MB |
| frozen float params: ViT/Whisper only |
| non-imported trainable float params: 0 |
| non-imported float buffers: 0 |
| ``` |
|
|
| ## Loss / Accumulator Hardening |
|
|
| The previous strict update could fail to flip `T` because `T_accum` only moved by `1` per update and many gradients changed direction before reaching threshold `3`. |
|
|
| Added loss-strength integer accumulator stepping: |
|
|
| ```text |
| loss_signal -> t_step in {1, 2, 3, 4} |
| T_accum += sign(grad) * t_step |
| ``` |
|
|
| This keeps `T_accum` int8, but lets high-loss updates reach threshold faster without adding float optimizer state. The Triton ternary-step kernels now accept `T_ACCUM_STEP`, and `_ternary_update_memory()` sets `_t_accum_step` per update from the current loss. |
|
|
| ## Scale Semantics |
|
|
| `E` remains an int8 logarithmic exponent and `S` remains derived, not stored: |
|
|
| ```text |
| W = T * 2^E |
| ``` |
|
|
| The effective weight values are not limited to `{-1, 0, +1}`. They are: |
|
|
| ```text |
| {-S, 0, +S} |
| ``` |
|
|
| So if the scale path represents `S = 99.9`, then the effective group values are `{ -99.9, 0, +99.9 }`. Current implementation uses base-2 integer exponent scales; representing non-power-of-two values like `99.9` exactly would require either a mantissa/residual scale field or a different logarithmic base/lattice. The current approach keeps persistent state integer-only and low overhead. |
|
|
| ## Kernel Status |
|
|
| The packed ternary linear, embedding, RMSNorm, `E` update, and `T_accum` update paths are Triton-backed. Graph edge weighting plus target aggregation is now also Triton-backed on CUDA, with a custom backward for projected message gradients. |
|
|
| MoE and Graph still contain Python-level control flow around multiple ternary kernels: |
|
|
| - MoE loops over top-k/expert routing and calls ternary projections per expert. |
| - Graph still loops over hops and calls GNN/update projections per hop, but each hop no longer materializes `messages` and calls `scatter_add_`; ternary edge weighting and aggregation are one Triton launch. |
|
|
| This pass did not honestly collapse the full MoE or Graph computation into one monolithic Triton kernel. Doing that correctly requires a dedicated packed-ternary fused expert dispatch kernel and a fused graph message-passing kernel that decode packed weights, route/scatter tokens, and update outputs inside one launch. The architecture is now ternary enough for that kernel work to be the next isolated performance phase. |
|
|
| ## Verification |
|
|
| - `python -m py_compile trigram.py tscale.py benchmark_true_ternary.py train.py ternary_audit.py testing/test_tscale.py` |
| - `PASS test_cuda_triton_correctness_update_E` |
| - `PASS test_cuda_triton_tscale_path` |
| - `PASS graph_aggregate_cuda_ok` |
| - Full text/internal audit: zero float params and zero float buffers. |
| - Strict train construction now passes `enable_audio=False` as well as `enable_image=False`, so strict mode no longer instantiates Whisper. |
| - Strict train-style audit with image/audio/VQ/graph/memory disabled and MoE enabled: |
|
|
| ```text |
| logical ternary weights: 14,011,904 |
| ternary training state: 18.27 MB |
| trainable float params: 0 tensors, 0.00 MB |
| frozen float params: 0 tensors, 0.00 MB |
| float buffers: 0 tensors, 0.00 MB |
| ``` |
|
|
| - Current strict train smoke after disabling audio in strict mode ran 3 steps with zero float params/buffers and loss moved `8.2048 -> 9.7809 -> 7.7685`; final eval loss `6.4239`. |
| - CUDA full-path smoke with VQ, graph, memory, and MoE enabled passed forward, backward, and `_ternary_update_memory()`. |
|
|
| ## Remaining Work |
|
|
| 1. Build fused MoE Triton dispatch kernel for top-k expert routing and expert projection scheduling. |
| 2. Extend the Graph Triton aggregation kernel into a full fused message-passing/hop-update kernel. |
| 3. Add component-specific ternary backward routing so LossComponents can update selected ternary module groups separately, not only through weighted total loss. |
| 4. Consider a low-overhead mantissa/residual scale lattice if exact non-power-of-two scale values such as `99.9` become required. |
|
|