True Ternary Refactor 6 — Architecture Ternarization And Accumulator Hardening
Scope
This pass moves the non-imported MORPH architecture toward persistent ternary state everywhere. ViT and Whisper remain imported frozen encoders as requested. The internal trainable/storage components are now ternary buffers or integer buffers rather than FP parameters.
Architecture Ternarization
Converted internal float trainable components:
ImageSequencer.patch_proj:nn.Linear->TernaryScaleTensorAudioSequencer.frame_proj:nn.Linear->TernaryScaleTensorModalityGate.weights: float parameter -> int8 bufferGNNLoRAAdapter.B: float parameter ->TernaryScaleTensorup projectionGNNLoRAAdapter.scale:nn.Embedding->TernaryEmbeddingTableMemGram.struct_emb/conv_emb: floatParameterList->TernaryEmbeddingTablemodulesMemGramstrength/decay logits: float parameters -> int8 buffersFocusGate.boundary_embed:nn.Embedding->TernaryEmbeddingTableFocusGate.reset_fc/dampen_fc:nn.Linear->TernaryScaleTensorConversationLSTM.focus_cell/topic_cell:nn.LSTMCell->TernaryLSTMCellConversationLSTM.topic_gate_fc:nn.Linear->TernaryScaleTensorGraphMoEGate.query: float parameter ->TernaryScaleTensorquery projectionTernaryGraph.edge_attr: float parameter -> int8 ternary edge bufferVQAdapter:FlashVQCodebookfloat buffers ->TernaryVQCodebookwithTernaryEmbeddingTableConvVQCodebook.embed: float buffer ->TernaryEmbeddingTableConvVQCodebookstrength/decay logits: float parameters -> int8 buffers
New reusable modules:
TernaryEmbeddingTable: packed ternary lookup table with int8E, int8E_accum, and int8T_accum.TernaryLSTMCell: LSTM-style gate cell using one ternary projection over[x, h].TernaryVQCodebook: VQ lookup against a ternary embedding table with integer cluster counters.
Audit Results
Text/internal full architecture without ViT/Whisper:
logical ternary weights: 23,887,936
ternary training state: 31.22 MB
trainable float params: 0 tensors, 0.00 MB
frozen float params: 0 tensors, 0.00 MB
float buffers: 0 tensors, 0.00 MB
Full model with ViT and Whisper enabled:
logical ternary weights: 25,560,128
ternary training state: 33.40 MB
trainable float params: 0 tensors, 0.00 MB
frozen float params: ViT/Whisper only
non-imported trainable float params: 0
non-imported float buffers: 0
Loss / Accumulator Hardening
The previous strict update could fail to flip T because T_accum only moved by 1 per update and many gradients changed direction before reaching threshold 3.
Added loss-strength integer accumulator stepping:
loss_signal -> t_step in {1, 2, 3, 4}
T_accum += sign(grad) * t_step
This keeps T_accum int8, but lets high-loss updates reach threshold faster without adding float optimizer state. The Triton ternary-step kernels now accept T_ACCUM_STEP, and _ternary_update_memory() sets _t_accum_step per update from the current loss.
Scale Semantics
E remains an int8 logarithmic exponent and S remains derived, not stored:
W = T * 2^E
The effective weight values are not limited to {-1, 0, +1}. They are:
{-S, 0, +S}
So if the scale path represents S = 99.9, then the effective group values are { -99.9, 0, +99.9 }. Current implementation uses base-2 integer exponent scales; representing non-power-of-two values like 99.9 exactly would require either a mantissa/residual scale field or a different logarithmic base/lattice. The current approach keeps persistent state integer-only and low overhead.
Kernel Status
The packed ternary linear, embedding, RMSNorm, E update, and T_accum update paths are Triton-backed. Graph edge weighting plus target aggregation is now also Triton-backed on CUDA, with a custom backward for projected message gradients.
MoE and Graph still contain Python-level control flow around multiple ternary kernels:
- MoE loops over top-k/expert routing and calls ternary projections per expert.
- Graph still loops over hops and calls GNN/update projections per hop, but each hop no longer materializes
messagesand callsscatter_add_; ternary edge weighting and aggregation are one Triton launch.
This pass did not honestly collapse the full MoE or Graph computation into one monolithic Triton kernel. Doing that correctly requires a dedicated packed-ternary fused expert dispatch kernel and a fused graph message-passing kernel that decode packed weights, route/scatter tokens, and update outputs inside one launch. The architecture is now ternary enough for that kernel work to be the next isolated performance phase.
Verification
python -m py_compile trigram.py tscale.py benchmark_true_ternary.py train.py ternary_audit.py testing/test_tscale.pyPASS test_cuda_triton_correctness_update_EPASS test_cuda_triton_tscale_pathPASS graph_aggregate_cuda_ok- Full text/internal audit: zero float params and zero float buffers.
- Strict train construction now passes
enable_audio=Falseas well asenable_image=False, so strict mode no longer instantiates Whisper. - Strict train-style audit with image/audio/VQ/graph/memory disabled and MoE enabled:
logical ternary weights: 14,011,904
ternary training state: 18.27 MB
trainable float params: 0 tensors, 0.00 MB
frozen float params: 0 tensors, 0.00 MB
float buffers: 0 tensors, 0.00 MB
- Current strict train smoke after disabling audio in strict mode ran 3 steps with zero float params/buffers and loss moved
8.2048 -> 9.7809 -> 7.7685; final eval loss6.4239. - CUDA full-path smoke with VQ, graph, memory, and MoE enabled passed forward, backward, and
_ternary_update_memory().
Remaining Work
- Build fused MoE Triton dispatch kernel for top-k expert routing and expert projection scheduling.
- Extend the Graph Triton aggregation kernel into a full fused message-passing/hop-update kernel.
- Add component-specific ternary backward routing so LossComponents can update selected ternary module groups separately, not only through weighted total loss.
- Consider a low-overhead mantissa/residual scale lattice if exact non-power-of-two scale values such as
99.9become required.