# TRUE-TERNARY-REFACTOR11 Date: 2026-05-19 ## Scope Final readiness pass for the platform-style ARB tree: - Hit the exact 1.5B logical ternary weight target. - Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized. - Make graph/VQ/MoE wiring consistent with the current config. - Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints. - Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights. ## Architecture Changes - `CODEBOOK_SIZE` is now `34108`. - `MOE_SHARED_INTER` is now `21216`. - Full multimodal ARB audit now lands at exactly `1,500,000,000` logical ternary weights. - The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state. - `ARBModel` now wires graph vocab size from `MultimodalVQBridge.total_codebook_size` instead of the old hardcoded `16384`. - MemGram and ConvVQ now use config constants instead of duplicated literals. - Supervised calls with `targets` always use `ByteHead`, even under `model.eval()`, so eval loss cannot accidentally route into video/audio heads. ## Kernel And Runtime Changes - Large `TernaryGraph` instances now use an active-code path when `total_vocab_size > 4096`. - This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests. - The 32-expert/top-4 MoE now disables dense all-expert dispatch (`dense_dispatch_max_tokens=0`) because the dense path is too expensive at 1.5B scale. - `TernaryScaleTensor.forward()` now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly. - `VideoHead` initial latent now requires grad in training so `cross_attn_q` can receive ternary gradient signals. - `TalkerHead.token_logits()` was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens. ## Training Fixes - Replaced stale `arbitor/train.py` with a current ARB trainer: - strict ternary updates by default, - no bitsandbytes dependency, - correct trigram target alignment (`x[:, 3:]`), - optional `--no-save` for smoke tests, - explicit sidecar/modal flags, - no optimizer when there are no trainable float params. - `training/text.py` now always runs ternary state updates after backward. - `training/audio.py` now trains `AudioSequencer -> TalkerHead.token_logits()` against `AudioVQEncoder` targets. - `training/vision.py` now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used. - `training/diffusion.py` now feeds relational tokens directly into `VideoHead` and avoids the unused full MoE. - LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose `--max-moe-iters`, and default to `1` iteration for local 8GB runs. - Audio input normalization now accepts `[T]`, `[B, T]`, and `[B, 1, T]`. ## Verification Passed: - `python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py` - `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"` - Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise. - Full multimodal audit: - logical ternary weights: `1,500,000,000` - ternary training state: `1956.05 MB` - trainable float params: `0` - frozen sidecar params: `318.80 MB` - graph vocab: `58684` - DINOv2 int8: `True` - Moonshine int8: `True` - Active graph CUDA train smoke passed with VQ+graph and no MoE. - `arbitor.train` strict ternary CUDA smoke passed with 1.5B MoE path. - Standalone modality smokes passed: - `training/audio.py --steps 1 --batch 1` - `training/vision.py --steps 1 --batch 1` - `training/diffusion.py --steps 1 --batch 1` - LoRA text finetune smoke passed: - `training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1` - pig-vae load now passes after installing `diffusers` into the user site. - The local `.safetensors` checkpoint must load through `AutoencoderKLWan.from_single_file`; direct `load_state_dict(strict=False)` had 194 missing and 194 unexpected keys and was silently leaving random VAE weights. - pig-vae int8 load smoke: - inner module: `AutoencoderKLWan` - quantized int8: `True` - trainable float params: `0` - pig-vae encode/decode smoke: - input video: `[1, 3, 4, 64, 64]` - latents: `[1, 16, 1, 8, 8]` - reconstruction: `[1, 3, 1, 64, 64]` ## Remaining Constraints - Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change. - The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling. - LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path. - Diffusers emits a non-fatal warning about missing `torchao` tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao. - `pyproject.toml` now requires `diffusers>=0.38.0` for the `diffusers` and `video` extras because older versions may not expose `AutoencoderKLWan`.