| # TRUE-TERNARY-REFACTOR11 |
|
|
| Date: 2026-05-19 |
|
|
| ## Scope |
|
|
| Final readiness pass for the platform-style ARB tree: |
|
|
| - Hit the exact 1.5B logical ternary weight target. |
| - Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized. |
| - Make graph/VQ/MoE wiring consistent with the current config. |
| - Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints. |
| - Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights. |
|
|
| ## Architecture Changes |
|
|
| - `CODEBOOK_SIZE` is now `34108`. |
| - `MOE_SHARED_INTER` is now `21216`. |
| - Full multimodal ARB audit now lands at exactly `1,500,000,000` logical ternary weights. |
| - The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state. |
| - `ARBModel` now wires graph vocab size from `MultimodalVQBridge.total_codebook_size` instead of the old hardcoded `16384`. |
| - MemGram and ConvVQ now use config constants instead of duplicated literals. |
| - Supervised calls with `targets` always use `ByteHead`, even under `model.eval()`, so eval loss cannot accidentally route into video/audio heads. |
|
|
| ## Kernel And Runtime Changes |
|
|
| - Large `TernaryGraph` instances now use an active-code path when `total_vocab_size > 4096`. |
| - This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests. |
| - The 32-expert/top-4 MoE now disables dense all-expert dispatch (`dense_dispatch_max_tokens=0`) because the dense path is too expensive at 1.5B scale. |
| - `TernaryScaleTensor.forward()` now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly. |
| - `VideoHead` initial latent now requires grad in training so `cross_attn_q` can receive ternary gradient signals. |
| - `TalkerHead.token_logits()` was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens. |
|
|
| ## Training Fixes |
|
|
| - Replaced stale `arbitor/train.py` with a current ARB trainer: |
| - strict ternary updates by default, |
| - no bitsandbytes dependency, |
| - correct trigram target alignment (`x[:, 3:]`), |
| - optional `--no-save` for smoke tests, |
| - explicit sidecar/modal flags, |
| - no optimizer when there are no trainable float params. |
| - `training/text.py` now always runs ternary state updates after backward. |
| - `training/audio.py` now trains `AudioSequencer -> TalkerHead.token_logits()` against `AudioVQEncoder` targets. |
| - `training/vision.py` now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used. |
| - `training/diffusion.py` now feeds relational tokens directly into `VideoHead` and avoids the unused full MoE. |
| - LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose `--max-moe-iters`, and default to `1` iteration for local 8GB runs. |
| - Audio input normalization now accepts `[T]`, `[B, T]`, and `[B, 1, T]`. |
|
|
| ## Verification |
|
|
| Passed: |
|
|
| - `python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py` |
| - `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"` |
| - Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise. |
| - Full multimodal audit: |
| - logical ternary weights: `1,500,000,000` |
| - ternary training state: `1956.05 MB` |
| - trainable float params: `0` |
| - frozen sidecar params: `318.80 MB` |
| - graph vocab: `58684` |
| - DINOv2 int8: `True` |
| - Moonshine int8: `True` |
| - Active graph CUDA train smoke passed with VQ+graph and no MoE. |
| - `arbitor.train` strict ternary CUDA smoke passed with 1.5B MoE path. |
| - Standalone modality smokes passed: |
| - `training/audio.py --steps 1 --batch 1` |
| - `training/vision.py --steps 1 --batch 1` |
| - `training/diffusion.py --steps 1 --batch 1` |
| - LoRA text finetune smoke passed: |
| - `training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1` |
| - pig-vae load now passes after installing `diffusers` into the user site. |
| - The local `.safetensors` checkpoint must load through `AutoencoderKLWan.from_single_file`; direct `load_state_dict(strict=False)` had 194 missing and 194 unexpected keys and was silently leaving random VAE weights. |
| - pig-vae int8 load smoke: |
| - inner module: `AutoencoderKLWan` |
| - quantized int8: `True` |
| - trainable float params: `0` |
| - pig-vae encode/decode smoke: |
| - input video: `[1, 3, 4, 64, 64]` |
| - latents: `[1, 16, 1, 8, 8]` |
| - reconstruction: `[1, 3, 1, 64, 64]` |
|
|
| ## Remaining Constraints |
|
|
| - Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change. |
| - The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling. |
| - LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path. |
| - Diffusers emits a non-fatal warning about missing `torchao` tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao. |
| - `pyproject.toml` now requires `diffusers>=0.38.0` for the `diffusers` and `video` extras because older versions may not expose `AutoencoderKLWan`. |
|
|