TRUE-TERNARY-REFACTOR11

Date: 2026-05-19

Scope

Final readiness pass for the platform-style ARB tree:

Hit the exact 1.5B logical ternary weight target.
Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
Make graph/VQ/MoE wiring consistent with the current config.
Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.

Architecture Changes

CODEBOOK_SIZE is now 34108.
MOE_SHARED_INTER is now 21216.
Full multimodal ARB audit now lands at exactly 1,500,000,000 logical ternary weights.
The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
ARBModel now wires graph vocab size from MultimodalVQBridge.total_codebook_size instead of the old hardcoded 16384.
MemGram and ConvVQ now use config constants instead of duplicated literals.
Supervised calls with targets always use ByteHead, even under model.eval(), so eval loss cannot accidentally route into video/audio heads.

Kernel And Runtime Changes

Large TernaryGraph instances now use an active-code path when total_vocab_size > 4096.
This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
The 32-expert/top-4 MoE now disables dense all-expert dispatch (dense_dispatch_max_tokens=0) because the dense path is too expensive at 1.5B scale.
TernaryScaleTensor.forward() now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.
VideoHead initial latent now requires grad in training so cross_attn_q can receive ternary gradient signals.
TalkerHead.token_logits() was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.

Training Fixes

Replaced stale arbitor/train.py with a current ARB trainer:
- strict ternary updates by default,
- no bitsandbytes dependency,
- correct trigram target alignment (x[:, 3:]),
- optional --no-save for smoke tests,
- explicit sidecar/modal flags,
- no optimizer when there are no trainable float params.
training/text.py now always runs ternary state updates after backward.
training/audio.py now trains AudioSequencer -> TalkerHead.token_logits() against AudioVQEncoder targets.
training/vision.py now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.
training/diffusion.py now feeds relational tokens directly into VideoHead and avoids the unused full MoE.
LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose --max-moe-iters, and default to 1 iteration for local 8GB runs.
Audio input normalization now accepts [T], [B, T], and [B, 1, T].

Verification

Passed:

python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py
python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"
Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
Full multimodal audit:
- logical ternary weights: 1,500,000,000
- ternary training state: 1956.05 MB
- trainable float params: 0
- frozen sidecar params: 318.80 MB
- graph vocab: 58684
- DINOv2 int8: True
- Moonshine int8: True
Active graph CUDA train smoke passed with VQ+graph and no MoE.
arbitor.train strict ternary CUDA smoke passed with 1.5B MoE path.
Standalone modality smokes passed:
- training/audio.py --steps 1 --batch 1
- training/vision.py --steps 1 --batch 1
- training/diffusion.py --steps 1 --batch 1
LoRA text finetune smoke passed:
- training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1
pig-vae load now passes after installing diffusers into the user site.
The local .safetensors checkpoint must load through AutoencoderKLWan.from_single_file; direct load_state_dict(strict=False) had 194 missing and 194 unexpected keys and was silently leaving random VAE weights.
pig-vae int8 load smoke:
- inner module: AutoencoderKLWan
- quantized int8: True
- trainable float params: 0
pig-vae encode/decode smoke:
- input video: [1, 3, 4, 64, 64]
- latents: [1, 16, 1, 8, 8]
- reconstruction: [1, 3, 1, 64, 64]

Remaining Constraints

Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
Diffusers emits a non-fatal warning about missing torchao tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao.
pyproject.toml now requires diffusers>=0.38.0 for the diffusers and video extras because older versions may not expose AutoencoderKLWan.