TRUE-TERNARY-REFACTOR11
Date: 2026-05-19
Scope
Final readiness pass for the platform-style ARB tree:
- Hit the exact 1.5B logical ternary weight target.
- Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
- Make graph/VQ/MoE wiring consistent with the current config.
- Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
- Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.
Architecture Changes
CODEBOOK_SIZEis now34108.MOE_SHARED_INTERis now21216.- Full multimodal ARB audit now lands at exactly
1,500,000,000logical ternary weights. - The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
ARBModelnow wires graph vocab size fromMultimodalVQBridge.total_codebook_sizeinstead of the old hardcoded16384.- MemGram and ConvVQ now use config constants instead of duplicated literals.
- Supervised calls with
targetsalways useByteHead, even undermodel.eval(), so eval loss cannot accidentally route into video/audio heads.
Kernel And Runtime Changes
- Large
TernaryGraphinstances now use an active-code path whentotal_vocab_size > 4096. - This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
- The 32-expert/top-4 MoE now disables dense all-expert dispatch (
dense_dispatch_max_tokens=0) because the dense path is too expensive at 1.5B scale. TernaryScaleTensor.forward()now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.VideoHeadinitial latent now requires grad in training socross_attn_qcan receive ternary gradient signals.TalkerHead.token_logits()was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.
Training Fixes
- Replaced stale
arbitor/train.pywith a current ARB trainer:- strict ternary updates by default,
- no bitsandbytes dependency,
- correct trigram target alignment (
x[:, 3:]), - optional
--no-savefor smoke tests, - explicit sidecar/modal flags,
- no optimizer when there are no trainable float params.
training/text.pynow always runs ternary state updates after backward.training/audio.pynow trainsAudioSequencer -> TalkerHead.token_logits()againstAudioVQEncodertargets.training/vision.pynow uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.training/diffusion.pynow feeds relational tokens directly intoVideoHeadand avoids the unused full MoE.- LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose
--max-moe-iters, and default to1iteration for local 8GB runs. - Audio input normalization now accepts
[T],[B, T], and[B, 1, T].
Verification
Passed:
python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.pypython -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"- Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
- Full multimodal audit:
- logical ternary weights:
1,500,000,000 - ternary training state:
1956.05 MB - trainable float params:
0 - frozen sidecar params:
318.80 MB - graph vocab:
58684 - DINOv2 int8:
True - Moonshine int8:
True
- logical ternary weights:
- Active graph CUDA train smoke passed with VQ+graph and no MoE.
arbitor.trainstrict ternary CUDA smoke passed with 1.5B MoE path.- Standalone modality smokes passed:
training/audio.py --steps 1 --batch 1training/vision.py --steps 1 --batch 1training/diffusion.py --steps 1 --batch 1
- LoRA text finetune smoke passed:
training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1
- pig-vae load now passes after installing
diffusersinto the user site. - The local
.safetensorscheckpoint must load throughAutoencoderKLWan.from_single_file; directload_state_dict(strict=False)had 194 missing and 194 unexpected keys and was silently leaving random VAE weights. - pig-vae int8 load smoke:
- inner module:
AutoencoderKLWan - quantized int8:
True - trainable float params:
0
- inner module:
- pig-vae encode/decode smoke:
- input video:
[1, 3, 4, 64, 64] - latents:
[1, 16, 1, 8, 8] - reconstruction:
[1, 3, 1, 64, 64]
- input video:
Remaining Constraints
- Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
- The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
- LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
- Diffusers emits a non-fatal warning about missing
torchaotensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao. pyproject.tomlnow requiresdiffusers>=0.38.0for thediffusersandvideoextras because older versions may not exposeAutoencoderKLWan.