ARBS / docs /true-ternary /TRUE-TERNARY-REFACTOR11.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified

TRUE-TERNARY-REFACTOR11

Date: 2026-05-19

Scope

Final readiness pass for the platform-style ARB tree:

  • Hit the exact 1.5B logical ternary weight target.
  • Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
  • Make graph/VQ/MoE wiring consistent with the current config.
  • Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
  • Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.

Architecture Changes

  • CODEBOOK_SIZE is now 34108.
  • MOE_SHARED_INTER is now 21216.
  • Full multimodal ARB audit now lands at exactly 1,500,000,000 logical ternary weights.
  • The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
  • ARBModel now wires graph vocab size from MultimodalVQBridge.total_codebook_size instead of the old hardcoded 16384.
  • MemGram and ConvVQ now use config constants instead of duplicated literals.
  • Supervised calls with targets always use ByteHead, even under model.eval(), so eval loss cannot accidentally route into video/audio heads.

Kernel And Runtime Changes

  • Large TernaryGraph instances now use an active-code path when total_vocab_size > 4096.
  • This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
  • The 32-expert/top-4 MoE now disables dense all-expert dispatch (dense_dispatch_max_tokens=0) because the dense path is too expensive at 1.5B scale.
  • TernaryScaleTensor.forward() now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.
  • VideoHead initial latent now requires grad in training so cross_attn_q can receive ternary gradient signals.
  • TalkerHead.token_logits() was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.

Training Fixes

  • Replaced stale arbitor/train.py with a current ARB trainer:
    • strict ternary updates by default,
    • no bitsandbytes dependency,
    • correct trigram target alignment (x[:, 3:]),
    • optional --no-save for smoke tests,
    • explicit sidecar/modal flags,
    • no optimizer when there are no trainable float params.
  • training/text.py now always runs ternary state updates after backward.
  • training/audio.py now trains AudioSequencer -> TalkerHead.token_logits() against AudioVQEncoder targets.
  • training/vision.py now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.
  • training/diffusion.py now feeds relational tokens directly into VideoHead and avoids the unused full MoE.
  • LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose --max-moe-iters, and default to 1 iteration for local 8GB runs.
  • Audio input normalization now accepts [T], [B, T], and [B, 1, T].

Verification

Passed:

  • python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py
  • python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"
  • Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
  • Full multimodal audit:
    • logical ternary weights: 1,500,000,000
    • ternary training state: 1956.05 MB
    • trainable float params: 0
    • frozen sidecar params: 318.80 MB
    • graph vocab: 58684
    • DINOv2 int8: True
    • Moonshine int8: True
  • Active graph CUDA train smoke passed with VQ+graph and no MoE.
  • arbitor.train strict ternary CUDA smoke passed with 1.5B MoE path.
  • Standalone modality smokes passed:
    • training/audio.py --steps 1 --batch 1
    • training/vision.py --steps 1 --batch 1
    • training/diffusion.py --steps 1 --batch 1
  • LoRA text finetune smoke passed:
    • training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1
  • pig-vae load now passes after installing diffusers into the user site.
  • The local .safetensors checkpoint must load through AutoencoderKLWan.from_single_file; direct load_state_dict(strict=False) had 194 missing and 194 unexpected keys and was silently leaving random VAE weights.
  • pig-vae int8 load smoke:
    • inner module: AutoencoderKLWan
    • quantized int8: True
    • trainable float params: 0
  • pig-vae encode/decode smoke:
    • input video: [1, 3, 4, 64, 64]
    • latents: [1, 16, 1, 8, 8]
    • reconstruction: [1, 3, 1, 64, 64]

Remaining Constraints

  • Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
  • The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
  • LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
  • Diffusers emits a non-fatal warning about missing torchao tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao.
  • pyproject.toml now requires diffusers>=0.38.0 for the diffusers and video extras because older versions may not expose AutoencoderKLWan.