File size: 5,473 Bytes
d8bc908 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | # TRUE-TERNARY-REFACTOR11
Date: 2026-05-19
## Scope
Final readiness pass for the platform-style ARB tree:
- Hit the exact 1.5B logical ternary weight target.
- Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
- Make graph/VQ/MoE wiring consistent with the current config.
- Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
- Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.
## Architecture Changes
- `CODEBOOK_SIZE` is now `34108`.
- `MOE_SHARED_INTER` is now `21216`.
- Full multimodal ARB audit now lands at exactly `1,500,000,000` logical ternary weights.
- The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
- `ARBModel` now wires graph vocab size from `MultimodalVQBridge.total_codebook_size` instead of the old hardcoded `16384`.
- MemGram and ConvVQ now use config constants instead of duplicated literals.
- Supervised calls with `targets` always use `ByteHead`, even under `model.eval()`, so eval loss cannot accidentally route into video/audio heads.
## Kernel And Runtime Changes
- Large `TernaryGraph` instances now use an active-code path when `total_vocab_size > 4096`.
- This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
- The 32-expert/top-4 MoE now disables dense all-expert dispatch (`dense_dispatch_max_tokens=0`) because the dense path is too expensive at 1.5B scale.
- `TernaryScaleTensor.forward()` now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.
- `VideoHead` initial latent now requires grad in training so `cross_attn_q` can receive ternary gradient signals.
- `TalkerHead.token_logits()` was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.
## Training Fixes
- Replaced stale `arbitor/train.py` with a current ARB trainer:
- strict ternary updates by default,
- no bitsandbytes dependency,
- correct trigram target alignment (`x[:, 3:]`),
- optional `--no-save` for smoke tests,
- explicit sidecar/modal flags,
- no optimizer when there are no trainable float params.
- `training/text.py` now always runs ternary state updates after backward.
- `training/audio.py` now trains `AudioSequencer -> TalkerHead.token_logits()` against `AudioVQEncoder` targets.
- `training/vision.py` now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.
- `training/diffusion.py` now feeds relational tokens directly into `VideoHead` and avoids the unused full MoE.
- LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose `--max-moe-iters`, and default to `1` iteration for local 8GB runs.
- Audio input normalization now accepts `[T]`, `[B, T]`, and `[B, 1, T]`.
## Verification
Passed:
- `python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py`
- `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"`
- Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
- Full multimodal audit:
- logical ternary weights: `1,500,000,000`
- ternary training state: `1956.05 MB`
- trainable float params: `0`
- frozen sidecar params: `318.80 MB`
- graph vocab: `58684`
- DINOv2 int8: `True`
- Moonshine int8: `True`
- Active graph CUDA train smoke passed with VQ+graph and no MoE.
- `arbitor.train` strict ternary CUDA smoke passed with 1.5B MoE path.
- Standalone modality smokes passed:
- `training/audio.py --steps 1 --batch 1`
- `training/vision.py --steps 1 --batch 1`
- `training/diffusion.py --steps 1 --batch 1`
- LoRA text finetune smoke passed:
- `training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1`
- pig-vae load now passes after installing `diffusers` into the user site.
- The local `.safetensors` checkpoint must load through `AutoencoderKLWan.from_single_file`; direct `load_state_dict(strict=False)` had 194 missing and 194 unexpected keys and was silently leaving random VAE weights.
- pig-vae int8 load smoke:
- inner module: `AutoencoderKLWan`
- quantized int8: `True`
- trainable float params: `0`
- pig-vae encode/decode smoke:
- input video: `[1, 3, 4, 64, 64]`
- latents: `[1, 16, 1, 8, 8]`
- reconstruction: `[1, 3, 1, 64, 64]`
## Remaining Constraints
- Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
- The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
- LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
- Diffusers emits a non-fatal warning about missing `torchao` tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao.
- `pyproject.toml` now requires `diffusers>=0.38.0` for the `diffusers` and `video` extras because older versions may not expose `AutoencoderKLWan`.
|