ARBS / docs /true-ternary /TRUE-TERNARY-REFACTOR11.md

Upload folder using huggingface_hub

d8bc908 verified 1 day ago

5.47 kB

	# TRUE-TERNARY-REFACTOR11

	Date: 2026-05-19

	## Scope

	Final readiness pass for the platform-style ARB tree:

	- Hit the exact 1.5B logical ternary weight target.
	- Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
	- Make graph/VQ/MoE wiring consistent with the current config.
	- Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
	- Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.

	## Architecture Changes

	- `CODEBOOK_SIZE` is now `34108`.
	- `MOE_SHARED_INTER` is now `21216`.
	- Full multimodal ARB audit now lands at exactly `1,500,000,000` logical ternary weights.
	- The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
	- `ARBModel` now wires graph vocab size from `MultimodalVQBridge.total_codebook_size` instead of the old hardcoded `16384`.
	- MemGram and ConvVQ now use config constants instead of duplicated literals.
	- Supervised calls with `targets` always use `ByteHead`, even under `model.eval()`, so eval loss cannot accidentally route into video/audio heads.

	## Kernel And Runtime Changes

	- Large `TernaryGraph` instances now use an active-code path when `total_vocab_size > 4096`.
	- This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
	- The 32-expert/top-4 MoE now disables dense all-expert dispatch (`dense_dispatch_max_tokens=0`) because the dense path is too expensive at 1.5B scale.
	- `TernaryScaleTensor.forward()` now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.
	- `VideoHead` initial latent now requires grad in training so `cross_attn_q` can receive ternary gradient signals.
	- `TalkerHead.token_logits()` was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.

	## Training Fixes

	- Replaced stale `arbitor/train.py` with a current ARB trainer:
	- strict ternary updates by default,
	- no bitsandbytes dependency,
	- correct trigram target alignment (`x[:, 3:]`),
	- optional `--no-save` for smoke tests,
	- explicit sidecar/modal flags,
	- no optimizer when there are no trainable float params.
	- `training/text.py` now always runs ternary state updates after backward.
	- `training/audio.py` now trains `AudioSequencer -> TalkerHead.token_logits()` against `AudioVQEncoder` targets.
	- `training/vision.py` now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.
	- `training/diffusion.py` now feeds relational tokens directly into `VideoHead` and avoids the unused full MoE.
	- LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose `--max-moe-iters`, and default to `1` iteration for local 8GB runs.
	- Audio input normalization now accepts `[T]`, `[B, T]`, and `[B, 1, T]`.

	## Verification

	Passed:

	- `python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py`
	- `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"`
	- Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
	- Full multimodal audit:
	- logical ternary weights: `1,500,000,000`
	- ternary training state: `1956.05 MB`
	- trainable float params: `0`
	- frozen sidecar params: `318.80 MB`
	- graph vocab: `58684`
	- DINOv2 int8: `True`
	- Moonshine int8: `True`
	- Active graph CUDA train smoke passed with VQ+graph and no MoE.
	- `arbitor.train` strict ternary CUDA smoke passed with 1.5B MoE path.
	- Standalone modality smokes passed:
	- `training/audio.py --steps 1 --batch 1`
	- `training/vision.py --steps 1 --batch 1`
	- `training/diffusion.py --steps 1 --batch 1`
	- LoRA text finetune smoke passed:
	- `training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1`
	- pig-vae load now passes after installing `diffusers` into the user site.
	- The local `.safetensors` checkpoint must load through `AutoencoderKLWan.from_single_file`; direct `load_state_dict(strict=False)` had 194 missing and 194 unexpected keys and was silently leaving random VAE weights.
	- pig-vae int8 load smoke:
	- inner module: `AutoencoderKLWan`
	- quantized int8: `True`
	- trainable float params: `0`
	- pig-vae encode/decode smoke:
	- input video: `[1, 3, 4, 64, 64]`
	- latents: `[1, 16, 1, 8, 8]`
	- reconstruction: `[1, 3, 1, 64, 64]`

	## Remaining Constraints

	- Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
	- The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
	- LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
	- Diffusers emits a non-fatal warning about missing `torchao` tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao.
	- `pyproject.toml` now requires `diffusers>=0.38.0` for the `diffusers` and `video` extras because older versions may not expose `AutoencoderKLWan`.