File size: 5,473 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# TRUE-TERNARY-REFACTOR11

Date: 2026-05-19

## Scope

Final readiness pass for the platform-style ARB tree:

- Hit the exact 1.5B logical ternary weight target.
- Keep imported DINOv2/Moonshine sidecars frozen and int8-quantized.
- Make graph/VQ/MoE wiring consistent with the current config.
- Repair strict ternary training, standalone modality training, and LoRA finetuning entrypoints.
- Verify CUDA kernel paths and smoke-test training without restoring hidden FP master weights.

## Architecture Changes

- `CODEBOOK_SIZE` is now `34108`.
- `MOE_SHARED_INTER` is now `21216`.
- Full multimodal ARB audit now lands at exactly `1,500,000,000` logical ternary weights.
- The exact count includes MoE shared-width RMS ternary state; the final VQ codebook size compensates for that extra per-width ternary state.
- `ARBModel` now wires graph vocab size from `MultimodalVQBridge.total_codebook_size` instead of the old hardcoded `16384`.
- MemGram and ConvVQ now use config constants instead of duplicated literals.
- Supervised calls with `targets` always use `ByteHead`, even under `model.eval()`, so eval loss cannot accidentally route into video/audio heads.

## Kernel And Runtime Changes

- Large `TernaryGraph` instances now use an active-code path when `total_vocab_size > 4096`.
- This avoids projecting every VQ codebook node on each forward. The full graph still keeps the existing Triton aggregate/gather kernels for small/full graph tests.
- The 32-expert/top-4 MoE now disables dense all-expert dispatch (`dense_dispatch_max_tokens=0`) because the dense path is too expensive at 1.5B scale.
- `TernaryScaleTensor.forward()` now forces a detached CUDA input to require grad while grad mode is active. This lets ternary weights after frozen sidecars train correctly.
- `VideoHead` initial latent now requires grad in training so `cross_attn_q` can receive ternary gradient signals.
- `TalkerHead.token_logits()` was added so audio training can use cross entropy on logits instead of non-differentiable argmax tokens.

## Training Fixes

- Replaced stale `arbitor/train.py` with a current ARB trainer:
  - strict ternary updates by default,
  - no bitsandbytes dependency,
  - correct trigram target alignment (`x[:, 3:]`),
  - optional `--no-save` for smoke tests,
  - explicit sidecar/modal flags,
  - no optimizer when there are no trainable float params.
- `training/text.py` now always runs ternary state updates after backward.
- `training/audio.py` now trains `AudioSequencer -> TalkerHead.token_logits()` against `AudioVQEncoder` targets.
- `training/vision.py` now uses VQ commitment to train image-side ternary projection state and avoids building the full MoE when it is not used.
- `training/diffusion.py` now feeds relational tokens directly into `VideoHead` and avoids the unused full MoE.
- LoRA finetuners now create checkpoint directories, use trigram-aligned targets, expose `--max-moe-iters`, and default to `1` iteration for local 8GB runs.
- Audio input normalization now accepts `[T]`, `[B, T]`, and `[B, 1, T]`.

## Verification

Passed:

- `python -m compileall -q arbitor training testing/model/test_arb.py testing/test_tscale.py`
- `python -m pytest -q testing/test_tscale.py -k "cuda_triton_correctness_update_E or cuda_triton_tscale_path"`
- Custom CUDA kernel smoke for graph aggregate, graph gather/add, MoE dense combine, and video denoise.
- Full multimodal audit:
  - logical ternary weights: `1,500,000,000`
  - ternary training state: `1956.05 MB`
  - trainable float params: `0`
  - frozen sidecar params: `318.80 MB`
  - graph vocab: `58684`
  - DINOv2 int8: `True`
  - Moonshine int8: `True`
- Active graph CUDA train smoke passed with VQ+graph and no MoE.
- `arbitor.train` strict ternary CUDA smoke passed with 1.5B MoE path.
- Standalone modality smokes passed:
  - `training/audio.py --steps 1 --batch 1`
  - `training/vision.py --steps 1 --batch 1`
  - `training/diffusion.py --steps 1 --batch 1`
- LoRA text finetune smoke passed:
  - `training/finetuning/text.py --steps 1 --batch 1 --accum 1 --ctx 4 --lora-rank 1 --max-moe-iters 1`
- pig-vae load now passes after installing `diffusers` into the user site.
- The local `.safetensors` checkpoint must load through `AutoencoderKLWan.from_single_file`; direct `load_state_dict(strict=False)` had 194 missing and 194 unexpected keys and was silently leaving random VAE weights.
- pig-vae int8 load smoke:
  - inner module: `AutoencoderKLWan`
  - quantized int8: `True`
  - trainable float params: `0`
- pig-vae encode/decode smoke:
  - input video: `[1, 3, 4, 64, 64]`
  - latents: `[1, 16, 1, 8, 8]`
  - reconstruction: `[1, 3, 1, 64, 64]`

## Remaining Constraints

- Full 1.5B MoE training is functionally correct but still slow on the RTX 4060 class GPU. One strict smoke step with MoE took about 77 seconds after the sparse-dispatch change.
- The active graph path is the practical path for the 58k VQ vocabulary. A future native CUDA graph kernel should fuse active-node projection, neighbor selection, hop update, and pooling.
- LoRA finetuning uses float adapter parameters by design; strict base ternary state remains frozen in that path.
- Diffusers emits a non-fatal warning about missing `torchao` tensor support in this environment. The local safetensors pig-vae checkpoint loaded and ran without torchao.
- `pyproject.toml` now requires `diffusers>=0.38.0` for the `diffusers` and `video` extras because older versions may not expose `AutoencoderKLWan`.