File size: 4,248 Bytes
d8bc908 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | # True Ternary Refactor 9 — Platform Components And Output Bridge
## Scope
The codebase has moved into the `arbitor/` package. This pass focuses only on the newly added platform components:
- output bridge heads: `OutputRouter`, `VideoHead`, `TalkerHead`
- custom audio training encoder: `AudioVQEncoder`
- imported sidecars: `pig_vae`, Moonshine audio encoder, ViT/DINO vision encoder
- new loop-heavy output paths
## AudioVQEncoder Ternarization
`arbitor/encoders/audio.py` was still a custom trainable float module:
```text
nn.Conv1d -> nn.Conv1d -> nn.Linear -> nn.Embedding -> nn.Linear
```
Converted it to persistent ternary state:
- Added `TernaryConv1d`, implemented as `unfold + TernaryScaleTensor`.
- Replaced all conv blocks with `TernaryConv1d`.
- Replaced `proj` and `out_proj` with `TernaryScaleTensor`.
- Replaced the VQ codebook `nn.Embedding` with `TernaryEmbeddingTable`.
Focused audit:
```text
AudioVQEncoder logical ternary weights: 404,864
trainable float params: 0
frozen float params: 0
float buffers: 0
```
Focused smoke:
```text
audio_vq_encoder_ok logits=(1, 4, 289), indices=(1, 4)
```
## Output Bridge Ternarization
`VideoHead.noise_embed` was a hidden float `nn.Embedding`.
Changed:
```text
nn.Embedding(max_steps, TRIGRAM_DIM)
```
to:
```text
TernaryEmbeddingTable(max_steps, TRIGRAM_DIM)
```
Focused audit for `VideoHead`:
```text
logical ternary weights: 17,040,896
trainable float params: 0
frozen float params: 0
float buffers: 0
```
`TalkerHead.forward()` had a nested Python loop:
```text
for token:
for stride:
logits = head(state)
append argmax token
```
Replaced it with one ternary head call over all conditioning tokens plus `repeat_interleave`, keeping the same stride/pad/truncate behavior.
Focused smoke:
```text
video_head_ok latents=(1, 16, 1, 32, 32)
talker_head_ok tokens=(1, 10)
```
## Imported Sidecars
`pig_vae` now explicitly freezes all parameters after optional int8 quantization:
```text
quantize(vae, weights=qint8)
freeze(vae)
for p in vae.parameters(): p.requires_grad = False
```
Moonshine audio and ViT/DINO vision already default to `quantize_weights='int8'` through `optimum.quanto`, then freeze parameters. If `optimum.quanto` is unavailable, they fall back to frozen BF16; that fallback is not strict ternary, but it is frozen imported sidecar state rather than trainable model state.
## New Kernel Support
Added a Triton denoise-step kernel for `VideoHead`:
```text
latent = (latent - (1 - alpha) * pred_noise) / sqrt(alpha)
```
Forward and backward are Triton-backed on CUDA. The ACT-style diffusion loop remains because it controls halting and repeated shared-weight denoising, but the per-step latent update is now one custom kernel.
Correctness against PyTorch:
```text
video_denoise_fwd_maxdiff: 7.15e-07
video_denoise_grad_latent_maxdiff: 4.77e-07
video_denoise_grad_pred_maxdiff: 1.79e-07
```
## Model-Level Verification
Package compile:
```text
python -m py_compile arbitor/components.py arbitor/sequencers.py arbitor/encoders/audio.py arbitor/encoders/pig_vae.py arbitor/main.py arbitor/vq.py arbitor/kernel/ternary_scale.py arbitor/kernel/ternary_audit.py
```
ARBModel with image/audio imports disabled, VQ/Graph/Memory/MoE/output heads enabled:
```text
logical ternary weights: 41,087,552
ternary training state: 53.65 MB
trainable float params: 0
frozen float params: 0
float buffers: 0
```
Smokes:
```text
arb_model_cpu_forward_ok logits=(2, 8, 297), indices=(2, 8)
arb_model_cuda_train_smoke_ok logits=(2, 8, 297), targets=(2, 7), loss=12.1709
```
The CUDA smoke completed forward, backward, and `_ternary_update_memory()`.
## Remaining Work
1. Add a strict sidecar audit mode that reports imported quantized sidecars separately from core ternary state.
2. Add tests that instantiate Moonshine/ViT only when cached locally, to avoid network-dependent CI.
3. Consider a true ternary transposed-conv replacement if `TinyNeuralCodec` is promoted from lazy frozen sidecar to trainable core model component.
4. The VideoHead diffusion control loop is still Python-level. Full fusion would require a fixed-step, no-break kernel variant or a persistent CUDA kernel, which is a larger design change.
|