File size: 4,248 Bytes

d8bc908

# True Ternary Refactor 9 — Platform Components And Output Bridge

## Scope

The codebase has moved into the `arbitor/` package. This pass focuses only on the newly added platform components:

- output bridge heads: `OutputRouter`, `VideoHead`, `TalkerHead`
- custom audio training encoder: `AudioVQEncoder`
- imported sidecars: `pig_vae`, Moonshine audio encoder, ViT/DINO vision encoder
- new loop-heavy output paths

## AudioVQEncoder Ternarization

`arbitor/encoders/audio.py` was still a custom trainable float module:

```text
nn.Conv1d -> nn.Conv1d -> nn.Linear -> nn.Embedding -> nn.Linear
```

Converted it to persistent ternary state:

- Added `TernaryConv1d`, implemented as `unfold + TernaryScaleTensor`.
- Replaced all conv blocks with `TernaryConv1d`.
- Replaced `proj` and `out_proj` with `TernaryScaleTensor`.
- Replaced the VQ codebook `nn.Embedding` with `TernaryEmbeddingTable`.

Focused audit:

```text
AudioVQEncoder logical ternary weights: 404,864
trainable float params: 0
frozen float params: 0
float buffers: 0
```

Focused smoke:

```text
audio_vq_encoder_ok logits=(1, 4, 289), indices=(1, 4)
```

## Output Bridge Ternarization

`VideoHead.noise_embed` was a hidden float `nn.Embedding`.

Changed:

```text
nn.Embedding(max_steps, TRIGRAM_DIM)
```

to:

```text
TernaryEmbeddingTable(max_steps, TRIGRAM_DIM)
```

Focused audit for `VideoHead`:

```text
logical ternary weights: 17,040,896
trainable float params: 0
frozen float params: 0
float buffers: 0
```

`TalkerHead.forward()` had a nested Python loop:

```text
for token:
    for stride:
        logits = head(state)
        append argmax token
```

Replaced it with one ternary head call over all conditioning tokens plus `repeat_interleave`, keeping the same stride/pad/truncate behavior.

Focused smoke:

```text
video_head_ok latents=(1, 16, 1, 32, 32)
talker_head_ok tokens=(1, 10)
```

## Imported Sidecars

`pig_vae` now explicitly freezes all parameters after optional int8 quantization:

```text
quantize(vae, weights=qint8)
freeze(vae)
for p in vae.parameters(): p.requires_grad = False
```

Moonshine audio and ViT/DINO vision already default to `quantize_weights='int8'` through `optimum.quanto`, then freeze parameters. If `optimum.quanto` is unavailable, they fall back to frozen BF16; that fallback is not strict ternary, but it is frozen imported sidecar state rather than trainable model state.

## New Kernel Support

Added a Triton denoise-step kernel for `VideoHead`:

```text
latent = (latent - (1 - alpha) * pred_noise) / sqrt(alpha)
```

Forward and backward are Triton-backed on CUDA. The ACT-style diffusion loop remains because it controls halting and repeated shared-weight denoising, but the per-step latent update is now one custom kernel.

Correctness against PyTorch:

```text
video_denoise_fwd_maxdiff:         7.15e-07
video_denoise_grad_latent_maxdiff: 4.77e-07
video_denoise_grad_pred_maxdiff:   1.79e-07
```

## Model-Level Verification

Package compile:

```text
python -m py_compile arbitor/components.py arbitor/sequencers.py arbitor/encoders/audio.py arbitor/encoders/pig_vae.py arbitor/main.py arbitor/vq.py arbitor/kernel/ternary_scale.py arbitor/kernel/ternary_audit.py
```

ARBModel with image/audio imports disabled, VQ/Graph/Memory/MoE/output heads enabled:

```text
logical ternary weights: 41,087,552
ternary training state: 53.65 MB
trainable float params: 0
frozen float params: 0
float buffers: 0
```

Smokes:

```text
arb_model_cpu_forward_ok logits=(2, 8, 297), indices=(2, 8)
arb_model_cuda_train_smoke_ok logits=(2, 8, 297), targets=(2, 7), loss=12.1709
```

The CUDA smoke completed forward, backward, and `_ternary_update_memory()`.

## Remaining Work

1. Add a strict sidecar audit mode that reports imported quantized sidecars separately from core ternary state.
2. Add tests that instantiate Moonshine/ViT only when cached locally, to avoid network-dependent CI.
3. Consider a true ternary transposed-conv replacement if `TinyNeuralCodec` is promoted from lazy frozen sidecar to trainable core model component.
4. The VideoHead diffusion control loop is still Python-level. Full fusion would require a fixed-step, no-break kernel variant or a persistent CUDA kernel, which is a larger design change.