File size: 4,248 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# True Ternary Refactor 9 — Platform Components And Output Bridge

## Scope

The codebase has moved into the `arbitor/` package. This pass focuses only on the newly added platform components:

- output bridge heads: `OutputRouter`, `VideoHead`, `TalkerHead`
- custom audio training encoder: `AudioVQEncoder`
- imported sidecars: `pig_vae`, Moonshine audio encoder, ViT/DINO vision encoder
- new loop-heavy output paths

## AudioVQEncoder Ternarization

`arbitor/encoders/audio.py` was still a custom trainable float module:

```text
nn.Conv1d -> nn.Conv1d -> nn.Linear -> nn.Embedding -> nn.Linear
```

Converted it to persistent ternary state:

- Added `TernaryConv1d`, implemented as `unfold + TernaryScaleTensor`.
- Replaced all conv blocks with `TernaryConv1d`.
- Replaced `proj` and `out_proj` with `TernaryScaleTensor`.
- Replaced the VQ codebook `nn.Embedding` with `TernaryEmbeddingTable`.

Focused audit:

```text
AudioVQEncoder logical ternary weights: 404,864
trainable float params: 0
frozen float params: 0
float buffers: 0
```

Focused smoke:

```text
audio_vq_encoder_ok logits=(1, 4, 289), indices=(1, 4)
```

## Output Bridge Ternarization

`VideoHead.noise_embed` was a hidden float `nn.Embedding`.

Changed:

```text
nn.Embedding(max_steps, TRIGRAM_DIM)
```

to:

```text
TernaryEmbeddingTable(max_steps, TRIGRAM_DIM)
```

Focused audit for `VideoHead`:

```text
logical ternary weights: 17,040,896
trainable float params: 0
frozen float params: 0
float buffers: 0
```

`TalkerHead.forward()` had a nested Python loop:

```text
for token:
    for stride:
        logits = head(state)
        append argmax token
```

Replaced it with one ternary head call over all conditioning tokens plus `repeat_interleave`, keeping the same stride/pad/truncate behavior.

Focused smoke:

```text
video_head_ok latents=(1, 16, 1, 32, 32)
talker_head_ok tokens=(1, 10)
```

## Imported Sidecars

`pig_vae` now explicitly freezes all parameters after optional int8 quantization:

```text
quantize(vae, weights=qint8)
freeze(vae)
for p in vae.parameters(): p.requires_grad = False
```

Moonshine audio and ViT/DINO vision already default to `quantize_weights='int8'` through `optimum.quanto`, then freeze parameters. If `optimum.quanto` is unavailable, they fall back to frozen BF16; that fallback is not strict ternary, but it is frozen imported sidecar state rather than trainable model state.

## New Kernel Support

Added a Triton denoise-step kernel for `VideoHead`:

```text
latent = (latent - (1 - alpha) * pred_noise) / sqrt(alpha)
```

Forward and backward are Triton-backed on CUDA. The ACT-style diffusion loop remains because it controls halting and repeated shared-weight denoising, but the per-step latent update is now one custom kernel.

Correctness against PyTorch:

```text
video_denoise_fwd_maxdiff:         7.15e-07
video_denoise_grad_latent_maxdiff: 4.77e-07
video_denoise_grad_pred_maxdiff:   1.79e-07
```

## Model-Level Verification

Package compile:

```text
python -m py_compile arbitor/components.py arbitor/sequencers.py arbitor/encoders/audio.py arbitor/encoders/pig_vae.py arbitor/main.py arbitor/vq.py arbitor/kernel/ternary_scale.py arbitor/kernel/ternary_audit.py
```

ARBModel with image/audio imports disabled, VQ/Graph/Memory/MoE/output heads enabled:

```text
logical ternary weights: 41,087,552
ternary training state: 53.65 MB
trainable float params: 0
frozen float params: 0
float buffers: 0
```

Smokes:

```text
arb_model_cpu_forward_ok logits=(2, 8, 297), indices=(2, 8)
arb_model_cuda_train_smoke_ok logits=(2, 8, 297), targets=(2, 7), loss=12.1709
```

The CUDA smoke completed forward, backward, and `_ternary_update_memory()`.

## Remaining Work

1. Add a strict sidecar audit mode that reports imported quantized sidecars separately from core ternary state.
2. Add tests that instantiate Moonshine/ViT only when cached locally, to avoid network-dependent CI.
3. Consider a true ternary transposed-conv replacement if `TinyNeuralCodec` is promoted from lazy frozen sidecar to trainable core model component.
4. The VideoHead diffusion control loop is still Python-level. Full fusion would require a fixed-step, no-break kernel variant or a persistent CUDA kernel, which is a larger design change.