File size: 3,534 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# TRUE TERNARY REFACTOR 15

Date: 2026-05-20

## Goal

Fix the two regressions reported after the platform restructure:

- model capacity had fallen back to about 1.9B instead of the 3B target
- training was hitting OOM from fp16/fp32 state leaking into ternary paths

## Changes

### 1. Restored the 3B target shape

`arbitor/config.py` now restores the large VQ targets while keeping the motif width small enough to stay near 3B:

- `CODEBOOK_DIM = 64`
- `SHARED_VQ_SIZE = 10_000_000`
- `KGVQ_CODEBOOK_SIZE = 5_000_000`
- `KGVQ_CODEBOOK_DIM = 64`

A no-allocation constructor trace of the assembled default model reports:

```text
dummy logical ternary total: 3,011,944,672
```

This keeps the requested 10M shared VQ and 5M KG VQ without the accidental 1024-wide VQ explosion.

### 2. Removed MoEGraph fp16 edge EMA

`MoEGraph` no longer allocates dense `codebook_size * 10` graph edges for large VQ graphs and no longer registers `edge_ema` as `float16`.

Large graphs now use bounded active edge state:

- `active_edge_src`: int32
- `active_edge_dst`: int32
- `active_edge_attr`: int8 ternary edge sign
- `active_edge_score`: int8 residual score
- `edge_index`: empty compatibility buffer for large active mode

Small graph tests still use dense edges, but the score path is now int8 `edge_score`, not fp16 EMA.

### 3. Removed float KG VQ buffers

The old `KGVQCodebook` kept float32 `embed` and `embed_avg` buffers. It is now a compatibility wrapper around `TernaryVQCodebook`, so the KG/composite VQ uses packed ternary rows, int8 scales, int8 accumulators, and int16 usage counts.

### 4. Large VQ initialization is now packed-first

`TernaryEmbeddingTable` now detects million-entry tables and initializes directly into:

- packed `uint8` trits
- int8 `E`
- int8 `E_accum`
- int8 `T_accum`

This avoids building temporary multi-GB float tensors for the 10M shared VQ and 5M KG VQ.

### 5. Removed persistent fp32 Triton training hooks

The Triton ternary backward path now stores `_hook_grad_T_sign` as int8 instead of keeping `_hook_grad_2d` and `_hook_x_2d` fp32 activation/gradient views on each ternary module after backward.

The direct fp32 hook fallback remains only for non-Triton compatibility paths, and the tests now assert that the CUDA Triton path does not retain fp32 grad/x hooks.

### 6. Ternary MoE centroids

MoEGraph routing centroids are now a `TernaryEmbeddingTable` instead of a float `nn.Parameter`.

## Validation

Passed:

```bash
python -m compileall -q arbitor training testing
python -m pytest -q testing/kg/test_kg_edges.py testing/kg/test_composite_head.py testing/test_gradient_capture.py testing/test_tilelang_training.py
python -m pytest -q testing/test_tscale.py::test_cuda_triton_tscale_path
python -m pytest -q --import-mode=importlib testing/model/test_tscale.py::test_cuda_triton_tscale_path
```

Additional targeted checks passed:

- large active MoEGraph with `codebook_size=10_000_000` has `edge_index.shape == (2, 0)` and no float edge buffers
- 1M-entry `TernaryVQCodebook` has no float buffers and trains through sparse forward/backward/update
- small active MoEGraph forward/backward remains finite with ternary centroids

## Remaining Risk

The kernels still use fp32 accumulators internally for numeric accumulation and losses still produce floating scalar loss values. This pass removes persistent fp16/fp32 ternary state and retained fp32 training hooks, which were the memory leak/OOM concern. A fully integer activation/loss path would be a separate kernel-level redesign.