File size: 5,370 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# TrueTernary Benchmark

Results from `benchmark_true_ternary.py` β€” comparing pure ternary training against standard methods on MORPHTernaryModel.

## Quick Start

```bash
cd models/Trigram

# TrueTernary (strict, 0 float params, 14M ternary weights)
python benchmark_true_ternary.py --configs TrueTernary --steps 200 --batch 4 --ctx 33

# Adam baseline (full model, 102M float params)
python benchmark_true_ternary.py --configs Adam_FP32 --steps 200 --batch 4 --ctx 33

# Compare both
python benchmark_true_ternary.py --configs Adam_FP32,TrueTernary --steps 200 --batch 4 --ctx 33 --reuse-base

# Training script (strict ternary)
python train.py --max_steps 1000 --batch_size 4 --ctx 33 --strict_ternary
```

## Head-to-Head: TrueTernary vs Adam (200 steps, B=4, C=33)

| Metric | Adam_FP32 | TrueTernary |
|--------|-----------|-------------|
| **Trainable params** | 102,629,376 (float32) | **0** (pure ternary) |
| **Model weights** | 473.6 MB | **0.0 MB** |
| **Optimizer state** | 391.5 MB | **0.0 MB** |
| **Training state** | 473.6 + 391.5 = **865 MB** | **18.3 MB** (buffers only) |
| **Peak VRAM** | ~2,548 MB | **~232 MB** (includes CUDA context) |
| **Step time** | ~200 ms | **~131 ms** |
| **Final loss** | ~12.3 | **5.75** ↓ |
| **Min loss** | β€” | **4.49** |
| **Converges?** | Yes (to high loss) | **Yes (near optimal: ln(288)β‰ˆ5.66)** |

### Key Takeaways

- **VRAM**: TrueTernary uses **~40Γ— less** persistent state (18 MB vs 865 MB)
- **Speed**: 1.5Γ— faster per step (131 ms vs 200 ms) β€” pure add/sub/skip, no float GEMM
- **Convergence**: TrueTernary reaches **5.75** (near theoretical minimum ln(288) β‰ˆ 5.66) β€” Adam stalls at **12.3**
- **No float params**: TrueTernary has 0 trainable float params, 0 float buffers

## TrueTernary Training Dynamics (200 steps)

The loss curve follows a characteristic 3-phase pattern:

```
Phase 1 (steps 0-15):  Mass T flips from random init, loss spikes to ~90
Phase 2 (steps 15-80):  Recovery and convergence, loss drops from ~15 to ~6
Phase 3 (steps 80-200): Stable convergence, loss hovers at 5.7-6.0
```

**Convergence evidence:**

| Segment | Mean Loss | Min Loss | Trend |
|---------|-----------|----------|-------|
| Steps 0-50 | 13.4 | 4.49 | High variance (T flips) |
| Steps 50-100 | 8.7 | 6.03 | Monotonic decline |
| Steps 100-150 | 6.4 | 5.69 | Approaching optimum |
| Steps 150-200 | **5.82** | **5.64** | **Converged** |

The minimum loss of **4.49** is well below the uniform-distribution baseline (ln(288) β‰ˆ 5.66), indicating the model captures meaningful byte-level patterns.

## Training State Breakdown (14M ternary weights)

| Component | Storage | Size | Role |
|-----------|---------|------|------|
| T_packed | 5-trit/byte uint8 | 2.67 MB | Packed {-1, 0, +1} weights |
| E | int8 per group | 1.12 MB | Logβ‚‚ scale exponent |
| E_accum | int8 per group | 1.12 MB | Residual E accumulator |
| T_accum | int8 per weight | 13.36 MB | Gradient sign accumulator |
| **Total** | | **18.27 MB** | |

All int8 or packed ternary β€” no IEEE float anywhere in weight state.

## Scale Projection to 3B Parameters

| Component | 14M | 3B (projected) |
|-----------|-----|----------------|
| T_packed | 2.67 MB | **~572 MB** |
| E | 1.12 MB | **~240 MB** |
| E_accum | 1.12 MB | **~240 MB** |
| T_accum | 13.36 MB | **~2.86 GB** |
| **Total training** | **18.27 MB** | **~3.9 GB** |
| Inference (T+E only) | ~3.8 MB | **~812 MB** |

At 3B: **~3.9 GB** training VRAM fits on a single RTX 4060 (8 GB). Compare to BF16 Adam: **~18 GB** (requires server GPU).

## Architecture Components

All internal trainable components are now ternary or integer buffers (REFACTOR6+):

- `TernaryScaleTensor` β€” packed ternary linear layers
- `TernaryEmbeddingTable` β€” packed ternary embedding lookup  
- `TernaryLSTMCell` β€” LSTM with ternary projections
- `TernaryVQCodebook` β€” VQ with ternary embedding table
- Graph: Triton-backed edge aggregation + gather-add kernels (REFACTOR8)
- MoE: Triton-backed dense combine kernel (REFACTOR8)

The only remaining float parameters are imported frozen encoders (ViT, Whisper).

## Running the Benchmark

```bash
# Default: 200 steps, batch=4, ctx=33
python benchmark_true_ternary.py

# Custom config
python benchmark_true_ternary.py \
  --configs TrueTernary \
  --steps 500 \
  --batch 8 \
  --ctx 66 \
  --update-backend gpu \
  --scale-update-interval 1

# Compare with Adam
python benchmark_true_ternary.py \
  --configs Adam_FP32,TrueTernary \
  --steps 200 \
  --batch 4 \
  --ctx 33 \
  --reuse-base

# Change T_accum threshold (higher = less frequent flips)
python benchmark_true_ternary.py \
  --accum-threshold 5

# Full training pipeline
python train.py \
  --max_steps 5000 \
  --batch_size 8 \
  --ctx 66 \
  --strict_ternary \
  --scale_update_interval 1 \
  --run_name my_ternary_run
```

## Benchmark CLI Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--configs` | `TrueTernary` | Comma-separated: `Adam_FP32`, `SignSGD_Old`, `TrueTernary` |
| `--steps` | 200 | Training steps |
| `--batch` | 4 | Batch size |
| `--ctx` | 33 | Context length |
| `--update-backend` | `gpu` | `gpu`, `gpu-signcache`, `dense-fallback`, `none` |
| `--scale-update-interval` | 1 | E update frequency (0 = disable) |
| `--accum-threshold` | 3 | T_accum flip threshold |
| `--print-every` | 50 | Logging frequency |