File size: 5,370 Bytes

d8bc908

# TrueTernary Benchmark

Results from `benchmark_true_ternary.py` — comparing pure ternary training against standard methods on MORPHTernaryModel.

## Quick Start

```bash
cd models/Trigram

# TrueTernary (strict, 0 float params, 14M ternary weights)
python benchmark_true_ternary.py --configs TrueTernary --steps 200 --batch 4 --ctx 33

# Adam baseline (full model, 102M float params)
python benchmark_true_ternary.py --configs Adam_FP32 --steps 200 --batch 4 --ctx 33

# Compare both
python benchmark_true_ternary.py --configs Adam_FP32,TrueTernary --steps 200 --batch 4 --ctx 33 --reuse-base

# Training script (strict ternary)
python train.py --max_steps 1000 --batch_size 4 --ctx 33 --strict_ternary
```

## Head-to-Head: TrueTernary vs Adam (200 steps, B=4, C=33)

| Metric | Adam_FP32 | TrueTernary |
|--------|-----------|-------------|
| **Trainable params** | 102,629,376 (float32) | **0** (pure ternary) |
| **Model weights** | 473.6 MB | **0.0 MB** |
| **Optimizer state** | 391.5 MB | **0.0 MB** |
| **Training state** | 473.6 + 391.5 = **865 MB** | **18.3 MB** (buffers only) |
| **Peak VRAM** | ~2,548 MB | **~232 MB** (includes CUDA context) |
| **Step time** | ~200 ms | **~131 ms** |
| **Final loss** | ~12.3 | **5.75** ↓ |
| **Min loss** | — | **4.49** |
| **Converges?** | Yes (to high loss) | **Yes (near optimal: ln(288)≈5.66)** |

### Key Takeaways

- **VRAM**: TrueTernary uses **~40× less** persistent state (18 MB vs 865 MB)
- **Speed**: 1.5× faster per step (131 ms vs 200 ms) — pure add/sub/skip, no float GEMM
- **Convergence**: TrueTernary reaches **5.75** (near theoretical minimum ln(288) ≈ 5.66) — Adam stalls at **12.3**
- **No float params**: TrueTernary has 0 trainable float params, 0 float buffers

## TrueTernary Training Dynamics (200 steps)

The loss curve follows a characteristic 3-phase pattern:

```
Phase 1 (steps 0-15):  Mass T flips from random init, loss spikes to ~90
Phase 2 (steps 15-80):  Recovery and convergence, loss drops from ~15 to ~6
Phase 3 (steps 80-200): Stable convergence, loss hovers at 5.7-6.0
```

**Convergence evidence:**

| Segment | Mean Loss | Min Loss | Trend |
|---------|-----------|----------|-------|
| Steps 0-50 | 13.4 | 4.49 | High variance (T flips) |
| Steps 50-100 | 8.7 | 6.03 | Monotonic decline |
| Steps 100-150 | 6.4 | 5.69 | Approaching optimum |
| Steps 150-200 | **5.82** | **5.64** | **Converged** |

The minimum loss of **4.49** is well below the uniform-distribution baseline (ln(288) ≈ 5.66), indicating the model captures meaningful byte-level patterns.

## Training State Breakdown (14M ternary weights)

| Component | Storage | Size | Role |
|-----------|---------|------|------|
| T_packed | 5-trit/byte uint8 | 2.67 MB | Packed {-1, 0, +1} weights |
| E | int8 per group | 1.12 MB | Log₂ scale exponent |
| E_accum | int8 per group | 1.12 MB | Residual E accumulator |
| T_accum | int8 per weight | 13.36 MB | Gradient sign accumulator |
| **Total** | | **18.27 MB** | |

All int8 or packed ternary — no IEEE float anywhere in weight state.

## Scale Projection to 3B Parameters

| Component | 14M | 3B (projected) |
|-----------|-----|----------------|
| T_packed | 2.67 MB | **~572 MB** |
| E | 1.12 MB | **~240 MB** |
| E_accum | 1.12 MB | **~240 MB** |
| T_accum | 13.36 MB | **~2.86 GB** |
| **Total training** | **18.27 MB** | **~3.9 GB** |
| Inference (T+E only) | ~3.8 MB | **~812 MB** |

At 3B: **~3.9 GB** training VRAM fits on a single RTX 4060 (8 GB). Compare to BF16 Adam: **~18 GB** (requires server GPU).

## Architecture Components

All internal trainable components are now ternary or integer buffers (REFACTOR6+):

- `TernaryScaleTensor` — packed ternary linear layers
- `TernaryEmbeddingTable` — packed ternary embedding lookup  
- `TernaryLSTMCell` — LSTM with ternary projections
- `TernaryVQCodebook` — VQ with ternary embedding table
- Graph: Triton-backed edge aggregation + gather-add kernels (REFACTOR8)
- MoE: Triton-backed dense combine kernel (REFACTOR8)

The only remaining float parameters are imported frozen encoders (ViT, Whisper).

## Running the Benchmark

```bash
# Default: 200 steps, batch=4, ctx=33
python benchmark_true_ternary.py

# Custom config
python benchmark_true_ternary.py \
  --configs TrueTernary \
  --steps 500 \
  --batch 8 \
  --ctx 66 \
  --update-backend gpu \
  --scale-update-interval 1

# Compare with Adam
python benchmark_true_ternary.py \
  --configs Adam_FP32,TrueTernary \
  --steps 200 \
  --batch 4 \
  --ctx 33 \
  --reuse-base

# Change T_accum threshold (higher = less frequent flips)
python benchmark_true_ternary.py \
  --accum-threshold 5

# Full training pipeline
python train.py \
  --max_steps 5000 \
  --batch_size 8 \
  --ctx 66 \
  --strict_ternary \
  --scale_update_interval 1 \
  --run_name my_ternary_run
```

## Benchmark CLI Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--configs` | `TrueTernary` | Comma-separated: `Adam_FP32`, `SignSGD_Old`, `TrueTernary` |
| `--steps` | 200 | Training steps |
| `--batch` | 4 | Batch size |
| `--ctx` | 33 | Context length |
| `--update-backend` | `gpu` | `gpu`, `gpu-signcache`, `dense-fallback`, `none` |
| `--scale-update-interval` | 1 | E update frequency (0 = disable) |
| `--accum-threshold` | 3 | T_accum flip threshold |
| `--print-every` | 50 | Logging frequency |