File size: 3,205 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Preliminary TernaryScale Benchmark Results

**Date:** 2025-05-15  
**Hardware:** RTX 4060 8GB  
**Steps:** 2500 | **Warmup:** 250 | **Batch:** 64 | **CTX:** 66  
**Params:** 1,668,128 | **Ternary forward:** T32  
**Data:** enwik8 (train 1,003,854 bytes / val 111,540 bytes)  
**All 6 configs run in parallel per step**

## Loss Over Time

| Step | SignSGD-C | SignSGD-E | Lion bf16 | Lion FP32 | Adam bf16 | Adam FP32 |
|------|-----------|-----------|-----------|-----------|-----------|-----------|
| 0    | 6.9770    | 8.0108    | 7.0011    | 6.5737    | 6.7317    | 6.8118    |
| 250  | 2.1486    | 1.9869    | 3.0933    | 3.1620    | 2.8921    | 2.7365    |
| 500  | 1.9801    | 1.7453    | 2.8211    | 2.5914    | 2.5739    | 2.4437    |
| 750  | 1.8682    | 1.6757    | 2.6109    | 2.5917    | 2.3861    | 2.2972    |
| 1000 | 1.8499    | 1.6601    | 2.6047    | 3.0808    | 2.3597    | 2.2618    |
| 1250 | 1.7943    | 1.6378    | 2.5462    | 3.2841    | 2.3527    | 2.3462    |
| 1500 | 1.7533    | 1.6506    | 2.5942    | 3.3279    | 2.4036    | 2.4838    |
| 1750 | 1.7601    | 1.6370    | 2.5652    | 3.3534    | 2.3876    | 2.5985    |
| 2000 | 1.7102    | 1.6108    | 2.5563    | 3.3116    | 2.3305    | 2.6721    |
| 2250 | 1.7354    | 1.5892    | 2.6064    | 3.4192    | 2.3509    | 2.7199    |
| 2500 | 1.6944    | 1.5861    | 2.5486    | 3.4403    | 2.3237    | 2.6991    |

## Summary

| Config | Final Loss | Min Loss | Loss@1k | Loss@2.5k | Step(ms) | Opt Mem | Total Mem |
|--------|-----------|----------|---------|------------|----------|---------|-----------|
| **SignSGD ConfigC** | 1.7113 | 1.5921 | 1.8099 | 1.6944 | 11.2 | 6.4 MB | 12.7 MB |
| **SignSGD ConfigE** | **1.6149** | **1.5157** | **1.7146** | **1.5861** | 11.2 | 6.4 MB | 12.7 MB |
| Lion bf16 | 2.5521 | 2.4668 | 2.5683 | 2.5486 | 11.2 | 9.6 MB | 12.7 MB |
| Lion FP32 | 3.4280 | 2.4911 | 3.0947 | 3.4403 | 11.2 | 12.7 MB | 19.1 MB |
| Adam bf16 | 2.3487 | 2.2480 | 2.3016 | 2.3237 | 11.2 | 9.6 MB | 12.7 MB |
| Adam FP32 | 2.7597 | 2.2049 | 2.3037 | 2.6991 | 11.2 | 19.1 MB | 25.4 MB |

## Loss Ratio vs SignSGD ConfigC

| Config | Ratio | Verdict |
|--------|-------|---------|
| SignSGD ConfigE | **0.944x** | better |
| Lion bf16 | 1.491x | worse |
| Lion FP32 | 2.003x | worse |
| Adam bf16 | 1.373x | worse |
| Adam FP32 | 1.613x | worse |

## Peak VRAM: 388 MB (all 6 models combined)

## Key Findings

1. **SignSGD ConfigE wins** — lowest loss (1.6149), lowest memory (12.7 MB total), no optimizer state
2. **SignSGD ConfigC second** — group-averaged S blurs magnitudes vs ConfigE's per-element S
3. **Momentum optimizers diverge on ternary** — Lion FP32 and Adam FP32 loss increases after step 1000
4. **bf16 helps momentum optimizers** — Lion bf16 (2.55) and Adam bf16 (2.35) are more stable than their FP32 variants (3.43, 2.76) but still far worse than SignSGD
5. **Root cause:** Momentum amplifies gradients that get blocked by ternary STE threshold mask, causing oscillation. SignSGD's pure sign update naturally complements ternary training — no amplification, no conflict.
6. **SignSGD memory advantage:** 6.4 MB optimizer mem vs 19.1 MB (Adam FP32) — 3x savings, zero state buffers