Preliminary TernaryScale Benchmark Results

Date: 2025-05-15
Hardware: RTX 4060 8GB
Steps: 2500 | Warmup: 250 | Batch: 64 | CTX: 66
Params: 1,668,128 | Ternary forward: T32
Data: enwik8 (train 1,003,854 bytes / val 111,540 bytes)
All 6 configs run in parallel per step

Loss Over Time

Step	SignSGD-C	SignSGD-E	Lion bf16	Lion FP32	Adam bf16	Adam FP32
0	6.9770	8.0108	7.0011	6.5737	6.7317	6.8118
250	2.1486	1.9869	3.0933	3.1620	2.8921	2.7365
500	1.9801	1.7453	2.8211	2.5914	2.5739	2.4437
750	1.8682	1.6757	2.6109	2.5917	2.3861	2.2972
1000	1.8499	1.6601	2.6047	3.0808	2.3597	2.2618
1250	1.7943	1.6378	2.5462	3.2841	2.3527	2.3462
1500	1.7533	1.6506	2.5942	3.3279	2.4036	2.4838
1750	1.7601	1.6370	2.5652	3.3534	2.3876	2.5985
2000	1.7102	1.6108	2.5563	3.3116	2.3305	2.6721
2250	1.7354	1.5892	2.6064	3.4192	2.3509	2.7199
2500	1.6944	1.5861	2.5486	3.4403	2.3237	2.6991

Summary

Config	Final Loss	Min Loss	Loss@1k	Loss@2.5k	Step(ms)	Opt Mem	Total Mem
SignSGD ConfigC	1.7113	1.5921	1.8099	1.6944	11.2	6.4 MB	12.7 MB
SignSGD ConfigE	1.6149	1.5157	1.7146	1.5861	11.2	6.4 MB	12.7 MB
Lion bf16	2.5521	2.4668	2.5683	2.5486	11.2	9.6 MB	12.7 MB
Lion FP32	3.4280	2.4911	3.0947	3.4403	11.2	12.7 MB	19.1 MB
Adam bf16	2.3487	2.2480	2.3016	2.3237	11.2	9.6 MB	12.7 MB
Adam FP32	2.7597	2.2049	2.3037	2.6991	11.2	19.1 MB	25.4 MB

Loss Ratio vs SignSGD ConfigC

Config	Ratio	Verdict
SignSGD ConfigE	0.944x	better
Lion bf16	1.491x	worse
Lion FP32	2.003x	worse
Adam bf16	1.373x	worse
Adam FP32	1.613x	worse

Peak VRAM: 388 MB (all 6 models combined)

Key Findings

SignSGD ConfigE wins — lowest loss (1.6149), lowest memory (12.7 MB total), no optimizer state
SignSGD ConfigC second — group-averaged S blurs magnitudes vs ConfigE's per-element S
Momentum optimizers diverge on ternary — Lion FP32 and Adam FP32 loss increases after step 1000
bf16 helps momentum optimizers — Lion bf16 (2.55) and Adam bf16 (2.35) are more stable than their FP32 variants (3.43, 2.76) but still far worse than SignSGD
Root cause: Momentum amplifies gradients that get blocked by ternary STE threshold mask, causing oscillation. SignSGD's pure sign update naturally complements ternary training — no amplification, no conflict.
SignSGD memory advantage: 6.4 MB optimizer mem vs 19.1 MB (Adam FP32) — 3x savings, zero state buffers