# Preliminary TernaryScale Benchmark Results **Date:** 2025-05-15 **Hardware:** RTX 4060 8GB **Steps:** 2500 | **Warmup:** 250 | **Batch:** 64 | **CTX:** 66 **Params:** 1,668,128 | **Ternary forward:** T32 **Data:** enwik8 (train 1,003,854 bytes / val 111,540 bytes) **All 6 configs run in parallel per step** ## Loss Over Time | Step | SignSGD-C | SignSGD-E | Lion bf16 | Lion FP32 | Adam bf16 | Adam FP32 | |------|-----------|-----------|-----------|-----------|-----------|-----------| | 0 | 6.9770 | 8.0108 | 7.0011 | 6.5737 | 6.7317 | 6.8118 | | 250 | 2.1486 | 1.9869 | 3.0933 | 3.1620 | 2.8921 | 2.7365 | | 500 | 1.9801 | 1.7453 | 2.8211 | 2.5914 | 2.5739 | 2.4437 | | 750 | 1.8682 | 1.6757 | 2.6109 | 2.5917 | 2.3861 | 2.2972 | | 1000 | 1.8499 | 1.6601 | 2.6047 | 3.0808 | 2.3597 | 2.2618 | | 1250 | 1.7943 | 1.6378 | 2.5462 | 3.2841 | 2.3527 | 2.3462 | | 1500 | 1.7533 | 1.6506 | 2.5942 | 3.3279 | 2.4036 | 2.4838 | | 1750 | 1.7601 | 1.6370 | 2.5652 | 3.3534 | 2.3876 | 2.5985 | | 2000 | 1.7102 | 1.6108 | 2.5563 | 3.3116 | 2.3305 | 2.6721 | | 2250 | 1.7354 | 1.5892 | 2.6064 | 3.4192 | 2.3509 | 2.7199 | | 2500 | 1.6944 | 1.5861 | 2.5486 | 3.4403 | 2.3237 | 2.6991 | ## Summary | Config | Final Loss | Min Loss | Loss@1k | Loss@2.5k | Step(ms) | Opt Mem | Total Mem | |--------|-----------|----------|---------|------------|----------|---------|-----------| | **SignSGD ConfigC** | 1.7113 | 1.5921 | 1.8099 | 1.6944 | 11.2 | 6.4 MB | 12.7 MB | | **SignSGD ConfigE** | **1.6149** | **1.5157** | **1.7146** | **1.5861** | 11.2 | 6.4 MB | 12.7 MB | | Lion bf16 | 2.5521 | 2.4668 | 2.5683 | 2.5486 | 11.2 | 9.6 MB | 12.7 MB | | Lion FP32 | 3.4280 | 2.4911 | 3.0947 | 3.4403 | 11.2 | 12.7 MB | 19.1 MB | | Adam bf16 | 2.3487 | 2.2480 | 2.3016 | 2.3237 | 11.2 | 9.6 MB | 12.7 MB | | Adam FP32 | 2.7597 | 2.2049 | 2.3037 | 2.6991 | 11.2 | 19.1 MB | 25.4 MB | ## Loss Ratio vs SignSGD ConfigC | Config | Ratio | Verdict | |--------|-------|---------| | SignSGD ConfigE | **0.944x** | better | | Lion bf16 | 1.491x | worse | | Lion FP32 | 2.003x | worse | | Adam bf16 | 1.373x | worse | | Adam FP32 | 1.613x | worse | ## Peak VRAM: 388 MB (all 6 models combined) ## Key Findings 1. **SignSGD ConfigE wins** — lowest loss (1.6149), lowest memory (12.7 MB total), no optimizer state 2. **SignSGD ConfigC second** — group-averaged S blurs magnitudes vs ConfigE's per-element S 3. **Momentum optimizers diverge on ternary** — Lion FP32 and Adam FP32 loss increases after step 1000 4. **bf16 helps momentum optimizers** — Lion bf16 (2.55) and Adam bf16 (2.35) are more stable than their FP32 variants (3.43, 2.76) but still far worse than SignSGD 5. **Root cause:** Momentum amplifies gradients that get blocked by ternary STE threshold mask, causing oscillation. SignSGD's pure sign update naturally complements ternary training — no amplification, no conflict. 6. **SignSGD memory advantage:** 6.4 MB optimizer mem vs 19.1 MB (Adam FP32) — 3x savings, zero state buffers