| # Preliminary TernaryScale Benchmark Results |
|
|
| **Date:** 2025-05-15 |
| **Hardware:** RTX 4060 8GB |
| **Steps:** 2500 | **Warmup:** 250 | **Batch:** 64 | **CTX:** 66 |
| **Params:** 1,668,128 | **Ternary forward:** T32 |
| **Data:** enwik8 (train 1,003,854 bytes / val 111,540 bytes) |
| **All 6 configs run in parallel per step** |
|
|
| ## Loss Over Time |
|
|
| | Step | SignSGD-C | SignSGD-E | Lion bf16 | Lion FP32 | Adam bf16 | Adam FP32 | |
| |------|-----------|-----------|-----------|-----------|-----------|-----------| |
| | 0 | 6.9770 | 8.0108 | 7.0011 | 6.5737 | 6.7317 | 6.8118 | |
| | 250 | 2.1486 | 1.9869 | 3.0933 | 3.1620 | 2.8921 | 2.7365 | |
| | 500 | 1.9801 | 1.7453 | 2.8211 | 2.5914 | 2.5739 | 2.4437 | |
| | 750 | 1.8682 | 1.6757 | 2.6109 | 2.5917 | 2.3861 | 2.2972 | |
| | 1000 | 1.8499 | 1.6601 | 2.6047 | 3.0808 | 2.3597 | 2.2618 | |
| | 1250 | 1.7943 | 1.6378 | 2.5462 | 3.2841 | 2.3527 | 2.3462 | |
| | 1500 | 1.7533 | 1.6506 | 2.5942 | 3.3279 | 2.4036 | 2.4838 | |
| | 1750 | 1.7601 | 1.6370 | 2.5652 | 3.3534 | 2.3876 | 2.5985 | |
| | 2000 | 1.7102 | 1.6108 | 2.5563 | 3.3116 | 2.3305 | 2.6721 | |
| | 2250 | 1.7354 | 1.5892 | 2.6064 | 3.4192 | 2.3509 | 2.7199 | |
| | 2500 | 1.6944 | 1.5861 | 2.5486 | 3.4403 | 2.3237 | 2.6991 | |
|
|
| ## Summary |
|
|
| | Config | Final Loss | Min Loss | Loss@1k | Loss@2.5k | Step(ms) | Opt Mem | Total Mem | |
| |--------|-----------|----------|---------|------------|----------|---------|-----------| |
| | **SignSGD ConfigC** | 1.7113 | 1.5921 | 1.8099 | 1.6944 | 11.2 | 6.4 MB | 12.7 MB | |
| | **SignSGD ConfigE** | **1.6149** | **1.5157** | **1.7146** | **1.5861** | 11.2 | 6.4 MB | 12.7 MB | |
| | Lion bf16 | 2.5521 | 2.4668 | 2.5683 | 2.5486 | 11.2 | 9.6 MB | 12.7 MB | |
| | Lion FP32 | 3.4280 | 2.4911 | 3.0947 | 3.4403 | 11.2 | 12.7 MB | 19.1 MB | |
| | Adam bf16 | 2.3487 | 2.2480 | 2.3016 | 2.3237 | 11.2 | 9.6 MB | 12.7 MB | |
| | Adam FP32 | 2.7597 | 2.2049 | 2.3037 | 2.6991 | 11.2 | 19.1 MB | 25.4 MB | |
|
|
| ## Loss Ratio vs SignSGD ConfigC |
|
|
| | Config | Ratio | Verdict | |
| |--------|-------|---------| |
| | SignSGD ConfigE | **0.944x** | better | |
| | Lion bf16 | 1.491x | worse | |
| | Lion FP32 | 2.003x | worse | |
| | Adam bf16 | 1.373x | worse | |
| | Adam FP32 | 1.613x | worse | |
|
|
| ## Peak VRAM: 388 MB (all 6 models combined) |
|
|
| ## Key Findings |
|
|
| 1. **SignSGD ConfigE wins** β lowest loss (1.6149), lowest memory (12.7 MB total), no optimizer state |
| 2. **SignSGD ConfigC second** β group-averaged S blurs magnitudes vs ConfigE's per-element S |
| 3. **Momentum optimizers diverge on ternary** β Lion FP32 and Adam FP32 loss increases after step 1000 |
| 4. **bf16 helps momentum optimizers** β Lion bf16 (2.55) and Adam bf16 (2.35) are more stable than their FP32 variants (3.43, 2.76) but still far worse than SignSGD |
| 5. **Root cause:** Momentum amplifies gradients that get blocked by ternary STE threshold mask, causing oscillation. SignSGD's pure sign update naturally complements ternary training β no amplification, no conflict. |
| 6. **SignSGD memory advantage:** 6.4 MB optimizer mem vs 19.1 MB (Adam FP32) β 3x savings, zero state buffers |
|
|