ARBS / testing /PRELIMENARY-TS.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified

Preliminary TernaryScale Benchmark Results

Date: 2025-05-15
Hardware: RTX 4060 8GB
Steps: 2500 | Warmup: 250 | Batch: 64 | CTX: 66
Params: 1,668,128 | Ternary forward: T32
Data: enwik8 (train 1,003,854 bytes / val 111,540 bytes)
All 6 configs run in parallel per step

Loss Over Time

Step SignSGD-C SignSGD-E Lion bf16 Lion FP32 Adam bf16 Adam FP32
0 6.9770 8.0108 7.0011 6.5737 6.7317 6.8118
250 2.1486 1.9869 3.0933 3.1620 2.8921 2.7365
500 1.9801 1.7453 2.8211 2.5914 2.5739 2.4437
750 1.8682 1.6757 2.6109 2.5917 2.3861 2.2972
1000 1.8499 1.6601 2.6047 3.0808 2.3597 2.2618
1250 1.7943 1.6378 2.5462 3.2841 2.3527 2.3462
1500 1.7533 1.6506 2.5942 3.3279 2.4036 2.4838
1750 1.7601 1.6370 2.5652 3.3534 2.3876 2.5985
2000 1.7102 1.6108 2.5563 3.3116 2.3305 2.6721
2250 1.7354 1.5892 2.6064 3.4192 2.3509 2.7199
2500 1.6944 1.5861 2.5486 3.4403 2.3237 2.6991

Summary

Config Final Loss Min Loss Loss@1k Loss@2.5k Step(ms) Opt Mem Total Mem
SignSGD ConfigC 1.7113 1.5921 1.8099 1.6944 11.2 6.4 MB 12.7 MB
SignSGD ConfigE 1.6149 1.5157 1.7146 1.5861 11.2 6.4 MB 12.7 MB
Lion bf16 2.5521 2.4668 2.5683 2.5486 11.2 9.6 MB 12.7 MB
Lion FP32 3.4280 2.4911 3.0947 3.4403 11.2 12.7 MB 19.1 MB
Adam bf16 2.3487 2.2480 2.3016 2.3237 11.2 9.6 MB 12.7 MB
Adam FP32 2.7597 2.2049 2.3037 2.6991 11.2 19.1 MB 25.4 MB

Loss Ratio vs SignSGD ConfigC

Config Ratio Verdict
SignSGD ConfigE 0.944x better
Lion bf16 1.491x worse
Lion FP32 2.003x worse
Adam bf16 1.373x worse
Adam FP32 1.613x worse

Peak VRAM: 388 MB (all 6 models combined)

Key Findings

  1. SignSGD ConfigE wins — lowest loss (1.6149), lowest memory (12.7 MB total), no optimizer state
  2. SignSGD ConfigC second — group-averaged S blurs magnitudes vs ConfigE's per-element S
  3. Momentum optimizers diverge on ternary — Lion FP32 and Adam FP32 loss increases after step 1000
  4. bf16 helps momentum optimizers — Lion bf16 (2.55) and Adam bf16 (2.35) are more stable than their FP32 variants (3.43, 2.76) but still far worse than SignSGD
  5. Root cause: Momentum amplifies gradients that get blocked by ternary STE threshold mask, causing oscillation. SignSGD's pure sign update naturally complements ternary training — no amplification, no conflict.
  6. SignSGD memory advantage: 6.4 MB optimizer mem vs 19.1 MB (Adam FP32) — 3x savings, zero state buffers