ARBS / testing /test-results-phase0.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified

MORPH Spike Testing Notes β€” Phase 0: Scaled Ternary Experiment

Date: 2026-05-12 File: test-stp.py (renamed from original spike.py plan β€” lives in repo root)


Test Results Summary

All 3 Configs Trained Successfully (5000 steps each)

Config Full Name Val Loss BPW Params vs FP32
A BitNet (Ξ± Β· T) 2.3728 16.0 115,072 1.16Γ— worse
B RMS-Scaled (SβŠ™T, S=1/rms(x)) 4.4917 1.58 115,072 2.19Γ— worse
C Learned-S (SβŠ™T, S=nn.Parameter) 2.8817 1.58 115,074 1.41Γ— worse
β€” FP32 baseline (nn.Linear) 2.0492 32.0 115,072 baseline

Success Criteria (D-13: C_loss ≀ 1.25 Γ— A_loss)

  • C/A = 1.214Γ— β†’ βœ… PASS (within 25% margin)
  • B/A = 1.893Γ— β†’ ❌ FAIL (RMS-S alone insufficient)

Extended Comparison (vs FP32)

  • Config A (BitNet): 1.16Γ— FP32 loss at 16 bpw β†’ good ternary baseline
  • Config C (Learned-S): 1.41Γ— FP32 loss at 1.58 bpw β†’ 10Γ— memory savings for 41% accuracy cost
  • Config B (RMS-S): 2.19Γ— FP32 loss at 1.58 bpw β†’ input-derived scaling not enough

Key Findings

1. Config A Weight Initialization Bug

Original: std=0.01 caused ~99% of weights below 0.05 threshold β†’ all-zero ternary β†’ zero gradients β†’ training relied only on bias + embedding. Fix: Changed to std=0.1 (same as B/C). This put ~38% of weights above threshold β†’ active STE gradient flow from step 1.

2. Config C Learned S Convergence

  • S values initialized at 1.0, converged to ~0.29–0.31 by step 5000
  • S gradients remained healthy (0.01–0.7 range) β€” no collapse or explosion
  • Learning rate 3e-4 with AdamW worked well for S optimization

3. TernarizeSTE Gradient Behavior

  • With std=0.1 init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution)
  • Dead zone (|w| < 0.05) is the main gradient bottleneck β€” confirms need for sticky zone or warmup in Phase 3

4. Config B Limitations

  • RMS-derived S depends on input magnitude, not weight distribution
  • Cannot learn task-specific scaling β†’ 1.89Γ— worse than BitNet
  • Useful insight: learned S (Config C) is strictly better than input-derived S (Config B)

5. All Configs Still Learning at Step 5000

All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely:

  • Narrow the A vs C gap
  • Potentially push Config B closer to threshold

Bug Fixes Applied During Testing

Bug Location Fix
TRAIN_PARAMS keys not strings Line 9-19 Changed batch: β†’ "batch_size": etc.
TernarizeSTE was nn.Module Line 25-32 Changed to torch.autograd.Function with @staticmethod
All __init__ missing self Lines 26,35,44,56,68 Added self parameter
torch.randn("out","in") Lines 46,58,70 Changed to torch.randn(out_dim, in_dim)
x^2 (XOR not power) Line 61 Changed to x**2
math.sqrt with tensor Line 61 Changed to torch.sqrt(torch.mean(...))
Missing return in forwards Lines 52,64,75 Added return statements
BitNet std=0.01 Line 71 Changed to std=0.1 for active gradient flow

Architecture Validation

ByteMLP (base class)
β”œβ”€β”€ Embedding(256, 64) β†’ [B, 8, 64] β†’ flatten β†’ [B, 512]
β”œβ”€β”€ fc1: Config-dependent linear layer β†’ [B, 128]
β”œβ”€β”€ ReLU
β”œβ”€β”€ fc2: Config-dependent linear layer β†’ [B, 256]
└── CrossEntropyLoss against y[:, -1]

Effective BPW calculations confirmed:

  • Config A: 16.0 bpw (FP16 shadow weights)
  • Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity)
  • Config C: 1.583 bpw (ternary + 2 scalar S params per layer)

Recommendations for Phase 3

  1. Use Config C (Learned-S) architecture β€” validated as viable
  2. Initialize steering weights at std=0.1 β€” critical for gradient flow
  3. Add threshold warmup (0.01β†’0.05 over first 10% steps) to reduce dead-zone effects
  4. Consider sticky zone STE for graph edges specifically (deferred from spike)
  5. Extend training to 10K+ steps for fair comparison at convergence
  6. Phase 8 idea: Hybrid ternary + FP8 selective precision could close the gap to BF16