MORPH Spike Testing Notes β Phase 0: Scaled Ternary Experiment
Date: 2026-05-12
File: test-stp.py (renamed from original spike.py plan β lives in repo root)
Test Results Summary
All 3 Configs Trained Successfully (5000 steps each)
| Config | Full Name | Val Loss | BPW | Params | vs FP32 |
|---|---|---|---|---|---|
| A | BitNet (Ξ± Β· T) | 2.3728 | 16.0 | 115,072 | 1.16Γ worse |
| B | RMS-Scaled (SβT, S=1/rms(x)) | 4.4917 | 1.58 | 115,072 | 2.19Γ worse |
| C | Learned-S (SβT, S=nn.Parameter) | 2.8817 | 1.58 | 115,074 | 1.41Γ worse |
| β | FP32 baseline (nn.Linear) | 2.0492 | 32.0 | 115,072 | baseline |
Success Criteria (D-13: C_loss β€ 1.25 Γ A_loss)
- C/A = 1.214Γ β β PASS (within 25% margin)
- B/A = 1.893Γ β β FAIL (RMS-S alone insufficient)
Extended Comparison (vs FP32)
- Config A (BitNet): 1.16Γ FP32 loss at 16 bpw β good ternary baseline
- Config C (Learned-S): 1.41Γ FP32 loss at 1.58 bpw β 10Γ memory savings for 41% accuracy cost
- Config B (RMS-S): 2.19Γ FP32 loss at 1.58 bpw β input-derived scaling not enough
Key Findings
1. Config A Weight Initialization Bug
Original: std=0.01 caused ~99% of weights below 0.05 threshold β all-zero ternary β zero gradients β training relied only on bias + embedding.
Fix: Changed to std=0.1 (same as B/C). This put ~38% of weights above threshold β active STE gradient flow from step 1.
2. Config C Learned S Convergence
- S values initialized at 1.0, converged to ~0.29β0.31 by step 5000
- S gradients remained healthy (0.01β0.7 range) β no collapse or explosion
- Learning rate 3e-4 with AdamW worked well for S optimization
3. TernarizeSTE Gradient Behavior
- With
std=0.1init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution) - Dead zone (|w| < 0.05) is the main gradient bottleneck β confirms need for sticky zone or warmup in Phase 3
4. Config B Limitations
- RMS-derived S depends on input magnitude, not weight distribution
- Cannot learn task-specific scaling β 1.89Γ worse than BitNet
- Useful insight: learned S (Config C) is strictly better than input-derived S (Config B)
5. All Configs Still Learning at Step 5000
All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely:
- Narrow the A vs C gap
- Potentially push Config B closer to threshold
Bug Fixes Applied During Testing
| Bug | Location | Fix |
|---|---|---|
| TRAIN_PARAMS keys not strings | Line 9-19 | Changed batch: β "batch_size": etc. |
| TernarizeSTE was nn.Module | Line 25-32 | Changed to torch.autograd.Function with @staticmethod |
All __init__ missing self |
Lines 26,35,44,56,68 | Added self parameter |
torch.randn("out","in") |
Lines 46,58,70 | Changed to torch.randn(out_dim, in_dim) |
x^2 (XOR not power) |
Line 61 | Changed to x**2 |
math.sqrt with tensor |
Line 61 | Changed to torch.sqrt(torch.mean(...)) |
Missing return in forwards |
Lines 52,64,75 | Added return statements |
BitNet std=0.01 |
Line 71 | Changed to std=0.1 for active gradient flow |
Architecture Validation
ByteMLP (base class)
βββ Embedding(256, 64) β [B, 8, 64] β flatten β [B, 512]
βββ fc1: Config-dependent linear layer β [B, 128]
βββ ReLU
βββ fc2: Config-dependent linear layer β [B, 256]
βββ CrossEntropyLoss against y[:, -1]
Effective BPW calculations confirmed:
- Config A: 16.0 bpw (FP16 shadow weights)
- Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity)
- Config C: 1.583 bpw (ternary + 2 scalar S params per layer)
Recommendations for Phase 3
- Use Config C (Learned-S) architecture β validated as viable
- Initialize steering weights at std=0.1 β critical for gradient flow
- Add threshold warmup (0.01β0.05 over first 10% steps) to reduce dead-zone effects
- Consider sticky zone STE for graph edges specifically (deferred from spike)
- Extend training to 10K+ steps for fair comparison at convergence
- Phase 8 idea: Hybrid ternary + FP8 selective precision could close the gap to BF16