MORPH Spike Testing Notes — Phase 0: Scaled Ternary Experiment

Date: 2026-05-12 File: test-stp.py (renamed from original spike.py plan — lives in repo root)

Test Results Summary

All 3 Configs Trained Successfully (5000 steps each)

Config	Full Name	Val Loss	BPW	Params	vs FP32
A	BitNet (α · T)	2.3728	16.0	115,072	1.16× worse
B	RMS-Scaled (S⊙T, S=1/rms(x))	4.4917	1.58	115,072	2.19× worse
C	Learned-S (S⊙T, S=nn.Parameter)	2.8817	1.58	115,074	1.41× worse
—	FP32 baseline (nn.Linear)	2.0492	32.0	115,072	baseline

Success Criteria (D-13: C_loss ≤ 1.25 × A_loss)

C/A = 1.214× → ✅ PASS (within 25% margin)
B/A = 1.893× → ❌ FAIL (RMS-S alone insufficient)

Extended Comparison (vs FP32)

Config A (BitNet): 1.16× FP32 loss at 16 bpw → good ternary baseline
Config C (Learned-S): 1.41× FP32 loss at 1.58 bpw → 10× memory savings for 41% accuracy cost
Config B (RMS-S): 2.19× FP32 loss at 1.58 bpw → input-derived scaling not enough

Key Findings

1. Config A Weight Initialization Bug

Original: std=0.01 caused ~99% of weights below 0.05 threshold → all-zero ternary → zero gradients → training relied only on bias + embedding. Fix: Changed to std=0.1 (same as B/C). This put ~38% of weights above threshold → active STE gradient flow from step 1.

2. Config C Learned S Convergence

S values initialized at 1.0, converged to ~0.29–0.31 by step 5000
S gradients remained healthy (0.01–0.7 range) — no collapse or explosion
Learning rate 3e-4 with AdamW worked well for S optimization

3. TernarizeSTE Gradient Behavior

With std=0.1 init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution)
Dead zone (|w| < 0.05) is the main gradient bottleneck — confirms need for sticky zone or warmup in Phase 3

4. Config B Limitations

RMS-derived S depends on input magnitude, not weight distribution
Cannot learn task-specific scaling → 1.89× worse than BitNet
Useful insight: learned S (Config C) is strictly better than input-derived S (Config B)

5. All Configs Still Learning at Step 5000

All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely:

Narrow the A vs C gap
Potentially push Config B closer to threshold

Bug Fixes Applied During Testing

Bug	Location	Fix
TRAIN_PARAMS keys not strings	Line 9-19	Changed `batch:` → `"batch_size":` etc.
TernarizeSTE was nn.Module	Line 25-32	Changed to `torch.autograd.Function` with `@staticmethod`
All `__init__` missing `self`	Lines 26,35,44,56,68	Added `self` parameter
`torch.randn("out","in")`	Lines 46,58,70	Changed to `torch.randn(out_dim, in_dim)`
`x^2` (XOR not power)	Line 61	Changed to `x**2`
`math.sqrt` with tensor	Line 61	Changed to `torch.sqrt(torch.mean(...))`
Missing `return` in forwards	Lines 52,64,75	Added return statements
BitNet `std=0.01`	Line 71	Changed to `std=0.1` for active gradient flow

Architecture Validation

ByteMLP (base class)
├── Embedding(256, 64) → [B, 8, 64] → flatten → [B, 512]
├── fc1: Config-dependent linear layer → [B, 128]
├── ReLU
├── fc2: Config-dependent linear layer → [B, 256]
└── CrossEntropyLoss against y[:, -1]

Effective BPW calculations confirmed:

Config A: 16.0 bpw (FP16 shadow weights)
Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity)
Config C: 1.583 bpw (ternary + 2 scalar S params per layer)

Recommendations for Phase 3

Use Config C (Learned-S) architecture — validated as viable
Initialize steering weights at std=0.1 — critical for gradient flow
Add threshold warmup (0.01→0.05 over first 10% steps) to reduce dead-zone effects
Consider sticky zone STE for graph edges specifically (deferred from spike)
Extend training to 10K+ steps for fair comparison at convergence
Phase 8 idea: Hybrid ternary + FP8 selective precision could close the gap to BF16