| # MORPH Spike Testing Notes β Phase 0: Scaled Ternary Experiment |
|
|
| **Date:** 2026-05-12 |
| **File:** `test-stp.py` (renamed from original `spike.py` plan β lives in repo root) |
|
|
| --- |
|
|
| ## Test Results Summary |
|
|
| ### All 3 Configs Trained Successfully (5000 steps each) |
|
|
| | Config | Full Name | Val Loss | BPW | Params | vs FP32 | |
| |--------|-----------|----------|-----|--------|---------| |
| | A | BitNet (Ξ± Β· T) | 2.3728 | 16.0 | 115,072 | 1.16Γ worse | |
| | B | RMS-Scaled (SβT, S=1/rms(x)) | 4.4917 | 1.58 | 115,072 | 2.19Γ worse | |
| | C | Learned-S (SβT, S=nn.Parameter) | 2.8817 | 1.58 | 115,074 | 1.41Γ worse | |
| | β | FP32 baseline (nn.Linear) | 2.0492 | 32.0 | 115,072 | baseline | |
|
|
| ### Success Criteria (D-13: C_loss β€ 1.25 Γ A_loss) |
|
|
| - **C/A = 1.214Γ** β β
PASS (within 25% margin) |
| - **B/A = 1.893Γ** β β FAIL (RMS-S alone insufficient) |
|
|
| ### Extended Comparison (vs FP32) |
|
|
| - Config A (BitNet): 1.16Γ FP32 loss at 16 bpw β good ternary baseline |
| - Config C (Learned-S): 1.41Γ FP32 loss at 1.58 bpw β **10Γ memory savings** for 41% accuracy cost |
| - Config B (RMS-S): 2.19Γ FP32 loss at 1.58 bpw β input-derived scaling not enough |
|
|
| --- |
|
|
| ## Key Findings |
|
|
| ### 1. Config A Weight Initialization Bug |
| **Original:** `std=0.01` caused ~99% of weights below 0.05 threshold β all-zero ternary β zero gradients β training relied only on bias + embedding. |
| **Fix:** Changed to `std=0.1` (same as B/C). This put ~38% of weights above threshold β active STE gradient flow from step 1. |
|
|
| ### 2. Config C Learned S Convergence |
| - S values initialized at 1.0, converged to ~0.29β0.31 by step 5000 |
| - S gradients remained healthy (0.01β0.7 range) β no collapse or explosion |
| - Learning rate 3e-4 with AdamW worked well for S optimization |
|
|
| ### 3. TernarizeSTE Gradient Behavior |
| - With `std=0.1` init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution) |
| - Dead zone (|w| < 0.05) is the main gradient bottleneck β confirms need for sticky zone or warmup in Phase 3 |
|
|
| ### 4. Config B Limitations |
| - RMS-derived S depends on input magnitude, not weight distribution |
| - Cannot learn task-specific scaling β 1.89Γ worse than BitNet |
| - Useful insight: learned S (Config C) is strictly better than input-derived S (Config B) |
|
|
| ### 5. All Configs Still Learning at Step 5000 |
| All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely: |
| - Narrow the A vs C gap |
| - Potentially push Config B closer to threshold |
| |
| --- |
| |
| ## Bug Fixes Applied During Testing |
| |
| | Bug | Location | Fix | |
| |-----|----------|-----| |
| | TRAIN_PARAMS keys not strings | Line 9-19 | Changed `batch:` β `"batch_size":` etc. | |
| | TernarizeSTE was nn.Module | Line 25-32 | Changed to `torch.autograd.Function` with `@staticmethod` | |
| | All `__init__` missing `self` | Lines 26,35,44,56,68 | Added `self` parameter | |
| | `torch.randn("out","in")` | Lines 46,58,70 | Changed to `torch.randn(out_dim, in_dim)` | |
| | `x^2` (XOR not power) | Line 61 | Changed to `x**2` | |
| | `math.sqrt` with tensor | Line 61 | Changed to `torch.sqrt(torch.mean(...))` | |
| | Missing `return` in forwards | Lines 52,64,75 | Added return statements | |
| | BitNet `std=0.01` | Line 71 | Changed to `std=0.1` for active gradient flow | |
|
|
| --- |
|
|
| ## Architecture Validation |
|
|
| ``` |
| ByteMLP (base class) |
| βββ Embedding(256, 64) β [B, 8, 64] β flatten β [B, 512] |
| βββ fc1: Config-dependent linear layer β [B, 128] |
| βββ ReLU |
| βββ fc2: Config-dependent linear layer β [B, 256] |
| βββ CrossEntropyLoss against y[:, -1] |
| ``` |
|
|
| Effective BPW calculations confirmed: |
| - Config A: 16.0 bpw (FP16 shadow weights) |
| - Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity) |
| - Config C: 1.583 bpw (ternary + 2 scalar S params per layer) |
|
|
| --- |
|
|
| ## Recommendations for Phase 3 |
|
|
| 1. **Use Config C (Learned-S) architecture** β validated as viable |
| 2. **Initialize steering weights at std=0.1** β critical for gradient flow |
| 3. **Add threshold warmup** (0.01β0.05 over first 10% steps) to reduce dead-zone effects |
| 4. **Consider sticky zone STE** for graph edges specifically (deferred from spike) |
| 5. **Extend training to 10K+ steps** for fair comparison at convergence |
| 6. **Phase 8 idea:** Hybrid ternary + FP8 selective precision could close the gap to BF16 |