ARBS / testing /test-results-phase0.md

Upload folder using huggingface_hub

d8bc908 verified 1 day ago

4.31 kB

	# MORPH Spike Testing Notes — Phase 0: Scaled Ternary Experiment

	Date: 2026-05-12
	File: `test-stp.py` (renamed from original `spike.py` plan — lives in repo root)

	---

	## Test Results Summary

	### All 3 Configs Trained Successfully (5000 steps each)

	\| Config \| Full Name \| Val Loss \| BPW \| Params \| vs FP32 \|
	\|--------\|-----------\|----------\|-----\|--------\|---------\|
	\| A \| BitNet (α · T) \| 2.3728 \| 16.0 \| 115,072 \| 1.16× worse \|
	\| B \| RMS-Scaled (S⊙T, S=1/rms(x)) \| 4.4917 \| 1.58 \| 115,072 \| 2.19× worse \|
	\| C \| Learned-S (S⊙T, S=nn.Parameter) \| 2.8817 \| 1.58 \| 115,074 \| 1.41× worse \|
	\| — \| FP32 baseline (nn.Linear) \| 2.0492 \| 32.0 \| 115,072 \| baseline \|

	### Success Criteria (D-13: C_loss ≤ 1.25 × A_loss)

	- C/A = 1.214× → ✅ PASS (within 25% margin)
	- B/A = 1.893× → ❌ FAIL (RMS-S alone insufficient)

	### Extended Comparison (vs FP32)

	- Config A (BitNet): 1.16× FP32 loss at 16 bpw → good ternary baseline
	- Config C (Learned-S): 1.41× FP32 loss at 1.58 bpw → 10× memory savings for 41% accuracy cost
	- Config B (RMS-S): 2.19× FP32 loss at 1.58 bpw → input-derived scaling not enough

	---

	## Key Findings

	### 1. Config A Weight Initialization Bug
	Original: `std=0.01` caused ~99% of weights below 0.05 threshold → all-zero ternary → zero gradients → training relied only on bias + embedding.
	Fix: Changed to `std=0.1` (same as B/C). This put ~38% of weights above threshold → active STE gradient flow from step 1.

	### 2. Config C Learned S Convergence
	- S values initialized at 1.0, converged to ~0.29–0.31 by step 5000
	- S gradients remained healthy (0.01–0.7 range) — no collapse or explosion
	- Learning rate 3e-4 with AdamW worked well for S optimization

	### 3. TernarizeSTE Gradient Behavior
	- With `std=0.1` init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution)
	- Dead zone (\|w\| < 0.05) is the main gradient bottleneck — confirms need for sticky zone or warmup in Phase 3

	### 4. Config B Limitations
	- RMS-derived S depends on input magnitude, not weight distribution
	- Cannot learn task-specific scaling → 1.89× worse than BitNet
	- Useful insight: learned S (Config C) is strictly better than input-derived S (Config B)

	### 5. All Configs Still Learning at Step 5000
	All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely:
	- Narrow the A vs C gap
	- Potentially push Config B closer to threshold

	---

	## Bug Fixes Applied During Testing

	\| Bug \| Location \| Fix \|
	\|-----\|----------\|-----\|
	\| TRAIN_PARAMS keys not strings \| Line 9-19 \| Changed `batch:` → `"batch_size":` etc. \|
	\| TernarizeSTE was nn.Module \| Line 25-32 \| Changed to `torch.autograd.Function` with `@staticmethod` \|
	\| All `__init__` missing `self` \| Lines 26,35,44,56,68 \| Added `self` parameter \|
	\| `torch.randn("out","in")` \| Lines 46,58,70 \| Changed to `torch.randn(out_dim, in_dim)` \|
	\| `x^2` (XOR not power) \| Line 61 \| Changed to `x**2` \|
	\| `math.sqrt` with tensor \| Line 61 \| Changed to `torch.sqrt(torch.mean(...))` \|
	\| Missing `return` in forwards \| Lines 52,64,75 \| Added return statements \|
	\| BitNet `std=0.01` \| Line 71 \| Changed to `std=0.1` for active gradient flow \|

	---

	## Architecture Validation

	```
	ByteMLP (base class)
	├── Embedding(256, 64) → [B, 8, 64] → flatten → [B, 512]
	├── fc1: Config-dependent linear layer → [B, 128]
	├── ReLU
	├── fc2: Config-dependent linear layer → [B, 256]
	└── CrossEntropyLoss against y[:, -1]
	```

	Effective BPW calculations confirmed:
	- Config A: 16.0 bpw (FP16 shadow weights)
	- Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity)
	- Config C: 1.583 bpw (ternary + 2 scalar S params per layer)

	---

	## Recommendations for Phase 3

	1. Use Config C (Learned-S) architecture — validated as viable
	2. Initialize steering weights at std=0.1 — critical for gradient flow
	3. Add threshold warmup (0.01→0.05 over first 10% steps) to reduce dead-zone effects
	4. Consider sticky zone STE for graph edges specifically (deferred from spike)
	5. Extend training to 10K+ steps for fair comparison at convergence
	6. Phase 8 idea: Hybrid ternary + FP8 selective precision could close the gap to BF16