File size: 4,308 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# MORPH Spike Testing Notes β€” Phase 0: Scaled Ternary Experiment

**Date:** 2026-05-12
**File:** `test-stp.py` (renamed from original `spike.py` plan β€” lives in repo root)

---

## Test Results Summary

### All 3 Configs Trained Successfully (5000 steps each)

| Config | Full Name | Val Loss | BPW | Params | vs FP32 |
|--------|-----------|----------|-----|--------|---------|
| A | BitNet (Ξ± Β· T) | 2.3728 | 16.0 | 115,072 | 1.16Γ— worse |
| B | RMS-Scaled (SβŠ™T, S=1/rms(x)) | 4.4917 | 1.58 | 115,072 | 2.19Γ— worse |
| C | Learned-S (SβŠ™T, S=nn.Parameter) | 2.8817 | 1.58 | 115,074 | 1.41Γ— worse |
| β€” | FP32 baseline (nn.Linear) | 2.0492 | 32.0 | 115,072 | baseline |

### Success Criteria (D-13: C_loss ≀ 1.25 Γ— A_loss)

- **C/A = 1.214Γ—** β†’ βœ… PASS (within 25% margin)
- **B/A = 1.893Γ—** β†’ ❌ FAIL (RMS-S alone insufficient)

### Extended Comparison (vs FP32)

- Config A (BitNet): 1.16Γ— FP32 loss at 16 bpw β†’ good ternary baseline
- Config C (Learned-S): 1.41Γ— FP32 loss at 1.58 bpw β†’ **10Γ— memory savings** for 41% accuracy cost
- Config B (RMS-S): 2.19Γ— FP32 loss at 1.58 bpw β†’ input-derived scaling not enough

---

## Key Findings

### 1. Config A Weight Initialization Bug
**Original:** `std=0.01` caused ~99% of weights below 0.05 threshold β†’ all-zero ternary β†’ zero gradients β†’ training relied only on bias + embedding.
**Fix:** Changed to `std=0.1` (same as B/C). This put ~38% of weights above threshold β†’ active STE gradient flow from step 1.

### 2. Config C Learned S Convergence
- S values initialized at 1.0, converged to ~0.29–0.31 by step 5000
- S gradients remained healthy (0.01–0.7 range) β€” no collapse or explosion
- Learning rate 3e-4 with AdamW worked well for S optimization

### 3. TernarizeSTE Gradient Behavior
- With `std=0.1` init: ~57% nonzero gradients at step 1 (higher than theoretical 38% due to weight evolution)
- Dead zone (|w| < 0.05) is the main gradient bottleneck β€” confirms need for sticky zone or warmup in Phase 3

### 4. Config B Limitations
- RMS-derived S depends on input magnitude, not weight distribution
- Cannot learn task-specific scaling β†’ 1.89Γ— worse than BitNet
- Useful insight: learned S (Config C) is strictly better than input-derived S (Config B)

### 5. All Configs Still Learning at Step 5000
All three configs showed decreasing val_loss between steps 4500-5000. Longer training (10K+ steps) would likely:
- Narrow the A vs C gap
- Potentially push Config B closer to threshold

---

## Bug Fixes Applied During Testing

| Bug | Location | Fix |
|-----|----------|-----|
| TRAIN_PARAMS keys not strings | Line 9-19 | Changed `batch:` β†’ `"batch_size":` etc. |
| TernarizeSTE was nn.Module | Line 25-32 | Changed to `torch.autograd.Function` with `@staticmethod` |
| All `__init__` missing `self` | Lines 26,35,44,56,68 | Added `self` parameter |
| `torch.randn("out","in")` | Lines 46,58,70 | Changed to `torch.randn(out_dim, in_dim)` |
| `x^2` (XOR not power) | Line 61 | Changed to `x**2` |
| `math.sqrt` with tensor | Line 61 | Changed to `torch.sqrt(torch.mean(...))` |
| Missing `return` in forwards | Lines 52,64,75 | Added return statements |
| BitNet `std=0.01` | Line 71 | Changed to `std=0.1` for active gradient flow |

---

## Architecture Validation

```
ByteMLP (base class)
β”œβ”€β”€ Embedding(256, 64) β†’ [B, 8, 64] β†’ flatten β†’ [B, 512]
β”œβ”€β”€ fc1: Config-dependent linear layer β†’ [B, 128]
β”œβ”€β”€ ReLU
β”œβ”€β”€ fc2: Config-dependent linear layer β†’ [B, 256]
└── CrossEntropyLoss against y[:, -1]
```

Effective BPW calculations confirmed:
- Config A: 16.0 bpw (FP16 shadow weights)
- Config B: 1.58 bpw (ternary {-1,0,+1} = 2 bits, but effective ~1.58 with sparsity)
- Config C: 1.583 bpw (ternary + 2 scalar S params per layer)

---

## Recommendations for Phase 3

1. **Use Config C (Learned-S) architecture** β€” validated as viable
2. **Initialize steering weights at std=0.1** β€” critical for gradient flow
3. **Add threshold warmup** (0.01β†’0.05 over first 10% steps) to reduce dead-zone effects
4. **Consider sticky zone STE** for graph edges specifically (deferred from spike)
5. **Extend training to 10K+ steps** for fair comparison at convergence
6. **Phase 8 idea:** Hybrid ternary + FP8 selective precision could close the gap to BF16