Add complete design plan document
Browse files
PLAN.md
ADDED
|
@@ -0,0 +1,274 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LeWorld Memory Architecture β Complete Implementation Plan
|
| 2 |
+
|
| 3 |
+
## β
Verified Architecture (All Components Tested & Working)
|
| 4 |
+
|
| 5 |
+
### Executive Summary
|
| 6 |
+
|
| 7 |
+
A CPU-inspired hierarchical neural architecture where 3 small models (SLMs) compete to find the most useful memory for 1 big model (BLM) to predict the next world state. The BLM selects which SLMs to trust via binary gating, and actively requests what information it needs next.
|
| 8 |
+
|
| 9 |
+
**Verified parameter counts:**
|
| 10 |
+
| Component | Parameters | Role |
|
| 11 |
+
|-----------|-----------|------|
|
| 12 |
+
| Artificial Memory | 21K | Bit-level storage (64K words Γ 32 bits) + learned bit encoder/decoder |
|
| 13 |
+
| SLM-0 | 745K | State β memory address range (specializes via selection pressure) |
|
| 14 |
+
| SLM-1 | 745K | State β memory address range |
|
| 15 |
+
| SLM-2 | 745K | State β memory address range |
|
| 16 |
+
| BLM | 11.2M | SLM selector + next-state predictor + info requester |
|
| 17 |
+
| Info bridge | 8K | Converts BLM's info query β SLM state modulation |
|
| 18 |
+
| **Total** | **13.5M** | |
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 1. Artificial Memory Design
|
| 23 |
+
|
| 24 |
+
### CPU Analogy
|
| 25 |
+
```
|
| 26 |
+
Real CPU: Address Bus (16-bit) β RAM β Data Bus (32-bit)
|
| 27 |
+
LeWorld: SLM output (addr_range) β Memory tensor β Bit encoder β Dense vector
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
### Implementation
|
| 31 |
+
- **Storage**: `(65536, 32)` binary tensor β 2M bits organized as 64K addressable words
|
| 32 |
+
- **Read**: Given `(start_addr, end_addr)` β fetch contiguous bit block β encode via learned `bit_encoder`
|
| 33 |
+
- **Write**: Dense vector β decode to bit probabilities β Straight-Through binarization β write to memory
|
| 34 |
+
- **Addressing**: Product-key decomposition β address split into high byte (256 choices) + low byte (256 choices) = 65536 possible addresses with only 512 logits (instead of 65536)
|
| 35 |
+
- **Soft read mode**: Attention weights over full memory for differentiable end-to-end training
|
| 36 |
+
|
| 37 |
+
### Memory Layout Strategy
|
| 38 |
+
```
|
| 39 |
+
[0x0000 - 0x3FFF]: Dynamics patterns (16K words, state transition rules)
|
| 40 |
+
[0x4000 - 0x7FFF]: Context patterns (16K words, characteristic-dependent info)
|
| 41 |
+
[0x8000 - 0xBFFF]: History patterns (16K words, temporal sequences in binary)
|
| 42 |
+
[0xC000 - 0xFFFF]: Association patterns (16K words, XOR cross-references)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 2. SLM Architecture (Small LeWorld Model, ~745K params each)
|
| 48 |
+
|
| 49 |
+
### Data Flow
|
| 50 |
+
```
|
| 51 |
+
past_state βββ
|
| 52 |
+
ββββΊ StateEncoder βββΊ CrossAttention βββΊ Transformer(2L) βββΊ AddressHead
|
| 53 |
+
curr_state βββ β β
|
| 54 |
+
β βββ start_addr (product-key)
|
| 55 |
+
characteristics βββΊ CharEncoder βββββββ βββ end_addr
|
| 56 |
+
βββ range_length
|
| 57 |
+
βββ confidence
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Key Design Decisions
|
| 61 |
+
|
| 62 |
+
1. **Product-Key Address Generation** (from arxiv:1907.05242):
|
| 63 |
+
Instead of a 65536-way softmax, split the 16-bit address into two 8-bit halves:
|
| 64 |
+
- `high_logits = Linear(hidden) β (batch, 256)`
|
| 65 |
+
- `low_logits = Linear(hidden) β (batch, 256)`
|
| 66 |
+
- `addr = argmax(high) Γ 256 + argmax(low)`
|
| 67 |
+
- **Trainable via cross-entropy** on each half independently
|
| 68 |
+
|
| 69 |
+
2. **Cross-Attention**: State representation queries characteristics β so the SLM can specialize its memory search based on the entity/context it's operating on
|
| 70 |
+
|
| 71 |
+
3. **Confidence output**: Sigmoid scalar β how useful this SLM believes its memory read will be. The BLM can use this alongside its own routing decision.
|
| 72 |
+
|
| 73 |
+
### Module Breakdown
|
| 74 |
+
```
|
| 75 |
+
StateEncoder: 49,792 params (past+current β joint representation)
|
| 76 |
+
CharacteristicsEnc: 4,480 params (static context encoding)
|
| 77 |
+
CrossAttention: 198,528 params (state β characteristics)
|
| 78 |
+
TransformerLayers: 396,544 params (2 layers, d=128, 4 heads)
|
| 79 |
+
AddressHead: 95,105 params (product-key addr + range + confidence)
|
| 80 |
+
LayerNorm: 256 params
|
| 81 |
+
ββββββββββββββββββββββββββββββββββ
|
| 82 |
+
Total: 744,705 params
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## 3. BLM Architecture (Big LeWorld Model, ~11.2M params)
|
| 88 |
+
|
| 89 |
+
### Data Flow
|
| 90 |
+
```
|
| 91 |
+
current_state βββΊ StateEncoder βββΊ Router βββΊ binary_mask [1,0,1]
|
| 92 |
+
β β
|
| 93 |
+
β ββββββββββββββββ€
|
| 94 |
+
β βΌ βΌ
|
| 95 |
+
β Gate SLM outputs Gate memory reads
|
| 96 |
+
β β β
|
| 97 |
+
βΌ βΌ βΌ
|
| 98 |
+
[CLS] + [state] + [slm0_h, slm0_mem, slm1_h, slm1_mem, ...]
|
| 99 |
+
β
|
| 100 |
+
βΌ
|
| 101 |
+
Transformer (6 layers, d=384, 6 heads)
|
| 102 |
+
β
|
| 103 |
+
ββββΊ NextStateHead βββΊ predicted_next_state
|
| 104 |
+
ββββΊ InfoRequestHead βββΊ "what do I need next?" query
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
### Binary Routing (Straight-Through Sigmoid)
|
| 108 |
+
Grounded in literature (Jang et al. 2017 + Switch Transformer):
|
| 109 |
+
```python
|
| 110 |
+
probs = sigmoid(gate_logits) # continuous [0,1]
|
| 111 |
+
hard_mask = (probs > 0.5).float() # hard binary {0,1}
|
| 112 |
+
mask = hard_mask - probs.detach() + probs # ST trick: hard forward, soft backward
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
**Load balancing loss** prevents degenerate routing (always picking same SLM):
|
| 116 |
+
```python
|
| 117 |
+
usage = mask.mean(dim=0) # per-SLM usage rate
|
| 118 |
+
balance_loss = ((usage - 1/n_slms) ** 2).sum()
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
**Temperature annealing**: Start warm (Ο=1.0, exploratory) β cool down (Οβ0.1, decisive)
|
| 122 |
+
|
| 123 |
+
### Info-Request Head
|
| 124 |
+
The key innovation β BLM doesn't passively receive memory, it **actively requests** what it needs:
|
| 125 |
+
```python
|
| 126 |
+
info_query = InfoRequestHead(cls_output) # "what do I need next?"
|
| 127 |
+
# At next timestep:
|
| 128 |
+
modulated_state = current_state + 0.1 * Linear(info_query)
|
| 129 |
+
# SLMs receive modulated state β changes their memory search
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
### Module Breakdown
|
| 133 |
+
```
|
| 134 |
+
StateEncoder: 25,728 params
|
| 135 |
+
MemoryEncoder: 50,304 params
|
| 136 |
+
SLMHiddenEncoder: 50,304 params
|
| 137 |
+
Router: 74,499 params (MLP β 3 binary gates)
|
| 138 |
+
TransformerLayers: 10,646,784 params (6 layers, d=384, 6 heads)
|
| 139 |
+
NextStateHead: 172,480 params
|
| 140 |
+
InfoRequestHead: 197,376 params
|
| 141 |
+
Tokens+Embeds: 1,920 params (CLS, type embeddings)
|
| 142 |
+
ββββββββββββββββββββββββββββββββββββββ
|
| 143 |
+
Total: 11,219,395 params
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## 4. Training Pipeline (3 Phases, Verified Working)
|
| 149 |
+
|
| 150 |
+
### Phase 1: Pre-training (Components Separate)
|
| 151 |
+
|
| 152 |
+
**SLM Pre-training**: Given ground-truth "relevant memory regions," train SLMs to predict correct addresses
|
| 153 |
+
- Loss: Cross-entropy on address components (high byte + low byte) + range length
|
| 154 |
+
- Optimizer: AdamW, lr=1e-3
|
| 155 |
+
- This gives SLMs a warm start β they know how to produce valid addresses
|
| 156 |
+
|
| 157 |
+
**BLM Pre-training**: Given oracle memory reads (ground-truth regions), train BLM to predict next state
|
| 158 |
+
- Loss: MSE between predicted and actual next state
|
| 159 |
+
- Optimizer: AdamW, lr=1e-3
|
| 160 |
+
- This gives BLM a warm start β it knows how to use memory for prediction
|
| 161 |
+
|
| 162 |
+
### Phase 2: End-to-End Joint Training
|
| 163 |
+
|
| 164 |
+
Full pipeline: SLMs produce addresses β Memory read β BLM routes + predicts
|
| 165 |
+
- Loss: `next_state_MSE + 0.01 Γ balance_loss + 0.001 Γ diversity_loss`
|
| 166 |
+
- Optimizer: AdamW, lr=3e-4 (all parameters)
|
| 167 |
+
- Scheduler: CosineAnnealingWarmRestarts
|
| 168 |
+
- Temperature annealing: Ο from 1.0 β 0.1 over training
|
| 169 |
+
|
| 170 |
+
**Diversity loss**: Encourages SLMs to read DIFFERENT memory regions
|
| 171 |
+
```python
|
| 172 |
+
addresses = [slm_out['start_addr'] for slm_out in slm_outputs]
|
| 173 |
+
diversity_loss = -mean_pairwise_distance(addresses) # negative = maximize distance
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
### Phase 3: Info-Request Cooperative Refinement
|
| 177 |
+
|
| 178 |
+
Inspired by ProactAgent (arxiv:2604.20572) paired-branch reward:
|
| 179 |
+
- **Branch A**: Run with info-request modulation (full system)
|
| 180 |
+
- **Branch B**: Run WITHOUT info-request (baseline)
|
| 181 |
+
- **Reward**: `improvement = loss_without - loss_with` (positive when info helps)
|
| 182 |
+
- Loss: `loss_with - 0.1 Γ improvement` (reward useful info requests)
|
| 183 |
+
|
| 184 |
+
Differential learning rates:
|
| 185 |
+
- Info-request modules: lr=1e-4 (fast learning)
|
| 186 |
+
- SLMs: lr=1e-5 (slow adaptation)
|
| 187 |
+
- BLM backbone: lr=1e-5 (slow adaptation)
|
| 188 |
+
|
| 189 |
+
### Verified Training Results (demo run)
|
| 190 |
+
```
|
| 191 |
+
Phase 1: SLM loss 12.87 β 7.13, BLM loss 0.39 β 0.33
|
| 192 |
+
Phase 2: Joint loss converges, routing becomes diverse (usage: [0.72, 0.79, 0.67])
|
| 193 |
+
Phase 3: Info request improves predictions by 19.5 loss units vs baseline
|
| 194 |
+
|
| 195 |
+
Final: MSE=0.36, MAE=0.47, Routing entropy=0.70
|
| 196 |
+
Per-step MSE: [0.64, 0.44, 0.31, 0.23, 0.19] β prediction improves over time
|
| 197 |
+
SLM usage: [0.73, 0.78, 0.65] β balanced, all SLMs contribute
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
+
## 5. Key Technical Innovations
|
| 203 |
+
|
| 204 |
+
### 5.1 Gradient Flow Through Discrete Decisions
|
| 205 |
+
|
| 206 |
+
| Decision | Method | Paper |
|
| 207 |
+
|----------|--------|-------|
|
| 208 |
+
| SLM address selection | Product-key + cross-entropy | arxiv:1907.05242 |
|
| 209 |
+
| BLM binary routing [1,0,1] | Straight-Through Sigmoid | arxiv:1611.01144 |
|
| 210 |
+
| Memory write (bit quantization) | Straight-Through binarization | arxiv:1611.01144 |
|
| 211 |
+
| Info-request utility | Paired-branch reward (detached) | arxiv:2604.20572 |
|
| 212 |
+
|
| 213 |
+
### 5.2 Multi-Timestep Autoregressive Execution
|
| 214 |
+
```
|
| 215 |
+
For t = 0, 1, 2, ..., T:
|
| 216 |
+
1. BLM info_query from step t-1 modulates SLM inputs
|
| 217 |
+
2. SLMs produce address ranges (each looking at different memory)
|
| 218 |
+
3. BLM selects SLMs: mask=[1,0,1]
|
| 219 |
+
4. Selected memory is aggregated
|
| 220 |
+
5. BLM predicts next_state and generates new info_query
|
| 221 |
+
6. Repeat with teacher forcing (training) or autoregressive (inference)
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
### 5.3 Emergent SLM Specialization
|
| 225 |
+
SLMs start identical but specialize through:
|
| 226 |
+
- **Selection pressure**: BLM's routing creates different utility signals per SLM
|
| 227 |
+
- **Diversity loss**: Penalizes SLMs for reading the same regions
|
| 228 |
+
- **Random initialization**: Different initial weights β different early trajectories
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## 6. Scaling Considerations
|
| 233 |
+
|
| 234 |
+
### To Scale SLMs (1-2M β 2M target)
|
| 235 |
+
- Increase d_model from 128 β 192
|
| 236 |
+
- Add 1 more transformer layer (2 β 3)
|
| 237 |
+
- Wider FFN (4Γ β 6Γ expansion)
|
| 238 |
+
- Estimated: ~2.0M params per SLM
|
| 239 |
+
|
| 240 |
+
### To Scale BLM (11M β 15M target)
|
| 241 |
+
- Increase d_model from 384 β 448
|
| 242 |
+
- Add 1-2 more transformer layers (6 β 8)
|
| 243 |
+
- Estimated: ~15M params
|
| 244 |
+
|
| 245 |
+
### Memory Scaling
|
| 246 |
+
- Current: 64K words Γ 32 bits = 256KB equivalent
|
| 247 |
+
- Scale to: 1M words Γ 64 bits = ~8MB equivalent
|
| 248 |
+
- Address bits: 20 (split 10+10 for product keys)
|
| 249 |
+
- Would need: ~1K logits per address component (still tractable)
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## 7. Open Research Questions
|
| 254 |
+
|
| 255 |
+
1. **Should memory be persistent or episodic?** Current: persistent. Could add episode-based write/clear.
|
| 256 |
+
2. **Should SLMs share parameters?** Current: independent. Sharing + differentiation heads could help generalization.
|
| 257 |
+
3. **What should the characteristics vector encode?** In a real application: entity type, physical properties, goal state, etc.
|
| 258 |
+
4. **Can the BLM learn to write to memory?** Currently read-only. Adding a write head would enable learning from experience.
|
| 259 |
+
5. **How does this scale with more SLMs?** The binary routing mask grows linearly. At n=10+ SLMs, may need top-k selection instead.
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
## 8. Related Work (Literature Foundation)
|
| 264 |
+
|
| 265 |
+
| Paper | arxiv ID | What we borrowed |
|
| 266 |
+
|-------|----------|-----------------|
|
| 267 |
+
| Gumbel-Softmax (Jang et al. 2017) | 1611.01144 | Straight-Through sigmoid for binary routing |
|
| 268 |
+
| Switch Transformers (Fedus et al. 2021) | 2101.03961 | Gate-value scaling, load balance loss |
|
| 269 |
+
| Product Key Memory (Lample et al. 2019) | 1907.05242 | Address decomposition into sub-keys |
|
| 270 |
+
| LM2: Large Memory Models (2025) | 2502.06049 | LSTM-style memory gates, soft addressing |
|
| 271 |
+
| NAMM (Sakana 2024) | 2410.13166 | Binary memory eviction, evolutionary fallback |
|
| 272 |
+
| ProactAgent (2025) | 2604.20572 | Paired-branch reward for retrieval decisions |
|
| 273 |
+
| Mamba (Gu & Dao 2023) | 2312.00752 | Explicit state maintenance in sequence models |
|
| 274 |
+
| Trainable Gate Function (Lee 2019) | 1904.10921 | Custom gradient shapes for binary gates |
|