File size: 9,105 Bytes
11c11f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# Chimera 5.3 β€” HYPER CPU Training (10,000+ tok/s target)

100% faithful implementation of the Chimera 5.x config. All 15 architectural components implemented in pure PyTorch, with **true 1.58-bit ternary computation** on CPU.

**v5.3 NEW**: 7 stacked training paradigms designed to push CPU training from ~50-200 tok/s to **10,000+ tok/s** on a single CPU β€” targeting AGI-class LLM training without GPUs.

**Tokenizer**: splintr-rs (Rust) β€” o200k_base vocab (200,073 tokens, OpenAI o1/o3).

## Repo Structure

The repo is now organized around the `chimera/` package as the source of truth:

- `chimera/` β€” model code, config helpers, package CLI wrappers, shared path helpers
- `train.py` β€” standard training entrypoint
- `train_fast.py` β€” cached-dataset training entrypoint
- `train_hyper.py` β€” hyper training entrypoint
- `inference.py` β€” generation entrypoint
- `gguf_import.py` β€” GGUF import entrypoint
- `tests/` β€” smoke and config tests

You can still run the root scripts directly, or use packaged commands after install:

```bash
chimera-train --help
chimera-train-fast --help
chimera-train-hyper --help
chimera-infer --help
chimera-import-gguf --help
```

---

## v5.3 β€” HYPER Training Paradigms

Seven orthogonal paradigms that stack **multiplicatively** for extreme CPU training speed:

| # | Paradigm | Speedup | Paper | Mechanism |
|---|----------|---------|-------|-----------|
| P1 | **GrowLength Curriculum** | 4-8Γ— | [arxiv:2310.00576](https://arxiv.org/abs/2310.00576) | Start seq=16, grow to target. Short seqs β†’ huge batch β†’ way more tok/s |
| P2 | **Reservoir Freezing** | 1.5-2Γ— | [arxiv:2512.23145](https://arxiv.org/abs/2512.23145) | Freeze 50% of recurrent gates as random ternary. No grad = fewer FLOPs |
| P3 | **Sparse MeZO** | 3-5Γ— | [arxiv:2406.02913](https://arxiv.org/abs/2406.02913) | Perturb only top-1% sensitive params. ZO signal quality ∝ sparsity |
| P4 | **Blockwise Pipeline** | 1.3-2Γ— | β€” | Pin layer-groups to core-groups; overlap forward passes |
| P5 | **Fused Ternary Cache** | 1.3Γ— | β€” | Pre-materialise dense weights once; reuse for both MeZO forwards |
| P6 | **Aggressive Token Packing** | 1.1-1.3Γ— | β€” | Zero padding waste; documents packed back-to-back with EOS |
| P7 | **Progressive Layer Unfreeze** | 1.5-2Γ— | β€” | Train only top 25% of layers first; unfreeze downward |

**Combined theoretical multiplier**: P1(6Γ—) Γ— P2(1.7Γ—) Γ— P3(4Γ—) Γ— P5(1.3Γ—) Γ— P7(1.7Γ—) β‰ˆ **57-260Γ—**

**Realistic target**: 50-200 tok/s baseline β†’ **3,000-15,000+ tok/s**

### Quick Start β€” HYPER Training

```bash
# All 7 paradigms ON β€” maximum speed
python train_hyper.py --scale tiny --max_steps 5000 --all

# Cherry-pick specific paradigms
python train_hyper.py --scale tiny --max_steps 5000 \
    --growlength --sparse-mezo --reservoir --fused-cache

# Benchmark: baseline vs hyper (side-by-side comparison)
python train_hyper.py --scale tiny --max_steps 100 --benchmark

# Full training run with all paradigms
OMP_NUM_THREADS=$(nproc) python train_hyper.py \
    --scale small --seq_len 256 --max_steps 50000 \
    --all --bf16 --compile \
    --save_every 5000 --log_every 10
```

### Paradigm Details

#### P1 β€” GrowLength Curriculum ([arxiv:2310.00576](https://arxiv.org/abs/2310.00576))

Trains with progressively longer sequences. At seq_len=16, you can fit 16Γ— more tokens per batch than at seq_len=256, giving massive throughput in early training where the learning signal is strongest.

Default schedule:
- 20% of training at seq_len = target/8
- 25% at target/4
- 25% at target/2
- 30% at full target

```bash
python train_hyper.py --growlength --seq_len 256
```

#### P2 β€” Reservoir Freezing ([arxiv:2512.23145](https://arxiv.org/abs/2512.23145))

Inspired by GRC (Reservoir Computing for Language Models): freezes gate/forget projections in recurrent layers as random ternary matrices with unit spectral radius. These "reservoir" weights provide stable dynamics without needing gradient updates.

Targets:
- GatedDeltaNet: `a_proj`, `b_proj` (alpha/beta gates)
- mLSTM: `fgate` (forget gate)
- TitansMAC: `alpha_proj` (forgetting gate)

```bash
python train_hyper.py --reservoir --reservoir-ratio 0.5
```

#### P3 β€” Sparse MeZO ([arxiv:2406.02913](https://arxiv.org/abs/2406.02913))

Standard MeZO perturbs all ~35M parameters β€” most contribute near-zero gradient signal. Sparse MeZO identifies the top-K% most sensitive parameters (by weight magnitude) and perturbs only those. This dramatically reduces the variance of the ZO gradient estimate.

At 1% sparsity on a 35M model: only 350K params perturbed per step β†’ **100Γ— better signal-to-noise per forward pass**.

```bash
python train_hyper.py --sparse-mezo --mezo-sparsity 0.01
```

#### P5 β€” Fused Ternary Cache

Before each MeZO dual-forward, pre-materialises all BitLinear packed+dense weight caches. Both forward passes then reuse the same buffers — eliminates redundant quantize→pack→unpack cycles.

```bash
python train_hyper.py --fused-cache
```

#### P7 β€” Progressive Layer Unfreezing

Starts with only the top ~25% of layers trainable. Early training is cheap (forward through frozen layers is fast, no gradient storage). Gradually unfreezes deeper layers as training progresses.

```bash
python train_hyper.py --progressive-unfreeze --unfreeze-stages 4
```

---

## Files

```
chimera/
  __init__.py          β€” Package exports (v5.3)
  config.py            β€” Config loading / scaling
  hyper.py             β€” β˜… NEW: 7 HYPER paradigm engine
  quantization.py      β€” BitLinear (2-bit packed, C++ kernel, STE, N:M 2:4)
  layers.py            β€” GatedDeltaNet, mLSTM, TitansMAC, TSPSpanKnot
  moe.py               β€” MoELayer (sort-based dispatch)
  looping.py           β€” ParcaeLoopController
  inference.py         β€” SpanBank, STree, Grammar, EntropyValve, DebtLedger
  evolution.py         β€” TTT, SemanticMemory, EpisodicCases, MetaGuidelines
  multimodal.py        β€” VisionEncoder, AudioEncoder
  tokenizer.py         β€” ChimeraTokenizer (splintr, o200k_base)
  model.py             β€” Chimera51ForCausalLM
config.json            β€” Full model config
train.py               β€” Standard training (MeZO + AdamW)
train_fast.py          β€” Fast training with pre-tokenized cache
train_hyper.py         β€” β˜… NEW: HYPER training (7 paradigms, 10k+ tok/s)
inference.py           β€” Inference / generation
```

---

## Previous Versions

### v5.1.4 β€” CPU Fast Path Audit
- Fixed package/runtime mismatch
- Added sparse MoELayer with expert-grouped dispatch
- Made C++ ternary extensions lazy-loaded
- Vectorized BitLinear AbsMean scaling
- Cached causal/triangular masks
- Reduced GatedDeltaNet clone churn

### v5.1.3 β€” Fix Illegal Instruction Crash
- Removed `-march=native` from C++ JIT flags
- Runtime CPUID detection for AVX-512/AVX2

### v5.1.2 β€” True Ternary Compute
- 2-bit packed uint8 weight storage (16Γ— compression)
- C++ unpack + MKL BLAS forward path
- MeZO sparse perturbation (skip ~33% zeros)
- STE backward with deep-zero masking

---

## Architecture (28 layers, 4 types)

```
Layer pattern: GD XM GD TM GD XM GD SK Γ— 3.5
  GD = Gated DeltaNet (14 layers) β€” arxiv:2412.06464
  XM = xLSTM mLSTM (7 layers) β€” arxiv:2405.04517
  TM = Titans MAC (4 layers) β€” arxiv:2501.00663
  SK = TSP Span Knot (3 layers)
```

All linear layers use **BitLinear** (ternary 1.58-bit) with per-group AbsMean scaling.

---

## Training Modes

### HYPER (v5.3 β€” Recommended)
- **7 stacked paradigms** for maximum CPU throughput
- Target: **10,000+ tok/s** on 8-core CPU (tiny scale)
- Forward-only training (Sparse MeZO): no backward pass
- Memory = 2Γ— model size (no activations, no gradients, no optimizer states)
- Each paradigm independently toggleable via CLI flags

### MeZO (v5.1 β€” Standard)
- Standard zeroth-order optimization
- 2 forward passes per step, no backward
- Good for fine-tuning; ~50-200 tok/s on CPU

### AdamW (v5.1 β€” Full backprop)
- Standard gradient descent with checkpointing
- Best convergence quality for pretraining from scratch
- ~10-50 tok/s on CPU

---

## References

37 papers indexed in `config.json` under `Β§`. Key additions for v5.3:
- [GrowLength](https://arxiv.org/abs/2310.00576) β€” Progressive sequence length training
- [GRC MatMul-free LM](https://arxiv.org/abs/2512.23145) β€” Reservoir computing for LMs
- [Sparse MeZO](https://arxiv.org/abs/2406.02913) β€” Sparse zeroth-order fine-tuning
- [GaLore](https://arxiv.org/abs/2403.03507) β€” Gradient low-rank projection
- [QuZO](https://arxiv.org/abs/2502.12346) β€” Quantized zeroth-order training
- [SparAMX](https://arxiv.org/abs/2502.12444) β€” AMX-accelerated sparse CPU kernels

Plus all previous references:
- [Gated DeltaNet](https://arxiv.org/abs/2412.06464) β€” NVIDIA
- [xLSTM](https://arxiv.org/abs/2405.04517) β€” NXAI/JKU
- [Titans](https://arxiv.org/abs/2501.00663) β€” Google
- [Parcae](https://arxiv.org/abs/2604.12946) β€” Stanford/Together
- [BitNet b1.58](https://arxiv.org/abs/2402.17764) β€” Microsoft
- [MeZO](https://arxiv.org/abs/2305.17333) β€” Princeton