File size: 10,428 Bytes
f7d254f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
# MuseMorphic: A Lightweight, Consumer-Grade MIDI Generation Architecture

## Novel Architecture for Infinite-Length, Controllable Symbolic Music Generation

---

## 1. Problem Statement

Current MIDI generation models suffer from fundamental limitations:

| Problem | Cause | Examples |
|---------|-------|----------|
| **Quadratic memory scaling** | Full self-attention O(L²) | Music Transformer, MusicGPT |
| **Loss of coherence over long sequences** | Attention dilution, no structural memory | All AR transformer models |
| **Uncontrollable generation** | No explicit control interface | Most open-source models |
| **Consumer-unfriendly** | >8GB VRAM, slow inference | MuseNet, MusicGen |
| **Training instability** | Post-LN, FP16 overflow, no gradient control | Common in research code |
| **No music theory awareness** | Absolute pitch encoding, no harmonic inductive bias | Most models |

## 2. Architecture Overview: MuseMorphic

MuseMorphic is a **two-stage hierarchical architecture** combining:

1. **Stage 1 — PhraseVAE**: Compresses REMI+ token sequences into compact latent phrase vectors
2. **Stage 2 — LatentMamba**: Generates sequences of phrase vectors using Selective State Space Models

### Key Design Principles

- **O(n) complexity** in sequence length (Mamba SSM backbone, no quadratic attention)
- **Hierarchical latent space** (100x compression: thousands of REMI tokens → tens of 64-dim vectors)
- **Music-native embeddings** (FME: translational invariance, transposability, separability)
- **Controllable generation** via control embeddings prepended to latent sequences
- **Infinite generation** via sliding-window state propagation (Mamba recurrent mode)
- **Training stability by design** (Pre-LN, σReparam, ZClip, BF16, label smoothing)

### Parameter Budget

| Component | Parameters | VRAM (BF16 training) | VRAM (Inference) |
|-----------|-----------|---------------------|------------------|
| PhraseVAE Encoder | ~8M | ~400MB | ~200MB |
| PhraseVAE Decoder | ~10M | ~500MB | ~300MB |
| LatentMamba | ~12M | ~600MB | ~300MB |
| Embeddings + Heads | ~3M | ~150MB | ~100MB |
| **Total** | **~33M** | **~1.7GB** | **~0.9GB** |

✅ Trains on free Colab T4 (16GB) with large batches
✅ Inference under 2GB VRAM easily

## 3. Mathematical Foundations

### 3.1 REMI+ Tokenization with BPE

Following MIDI-RWKV (2025), we use REMI+ encoding with BPE compression:

```
Raw MIDI → REMI+ tokens → BPE (vocab=8192) → Integer sequence
```

REMI+ vocabulary structure:
- `[BAR]` — Bar boundary markers
- `[POS_0..POS_15]` — 16th-note grid positions within bar
- `[PITCH_0..PITCH_127]` — MIDI pitch values
- `[VEL_1..VEL_32]` — Velocity bins (32 levels)
- `[DUR_1..DUR_16]` — Duration bins (16th note to whole note)
- `[TEMPO_30..TEMPO_210]` — Tempo bins (4 BPM resolution)
- `[TIMESIG_*]` — Time signature tokens
- `[TRACK_START]`, `[TRACK_END]` — Track delimiters
- `[PROGRAM_0..PROGRAM_127]` — GM instrument programs

### 3.2 Fundamental Music Embedding (FME)

From Liang et al. (2022), we use physics-aware embeddings that respect musical intervals:

$$\text{FME}(f) = \bigoplus_{k=0}^{d/2-1} \left[\sin(w_k f) + b_{\sin,k},\ \cos(w_k f) + b_{\cos,k}\right]$$

where $w_k = B^{-2k/d}$, $B$ is the base (different for pitch/duration/onset).

**Key mathematical properties:**
1. **Translational invariance**: Equal intervals → equal embedding distances
   $$|f_a - f_b| = |f_c - f_d| \Rightarrow \|\text{FME}(f_a) - \text{FME}(f_b)\|_2 = \|\text{FME}(f_c) - \text{FME}(f_d)\|_2$$
2. **Transposability**: Key transposition is a linear operation in embedding space
3. **Separability**: Pitch, duration, onset embeddings are orthogonal (different B values)

We extend FME by encoding pitch as **log-frequency** instead of MIDI integer:
$$f_{hz} = 440 \cdot 2^{(p - 69)/12}$$

This makes the embedding space respect the physical harmonic series.

### 3.3 PhraseVAE (Stage 1)

Compresses one bar of one track (a "phrase") from REMI+ tokens into a 64-dim latent vector.

**Encoder** (3-layer Pre-LN Transformer with σReparam):
$$h = \text{TransformerEncoder}(\text{FME}(x) + \text{PosEnc}(x))$$

**Multi-Query Bottleneck** (from PhraseVAE, 2024):
$$q_1..q_m = \text{LearnedQueries}(m=4)$$
$$z_{queries} = \text{MultiHeadCrossAttention}(Q=q, K=h, V=h)$$
$$z_{flat} = \text{Flatten}(z_{queries})$$
$$\mu, \log\sigma^2 = \text{Linear}(z_{flat})$$
$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

**Decoder** (3-layer Pre-LN Transformer, autoregressive):
$$p(x|z) = \prod_t p(x_t | x_{<t}, z)$$

**Training Loss:**
$$\mathcal{L}_{VAE} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) \| p(z))$$

where $\beta = 0.01$ (following PhraseVAE's finding that low β prevents KL domination).

**Three-stage training (curriculum):**
1. Span-infilling pretraining (learn REMI grammar)
2. Autoencoder training (minimize reconstruction)
3. VAE fine-tuning (add KL with β=0.01)

### 3.4 LatentMamba (Stage 2)

Generates sequences of phrase latent vectors using Selective State Space Model.

**Selective SSM (Mamba) Core:**

Given input $x \in \mathbb{R}^{B \times L \times D}$:

$$B(x) = \text{Linear}_N(x), \quad C(x) = \text{Linear}_N(x)$$
$$\Delta(x) = \text{softplus}(\text{Linear}_1(x) + \text{Parameter})$$

Discretization (Zero-Order Hold):
$$\bar{A} = \exp(\Delta \cdot A), \quad \bar{B} = (\Delta \cdot A)^{-1}(\bar{A} - I) \cdot \Delta \cdot B(x)$$

Recurrence (parallel scan during training):
$$h_t = \bar{A} h_{t-1} + \bar{B} x_t$$
$$y_t = C(x_t) h_t$$

**Complexity:**
- Training: O(BLD·N) with parallel scan — **linear in L**
- Inference: O(BD·N) per step — **constant per token**
- State size: O(D·N) per layer — **fixed, doesn't grow**

**Architecture:**
```
Input: [control_embed, z_phrase_1, z_phrase_2, ..., z_phrase_T]

Linear projection (64 → d_model=256)

MambaBlock × 8 (d_model=256, d_state=16, d_conv=4, expand=2)

Linear projection (256 → 64)

Output: predicted z_phrase_{2..T+1}
```

Each MambaBlock:
```
x → Pre-LN → [Linear(expand*D), SiLU, Conv1d] → SSM → × gate → Linear(D) → + residual
```

### 3.5 Control Mechanism

Control tokens are projected to d_model and prepended to the latent phrase sequence:

```
Controls: {tempo_class, key_signature, time_signature, density_level, style_tag}

Control Embeddings: Embed(control_i) for each control

Prepend: [ctrl_1, ctrl_2, ..., ctrl_K, z_1, z_2, ...]
```

During training, controls are extracted from the actual music. During inference, user specifies controls.

### 3.6 Infinite Generation

Mamba operates in recurrent mode during inference:
1. Initialize state h₀ = 0 (or style-tuned state à la MIDI-RWKV)
2. Generate phrase latent z_t from state h_{t-1}
3. Update state: h_t = f(h_{t-1}, z_t)
4. Decode z_t through PhraseVAE decoder → REMI+ tokens → MIDI
5. **State is fixed-size** — no memory growth regardless of length

## 4. Training Stability Guarantees

### 4.1 σReparam (Spectral Reparameterization)

Applied to ALL linear layers:
$$\hat{W} = \frac{\gamma}{\sigma(W)} W$$

where σ(W) is the spectral norm (largest singular value), γ is learnable scalar.

**Prevents attention entropy collapse** — the #1 cause of training instability in music transformers.

### 4.2 ZClip (Adaptive Gradient Clipping)

```python
μ_t = α · μ_{t-1} + (1 - α) · ||g_t||
σ_t = √(α · σ²_{t-1} + (1-α) · (||g_t|| - μ_t)²)
threshold_t = μ_t + z_thresh · σ_t  # z_thresh = 2.5
```

Clips only genuine gradient spikes, not normal gradients.

### 4.3 Pre-LayerNorm

All transformer/SSM blocks use Pre-LN:
```
x → LayerNorm → Sublayer → + residual
```

Eliminates need for learning rate warmup. Bounded gradient norms analytically.

### 4.4 BFloat16

Same exponent range as FP32 (8-bit). No loss scaling needed. No overflow/underflow.

### 4.5 Label Smoothing

$$\mathcal{L} = (1-\epsilon) \cdot \text{CE}(p, y) + \epsilon \cdot H(p, u)$$

with ε=0.1. Prevents overconfident pitch predictions that cause mode collapse.

## 5. Comparison with Existing Approaches

| Feature | Music Transformer | MIDI-RWKV | MusicMamba | **MuseMorphic** |
|---------|------------------|-----------|------------|-----------------|
| Complexity | O(L²) | O(L) | O(L²) + O(L) | **O(L) everywhere** |
| VRAM Training | >8GB | ~4GB | ~6GB | **~1.7GB** |
| VRAM Inference | >4GB | ~1GB | ~2GB | **~0.9GB** |
| Controllable | ✗ | ✓ (attribute) | ✗ | **✓ (multi-attribute)** |
| Infinite Gen | ✗ (KV cache grows) | ✓ (recurrent) | ✗ | **✓ (recurrent)** |
| Music Theory Aware | Relative attention | REMI+ | Mode-aware | **FME + REMI+ + harmonic** |
| Training Stable | ✗ (needs tuning) | ✓ | ✓ | **✓ (by design)** |
| Model Size | 50-100M+ | 20-50M | ~30M | **~33M** |
| Hierarchical | ✗ | ✗ | ✗ | **✓ (phrase latent)** |

## 6. Novel Contributions

1. **First SSM-based latent music generator**: Mamba operating on compressed phrase latents, not raw tokens
2. **FME with log-frequency encoding**: Physics-aware embeddings respecting harmonic series
3. **Multi-attribute control via latent conditioning**: Tempo, key, density, style as prepended control embeddings
4. **Guaranteed training stability stack**: σReparam + ZClip + Pre-LN + BF16 + label smoothing
5. **Three-stage curriculum for PhraseVAE**: Span-infilling → AE → VAE (prevents posterior collapse)
6. **Sub-1GB inference**: Phrase-level Mamba recurrence with fixed-size state

## 7. References

- Gu & Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
- MIDI-RWKV (2025). Personalizable Long-Context Symbolic Music Infilling. arXiv:2506.13001
- PhraseVAE (2024). Phrase-level latent diffusion for music. arXiv:2512.11348
- FME (2022). Domain-Knowledge-Inspired Music Embedding. arXiv:2212.00973
- σReparam (2023). Stabilizing Transformer Training. arXiv:2303.06296
- ZClip (2025). Adaptive Spike Mitigation for LLM Pre-Training. arXiv:2504.02507
- REMI (2020). Pop Music Transformer. arXiv:2002.00212
- MusicMamba (2024). Dual-Feature Modeling for Chinese Music. arXiv:2409.02421
- MIDI-GPT (2025). Infilling + Multi-Attribute Control. arXiv:2501.17011
- NotaGen (2025). AR Foundation Model with CLaMP-DPO. arXiv:2502.18008
- GETMusic (2023). Non-AR Discrete Diffusion for Multi-Track. arXiv:2305.10841
- Wan (2025). 3D Causal VAE + Flow Matching. arXiv:2503.20314

## License

Apache 2.0