Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,260 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MuseMorphic: A Lightweight, Consumer-Grade MIDI Generation Architecture
|
| 2 |
+
|
| 3 |
+
## Novel Architecture for Infinite-Length, Controllable Symbolic Music Generation
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Problem Statement
|
| 8 |
+
|
| 9 |
+
Current MIDI generation models suffer from fundamental limitations:
|
| 10 |
+
|
| 11 |
+
| Problem | Cause | Examples |
|
| 12 |
+
|---------|-------|----------|
|
| 13 |
+
| **Quadratic memory scaling** | Full self-attention O(L²) | Music Transformer, MusicGPT |
|
| 14 |
+
| **Loss of coherence over long sequences** | Attention dilution, no structural memory | All AR transformer models |
|
| 15 |
+
| **Uncontrollable generation** | No explicit control interface | Most open-source models |
|
| 16 |
+
| **Consumer-unfriendly** | >8GB VRAM, slow inference | MuseNet, MusicGen |
|
| 17 |
+
| **Training instability** | Post-LN, FP16 overflow, no gradient control | Common in research code |
|
| 18 |
+
| **No music theory awareness** | Absolute pitch encoding, no harmonic inductive bias | Most models |
|
| 19 |
+
|
| 20 |
+
## 2. Architecture Overview: MuseMorphic
|
| 21 |
+
|
| 22 |
+
MuseMorphic is a **two-stage hierarchical architecture** combining:
|
| 23 |
+
|
| 24 |
+
1. **Stage 1 — PhraseVAE**: Compresses REMI+ token sequences into compact latent phrase vectors
|
| 25 |
+
2. **Stage 2 — LatentMamba**: Generates sequences of phrase vectors using Selective State Space Models
|
| 26 |
+
|
| 27 |
+
### Key Design Principles
|
| 28 |
+
|
| 29 |
+
- **O(n) complexity** in sequence length (Mamba SSM backbone, no quadratic attention)
|
| 30 |
+
- **Hierarchical latent space** (100x compression: thousands of REMI tokens → tens of 64-dim vectors)
|
| 31 |
+
- **Music-native embeddings** (FME: translational invariance, transposability, separability)
|
| 32 |
+
- **Controllable generation** via control embeddings prepended to latent sequences
|
| 33 |
+
- **Infinite generation** via sliding-window state propagation (Mamba recurrent mode)
|
| 34 |
+
- **Training stability by design** (Pre-LN, σReparam, ZClip, BF16, label smoothing)
|
| 35 |
+
|
| 36 |
+
### Parameter Budget
|
| 37 |
+
|
| 38 |
+
| Component | Parameters | VRAM (BF16 training) | VRAM (Inference) |
|
| 39 |
+
|-----------|-----------|---------------------|------------------|
|
| 40 |
+
| PhraseVAE Encoder | ~8M | ~400MB | ~200MB |
|
| 41 |
+
| PhraseVAE Decoder | ~10M | ~500MB | ~300MB |
|
| 42 |
+
| LatentMamba | ~12M | ~600MB | ~300MB |
|
| 43 |
+
| Embeddings + Heads | ~3M | ~150MB | ~100MB |
|
| 44 |
+
| **Total** | **~33M** | **~1.7GB** | **~0.9GB** |
|
| 45 |
+
|
| 46 |
+
✅ Trains on free Colab T4 (16GB) with large batches
|
| 47 |
+
✅ Inference under 2GB VRAM easily
|
| 48 |
+
|
| 49 |
+
## 3. Mathematical Foundations
|
| 50 |
+
|
| 51 |
+
### 3.1 REMI+ Tokenization with BPE
|
| 52 |
+
|
| 53 |
+
Following MIDI-RWKV (2025), we use REMI+ encoding with BPE compression:
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
Raw MIDI → REMI+ tokens → BPE (vocab=8192) → Integer sequence
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
REMI+ vocabulary structure:
|
| 60 |
+
- `[BAR]` — Bar boundary markers
|
| 61 |
+
- `[POS_0..POS_15]` — 16th-note grid positions within bar
|
| 62 |
+
- `[PITCH_0..PITCH_127]` — MIDI pitch values
|
| 63 |
+
- `[VEL_1..VEL_32]` — Velocity bins (32 levels)
|
| 64 |
+
- `[DUR_1..DUR_16]` — Duration bins (16th note to whole note)
|
| 65 |
+
- `[TEMPO_30..TEMPO_210]` — Tempo bins (4 BPM resolution)
|
| 66 |
+
- `[TIMESIG_*]` — Time signature tokens
|
| 67 |
+
- `[TRACK_START]`, `[TRACK_END]` — Track delimiters
|
| 68 |
+
- `[PROGRAM_0..PROGRAM_127]` — GM instrument programs
|
| 69 |
+
|
| 70 |
+
### 3.2 Fundamental Music Embedding (FME)
|
| 71 |
+
|
| 72 |
+
From Liang et al. (2022), we use physics-aware embeddings that respect musical intervals:
|
| 73 |
+
|
| 74 |
+
$$\text{FME}(f) = \bigoplus_{k=0}^{d/2-1} \left[\sin(w_k f) + b_{\sin,k},\ \cos(w_k f) + b_{\cos,k}\right]$$
|
| 75 |
+
|
| 76 |
+
where $w_k = B^{-2k/d}$, $B$ is the base (different for pitch/duration/onset).
|
| 77 |
+
|
| 78 |
+
**Key mathematical properties:**
|
| 79 |
+
1. **Translational invariance**: Equal intervals → equal embedding distances
|
| 80 |
+
$$|f_a - f_b| = |f_c - f_d| \Rightarrow \|\text{FME}(f_a) - \text{FME}(f_b)\|_2 = \|\text{FME}(f_c) - \text{FME}(f_d)\|_2$$
|
| 81 |
+
2. **Transposability**: Key transposition is a linear operation in embedding space
|
| 82 |
+
3. **Separability**: Pitch, duration, onset embeddings are orthogonal (different B values)
|
| 83 |
+
|
| 84 |
+
We extend FME by encoding pitch as **log-frequency** instead of MIDI integer:
|
| 85 |
+
$$f_{hz} = 440 \cdot 2^{(p - 69)/12}$$
|
| 86 |
+
|
| 87 |
+
This makes the embedding space respect the physical harmonic series.
|
| 88 |
+
|
| 89 |
+
### 3.3 PhraseVAE (Stage 1)
|
| 90 |
+
|
| 91 |
+
Compresses one bar of one track (a "phrase") from REMI+ tokens into a 64-dim latent vector.
|
| 92 |
+
|
| 93 |
+
**Encoder** (3-layer Pre-LN Transformer with σReparam):
|
| 94 |
+
$$h = \text{TransformerEncoder}(\text{FME}(x) + \text{PosEnc}(x))$$
|
| 95 |
+
|
| 96 |
+
**Multi-Query Bottleneck** (from PhraseVAE, 2024):
|
| 97 |
+
$$q_1..q_m = \text{LearnedQueries}(m=4)$$
|
| 98 |
+
$$z_{queries} = \text{MultiHeadCrossAttention}(Q=q, K=h, V=h)$$
|
| 99 |
+
$$z_{flat} = \text{Flatten}(z_{queries})$$
|
| 100 |
+
$$\mu, \log\sigma^2 = \text{Linear}(z_{flat})$$
|
| 101 |
+
$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$
|
| 102 |
+
|
| 103 |
+
**Decoder** (3-layer Pre-LN Transformer, autoregressive):
|
| 104 |
+
$$p(x|z) = \prod_t p(x_t | x_{<t}, z)$$
|
| 105 |
+
|
| 106 |
+
**Training Loss:**
|
| 107 |
+
$$\mathcal{L}_{VAE} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) \| p(z))$$
|
| 108 |
+
|
| 109 |
+
where $\beta = 0.01$ (following PhraseVAE's finding that low β prevents KL domination).
|
| 110 |
+
|
| 111 |
+
**Three-stage training (curriculum):**
|
| 112 |
+
1. Span-infilling pretraining (learn REMI grammar)
|
| 113 |
+
2. Autoencoder training (minimize reconstruction)
|
| 114 |
+
3. VAE fine-tuning (add KL with β=0.01)
|
| 115 |
+
|
| 116 |
+
### 3.4 LatentMamba (Stage 2)
|
| 117 |
+
|
| 118 |
+
Generates sequences of phrase latent vectors using Selective State Space Model.
|
| 119 |
+
|
| 120 |
+
**Selective SSM (Mamba) Core:**
|
| 121 |
+
|
| 122 |
+
Given input $x \in \mathbb{R}^{B \times L \times D}$:
|
| 123 |
+
|
| 124 |
+
$$B(x) = \text{Linear}_N(x), \quad C(x) = \text{Linear}_N(x)$$
|
| 125 |
+
$$\Delta(x) = \text{softplus}(\text{Linear}_1(x) + \text{Parameter})$$
|
| 126 |
+
|
| 127 |
+
Discretization (Zero-Order Hold):
|
| 128 |
+
$$\bar{A} = \exp(\Delta \cdot A), \quad \bar{B} = (\Delta \cdot A)^{-1}(\bar{A} - I) \cdot \Delta \cdot B(x)$$
|
| 129 |
+
|
| 130 |
+
Recurrence (parallel scan during training):
|
| 131 |
+
$$h_t = \bar{A} h_{t-1} + \bar{B} x_t$$
|
| 132 |
+
$$y_t = C(x_t) h_t$$
|
| 133 |
+
|
| 134 |
+
**Complexity:**
|
| 135 |
+
- Training: O(BLD·N) with parallel scan — **linear in L**
|
| 136 |
+
- Inference: O(BD·N) per step — **constant per token**
|
| 137 |
+
- State size: O(D·N) per layer — **fixed, doesn't grow**
|
| 138 |
+
|
| 139 |
+
**Architecture:**
|
| 140 |
+
```
|
| 141 |
+
Input: [control_embed, z_phrase_1, z_phrase_2, ..., z_phrase_T]
|
| 142 |
+
↓
|
| 143 |
+
Linear projection (64 → d_model=256)
|
| 144 |
+
↓
|
| 145 |
+
MambaBlock × 8 (d_model=256, d_state=16, d_conv=4, expand=2)
|
| 146 |
+
↓
|
| 147 |
+
Linear projection (256 → 64)
|
| 148 |
+
↓
|
| 149 |
+
Output: predicted z_phrase_{2..T+1}
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
Each MambaBlock:
|
| 153 |
+
```
|
| 154 |
+
x → Pre-LN → [Linear(expand*D), SiLU, Conv1d] → SSM → × gate → Linear(D) → + residual
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
### 3.5 Control Mechanism
|
| 158 |
+
|
| 159 |
+
Control tokens are projected to d_model and prepended to the latent phrase sequence:
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
Controls: {tempo_class, key_signature, time_signature, density_level, style_tag}
|
| 163 |
+
↓
|
| 164 |
+
Control Embeddings: Embed(control_i) for each control
|
| 165 |
+
↓
|
| 166 |
+
Prepend: [ctrl_1, ctrl_2, ..., ctrl_K, z_1, z_2, ...]
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
During training, controls are extracted from the actual music. During inference, user specifies controls.
|
| 170 |
+
|
| 171 |
+
### 3.6 Infinite Generation
|
| 172 |
+
|
| 173 |
+
Mamba operates in recurrent mode during inference:
|
| 174 |
+
1. Initialize state h₀ = 0 (or style-tuned state à la MIDI-RWKV)
|
| 175 |
+
2. Generate phrase latent z_t from state h_{t-1}
|
| 176 |
+
3. Update state: h_t = f(h_{t-1}, z_t)
|
| 177 |
+
4. Decode z_t through PhraseVAE decoder → REMI+ tokens → MIDI
|
| 178 |
+
5. **State is fixed-size** — no memory growth regardless of length
|
| 179 |
+
|
| 180 |
+
## 4. Training Stability Guarantees
|
| 181 |
+
|
| 182 |
+
### 4.1 σReparam (Spectral Reparameterization)
|
| 183 |
+
|
| 184 |
+
Applied to ALL linear layers:
|
| 185 |
+
$$\hat{W} = \frac{\gamma}{\sigma(W)} W$$
|
| 186 |
+
|
| 187 |
+
where σ(W) is the spectral norm (largest singular value), γ is learnable scalar.
|
| 188 |
+
|
| 189 |
+
**Prevents attention entropy collapse** — the #1 cause of training instability in music transformers.
|
| 190 |
+
|
| 191 |
+
### 4.2 ZClip (Adaptive Gradient Clipping)
|
| 192 |
+
|
| 193 |
+
```python
|
| 194 |
+
μ_t = α · μ_{t-1} + (1 - α) · ||g_t||
|
| 195 |
+
σ_t = √(α · σ²_{t-1} + (1-α) · (||g_t|| - μ_t)²)
|
| 196 |
+
threshold_t = μ_t + z_thresh · σ_t # z_thresh = 2.5
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
Clips only genuine gradient spikes, not normal gradients.
|
| 200 |
+
|
| 201 |
+
### 4.3 Pre-LayerNorm
|
| 202 |
+
|
| 203 |
+
All transformer/SSM blocks use Pre-LN:
|
| 204 |
+
```
|
| 205 |
+
x → LayerNorm → Sublayer → + residual
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
Eliminates need for learning rate warmup. Bounded gradient norms analytically.
|
| 209 |
+
|
| 210 |
+
### 4.4 BFloat16
|
| 211 |
+
|
| 212 |
+
Same exponent range as FP32 (8-bit). No loss scaling needed. No overflow/underflow.
|
| 213 |
+
|
| 214 |
+
### 4.5 Label Smoothing
|
| 215 |
+
|
| 216 |
+
$$\mathcal{L} = (1-\epsilon) \cdot \text{CE}(p, y) + \epsilon \cdot H(p, u)$$
|
| 217 |
+
|
| 218 |
+
with ε=0.1. Prevents overconfident pitch predictions that cause mode collapse.
|
| 219 |
+
|
| 220 |
+
## 5. Comparison with Existing Approaches
|
| 221 |
+
|
| 222 |
+
| Feature | Music Transformer | MIDI-RWKV | MusicMamba | **MuseMorphic** |
|
| 223 |
+
|---------|------------------|-----------|------------|-----------------|
|
| 224 |
+
| Complexity | O(L²) | O(L) | O(L²) + O(L) | **O(L) everywhere** |
|
| 225 |
+
| VRAM Training | >8GB | ~4GB | ~6GB | **~1.7GB** |
|
| 226 |
+
| VRAM Inference | >4GB | ~1GB | ~2GB | **~0.9GB** |
|
| 227 |
+
| Controllable | ✗ | ✓ (attribute) | ✗ | **✓ (multi-attribute)** |
|
| 228 |
+
| Infinite Gen | ✗ (KV cache grows) | ✓ (recurrent) | ✗ | **✓ (recurrent)** |
|
| 229 |
+
| Music Theory Aware | Relative attention | REMI+ | Mode-aware | **FME + REMI+ + harmonic** |
|
| 230 |
+
| Training Stable | ✗ (needs tuning) | ✓ | ✓ | **✓ (by design)** |
|
| 231 |
+
| Model Size | 50-100M+ | 20-50M | ~30M | **~33M** |
|
| 232 |
+
| Hierarchical | ✗ | ✗ | ✗ | **✓ (phrase latent)** |
|
| 233 |
+
|
| 234 |
+
## 6. Novel Contributions
|
| 235 |
+
|
| 236 |
+
1. **First SSM-based latent music generator**: Mamba operating on compressed phrase latents, not raw tokens
|
| 237 |
+
2. **FME with log-frequency encoding**: Physics-aware embeddings respecting harmonic series
|
| 238 |
+
3. **Multi-attribute control via latent conditioning**: Tempo, key, density, style as prepended control embeddings
|
| 239 |
+
4. **Guaranteed training stability stack**: σReparam + ZClip + Pre-LN + BF16 + label smoothing
|
| 240 |
+
5. **Three-stage curriculum for PhraseVAE**: Span-infilling → AE → VAE (prevents posterior collapse)
|
| 241 |
+
6. **Sub-1GB inference**: Phrase-level Mamba recurrence with fixed-size state
|
| 242 |
+
|
| 243 |
+
## 7. References
|
| 244 |
+
|
| 245 |
+
- Gu & Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
|
| 246 |
+
- MIDI-RWKV (2025). Personalizable Long-Context Symbolic Music Infilling. arXiv:2506.13001
|
| 247 |
+
- PhraseVAE (2024). Phrase-level latent diffusion for music. arXiv:2512.11348
|
| 248 |
+
- FME (2022). Domain-Knowledge-Inspired Music Embedding. arXiv:2212.00973
|
| 249 |
+
- σReparam (2023). Stabilizing Transformer Training. arXiv:2303.06296
|
| 250 |
+
- ZClip (2025). Adaptive Spike Mitigation for LLM Pre-Training. arXiv:2504.02507
|
| 251 |
+
- REMI (2020). Pop Music Transformer. arXiv:2002.00212
|
| 252 |
+
- MusicMamba (2024). Dual-Feature Modeling for Chinese Music. arXiv:2409.02421
|
| 253 |
+
- MIDI-GPT (2025). Infilling + Multi-Attribute Control. arXiv:2501.17011
|
| 254 |
+
- NotaGen (2025). AR Foundation Model with CLaMP-DPO. arXiv:2502.18008
|
| 255 |
+
- GETMusic (2023). Non-AR Discrete Diffusion for Multi-Track. arXiv:2305.10841
|
| 256 |
+
- Wan (2025). 3D Causal VAE + Flow Matching. arXiv:2503.20314
|
| 257 |
+
|
| 258 |
+
## License
|
| 259 |
+
|
| 260 |
+
Apache 2.0
|