# MuseMorphic: A Lightweight, Consumer-Grade MIDI Generation Architecture ## Novel Architecture for Infinite-Length, Controllable Symbolic Music Generation --- ## 1. Problem Statement Current MIDI generation models suffer from fundamental limitations: | Problem | Cause | Examples | |---------|-------|----------| | **Quadratic memory scaling** | Full self-attention O(L²) | Music Transformer, MusicGPT | | **Loss of coherence over long sequences** | Attention dilution, no structural memory | All AR transformer models | | **Uncontrollable generation** | No explicit control interface | Most open-source models | | **Consumer-unfriendly** | >8GB VRAM, slow inference | MuseNet, MusicGen | | **Training instability** | Post-LN, FP16 overflow, no gradient control | Common in research code | | **No music theory awareness** | Absolute pitch encoding, no harmonic inductive bias | Most models | ## 2. Architecture Overview: MuseMorphic MuseMorphic is a **two-stage hierarchical architecture** combining: 1. **Stage 1 — PhraseVAE**: Compresses REMI+ token sequences into compact latent phrase vectors 2. **Stage 2 — LatentMamba**: Generates sequences of phrase vectors using Selective State Space Models ### Key Design Principles - **O(n) complexity** in sequence length (Mamba SSM backbone, no quadratic attention) - **Hierarchical latent space** (100x compression: thousands of REMI tokens → tens of 64-dim vectors) - **Music-native embeddings** (FME: translational invariance, transposability, separability) - **Controllable generation** via control embeddings prepended to latent sequences - **Infinite generation** via sliding-window state propagation (Mamba recurrent mode) - **Training stability by design** (Pre-LN, σReparam, ZClip, BF16, label smoothing) ### Parameter Budget | Component | Parameters | VRAM (BF16 training) | VRAM (Inference) | |-----------|-----------|---------------------|------------------| | PhraseVAE Encoder | ~8M | ~400MB | ~200MB | | PhraseVAE Decoder | ~10M | ~500MB | ~300MB | | LatentMamba | ~12M | ~600MB | ~300MB | | Embeddings + Heads | ~3M | ~150MB | ~100MB | | **Total** | **~33M** | **~1.7GB** | **~0.9GB** | ✅ Trains on free Colab T4 (16GB) with large batches ✅ Inference under 2GB VRAM easily ## 3. Mathematical Foundations ### 3.1 REMI+ Tokenization with BPE Following MIDI-RWKV (2025), we use REMI+ encoding with BPE compression: ``` Raw MIDI → REMI+ tokens → BPE (vocab=8192) → Integer sequence ``` REMI+ vocabulary structure: - `[BAR]` — Bar boundary markers - `[POS_0..POS_15]` — 16th-note grid positions within bar - `[PITCH_0..PITCH_127]` — MIDI pitch values - `[VEL_1..VEL_32]` — Velocity bins (32 levels) - `[DUR_1..DUR_16]` — Duration bins (16th note to whole note) - `[TEMPO_30..TEMPO_210]` — Tempo bins (4 BPM resolution) - `[TIMESIG_*]` — Time signature tokens - `[TRACK_START]`, `[TRACK_END]` — Track delimiters - `[PROGRAM_0..PROGRAM_127]` — GM instrument programs ### 3.2 Fundamental Music Embedding (FME) From Liang et al. (2022), we use physics-aware embeddings that respect musical intervals: $$\text{FME}(f) = \bigoplus_{k=0}^{d/2-1} \left[\sin(w_k f) + b_{\sin,k},\ \cos(w_k f) + b_{\cos,k}\right]$$ where $w_k = B^{-2k/d}$, $B$ is the base (different for pitch/duration/onset). **Key mathematical properties:** 1. **Translational invariance**: Equal intervals → equal embedding distances $$|f_a - f_b| = |f_c - f_d| \Rightarrow \|\text{FME}(f_a) - \text{FME}(f_b)\|_2 = \|\text{FME}(f_c) - \text{FME}(f_d)\|_2$$ 2. **Transposability**: Key transposition is a linear operation in embedding space 3. **Separability**: Pitch, duration, onset embeddings are orthogonal (different B values) We extend FME by encoding pitch as **log-frequency** instead of MIDI integer: $$f_{hz} = 440 \cdot 2^{(p - 69)/12}$$ This makes the embedding space respect the physical harmonic series. ### 3.3 PhraseVAE (Stage 1) Compresses one bar of one track (a "phrase") from REMI+ tokens into a 64-dim latent vector. **Encoder** (3-layer Pre-LN Transformer with σReparam): $$h = \text{TransformerEncoder}(\text{FME}(x) + \text{PosEnc}(x))$$ **Multi-Query Bottleneck** (from PhraseVAE, 2024): $$q_1..q_m = \text{LearnedQueries}(m=4)$$ $$z_{queries} = \text{MultiHeadCrossAttention}(Q=q, K=h, V=h)$$ $$z_{flat} = \text{Flatten}(z_{queries})$$ $$\mu, \log\sigma^2 = \text{Linear}(z_{flat})$$ $$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$ **Decoder** (3-layer Pre-LN Transformer, autoregressive): $$p(x|z) = \prod_t p(x_t | x_{8GB | ~4GB | ~6GB | **~1.7GB** | | VRAM Inference | >4GB | ~1GB | ~2GB | **~0.9GB** | | Controllable | ✗ | ✓ (attribute) | ✗ | **✓ (multi-attribute)** | | Infinite Gen | ✗ (KV cache grows) | ✓ (recurrent) | ✗ | **✓ (recurrent)** | | Music Theory Aware | Relative attention | REMI+ | Mode-aware | **FME + REMI+ + harmonic** | | Training Stable | ✗ (needs tuning) | ✓ | ✓ | **✓ (by design)** | | Model Size | 50-100M+ | 20-50M | ~30M | **~33M** | | Hierarchical | ✗ | ✗ | ✗ | **✓ (phrase latent)** | ## 6. Novel Contributions 1. **First SSM-based latent music generator**: Mamba operating on compressed phrase latents, not raw tokens 2. **FME with log-frequency encoding**: Physics-aware embeddings respecting harmonic series 3. **Multi-attribute control via latent conditioning**: Tempo, key, density, style as prepended control embeddings 4. **Guaranteed training stability stack**: σReparam + ZClip + Pre-LN + BF16 + label smoothing 5. **Three-stage curriculum for PhraseVAE**: Span-infilling → AE → VAE (prevents posterior collapse) 6. **Sub-1GB inference**: Phrase-level Mamba recurrence with fixed-size state ## 7. References - Gu & Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 - MIDI-RWKV (2025). Personalizable Long-Context Symbolic Music Infilling. arXiv:2506.13001 - PhraseVAE (2024). Phrase-level latent diffusion for music. arXiv:2512.11348 - FME (2022). Domain-Knowledge-Inspired Music Embedding. arXiv:2212.00973 - σReparam (2023). Stabilizing Transformer Training. arXiv:2303.06296 - ZClip (2025). Adaptive Spike Mitigation for LLM Pre-Training. arXiv:2504.02507 - REMI (2020). Pop Music Transformer. arXiv:2002.00212 - MusicMamba (2024). Dual-Feature Modeling for Chinese Music. arXiv:2409.02421 - MIDI-GPT (2025). Infilling + Multi-Attribute Control. arXiv:2501.17011 - NotaGen (2025). AR Foundation Model with CLaMP-DPO. arXiv:2502.18008 - GETMusic (2023). Non-AR Discrete Diffusion for Multi-Track. arXiv:2305.10841 - Wan (2025). 3D Causal VAE + Flow Matching. arXiv:2503.20314 ## License Apache 2.0