CSM-1B — Combined Naturalness Fine-tune
Fine-tuned Sesame CSM-1B with a combined naturalness objective targeting both decoder and audio components.
Approach
Unlike the decoder-only fine-tune which freezes the backbone, this variant jointly optimizes decoder, audio head, and projection layers with a naturalness-weighted loss that balances:
- Acoustic quality (spectral fidelity to the target speaker)
- Prosodic naturalness (pitch contour and rhythm preservation)
- Conversational coherence (context-appropriate intonation)
Trained for 13,500 steps on Expresso conversational speech data (female speaker ex04).
Files
| File | Size | Description |
|---|---|---|
decoder.pt |
~212MB | Fine-tuned speech decoder |
audio_head.pt |
~124MB | Fine-tuned audio prediction head |
projection.pt |
~4MB | Fine-tuned projection layer |
model_merged.pt |
~2.9GB | Full merged model |
Usage
import torch
model_state = torch.load("model_merged.pt", map_location="cuda")