CSM-1B Conversational β Fine-tuned on Expresso
Fine-tuned version of Sesame's CSM-1B (Conversational Speech Model) for improved naturalness in conversational TTS.
What is CSM?
CSM (Conversational Speech Model) is the model behind Sesame's Maya β the first voice AI widely considered to have crossed the uncanny valley of conversational speech. CSM generates speech tokens from text + audio context, decoded through a Mimi neural codec.
Training
Architecture: Decoder-only fine-tuning of CSM-1B's speech decoder and audio head.
- Base model: sesame/csm-1b (1B params, trained on ~1M hours of conversational audio)
- Dataset: Expresso β high-quality expressive speech corpus (female speaker ex04)
- Method: Decoder-only fine-tuning (decoder.pt + audio_head.pt + projection.pt) β keeps the backbone frozen to preserve conversational priors
- Training: 5000 steps, best checkpoint selected by validation loss
- Hardware: NVIDIA A10G 24GB
Why decoder-only?
CSM's backbone was trained on ~1M hours of conversation β the largest conversational speech dataset in existence. Full fine-tuning risks catastrophic forgetting of these priors. By only fine-tuning the decoder components, we preserve conversational naturalness while adapting the voice characteristics.
Files
| File | Size | Description |
|---|---|---|
decoder.pt |
212MB | Fine-tuned speech decoder weights |
audio_head.pt |
124MB | Fine-tuned audio prediction head |
projection.pt |
4MB | Fine-tuned projection layer |
model_merged.pt |
2.9GB | Full merged model (ready to load) |
Usage
# Load the merged model directly
import torch
# The merged model can be loaded as a drop-in replacement for CSM-1B
model_state = torch.load("model_merged.pt", map_location="cuda")
# Or load individual components for surgical replacement
decoder_state = torch.load("decoder.pt", map_location="cuda")
audio_head_state = torch.load("audio_head.pt", map_location="cuda")
projection_state = torch.load("projection.pt", map_location="cuda")
For full pipeline integration, see Project Maya.
Part of Project Maya
This model was fine-tuned as part of Project Maya β a real-time conversational voice AI system. The full pipeline includes:
- STT: faster-whisper (large-v3-turbo)
- LLM: Llama 3.2 3B via llama.cpp
- TTS: This model (CSM-1B fine-tuned)
- Pipeline: Multi-GPU streaming with <2s end-to-end latency
Repos:
Limitations
- Fine-tuned on a single speaker (ex04) β voice characteristics are speaker-specific
- Requires the CSM-1B architecture code from Sesame for inference
- Decoder-only fine-tune means the backbone tokenizer behavior is unchanged
Citation
@misc{csm2025sesame,
title={CSM: Conversational Speech Model},
author={Sesame},
year={2025},
url={https://github.com/SesameAILabs/csm}
}