CSM-1B Conversational — Fine-tuned on Expresso

Fine-tuned version of Sesame's CSM-1B (Conversational Speech Model) for improved naturalness in conversational TTS.

What is CSM?

CSM (Conversational Speech Model) is the model behind Sesame's Maya — the first voice AI widely considered to have crossed the uncanny valley of conversational speech. CSM generates speech tokens from text + audio context, decoded through a Mimi neural codec.

Training

Architecture: Decoder-only fine-tuning of CSM-1B's speech decoder and audio head.

Base model: sesame/csm-1b (1B params, trained on ~1M hours of conversational audio)
Dataset: Expresso — high-quality expressive speech corpus (female speaker ex04)
Method: Decoder-only fine-tuning (decoder.pt + audio_head.pt + projection.pt) — keeps the backbone frozen to preserve conversational priors
Training: 5000 steps, best checkpoint selected by validation loss
Hardware: NVIDIA A10G 24GB

Why decoder-only?

CSM's backbone was trained on ~1M hours of conversation — the largest conversational speech dataset in existence. Full fine-tuning risks catastrophic forgetting of these priors. By only fine-tuning the decoder components, we preserve conversational naturalness while adapting the voice characteristics.

Files

File	Size	Description
`decoder.pt`	212MB	Fine-tuned speech decoder weights
`audio_head.pt`	124MB	Fine-tuned audio prediction head
`projection.pt`	4MB	Fine-tuned projection layer
`model_merged.pt`	2.9GB	Full merged model (ready to load)

Usage

# Load the merged model directly
import torch

# The merged model can be loaded as a drop-in replacement for CSM-1B
model_state = torch.load("model_merged.pt", map_location="cuda")

# Or load individual components for surgical replacement
decoder_state = torch.load("decoder.pt", map_location="cuda")
audio_head_state = torch.load("audio_head.pt", map_location="cuda")
projection_state = torch.load("projection.pt", map_location="cuda")

For full pipeline integration, see Project Maya.

Part of Project Maya

This model was fine-tuned as part of Project Maya — a real-time conversational voice AI system. The full pipeline includes:

STT: faster-whisper (large-v3-turbo)
LLM: Llama 3.2 3B via llama.cpp
TTS: This model (CSM-1B fine-tuned)
Pipeline: Multi-GPU streaming with <2s end-to-end latency

Repos:

Limitations

Fine-tuned on a single speaker (ex04) — voice characteristics are speaker-specific
Requires the CSM-1B architecture code from Sesame for inference
Decoder-only fine-tune means the backbone tokenizer behavior is unchanged

Citation

@misc{csm2025sesame,
  title={CSM: Conversational Speech Model},
  author={Sesame},
  year={2025},
  url={https://github.com/SesameAILabs/csm}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Tachyeon
/

csm-1b-conversational-finetuned