CSM-1B Conversational β€” Fine-tuned on Expresso

Fine-tuned version of Sesame's CSM-1B (Conversational Speech Model) for improved naturalness in conversational TTS.

What is CSM?

CSM (Conversational Speech Model) is the model behind Sesame's Maya β€” the first voice AI widely considered to have crossed the uncanny valley of conversational speech. CSM generates speech tokens from text + audio context, decoded through a Mimi neural codec.

Training

Architecture: Decoder-only fine-tuning of CSM-1B's speech decoder and audio head.

  • Base model: sesame/csm-1b (1B params, trained on ~1M hours of conversational audio)
  • Dataset: Expresso β€” high-quality expressive speech corpus (female speaker ex04)
  • Method: Decoder-only fine-tuning (decoder.pt + audio_head.pt + projection.pt) β€” keeps the backbone frozen to preserve conversational priors
  • Training: 5000 steps, best checkpoint selected by validation loss
  • Hardware: NVIDIA A10G 24GB

Why decoder-only?

CSM's backbone was trained on ~1M hours of conversation β€” the largest conversational speech dataset in existence. Full fine-tuning risks catastrophic forgetting of these priors. By only fine-tuning the decoder components, we preserve conversational naturalness while adapting the voice characteristics.

Files

File Size Description
decoder.pt 212MB Fine-tuned speech decoder weights
audio_head.pt 124MB Fine-tuned audio prediction head
projection.pt 4MB Fine-tuned projection layer
model_merged.pt 2.9GB Full merged model (ready to load)

Usage

# Load the merged model directly
import torch

# The merged model can be loaded as a drop-in replacement for CSM-1B
model_state = torch.load("model_merged.pt", map_location="cuda")

# Or load individual components for surgical replacement
decoder_state = torch.load("decoder.pt", map_location="cuda")
audio_head_state = torch.load("audio_head.pt", map_location="cuda")
projection_state = torch.load("projection.pt", map_location="cuda")

For full pipeline integration, see Project Maya.

Part of Project Maya

This model was fine-tuned as part of Project Maya β€” a real-time conversational voice AI system. The full pipeline includes:

  • STT: faster-whisper (large-v3-turbo)
  • LLM: Llama 3.2 3B via llama.cpp
  • TTS: This model (CSM-1B fine-tuned)
  • Pipeline: Multi-GPU streaming with <2s end-to-end latency

Repos:

Limitations

  • Fine-tuned on a single speaker (ex04) β€” voice characteristics are speaker-specific
  • Requires the CSM-1B architecture code from Sesame for inference
  • Decoder-only fine-tune means the backbone tokenizer behavior is unchanged

Citation

@misc{csm2025sesame,
  title={CSM: Conversational Speech Model},
  author={Sesame},
  year={2025},
  url={https://github.com/SesameAILabs/csm}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Tachyeon/csm-1b-conversational-finetuned