CSM-1B — Combined Naturalness Fine-tune

Fine-tuned Sesame CSM-1B with a combined naturalness objective targeting both decoder and audio components.

Approach

Unlike the decoder-only fine-tune which freezes the backbone, this variant jointly optimizes decoder, audio head, and projection layers with a naturalness-weighted loss that balances:

  • Acoustic quality (spectral fidelity to the target speaker)
  • Prosodic naturalness (pitch contour and rhythm preservation)
  • Conversational coherence (context-appropriate intonation)

Trained for 13,500 steps on Expresso conversational speech data (female speaker ex04).

Files

File Size Description
decoder.pt ~212MB Fine-tuned speech decoder
audio_head.pt ~124MB Fine-tuned audio prediction head
projection.pt ~4MB Fine-tuned projection layer
model_merged.pt ~2.9GB Full merged model

Usage

import torch
model_state = torch.load("model_merged.pt", map_location="cuda")

Part of Project Maya

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Tachyeon/csm-1b-combined-naturalness