CSM-1B — Combined Naturalness Fine-tune

Fine-tuned Sesame CSM-1B with a combined naturalness objective targeting both decoder and audio components.

Approach

Unlike the decoder-only fine-tune which freezes the backbone, this variant jointly optimizes decoder, audio head, and projection layers with a naturalness-weighted loss that balances:

Acoustic quality (spectral fidelity to the target speaker)
Prosodic naturalness (pitch contour and rhythm preservation)
Conversational coherence (context-appropriate intonation)

Trained for 13,500 steps on Expresso conversational speech data (female speaker ex04).

Files

File	Size	Description
`decoder.pt`	~212MB	Fine-tuned speech decoder
`audio_head.pt`	~124MB	Fine-tuned audio prediction head
`projection.pt`	~4MB	Fine-tuned projection layer
`model_merged.pt`	~2.9GB	Full merged model

Usage

import torch
model_state = torch.load("model_merged.pt", map_location="cuda")

Part of Project Maya

Downloads last month: -; Downloads are not tracked for this model. How to track

Tachyeon
/

csm-1b-combined-naturalness

CSM-1B — Combined Naturalness Fine-tune

Approach

Files

Usage

Part of Project Maya

Dataset used to train Tachyeon/csm-1b-combined-naturalness