CSM-1B — GRPO Naturalness Optimization

Sesame CSM-1B optimized with Group Relative Policy Optimization (GRPO) using UTMOS as a naturalness reward signal.

Why GRPO for TTS?

Standard supervised fine-tuning minimizes reconstruction loss, which doesn't always correlate with perceived naturalness. GRPO instead:

  1. Generates multiple speech candidates per text input
  2. Scores each with UTMOS (a trained naturalness predictor)
  3. Uses relative rankings to update the policy toward higher-quality outputs

This is inspired by GLM-4-Voice which showed RL-based optimization improves TTS quality metrics beyond supervised fine-tuning.

Training

  • Base: sesame/csm-1b (1B params)
  • Method: GRPO with UTMOS reward (v3 — third iteration with refined hyperparams)
  • Steps: 1,500 (best checkpoint by reward)
  • Hardware: NVIDIA A10G 24GB

Files

File Description
decoder.pt GRPO-optimized decoder
audio_head.pt GRPO-optimized audio head
projection.pt GRPO-optimized projection

Part of Project Maya

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Tachyeon/csm-1b-grpo-naturalness

Paper for Tachyeon/csm-1b-grpo-naturalness