CSM-1B — GRPO Naturalness Optimization

Sesame CSM-1B optimized with Group Relative Policy Optimization (GRPO) using UTMOS as a naturalness reward signal.

Why GRPO for TTS?

Standard supervised fine-tuning minimizes reconstruction loss, which doesn't always correlate with perceived naturalness. GRPO instead:

This is inspired by GLM-4-Voice which showed RL-based optimization improves TTS quality metrics beyond supervised fine-tuning.

Base: sesame/csm-1b (1B params)
Method: GRPO with UTMOS reward (v3 — third iteration with refined hyperparams)
Steps: 1,500 (best checkpoint by reward)
Hardware: NVIDIA A10G 24GB

Downloads last month: -; Downloads are not tracked for this model. How to track