GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Paper • 2412.02612 • Published
Sesame CSM-1B optimized with Group Relative Policy Optimization (GRPO) using UTMOS as a naturalness reward signal.
Standard supervised fine-tuning minimizes reconstruction loss, which doesn't always correlate with perceived naturalness. GRPO instead:
This is inspired by GLM-4-Voice which showed RL-based optimization improves TTS quality metrics beyond supervised fine-tuning.
| File | Description |
|---|---|
decoder.pt |
GRPO-optimized decoder |
audio_head.pt |
GRPO-optimized audio head |
projection.pt |
GRPO-optimized projection |