| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| tags: |
| - tts |
| - speech-evaluation |
| - continuation-score |
| - role-play |
| - reward-model |
| pipeline_tag: text-to-speech |
| --- |
| |
| <h1 align="center"> |
| MCLP-Score: Continuation Score Model for MCLP Metric |
| </h1> |
|
|
| <p align="center"> |
| Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup> |
| </p> |
|
|
| <p align="center"> |
| <sup>1</sup>StepFun <sup>2</sup>University of Chinese Academy of Sciences <sup>3</sup>Beihang University |
| </p> |
|
|
| <p align="center"> |
| <sup>*</sup>Equal contribution |
| </p> |
| |
| <p align="center"> |
| π <a href="https://arxiv.org/abs/2601.22661">Paper</a> | |
| π» <a href="https://github.com/y-ren16/MCLP">Code</a> | |
| π <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> | |
| π£οΈ <a href="https://huggingface.co/y-ren16/MCLP-RPTTS">MCLP-RPTTS Model</a> |
| </p> |
| |
| ## Model Description |
| |
| **MCLP-Score** is the Continuation Score model used to compute the **MCLP (Mean Continuation Log-Probability)** metric. Given a ground-truth audio prefix, this model evaluates how well a generated audio segment continues the stylistic pattern of the ground-truth, producing a log-probability score that measures expressive consistency. |
| |
| The MCLP metric serves as both: |
| 1. **An evaluation metric** for role-play TTS quality (correlation with human MOS: Spearman Ο = 0.94) |
| 2. **A reward signal** for GRPO-based reinforcement learning to improve TTS expressiveness |
| |
| This model is presented in: |
| |
| > **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability** |
| > *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang* |
| > ICML 2026 |
|
|
| ## How MCLP Works |
|
|
| The MCLP metric computes the mean log-probability of audio tokens in the **generated segment**, conditioned on a **ground-truth audio prefix**: |
|
|
| ``` |
| MCLP = (1/N) * Ξ£ log P(token_i | gt_prefix, token_1, ..., token_{i-1}) |
| ``` |
|
|
| Higher MCLP scores indicate better stylistic consistency with the ground-truth speaking style. |
|
|
| ## Usage |
|
|
| ```bash |
| # Clone the inference code |
| git clone https://github.com/y-ren16/MCLP.git |
| cd MCLP |
| |
| # Compute MCLP scores |
| python compute_contination_score.py \ |
| --model-path /path/to/MCLP-Score \ |
| --audio-dir ./outputs/roleplay_tts \ |
| --gt-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \ |
| --gt-dir /path/to/WenetSpeech-RP/eval/audio \ |
| --save-json mclp_results.json |
| ``` |
|
|
| **Output:** |
| ``` |
| MCLP (Mean avg_log_prob): -4.636xxx |
| Mean avg_prob: 0.xxxxx |
| Mean avg_rank: xx.xx |
| ``` |
|
|
| For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP). |
|
|
| ## Requirements |
|
|
| - Python >= 3.10 |
| - PyTorch >= 2.3 with CUDA |
| - GPU: at least 1x A100/H100 (80GB) for inference |
|
|
| ```bash |
| pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy |
| ``` |
|
|
| ## Related Resources |
|
|
| | Resource | Link | |
| |----------|------| |
| | π Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) | |
| | π» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) | |
| | π WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) | |
| | π£οΈ MCLP-RPTTS Model | [huggingface.co/y-ren16/MCLP-RPTTS](https://huggingface.co/y-ren16/MCLP-RPTTS) | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{ren2026mclp, |
| title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability}, |
| author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui}, |
| booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the [Apache 2.0 License](LICENSE). |
|
|
| ## Acknowledgements |
|
|
| This project builds upon: |
| - [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2) |
| - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) |
| - [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice) |
|
|