File size: 4,247 Bytes
9dc1829 c3881f2 9dc1829 c3881f2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | ---
license: apache-2.0
language:
- zh
- en
tags:
- tts
- speech-evaluation
- continuation-score
- role-play
- reward-model
pipeline_tag: text-to-speech
---
<h1 align="center">
MCLP-Score: Continuation Score Model for MCLP Metric
</h1>
<p align="center">
Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
</p>
<p align="center">
<sup>1</sup>StepFun <sup>2</sup>University of Chinese Academy of Sciences <sup>3</sup>Beihang University
</p>
<p align="center">
<sup>*</sup>Equal contribution
</p>
<p align="center">
π <a href="https://arxiv.org/abs/2601.22661">Paper</a> |
π» <a href="https://github.com/y-ren16/MCLP">Code</a> |
π <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> |
π£οΈ <a href="https://huggingface.co/y-ren16/MCLP-RPTTS">MCLP-RPTTS Model</a>
</p>
## Model Description
**MCLP-Score** is the Continuation Score model used to compute the **MCLP (Mean Continuation Log-Probability)** metric. Given a ground-truth audio prefix, this model evaluates how well a generated audio segment continues the stylistic pattern of the ground-truth, producing a log-probability score that measures expressive consistency.
The MCLP metric serves as both:
1. **An evaluation metric** for role-play TTS quality (correlation with human MOS: Spearman Ο = 0.94)
2. **A reward signal** for GRPO-based reinforcement learning to improve TTS expressiveness
This model is presented in:
> **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
> *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
> ICML 2026
## How MCLP Works
The MCLP metric computes the mean log-probability of audio tokens in the **generated segment**, conditioned on a **ground-truth audio prefix**:
```
MCLP = (1/N) * Ξ£ log P(token_i | gt_prefix, token_1, ..., token_{i-1})
```
Higher MCLP scores indicate better stylistic consistency with the ground-truth speaking style.
## Usage
```bash
# Clone the inference code
git clone https://github.com/y-ren16/MCLP.git
cd MCLP
# Compute MCLP scores
python compute_contination_score.py \
--model-path /path/to/MCLP-Score \
--audio-dir ./outputs/roleplay_tts \
--gt-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
--gt-dir /path/to/WenetSpeech-RP/eval/audio \
--save-json mclp_results.json
```
**Output:**
```
MCLP (Mean avg_log_prob): -4.636xxx
Mean avg_prob: 0.xxxxx
Mean avg_rank: xx.xx
```
For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).
## Requirements
- Python >= 3.10
- PyTorch >= 2.3 with CUDA
- GPU: at least 1x A100/H100 (80GB) for inference
```bash
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
```
## Related Resources
| Resource | Link |
|----------|------|
| π Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
| π» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
| π WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
| π£οΈ MCLP-RPTTS Model | [huggingface.co/y-ren16/MCLP-RPTTS](https://huggingface.co/y-ren16/MCLP-RPTTS) |
## Citation
```bibtex
@inproceedings{ren2026mclp,
title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year={2026}
}
```
## License
This model is released under the [Apache 2.0 License](LICENSE).
## Acknowledgements
This project builds upon:
- [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
- [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)
|