y-ren16
/

MCLP-Score

 ---
 license: apache-2.0
+language:
+  - zh
+  - en
+tags:
+  - tts
+  - speech-evaluation
+  - continuation-score
+  - role-play
+  - reward-model
+pipeline_tag: text-to-speech
 ---
+<h1 align="center">
+  MCLP-Score: Continuation Score Model for MCLP Metric
+</h1>
+<p align="center">
+  Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
+</p>
+<p align="center">
+  <sup>1</sup>StepFun &nbsp; <sup>2</sup>University of Chinese Academy of Sciences &nbsp; <sup>3</sup>Beihang University
+</p>
+<p align="center">
+  <sup>*</sup>Equal contribution
+</p>
+<p align="center">
+📑 <a href="https://arxiv.org/abs/2601.22661">Paper</a> &nbsp;|&nbsp;
+💻 <a href="https://github.com/y-ren16/MCLP">Code</a> &nbsp;|&nbsp;
+📊 <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> &nbsp;|&nbsp;
+🗣️ <a href="https://huggingface.co/y-ren16/MCLP-RPTTS">MCLP-RPTTS Model</a>
+</p>
+## Model Description
+**MCLP-Score** is the Continuation Score model used to compute the **MCLP (Mean Continuation Log-Probability)** metric. Given a ground-truth audio prefix, this model evaluates how well a generated audio segment continues the stylistic pattern of the ground-truth, producing a log-probability score that measures expressive consistency.
+The MCLP metric serves as both:
+1. **An evaluation metric** for role-play TTS quality (correlation with human MOS: Spearman ρ = 0.94)
+2. **A reward signal** for GRPO-based reinforcement learning to improve TTS expressiveness
+This model is presented in:
+> **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
+> *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
+> ICML 2026
+## How MCLP Works
+The MCLP metric computes the mean log-probability of audio tokens in the **generated segment**, conditioned on a **ground-truth audio prefix**:
+```
+MCLP = (1/N) * Σ log P(token_i | gt_prefix, token_1, ..., token_{i-1})
+```
+Higher MCLP scores indicate better stylistic consistency with the ground-truth speaking style.
+## Usage
+```bash
+# Clone the inference code
+git clone https://github.com/y-ren16/MCLP.git
+cd MCLP
+# Compute MCLP scores
+python compute_contination_score.py \
+    --model-path /path/to/MCLP-Score \
+    --audio-dir ./outputs/roleplay_tts \
+    --gt-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
+    --gt-dir /path/to/WenetSpeech-RP/eval/audio \
+    --save-json mclp_results.json
+```
+**Output:**
+```
+MCLP (Mean avg_log_prob): -4.636xxx
+Mean avg_prob: 0.xxxxx
+Mean avg_rank: xx.xx
+```
+For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).
+## Requirements
+- Python >= 3.10
+- PyTorch >= 2.3 with CUDA
+- GPU: at least 1x A100/H100 (80GB) for inference
+```bash
+pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
+```
+## Related Resources
+| Resource | Link |
+|----------|------|
+| 📑 Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
+| 💻 Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
+| 📊 WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
+| 🗣️ MCLP-RPTTS Model | [huggingface.co/y-ren16/MCLP-RPTTS](https://huggingface.co/y-ren16/MCLP-RPTTS) |
+## Citation
+```bibtex
+@inproceedings{ren2026mclp,
+  title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
+  author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
+  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
+  year={2026}
+}
+```
+## License
+This model is released under the [Apache 2.0 License](LICENSE).
+## Acknowledgements
+This project builds upon:
+- [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
+- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
+- [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)