y-ren16 commited on
Commit
c3881f2
Β·
verified Β·
1 Parent(s): c9f304a

Add Model Card

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md CHANGED
@@ -1,3 +1,126 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ tags:
7
+ - tts
8
+ - speech-evaluation
9
+ - continuation-score
10
+ - role-play
11
+ - reward-model
12
+ pipeline_tag: text-to-speech
13
  ---
14
+
15
+ <h1 align="center">
16
+ MCLP-Score: Continuation Score Model for MCLP Metric
17
+ </h1>
18
+
19
+ <p align="center">
20
+ Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
21
+ </p>
22
+
23
+ <p align="center">
24
+ <sup>1</sup>StepFun &nbsp; <sup>2</sup>University of Chinese Academy of Sciences &nbsp; <sup>3</sup>Beihang University
25
+ </p>
26
+
27
+ <p align="center">
28
+ <sup>*</sup>Equal contribution
29
+ </p>
30
+
31
+ <p align="center">
32
+ πŸ“‘ <a href="https://arxiv.org/abs/2601.22661">Paper</a> &nbsp;|&nbsp;
33
+ πŸ’» <a href="https://github.com/y-ren16/MCLP">Code</a> &nbsp;|&nbsp;
34
+ πŸ“Š <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> &nbsp;|&nbsp;
35
+ πŸ—£οΈ <a href="https://huggingface.co/y-ren16/MCLP-RPTTS">MCLP-RPTTS Model</a>
36
+ </p>
37
+
38
+ ## Model Description
39
+
40
+ **MCLP-Score** is the Continuation Score model used to compute the **MCLP (Mean Continuation Log-Probability)** metric. Given a ground-truth audio prefix, this model evaluates how well a generated audio segment continues the stylistic pattern of the ground-truth, producing a log-probability score that measures expressive consistency.
41
+
42
+ The MCLP metric serves as both:
43
+ 1. **An evaluation metric** for role-play TTS quality (correlation with human MOS: Spearman ρ = 0.94)
44
+ 2. **A reward signal** for GRPO-based reinforcement learning to improve TTS expressiveness
45
+
46
+ This model is presented in:
47
+
48
+ > **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
49
+ > *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
50
+ > ICML 2026
51
+
52
+ ## How MCLP Works
53
+
54
+ The MCLP metric computes the mean log-probability of audio tokens in the **generated segment**, conditioned on a **ground-truth audio prefix**:
55
+
56
+ ```
57
+ MCLP = (1/N) * Ξ£ log P(token_i | gt_prefix, token_1, ..., token_{i-1})
58
+ ```
59
+
60
+ Higher MCLP scores indicate better stylistic consistency with the ground-truth speaking style.
61
+
62
+ ## Usage
63
+
64
+ ```bash
65
+ # Clone the inference code
66
+ git clone https://github.com/y-ren16/MCLP.git
67
+ cd MCLP
68
+
69
+ # Compute MCLP scores
70
+ python compute_contination_score.py \
71
+ --model-path /path/to/MCLP-Score \
72
+ --audio-dir ./outputs/roleplay_tts \
73
+ --gt-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
74
+ --gt-dir /path/to/WenetSpeech-RP/eval/audio \
75
+ --save-json mclp_results.json
76
+ ```
77
+
78
+ **Output:**
79
+ ```
80
+ MCLP (Mean avg_log_prob): -4.636xxx
81
+ Mean avg_prob: 0.xxxxx
82
+ Mean avg_rank: xx.xx
83
+ ```
84
+
85
+ For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).
86
+
87
+ ## Requirements
88
+
89
+ - Python >= 3.10
90
+ - PyTorch >= 2.3 with CUDA
91
+ - GPU: at least 1x A100/H100 (80GB) for inference
92
+
93
+ ```bash
94
+ pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
95
+ ```
96
+
97
+ ## Related Resources
98
+
99
+ | Resource | Link |
100
+ |----------|------|
101
+ | πŸ“‘ Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
102
+ | πŸ’» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
103
+ | πŸ“Š WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
104
+ | πŸ—£οΈ MCLP-RPTTS Model | [huggingface.co/y-ren16/MCLP-RPTTS](https://huggingface.co/y-ren16/MCLP-RPTTS) |
105
+
106
+ ## Citation
107
+
108
+ ```bibtex
109
+ @inproceedings{ren2026mclp,
110
+ title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
111
+ author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
112
+ booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
113
+ year={2026}
114
+ }
115
+ ```
116
+
117
+ ## License
118
+
119
+ This model is released under the [Apache 2.0 License](LICENSE).
120
+
121
+ ## Acknowledgements
122
+
123
+ This project builds upon:
124
+ - [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
125
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
126
+ - [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)