| --- |
| license: other |
| license_name: license-term-of-stabletoken |
| language: |
| - en |
| - zh |
| tags: |
| - speech tokenizer |
| --- |
| # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026) |
|
|
| **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments. |
|
|
| π [Paper](https://arxiv.org/abs/2509.22220) | π» [GitHub](https://github.com/Tencent/StableToken) |
|
|
| For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken). |
|
|
| ## Model Details |
|
|
| | Attribute | Value | |
| |:----------|:------| |
| | Frame Rate | 25 Hz | |
| | Codebook Size | 8,192 | |
| | BPS (Bits Per Second) | 325 | |
|
|
| ## Quick Start |
|
|
| To use StableToken, please clone the official repository and install dependencies. |
|
|
| ### Installation |
|
|
| ```bash |
| git clone --recursive https://github.com/Tencent/StableToken.git |
| cd StableToken && pip install -r requirements.txt |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import os |
| from huggingface_hub import snapshot_download |
| from transformers import WhisperFeatureExtractor |
| from src.model.modeling_whisper import WhisperLFQEncoder |
| from src.utils.flow_inference import AudioDecoder |
| from src.utils.utils import extract_speech_token, speech_token_to_wav |
| |
| # 1. Download & Load Models |
| model_dir = snapshot_download("tencent/StableToken") |
| |
| # Load Tokenizer |
| tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda() |
| feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer")) |
| |
| # Load Decoder |
| decoder = AudioDecoder( |
| config_path=os.path.join(model_dir, "decoder", "config.yaml"), |
| flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"), |
| hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"), |
| device="cuda" |
| ) |
| |
| # 2. Tokenize |
| tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] |
| |
| # 3. Reconstruct |
| tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) |
| ``` |
|
|
| ## Performance |
|
|
| StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers. |
|
|
| ### Noise Robustness (UED β) |
|
|
| | Model | Frame Rate | Codebook Size | UED (%, β) | |
| |:---|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 | |
| | **StableToken** | 25Hz | 8,192 | **10.17** π | |
|
|
| ### Reconstruction Quality |
|
|
| Measurements on LibriSpeech (LS) and SEED benchmarks. |
|
|
| | Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | |
| | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{song2025stabletoken, |
| title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs}, |
| author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao}, |
| journal={arXiv preprint arXiv:2509.22220}, |
| year={2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| This project is licensed under the [License Term of StableToken](LICENSE). |
|
|