| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| tags: |
| - automatic-speech-recognition |
| - speech-recognition |
| - audio |
| - robust-asr |
| - qwen3-asr |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Mega-ASR |
|
|
| Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text. |
|
|
| The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech. |
|
|
| ## Model Details |
|
|
| - **Model name:** Mega-ASR |
| - **Task:** Automatic speech recognition |
| - **Backbone:** Qwen3-ASR-1.7B |
| - **Primary use case:** In-the-wild ASR under challenging acoustic conditions |
| - **Default decoding:** Greedy decoding |
| - **Default max new tokens:** 256 in the Mega-ASR inference wrapper |
| - **Router:** Audio quality classifier with a default threshold of 0.5 |
| - **License:** Apache-2.0 |
|
|
| ## Repository Contents |
|
|
| ```text |
| Mega-ASR/ |
| βββ Qwen3-ASR-1.7B/ # Backbone model, tokenizer, processor, and generation config |
| βββ mega-asr-merged/ # Mega-ASR adaptation weights used by the inference wrapper |
| βββ audio_quality_router/ # Audio quality router checkpoint |
| βββ README.md # Model card |
| ``` |
|
|
| ## Intended Use |
|
|
| Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions. |
|
|
| ## Quick Start |
|
|
| Install the Mega-ASR codebase and dependencies: |
|
|
| ```bash |
| git clone https://github.com/xzf-thu/Mega-ASR.git |
| cd Mega-ASR |
| |
| conda create -n mega-asr python=3.10 -y |
| conda activate mega-asr |
| pip install -r requirements.txt |
| ``` |
|
|
| Place this checkpoint directory at: |
|
|
| ```text |
| ckpt/Mega-ASR |
| ``` |
|
|
| Run inference: |
|
|
| ```bash |
| python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR |
| ``` |
|
|
| Disable routing if you want to always use the robust recognition path: |
|
|
| ```bash |
| python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false |
| ``` |
|
|
| Python usage: |
|
|
| ```python |
| from MegaASR.model.megaASR import MegaASR |
| |
| model = MegaASR( |
| model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B", |
| router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt", |
| routing_enabled=True, |
| ) |
| |
| result = model.infer("/path/to/audio.wav", return_route=True) |
| print(result) |
| ``` |
|
|
| ## Decoding Defaults |
|
|
| The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256. |
|
|
| The default generation configuration is deterministic: |
|
|
| ```text |
| do_sample: false |
| num_beams: 1 |
| repetition_penalty: 1.0 |
| top_p: 1.0 |
| top_k: 50 |
| ``` |
|
|
| Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference. |
|
|
| ## Training Summary |
|
|
| Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation. |
|
|
| The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio. |
|
|
| ## Evaluation |
|
|
| Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are: |
|
|
| - **WER** for English and whitespace-tokenized languages |
| - **CER** for Chinese and character-based evaluation |
|
|
| The Mega-ASR repository includes an evaluation script: |
|
|
| ```bash |
| python src/MegaASR/eval/evaluate_wer.py \ |
| --ckpt_dir ckpt/Mega-ASR \ |
| --input_jsonl examples/test.jsonl \ |
| --output_jsonl outputs/pred_with_wer.jsonl |
| ``` |
|
|
| Input JSONL format: |
|
|
| ```json |
| {"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."} |
| ``` |
|
|
| ## Citation |
|
|
| If you use Mega-ASR, please cite the project: |
|
|
| ```bibtex |
| @misc{xie2026megaasrinthewild2speechrecognition, |
| title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation}, |
| author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao}, |
| year={2026}, |
| eprint={2605.19833}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.SD}, |
| url={https://arxiv.org/abs/2605.19833}, |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project. |
|
|