File size: 5,172 Bytes
aa12b42 952fade aa12b42 952fade | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
license: apache-2.0
language:
- en
- zh
tags:
- automatic-speech-recognition
- speech-recognition
- audio
- robust-asr
- qwen3-asr
pipeline_tag: automatic-speech-recognition
---
# Mega-ASR
Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.
## Model Details
- **Model name:** Mega-ASR
- **Task:** Automatic speech recognition
- **Backbone:** Qwen3-ASR-1.7B
- **Primary use case:** In-the-wild ASR under challenging acoustic conditions
- **Default decoding:** Greedy decoding
- **Default max new tokens:** 256 in the Mega-ASR inference wrapper
- **Router:** Audio quality classifier with a default threshold of 0.5
- **License:** Apache-2.0
## Repository Contents
```text
Mega-ASR/
βββ Qwen3-ASR-1.7B/ # Backbone model, tokenizer, processor, and generation config
βββ mega-asr-merged/ # Mega-ASR adaptation weights used by the inference wrapper
βββ audio_quality_router/ # Audio quality router checkpoint
βββ README.md # Model card
```
## Intended Use
Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.
## Quick Start
Install the Mega-ASR codebase and dependencies:
```bash
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR
conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt
```
Place this checkpoint directory at:
```text
ckpt/Mega-ASR
```
Run inference:
```bash
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
```
Disable routing if you want to always use the robust recognition path:
```bash
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
```
Python usage:
```python
from MegaASR.model.megaASR import MegaASR
model = MegaASR(
model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
routing_enabled=True,
)
result = model.infer("/path/to/audio.wav", return_route=True)
print(result)
```
## Decoding Defaults
The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
The default generation configuration is deterministic:
```text
do_sample: false
num_beams: 1
repetition_penalty: 1.0
top_p: 1.0
top_k: 50
```
Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
## Training Summary
Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
## Evaluation
Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:
- **WER** for English and whitespace-tokenized languages
- **CER** for Chinese and character-based evaluation
The Mega-ASR repository includes an evaluation script:
```bash
python src/MegaASR/eval/evaluate_wer.py \
--ckpt_dir ckpt/Mega-ASR \
--input_jsonl examples/test.jsonl \
--output_jsonl outputs/pred_with_wer.jsonl
```
Input JSONL format:
```json
{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
```
## Citation
If you use Mega-ASR, please cite the project:
```bibtex
@misc{xie2026megaasrinthewild2speechrecognition,
title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
year={2026},
eprint={2605.19833},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2605.19833},
}
```
## Acknowledgements
Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.
|