File size: 5,172 Bytes
aa12b42
 
952fade
 
 
 
 
 
 
 
 
 
aa12b42
952fade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
language:
- en
- zh
tags:
- automatic-speech-recognition
- speech-recognition
- audio
- robust-asr
- qwen3-asr
pipeline_tag: automatic-speech-recognition
---

# Mega-ASR

Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.

The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.

## Model Details

- **Model name:** Mega-ASR
- **Task:** Automatic speech recognition
- **Backbone:** Qwen3-ASR-1.7B
- **Primary use case:** In-the-wild ASR under challenging acoustic conditions
- **Default decoding:** Greedy decoding
- **Default max new tokens:** 256 in the Mega-ASR inference wrapper
- **Router:** Audio quality classifier with a default threshold of 0.5
- **License:** Apache-2.0

## Repository Contents

```text
Mega-ASR/
β”œβ”€β”€ Qwen3-ASR-1.7B/              # Backbone model, tokenizer, processor, and generation config
β”œβ”€β”€ mega-asr-merged/             # Mega-ASR adaptation weights used by the inference wrapper
β”œβ”€β”€ audio_quality_router/        # Audio quality router checkpoint
└── README.md                    # Model card
```

## Intended Use

Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.

## Quick Start

Install the Mega-ASR codebase and dependencies:

```bash
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR

conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt
```

Place this checkpoint directory at:

```text
ckpt/Mega-ASR
```

Run inference:

```bash
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
```

Disable routing if you want to always use the robust recognition path:

```bash
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
```

Python usage:

```python
from MegaASR.model.megaASR import MegaASR

model = MegaASR(
    model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
    router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
    routing_enabled=True,
)

result = model.infer("/path/to/audio.wav", return_route=True)
print(result)
```

## Decoding Defaults

The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.

The default generation configuration is deterministic:

```text
do_sample: false
num_beams: 1
repetition_penalty: 1.0
top_p: 1.0
top_k: 50
```

Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.

## Training Summary

Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.

The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.

## Evaluation

Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:

- **WER** for English and whitespace-tokenized languages
- **CER** for Chinese and character-based evaluation

The Mega-ASR repository includes an evaluation script:

```bash
python src/MegaASR/eval/evaluate_wer.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.jsonl
```

Input JSONL format:

```json
{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
```

## Citation

If you use Mega-ASR, please cite the project:

```bibtex
@misc{xie2026megaasrinthewild2speechrecognition,
      title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
      author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2605.19833},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.19833},
}
```

## Acknowledgements

Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.