zhifeixie commited on
Commit
952fade
Β·
verified Β·
1 Parent(s): 6185df0

Update model card

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md CHANGED
@@ -1,3 +1,154 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - speech-recognition
9
+ - audio
10
+ - robust-asr
11
+ - qwen3-asr
12
+ pipeline_tag: automatic-speech-recognition
13
  ---
14
+
15
+ # Mega-ASR
16
+
17
+ Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
18
+
19
+ The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.
20
+
21
+ ## Model Details
22
+
23
+ - **Model name:** Mega-ASR
24
+ - **Task:** Automatic speech recognition
25
+ - **Backbone:** Qwen3-ASR-1.7B
26
+ - **Primary use case:** In-the-wild ASR under challenging acoustic conditions
27
+ - **Default decoding:** Greedy decoding
28
+ - **Default max new tokens:** 256 in the Mega-ASR inference wrapper
29
+ - **Router:** Audio quality classifier with a default threshold of 0.5
30
+ - **License:** Apache-2.0
31
+
32
+ ## Repository Contents
33
+
34
+ ```text
35
+ Mega-ASR/
36
+ β”œβ”€β”€ Qwen3-ASR-1.7B/ # Backbone model, tokenizer, processor, and generation config
37
+ β”œβ”€β”€ mega-asr-merged/ # Mega-ASR adaptation weights used by the inference wrapper
38
+ β”œβ”€β”€ audio_quality_router/ # Audio quality router checkpoint
39
+ └── README.md # Model card
40
+ ```
41
+
42
+ ## Intended Use
43
+
44
+ Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.
45
+
46
+ ## Quick Start
47
+
48
+ Install the Mega-ASR codebase and dependencies:
49
+
50
+ ```bash
51
+ git clone https://github.com/xzf-thu/Mega-ASR.git
52
+ cd Mega-ASR
53
+
54
+ conda create -n mega-asr python=3.10 -y
55
+ conda activate mega-asr
56
+ pip install -r requirements.txt
57
+ ```
58
+
59
+ Place this checkpoint directory at:
60
+
61
+ ```text
62
+ ckpt/Mega-ASR
63
+ ```
64
+
65
+ Run inference:
66
+
67
+ ```bash
68
+ python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
69
+ ```
70
+
71
+ Disable routing if you want to always use the robust recognition path:
72
+
73
+ ```bash
74
+ python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
75
+ ```
76
+
77
+ Python usage:
78
+
79
+ ```python
80
+ from MegaASR.model.megaASR import MegaASR
81
+
82
+ model = MegaASR(
83
+ model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
84
+ router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
85
+ routing_enabled=True,
86
+ )
87
+
88
+ result = model.infer("/path/to/audio.wav", return_route=True)
89
+ print(result)
90
+ ```
91
+
92
+ ## Decoding Defaults
93
+
94
+ The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
95
+
96
+ The default generation configuration is deterministic:
97
+
98
+ ```text
99
+ do_sample: false
100
+ num_beams: 1
101
+ repetition_penalty: 1.0
102
+ top_p: 1.0
103
+ top_k: 50
104
+ ```
105
+
106
+ Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
107
+
108
+ ## Training Summary
109
+
110
+ Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
111
+
112
+ The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
113
+
114
+ ## Evaluation
115
+
116
+ Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:
117
+
118
+ - **WER** for English and whitespace-tokenized languages
119
+ - **CER** for Chinese and character-based evaluation
120
+
121
+ The Mega-ASR repository includes an evaluation script:
122
+
123
+ ```bash
124
+ python src/MegaASR/eval/evaluate_wer.py \
125
+ --ckpt_dir ckpt/Mega-ASR \
126
+ --input_jsonl examples/test.jsonl \
127
+ --output_jsonl outputs/pred_with_wer.jsonl
128
+ ```
129
+
130
+ Input JSONL format:
131
+
132
+ ```json
133
+ {"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
134
+ ```
135
+
136
+ ## Citation
137
+
138
+ If you use Mega-ASR, please cite the project:
139
+
140
+ ```bibtex
141
+ @misc{xie2026megaasrinthewild2speechrecognition,
142
+ title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
143
+ author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
144
+ year={2026},
145
+ eprint={2605.19833},
146
+ archivePrefix={arXiv},
147
+ primaryClass={cs.SD},
148
+ url={https://arxiv.org/abs/2605.19833},
149
+ }
150
+ ```
151
+
152
+ ## Acknowledgements
153
+
154
+ Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.