Can this model perform ASR?

by Jonathan0528 - opened Sep 23, 2025

Sep 23, 2025

As mentioned in the model card, this model seems to be an audio analysis model.
Can it perform ASR task as Qwen3-ASR-Flash does? Any benchmarks on it?

gghfez

Sep 23, 2025

Doesn't look like it:

Note: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports audio input only, with text output only. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fine‑grained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds.

https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb

Which is a shame, because it's really good at detecting emotions.

Not really a "captioner" then is it? :)

lanesun

Sep 24, 2025

Not really a "captioner" then is it? :)

I think it should be relatively easy to achieve if it can be further trained, for example, training it to only output a JSON array, with each item containing the subtitle text, start time, duration, etc.

gghfez

Sep 26, 2025

Well yeah, you can train it to do anything (though I've already trained Voxtral to do this and haven't had good results finetuning MoE).

Don't get me wrong, it's a great audio analysis model (the best I've used). It's just confusing the way it's been called a "Captioner" when it... doesn't produce captions lol

I've had some success just sending the output from this model to another LLM and prompting it to pick out the information I want.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment