Can this model perform ASR?
As mentioned in the model card, this model seems to be an audio analysis model.
Can it perform ASR task as Qwen3-ASR-Flash does? Any benchmarks on it?
Doesn't look like it:
Note: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports audio input only, with text output only. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fine‑grained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds.
https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb
Which is a shame, because it's really good at detecting emotions.
Not really a "captioner" then is it? :)
Not really a "captioner" then is it? :)
I think it should be relatively easy to achieve if it can be further trained, for example, training it to only output a JSON array, with each item containing the subtitle text, start time, duration, etc.
Well yeah, you can train it to do anything (though I've already trained Voxtral to do this and haven't had good results finetuning MoE).
Don't get me wrong, it's a great audio analysis model (the best I've used). It's just confusing the way it's been called a "Captioner" when it... doesn't produce captions lol
I've had some success just sending the output from this model to another LLM and prompting it to pick out the information I want.