HydraQwen2.5-Omni-3B
Paper: https://arxiv.org/abs/2603.28554
One model, many heads.
Omni-modal extension of Hydra, applying the dual-head LoRA-toggle architecture to Qwen2.5-Omni-3B. No Hydra-specific training -- the adapter from vidore/colqwen-omni-v0.1 is used as-is.
Three inference modes from a single 4.4B-parameter model:
- Retrieval (LoRA on, bidirectional): ColBERT multi-vector embeddings over images, audio, or video
- Text generation (LoRA off, causal): Autoregressive text conditioned on any input modality
- Speech generation (LoRA off, causal, talker enabled): Spoken answers via thinker-talker-vocoder pipeline
Files
lm_head.pt-- Preserved lm_head weights from Qwen2.5-Omni-3B thinkerresults/-- Raw evaluation JSONsscripts/-- Training + eval scripts
Results
Proof-of-concept results -- zero-shot, single run, no Hydra-specific training.
Retrieval
| Benchmark | avg nDCG@5 | # tasks |
|---|---|---|
| ViDoRe V1 | 0.8865 | 10 |
| ViDoRe V2 | 0.5353 | 4 |
| ViDoRe V3 | 0.4907 | 8 |
| AudioCaps R@1 (zero-shot) | 26.2% | — |
V1 is the full 10-task set (InfoVQA retrieval added 2026-04-18 — see results/VidoreInfoVQARetrieval_predictions.json), directly comparable to the HydraQwen3.5-4B V1 column.
Generation equivalence (InfoVQA)
Base Qwen2.5-Omni-3B thinker vs. HydraQwen2.5-Omni-3B with LoRA disabled, same ViDoRe-matched protocol as the 4B model: greedy (T=0), 128 new tokens, full InfoVQA validation (n=2,801), short-answer prompt suffix applied identically to both paths (needed because Qwen2.5-Omni's default outputs are sentence-form).
| n | Base ANLS | Hydra ANLS | Δ (95% CI) | Exact match |
|---|---|---|---|---|
| 2,801 | 0.7257 | 0.7257 | +0.0000 [+0.0000, +0.0000] | 2,801 / 2,801 (100.00%) |
Byte-identical outputs on every sample. The adapter-off path recovers base-model generation exactly at the output-token level. Report: results/infovqa_report.json.
Related
- HydraQwen3.5-4B
- Code (training + eval scripts)
Citation
@article{georgiou2026hydra,
title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
author={Georgiou, Athos},
year={2026}
}
- Downloads last month
- -
Model tree for athrael-soju/HydraQwen2.5-Omni-3B
Base model
Qwen/Qwen2.5-Omni-3B