HydraQwen2.5-Omni-3B

Paper: https://arxiv.org/abs/2603.28554

One model, many heads.

Omni-modal extension of Hydra, applying the dual-head LoRA-toggle architecture to Qwen2.5-Omni-3B. No Hydra-specific training -- the adapter from vidore/colqwen-omni-v0.1 is used as-is.

Three inference modes from a single 4.4B-parameter model:

Retrieval (LoRA on, bidirectional): ColBERT multi-vector embeddings over images, audio, or video
Text generation (LoRA off, causal): Autoregressive text conditioned on any input modality
Speech generation (LoRA off, causal, talker enabled): Spoken answers via thinker-talker-vocoder pipeline

Files

lm_head.pt -- Preserved lm_head weights from Qwen2.5-Omni-3B thinker
results/ -- Raw evaluation JSONs
scripts/ -- Training + eval scripts

Results

Proof-of-concept results -- zero-shot, single run, no Hydra-specific training.

Retrieval

Benchmark	avg nDCG@5	# tasks
ViDoRe V1	0.8865	10
ViDoRe V2	0.5353	4
ViDoRe V3	0.4907	8
AudioCaps R@1 (zero-shot)	26.2%	—

V1 is the full 10-task set (InfoVQA retrieval added 2026-04-18 — see results/VidoreInfoVQARetrieval_predictions.json), directly comparable to the HydraQwen3.5-4B V1 column.

Generation equivalence (InfoVQA)

Base Qwen2.5-Omni-3B thinker vs. HydraQwen2.5-Omni-3B with LoRA disabled, same ViDoRe-matched protocol as the 4B model: greedy (T=0), 128 new tokens, full InfoVQA validation (n=2,801), short-answer prompt suffix applied identically to both paths (needed because Qwen2.5-Omni's default outputs are sentence-form).

n	Base ANLS	Hydra ANLS	Δ (95% CI)	Exact match
2,801	0.7257	0.7257	+0.0000 [+0.0000, +0.0000]	2,801 / 2,801 (100.00%)

Byte-identical outputs on every sample. The adapter-off path recovers base-model generation exactly at the output-token level. Report: results/infovqa_report.json.

HydraQwen3.5-4B
Code (training + eval scripts)

Citation

@article{georgiou2026hydra,
  title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
  author={Georgiou, Athos},
  year={2026}
}