BidirLM: Turning Generative LLMs into the Best Open-Source Omnimodal Encoders

Community Article Published April 7, 2026

There are millions of GPU hours sitting in open-source causal language models. Code specialists, math specialists, vision models, audio models. All left aside when it comes to representation tasks.

With BidirLM, we introduce a complete, open-source recipe to transform any causal decoder LLM into a powerful bidirectional encoder. Through systematic ablations on Gemma3 and Qwen3, we identified the adaptation strategy that actually works, then scaled it without access to the original pre-training data. But we didn't stop there. By composing specialized causal models through weight merging, we turned our text encoder into BidirLM-Omni-2.5B: a single compact model that handles text, images, and audio, and beats both omnimodal and unimodal specialists on standard benchmarks.

We're releasing the full model family, training data, and all checkpoints on the Hub. Let's walk through how we got there.

📦 Models, datasets & checkpoints on the Hub 📄 Paper

Table of Contents

  1. The Recipe: What Actually Matters
  2. Scaling Without the Original Data
  3. Text Encoders Hit the Pareto Frontier
  4. Composing Specialists Through Merging
  5. Building BidirLM-Omni
  6. Results: Beating Omnimodal and Unimodal Models
  7. Why This Matters

The Recipe: What Actually Matters When Adapting Causal Models

The field has been surprisingly confused about this. Some work skips the masking phase entirely and jumps straight to contrastive training. Others enable bidirectional attention while others stick with causal masking. What is sure is that nobody agrees, and nobody has reported the controlled experiments to settle the question.

So we did. We tested five distinct adaptation variants on two model families (Gemma3 and Qwen3), carefully isolating every design choice: Base (1): The original causal model. Bi+Base (2): The Base model with bidirectional attention enabled. Bi+MNTP (3): The Bi+Base model with an MNTP adaptation phase. Bi+Contrastive (4): The Bi+Base model with a contrastive adaptation phase. Bi+MNTP+Contrastive (5): The Bi+Base model adapted sequentially using MNTP followed by contrastive training.

Adaptation variants overview

The answer is clear. Simply flipping the attention mask from causal to bidirectional gives inconsistent results. The model needs to learn how to use bidirectional context, and that's exactly what the MNTP (Masked Next-Token Prediction) phase provides. Once that's in place, contrastive training builds strong generic embedding quality on top.

The key finding? Recent contrastive-only approaches sacrifice fine-tuning quality for embedding gains. Restoring a prior MNTP phase lets you have both. This two-phase pipeline (MNTP then contrastive) is the backbone of every BidirLM model.

Scaling Without the Original Data

Here's where it gets tricky. Most recent adapted encoders come from the same organizations that trained the base models. They have access to the original pre-training data, which quietly prevents catastrophic forgetting. For everyone else, scaling adaptation on a different data distribution causes the model to forget what it knew: languages, code, math.

We saw this firsthand. Extending MNTP training from 10B to 30B tokens on English-only data, Gemma dropped 7 points on Arabic retrieval. Qwen lost ground on math and code.

Catastrophic forgetting

Our solution combines two lightweight strategies:

Linear weight merging. We average the adapted model's weights with the original base checkpoint at a 50/50 ratio. This works because both models sit close in weight space (cosine similarity of 0.97 for Qwen), and the interpolation recovers the base model's distributional coverage while preserving the new bidirectional attention patterns.

A multi-domain data mixture. We replace just 20% of the English training data with multilingual, math, and code samples. That small fraction is enough to maintain cross-domain knowledge.

Combined, these two strategies gave us our best results: +2 points on multilingual benchmarks, up to +11 on code for Gemma. No proprietary data needed. No expensive replay buffers.

Text Encoders Hit the Open-Source Pareto Frontier

We scaled this recipe to larger architectures (Gemma3 up to 1B, Qwen3 up to 1.7B) and added contrastive training on 10M multi-domain samples to produce the BidirLM text encoder family.

Open-Source Pareto Frontier Performance

On the augmented XTREME benchmark (covering multilingual NLU, retrieval, code, and math), every BidirLM variant sets a new performance frontier. BidirLM-270M matches mmBERT-base with 10% fewer parameters. BidirLM-0.6B beats EuroBERT-610m by over a point.

On MTEB Multilingual v2, the same models advance the open-source Pareto frontier across three of four size configurations. No knowledge distillation from proprietary models. No multi-run averaging tricks. Just classical contrastive training on open data.

But here's where it gets really interesting.

The Big Idea: Composing Specialists Through Merging

Weight merging worked beautifully for mitigating forgetting. So we asked a bolder question: can we use the same technique to absorb capabilities from entirely different specialized models?

The causal LLM ecosystem is enormous. For Qwen3 alone, there are safety models (Qwen3Guard), vision-language models (Qwen3-VL), audio models (Qwen3-ASR). Each represents thousands of GPU hours of specialized training. Each shares the same underlying backbone architecture as our adapted encoder.

What if we could just... merge them in?

Domain transfer: safety as a test case

We started with safety moderation. We merged our Bi+MNTP Qwen3-0.6B encoder with Qwen3Guard-Gen-0.6B at a 50/50 ratio, then fine-tuned for just 500 steps (two minutes on a single GPU).

Domain transfer

The merged model outperformed every baseline by over 1 point on average across three safety benchmarks, including two it had never seen during training. Even more striking: it reached 93% of its peak performance in just 20 steps, that's 80 training samples. At that point, it was already 5 points ahead of every other variant.

Merging doesn't just transfer knowledge. It makes adaptation dramatically more sample-efficient.

Modality transfer: vision and audio

We pushed further. We merged our Qwen3-1.7B encoder with Qwen3-VL-2B-Instruct for vision, and our Qwen3-0.6B encoder with Qwen3-ASR-0.6B for audio.

Modality transfer

The results were even more dramatic. On visual-textual entailment, the merged model exceeded the unmerged baseline by over 30 F1 points. On audio comprehension, the gap was 19 points. In both cases, the merged variant also outperformed the specialist causal model fine-tuned with bidirectional attention.

Perhaps the most surprising result: merging succeeded even when the models shared no prior overlapping modalities. The audio specialist was trained exclusively on speech recognition, with no text understanding objective. Yet combining it with our text encoder produced a model that understood both.

Building BidirLM-Omni: One Model, Three Modalities

This is where everything comes together. The key insight is architectural: Qwen3-VL-2B-Instruct (vision-language), Qwen3-ASR-1.7B (audio), and our Bi+MNTP Qwen3-1.7B text encoder all share the exact same transformer backbone. They were all derived from the same pre-trained Qwen3 weights and then specialized in different directions, so their weight spaces haven't drifted far apart (cosine similarities of 0.97 with the vision model, 0.93 with audio). Prior work on linear mode connectivity tells us that models fine-tuned from a shared checkpoint remain in the same basin of the loss landscape, which is why averaging their weights produces something coherent rather than noise. Each specialist learned its own capabilities along roughly independent directions. Merging combines their strengths without destructive interference.

The construction

The recipe has two steps. First, we isolate the textual backbone of each model, stripping away modality-specific projection heads, and perform a linear weight merge in equal proportions (one third each). This produces a unified backbone at the intersection of all three specializations. Second, we take the frozen projection heads (visual from Qwen3-VL, audio from Qwen3-ASR) and attach them directly to this merged backbone. These heads were already trained to project their modality into their original backbone's representation space, so they plug in seamlessly with no retraining.

Aligning all modalities with contrastive training

A merged backbone with attached heads isn't enough on its own: the model still needs to learn a shared representation space where text, images, and audio can be meaningfully compared. For this, we run a final contrastive training phase using Sentence Transformers on Omni-Contrastive, a 1.8M-pair corpus balanced across modalities: 65% text-text pairs, 17.5% audio-text pairs (Laion-Audio-300M and LibriSpeech), and 17.5% image-text pairs (ColPali, NatCap, and MSCOCO). Using InfoNCE with in-batch negatives, this phase pulls matching cross-modal pairs together while pushing non-matching ones apart. The entire process, merging plus contrastive training, required only 250 GPU hours on MI250X, a fraction of the thousands spent training each specialist from scratch.

The Results: Beating Both Omnimodal and Unimodal Models

Omnimodal Performance

BidirLM-Omni outperforms Nemotron-Omni-3B, the previous best omnimodal model, across all three modalities with gains of +17 points on text (MTEB) and +5 on images (MIEB), while being nearly half the size (2.5B vs. 4.8B). But BidirLM-Omni doesn't just beat other omnimodal models. It establishes new Pareto frontiers even against dedicated unimodal architectures.

On MIEB (image embeddings), it ranks first among all baselines, ahead of SigLIP-400M, CLIP variants, and E5-V, many of which are several times larger. On MAEB (audio embeddings), it ranks third overall, surpassing bimodal architectures with nearly 5B parameters. On MTEB Multilingual v2 (text), it scores 63.1, on par with our best text-only encoder (BidirLM-1.7B at 62.9) despite handling two additional modalities. A single 2.5B model, 250 GPU hours of extra compute, competing with or beating specialists designed to do one thing well.

Why This Matters

The traditional approach to building multimodal encoders is to train everything from scratch, for every combination of modalities you care about. That's expensive, inflexible, and wasteful.

BidirLM-Omni demonstrates a fundamentally different path. Start with a strong adapted text encoder. As new specialized causal models appear on the Hub, merge them in. Attach the modality head. Run lightweight contrastive training. Done.

New audio model released? Merge it. Better vision backbone? Merge it. Domain-specific variant for biomedical text? Merge it. The pipeline is modular, incremental, and cheap.

Get Started

We're releasing everything openly:

Models: BidirLM-270M, BidirLM-0.6B, BidirLM-1B, BidirLM-1.7B, and BidirLM-Omni-2.5B Data: Our full contrastive training corpus and the Omni-Contrastive multimodal dataset Checkpoints: All intermediate experimental variants

👉 Explore the full collection at: https://huggingface.co/BidirLM

📝 The paper: https://arxiv.org/abs/2604.02045

✖️ X of the first authors: @N1colAIs, @TheoDescha33800

@misc{boizard2026bidirlmtextomnimodalbidirectional,
      title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs},
      author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
      year={2026},
      eprint={2604.02045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.02045},
}

Community

Sign up or log in to comment