ru-Promptriever-4B

arXiv Dataset GitHub License


Overview

Standard dense retrieval models score query–passage pairs using a single semantic similarity signal, giving users no control over what "relevant" means beyond keyword choice. Promptriever (Weller et al., 2024) introduced per-instance natural language instructions that dynamically redefine relevance — a capability previously limited to generative LLMs.

ru-Promptriever-4B extends this paradigm to Russian:

  • Base model: Qwen3-4B fine-tuned as a bi-encoder with QLoRA + GradCache
  • Training data: ~500k instruction-augmented Russian retrieval triples (ru-promptriever-dataset)
  • Pooling: last-token (EOS) pooling, same as the original Promptriever

The key training signal is instruction negatives: passages that are topically relevant to the query but violate the instruction constraint. Without them, models learn to ignore instructions entirely.


Evaluation Results

Results on three benchmarks. p-MRR (Pairwise Mean Reciprocal Rank, ×100) is the primary instruction-following metric — higher means the model correctly adjusts rankings when instructions change. nDCG measures standard retrieval quality.

mFollowIR-RU

Russian split of mFollowIR — multilingual instruction-following retrieval using TREC NeuCLIR narratives as instructions.

Model nDCG@20 p-MRR
BM25 0.452 +0.67
mE5-large 0.428 -2.03
BGE-M3 0.479 -4.15
GigaEmbeddings-instruct 0.236 +3.44
Promptriever-8B 0.532 +12.21
Qwen3-Embedding-4B 0.549 +8.10
ru-Promptriever-4B (ours) 0.453 +13.54

Synthetic Test (ru-promptriever-dataset)

Held-out test split of our own dataset. Paired standard + instructed queries; p-MRR measures instruction sensitivity.

Model nDCG@20 p-MRR
BM25 0.652 -11.63
mE5-large 0.806 -2.55
BGE-M3 0.757 +0.75
GigaEmbeddings-instruct 0.457 -21.93
Promptriever-8B 0.770 -31.00
Qwen3-Embedding-4B 0.852 -4.55
ru-Promptriever-4B (ours) 0.888 +1.78

RuBQ Retrieval (ruMTEB)

Standard Russian retrieval benchmark from MTEB — no instructions provided.

Model nDCG@10
BM25 0.363
mE5-large 0.721
BGE-M3 0.712
ru-Promptriever-4B (ours) 0.640

Key takeaway: ru-Promptriever achieves the highest p-MRR on both instruction-following benchmarks (+13.54 on mFollowIR-RU, +1.78 on Synthetic Test) and the best nDCG@20 on the Synthetic Test (0.888), while maintaining competitive standard retrieval performance on RuBQ (0.640 vs 0.363 BM25 baseline).


Usage

Basic Retrieval (no instruction)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn.functional as F

model_name = "Vladimirlv/ru-promptriever-qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

def encode(texts: list[str], max_length: int = 512) -> torch.Tensor:
    """Encode texts using last-token (EOS) pooling."""
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        # Bypass lm_head to get post-norm hidden states
        original_lm_head = model.lm_head
        model.lm_head = torch.nn.Identity()
        outputs = model(**inputs, use_cache=False, return_dict=True)
        model.lm_head = original_lm_head

    # EOS pooling: take embedding at last non-padding token
    seq_len = inputs["attention_mask"].sum(dim=1) - 1
    embeddings = outputs.logits[torch.arange(len(texts)), seq_len]
    return F.normalize(embeddings, p=2, dim=1)


query = "Когда была основана Москва?"
passages = [
    "Москва была основана в 1147 году князем Юрием Долгоруким.",
    "Санкт-Петербург был основан Петром I в 1703 году.",
]

q_emb = encode([query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
print(scores)  # tensor([0.82, 0.61])

Instruction-Following Retrieval

# Append the instruction directly to the query (same format as training)
instruction = "Найди документ, в котором упоминается конкретная дата основания города."
instructed_query = f"{query} {instruction}"

q_emb = encode([instructed_query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
# The model adjusts rankings based on the instruction

Using with sentence-transformers

This model is not compatible with sentence-transformers out of the box due to the custom EOS pooling. Use the snippet above directly with transformers.


Model Details

Property Value
Base model Qwen/Qwen3-4B
Architecture CausalLM bi-encoder (EOS pooling)
Fine-tuning method QLoRA (rank-32, α=64, attention projections)
Training data Vladimirlv/ru-promptriever-dataset (~500k instruction rows)
Effective batch size 128 (8 per device × 8 accum × 2 GPUs)
Negatives per query 7 (3 instruction negatives + 4 BM25 hard negatives)
Loss InfoNCE contrastive (temperature=0.01)
Hardware 2× NVIDIA RTX 5090 (Vast.ai)
Max query length 304 tokens
Max passage length 256 tokens

Training Data

The model was trained on Vladimirlv/ru-promptriever-dataset — a Russian-language instruction-following retrieval dataset built on top of mMARCO-ru (~8.8M passages). Key properties:

  • ~1.2M total rows (500k instruction-augmented + 500k standard pairs + repeated-query variants)
  • Instruction negatives: synthetic passages that are topically relevant but violate the instruction (3 per instructed query, across 3 failure modes: different_interpretation, omission, mention_non_relevant_flag)
  • BM25 hard negatives: top-k retrieved passages from the full mMARCO-ru corpus
  • Paired rows: each source query has both a standard row and an instructed row to prevent catastrophic forgetting

Intended Use

  • Instruction-following dense retrieval in Russian: RAG pipelines, search systems, and scenarios requiring fine-grained query control via natural language
  • Research on multilingual instruction-following retrieval and bi-encoder training
  • Benchmarking alongside mE5, BGE-M3, and Promptriever-style models

Out-of-Scope

  • General-purpose text embedding (use mE5-large or BGE-M3 if no instruction-following is needed)
  • Commercial applications (see License below)
  • Languages other than Russian

Limitations

  • MS MARCO origin: The training corpus derives from English web passages machine-translated to Russian. A portion of passages retain translation artifacts despite LLM-based rewriting.
  • Standard retrieval trade-off: Instruction-following training slightly reduces standard retrieval quality compared to encoder-only models (mE5-large, BGE-M3).
  • Noisy synthetic data: Instructions and negatives were generated and validated automatically by an LLM; a small fraction of imperfect examples may remain.
  • Russian only: The model was trained and evaluated exclusively on Russian data.

License

This model is released under CC BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0 International).

The non-commercial restriction is inherited from the upstream MS MARCO license (Microsoft Research License — non-commercial use only), which governs the training corpus.


Citation

If you use this model, please cite the original Promptriever paper:

@article{weller2024promptriever,
  title   = {Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
  author  = {Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and
             Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
  journal = {arXiv preprint arXiv:2409.11136},
  year    = {2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vladimirlv/ru-promptriever-qwen3-4b

Finetuned
Qwen/Qwen3-4B
Adapter
(969)
this model

Dataset used to train Vladimirlv/ru-promptriever-qwen3-4b

Paper for Vladimirlv/ru-promptriever-qwen3-4b