ru-Promptriever-4B
Overview
Standard dense retrieval models score query–passage pairs using a single semantic similarity signal, giving users no control over what "relevant" means beyond keyword choice. Promptriever (Weller et al., 2024) introduced per-instance natural language instructions that dynamically redefine relevance — a capability previously limited to generative LLMs.
ru-Promptriever-4B extends this paradigm to Russian:
- Base model: Qwen3-4B fine-tuned as a bi-encoder with QLoRA + GradCache
- Training data: ~500k instruction-augmented Russian retrieval triples (ru-promptriever-dataset)
- Pooling: last-token (EOS) pooling, same as the original Promptriever
The key training signal is instruction negatives: passages that are topically relevant to the query but violate the instruction constraint. Without them, models learn to ignore instructions entirely.
Evaluation Results
Results on three benchmarks. p-MRR (Pairwise Mean Reciprocal Rank, ×100) is the primary instruction-following metric — higher means the model correctly adjusts rankings when instructions change. nDCG measures standard retrieval quality.
mFollowIR-RU
Russian split of mFollowIR — multilingual instruction-following retrieval using TREC NeuCLIR narratives as instructions.
| Model | nDCG@20 | p-MRR |
|---|---|---|
| BM25 | 0.452 | +0.67 |
| mE5-large | 0.428 | -2.03 |
| BGE-M3 | 0.479 | -4.15 |
| GigaEmbeddings-instruct | 0.236 | +3.44 |
| Promptriever-8B | 0.532 | +12.21 |
| Qwen3-Embedding-4B | 0.549 | +8.10 |
| ru-Promptriever-4B (ours) | 0.453 | +13.54 |
Synthetic Test (ru-promptriever-dataset)
Held-out test split of our own dataset. Paired standard + instructed queries; p-MRR measures instruction sensitivity.
| Model | nDCG@20 | p-MRR |
|---|---|---|
| BM25 | 0.652 | -11.63 |
| mE5-large | 0.806 | -2.55 |
| BGE-M3 | 0.757 | +0.75 |
| GigaEmbeddings-instruct | 0.457 | -21.93 |
| Promptriever-8B | 0.770 | -31.00 |
| Qwen3-Embedding-4B | 0.852 | -4.55 |
| ru-Promptriever-4B (ours) | 0.888 | +1.78 |
RuBQ Retrieval (ruMTEB)
Standard Russian retrieval benchmark from MTEB — no instructions provided.
| Model | nDCG@10 |
|---|---|
| BM25 | 0.363 |
| mE5-large | 0.721 |
| BGE-M3 | 0.712 |
| ru-Promptriever-4B (ours) | 0.640 |
Key takeaway: ru-Promptriever achieves the highest p-MRR on both instruction-following benchmarks (+13.54 on mFollowIR-RU, +1.78 on Synthetic Test) and the best nDCG@20 on the Synthetic Test (0.888), while maintaining competitive standard retrieval performance on RuBQ (0.640 vs 0.363 BM25 baseline).
Usage
Basic Retrieval (no instruction)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn.functional as F
model_name = "Vladimirlv/ru-promptriever-qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
def encode(texts: list[str], max_length: int = 512) -> torch.Tensor:
"""Encode texts using last-token (EOS) pooling."""
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
# Bypass lm_head to get post-norm hidden states
original_lm_head = model.lm_head
model.lm_head = torch.nn.Identity()
outputs = model(**inputs, use_cache=False, return_dict=True)
model.lm_head = original_lm_head
# EOS pooling: take embedding at last non-padding token
seq_len = inputs["attention_mask"].sum(dim=1) - 1
embeddings = outputs.logits[torch.arange(len(texts)), seq_len]
return F.normalize(embeddings, p=2, dim=1)
query = "Когда была основана Москва?"
passages = [
"Москва была основана в 1147 году князем Юрием Долгоруким.",
"Санкт-Петербург был основан Петром I в 1703 году.",
]
q_emb = encode([query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
print(scores) # tensor([0.82, 0.61])
Instruction-Following Retrieval
# Append the instruction directly to the query (same format as training)
instruction = "Найди документ, в котором упоминается конкретная дата основания города."
instructed_query = f"{query} {instruction}"
q_emb = encode([instructed_query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
# The model adjusts rankings based on the instruction
Using with sentence-transformers
This model is not compatible with sentence-transformers out of the box due to the custom EOS pooling. Use the snippet above directly with transformers.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-4B |
| Architecture | CausalLM bi-encoder (EOS pooling) |
| Fine-tuning method | QLoRA (rank-32, α=64, attention projections) |
| Training data | Vladimirlv/ru-promptriever-dataset (~500k instruction rows) |
| Effective batch size | 128 (8 per device × 8 accum × 2 GPUs) |
| Negatives per query | 7 (3 instruction negatives + 4 BM25 hard negatives) |
| Loss | InfoNCE contrastive (temperature=0.01) |
| Hardware | 2× NVIDIA RTX 5090 (Vast.ai) |
| Max query length | 304 tokens |
| Max passage length | 256 tokens |
Training Data
The model was trained on Vladimirlv/ru-promptriever-dataset — a Russian-language instruction-following retrieval dataset built on top of mMARCO-ru (~8.8M passages). Key properties:
- ~1.2M total rows (500k instruction-augmented + 500k standard pairs + repeated-query variants)
- Instruction negatives: synthetic passages that are topically relevant but violate the instruction (3 per instructed query, across 3 failure modes:
different_interpretation,omission,mention_non_relevant_flag) - BM25 hard negatives: top-k retrieved passages from the full mMARCO-ru corpus
- Paired rows: each source query has both a standard row and an instructed row to prevent catastrophic forgetting
Intended Use
- Instruction-following dense retrieval in Russian: RAG pipelines, search systems, and scenarios requiring fine-grained query control via natural language
- Research on multilingual instruction-following retrieval and bi-encoder training
- Benchmarking alongside mE5, BGE-M3, and Promptriever-style models
Out-of-Scope
- General-purpose text embedding (use mE5-large or BGE-M3 if no instruction-following is needed)
- Commercial applications (see License below)
- Languages other than Russian
Limitations
- MS MARCO origin: The training corpus derives from English web passages machine-translated to Russian. A portion of passages retain translation artifacts despite LLM-based rewriting.
- Standard retrieval trade-off: Instruction-following training slightly reduces standard retrieval quality compared to encoder-only models (mE5-large, BGE-M3).
- Noisy synthetic data: Instructions and negatives were generated and validated automatically by an LLM; a small fraction of imperfect examples may remain.
- Russian only: The model was trained and evaluated exclusively on Russian data.
License
This model is released under CC BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0 International).
The non-commercial restriction is inherited from the upstream MS MARCO license (Microsoft Research License — non-commercial use only), which governs the training corpus.
Citation
If you use this model, please cite the original Promptriever paper:
@article{weller2024promptriever,
title = {Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author = {Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and
Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
journal = {arXiv preprint arXiv:2409.11136},
year = {2024}
}