ru-Promptriever-4B

Overview

Standard dense retrieval models score query–passage pairs using a single semantic similarity signal, giving users no control over what "relevant" means beyond keyword choice. Promptriever (Weller et al., 2024) introduced per-instance natural language instructions that dynamically redefine relevance — a capability previously limited to generative LLMs.

ru-Promptriever-4B extends this paradigm to Russian:

Base model: Qwen3-4B fine-tuned as a bi-encoder with QLoRA + GradCache
Training data: ~500k instruction-augmented Russian retrieval triples (ru-promptriever-dataset)
Pooling: last-token (EOS) pooling, same as the original Promptriever

The key training signal is instruction negatives: passages that are topically relevant to the query but violate the instruction constraint. Without them, models learn to ignore instructions entirely.

Evaluation Results

Results on three benchmarks. p-MRR (Pairwise Mean Reciprocal Rank, ×100) is the primary instruction-following metric — higher means the model correctly adjusts rankings when instructions change. nDCG measures standard retrieval quality.

mFollowIR-RU

Russian split of mFollowIR — multilingual instruction-following retrieval using TREC NeuCLIR narratives as instructions.

Model	nDCG@20	p-MRR
BM25	0.452	+0.67
mE5-large	0.428	-2.03
BGE-M3	0.479	-4.15
GigaEmbeddings-instruct	0.236	+3.44
Promptriever-8B	0.532	+12.21
Qwen3-Embedding-4B	0.549	+8.10
ru-Promptriever-4B (ours)	0.453	+13.54

Synthetic Test (ru-promptriever-dataset)

Held-out test split of our own dataset. Paired standard + instructed queries; p-MRR measures instruction sensitivity.

Model	nDCG@20	p-MRR
BM25	0.652	-11.63
mE5-large	0.806	-2.55
BGE-M3	0.757	+0.75
GigaEmbeddings-instruct	0.457	-21.93
Promptriever-8B	0.770	-31.00
Qwen3-Embedding-4B	0.852	-4.55
ru-Promptriever-4B (ours)	0.888	+1.78

RuBQ Retrieval (ruMTEB)

Standard Russian retrieval benchmark from MTEB — no instructions provided.

Model	nDCG@10
BM25	0.363
mE5-large	0.721
BGE-M3	0.712
ru-Promptriever-4B (ours)	0.640

Key takeaway: ru-Promptriever achieves the highest p-MRR on both instruction-following benchmarks (+13.54 on mFollowIR-RU, +1.78 on Synthetic Test) and the best nDCG@20 on the Synthetic Test (0.888), while maintaining competitive standard retrieval performance on RuBQ (0.640 vs 0.363 BM25 baseline).

Usage

Basic Retrieval (no instruction)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn.functional as F

model_name = "Vladimirlv/ru-promptriever-qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

def encode(texts: list[str], max_length: int = 512) -> torch.Tensor:
    """Encode texts using last-token (EOS) pooling."""
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        # Bypass lm_head to get post-norm hidden states
        original_lm_head = model.lm_head
        model.lm_head = torch.nn.Identity()
        outputs = model(**inputs, use_cache=False, return_dict=True)
        model.lm_head = original_lm_head

    # EOS pooling: take embedding at last non-padding token
    seq_len = inputs["attention_mask"].sum(dim=1) - 1
    embeddings = outputs.logits[torch.arange(len(texts)), seq_len]
    return F.normalize(embeddings, p=2, dim=1)


query = "Когда была основана Москва?"
passages = [
    "Москва была основана в 1147 году князем Юрием Долгоруким.",
    "Санкт-Петербург был основан Петром I в 1703 году.",
]

q_emb = encode([query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
print(scores)  # tensor([0.82, 0.61])

Instruction-Following Retrieval

# Append the instruction directly to the query (same format as training)
instruction = "Найди документ, в котором упоминается конкретная дата основания города."
instructed_query = f"{query} {instruction}"

q_emb = encode([instructed_query])
p_emb = encode(passages)
scores = (q_emb @ p_emb.T).squeeze()
# The model adjusts rankings based on the instruction

Using with sentence-transformers

This model is not compatible with sentence-transformers out of the box due to the custom EOS pooling. Use the snippet above directly with transformers.

Model Details

Property	Value
Base model	Qwen/Qwen3-4B
Architecture	CausalLM bi-encoder (EOS pooling)
Fine-tuning method	QLoRA (rank-32, α=64, attention projections)
Training data	Vladimirlv/ru-promptriever-dataset (~500k instruction rows)
Effective batch size	128 (8 per device × 8 accum × 2 GPUs)
Negatives per query	7 (3 instruction negatives + 4 BM25 hard negatives)
Loss	InfoNCE contrastive (temperature=0.01)
Hardware	2× NVIDIA RTX 5090 (Vast.ai)
Max query length	304 tokens
Max passage length	256 tokens

Training Data

The model was trained on Vladimirlv/ru-promptriever-dataset — a Russian-language instruction-following retrieval dataset built on top of mMARCO-ru (~8.8M passages). Key properties:

~1.2M total rows (500k instruction-augmented + 500k standard pairs + repeated-query variants)
Instruction negatives: synthetic passages that are topically relevant but violate the instruction (3 per instructed query, across 3 failure modes: different_interpretation, omission, mention_non_relevant_flag)
BM25 hard negatives: top-k retrieved passages from the full mMARCO-ru corpus
Paired rows: each source query has both a standard row and an instructed row to prevent catastrophic forgetting

Intended Use

Instruction-following dense retrieval in Russian: RAG pipelines, search systems, and scenarios requiring fine-grained query control via natural language
Research on multilingual instruction-following retrieval and bi-encoder training
Benchmarking alongside mE5, BGE-M3, and Promptriever-style models

Out-of-Scope

General-purpose text embedding (use mE5-large or BGE-M3 if no instruction-following is needed)
Commercial applications (see License below)
Languages other than Russian

Limitations

MS MARCO origin: The training corpus derives from English web passages machine-translated to Russian. A portion of passages retain translation artifacts despite LLM-based rewriting.
Standard retrieval trade-off: Instruction-following training slightly reduces standard retrieval quality compared to encoder-only models (mE5-large, BGE-M3).
Noisy synthetic data: Instructions and negatives were generated and validated automatically by an LLM; a small fraction of imperfect examples may remain.
Russian only: The model was trained and evaluated exclusively on Russian data.

License

This model is released under CC BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0 International).

The non-commercial restriction is inherited from the upstream MS MARCO license (Microsoft Research License — non-commercial use only), which governs the training corpus.

Citation

If you use this model, please cite the original Promptriever paper:

@article{weller2024promptriever,
  title   = {Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
  author  = {Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and
             Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
  journal = {arXiv preprint arXiv:2409.11136},
  year    = {2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Vladimirlv/ru-promptriever-qwen3-4b

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(969)

this model

Dataset used to train Vladimirlv/ru-promptriever-qwen3-4b

Paper for Vladimirlv/ru-promptriever-qwen3-4b

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Paper • 2409.11136 • Published Sep 17, 2024 • 23