| --- |
| license: mit |
| language: |
| - en |
| - zh |
| - multilingual |
| pipeline_tag: text-ranking |
| library_name: transformers |
| tags: |
| - reranker |
| - retrieval |
| - rag |
| - agentic-search |
| - qwen3.5 |
| - sentence-transformers |
| --- |
| |
| # Prism-Reranker |
|
|
| **Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.** |
|
|
| A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document. |
|
|
|  |
|
|
| ## Released models |
|
|
| Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality. |
|
|
| | Model | Backbone | Parameters | Hugging Face | |
| |---|---|---|---| |
| | Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) | |
| | Prism-Qwen3.5-Reranker-2B | Qwen3.5 | 2B | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) | |
| | Prism-Qwen3.5-Reranker-4B | Qwen3.5 | 4B | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) | |
| | Prism-Qwen3.5-Reranker-9B | Qwen3.5 | 9B | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) | |
| | Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) | |
|
|
|
|
|
|
| ## Why this model? |
|
|
| In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself: |
|
|
| - **Relevance score** — `s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready. |
| - **`<contribution>`** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc. |
| - **`<evidence>`** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `<evidence>` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise. |
|
|
| If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated. |
|
|
| ## Highlights |
|
|
| - **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B. |
| - **Context length**: training data capped at **10K tokens** per example, covering most real-world documents. |
| - **Multilingual**: Chinese / English primary; other languages supported but with less coverage. |
| - **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries. |
| - **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time. |
| - **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut. |
| - **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `<contribution>` + `<evidence>`, supervised by a 5-LLM-as-judge ensemble. |
|
|
| ## Quickstart |
|
|
| Two ways to call the model. Both produce the **same** relevance score `s(q, d) = σ(ℓ_yes − ℓ_no)`. Use **A** when you also want `<contribution>` / `<evidence>`. Use **B** when you only need a score and want a drop-in replacement for any other CrossEncoder reranker. |
|
|
| We use one shared example throughout so you can compare the outputs side by side: |
|
|
| ```python |
| QUERY = "What is the boiling point of water at sea level?" |
| DOCUMENTS = [ |
| "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), " |
| "which corresponds to sea-level conditions.", |
| "Mount Everest is the highest mountain on Earth, with a peak elevation " |
| "of 8,848 meters above sea level.", |
| ] |
| ``` |
|
|
| ### A. Transformers (full output: score + contribution + evidence) |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above |
| |
| SYSTEM_PROMPT = ( |
| "Judge whether the Document meets the requirements based on " |
| "the Query and the Instruct provided. " |
| ) |
| |
| INSTRUCTION = ( |
| 'Judge if the document is relevant to the query. Reply "yes" or "no".\n' |
| 'On "yes", also emit:\n' |
| "<contribution>One sentence covering every core point the document " |
| "contributes to the query, without elaboration.</contribution>\n" |
| "<evidence>Self-contained rewrite of the query-relevant content. Rules:\n" |
| "- Faithful: rephrase only; add or infer nothing.\n" |
| "- Self-contained: evidence alone must fully answer the query.\n" |
| "- Concise: drop query-irrelevant background.\n" |
| "- Verbatim (no translation): proper nouns, terms, abbreviations, " |
| "numbers, dates, code, URLs.\n" |
| "- Output language: multilingual doc → query's language; else doc's language." |
| "</evidence>" |
| ) |
| |
| PROMPT_TEMPLATE = ( |
| "<|im_start|>system\n{system}<|im_end|>\n" |
| "<|im_start|>user\n" |
| "<Instruct>: {instruction}\n" |
| "<Query>: {query}\n" |
| "<Document>: {doc}<|im_end|>\n" |
| "<|im_start|>assistant\n<think>\n\n</think>\n\n" |
| ) |
| |
| |
| def build_prompt(query: str, doc: str) -> str: |
| return PROMPT_TEMPLATE.format( |
| system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc |
| ) |
| |
| |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) |
| model = AutoModelForCausalLM.from_pretrained( |
| MODEL_PATH, |
| torch_dtype=torch.bfloat16, |
| device_map="cuda", |
| attn_implementation="sdpa", |
| ).eval() |
| |
| yes_id = tokenizer.encode("yes", add_special_tokens=False)[0] |
| no_id = tokenizer.encode("no", add_special_tokens=False)[0] |
| |
| |
| @torch.no_grad() |
| def rerank(query: str, doc: str, max_new_tokens: int = 512): |
| prompt = build_prompt(query, doc) |
| input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) |
| |
| out = model.generate( |
| input_ids=input_ids, |
| max_new_tokens=max_new_tokens, |
| do_sample=False, |
| return_dict_in_generate=True, |
| output_scores=True, |
| pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id, |
| ) |
| |
| # Relevance score = softmax over {yes, no} at the first generated token. |
| first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1) |
| yes_p = first_logprobs[yes_id].exp() |
| no_p = first_logprobs[no_id].exp() |
| score = (yes_p / (yes_p + no_p)).item() |
| |
| # Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence> |
| gen_ids = out.sequences[0, input_ids.shape[1]:] |
| text = tokenizer.decode(gen_ids, skip_special_tokens=True) |
| return {"score": score, "text": text} |
| |
| |
| for doc in DOCUMENTS: |
| print(rerank(QUERY, doc)) |
| ``` |
|
|
| Expected output (one dict per document): |
|
|
| ```text |
| {"score": 0.98, "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"} |
| {"score": 0.01, "text": "no"} |
| ``` |
|
|
| For irrelevant pairs the score is close to 0 and `text` is just `"no"`. |
|
|
| ### B. Sentence Transformers CrossEncoder (score only) |
|
|
| If you only need the score and want a drop-in CrossEncoder, the same model works directly with `sentence-transformers >= 5.4.0`. **Note:** in this mode `<contribution>` and `<evidence>` are not produced — only the calibrated relevance score. |
|
|
| The system prompt and instruction are baked into the model's `chat_template.jinja` and are **not configurable** — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass `(query, document)`; the rest is hardcoded. |
|
|
| ```python |
| import torch |
| from sentence_transformers import CrossEncoder |
| |
| MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above |
| |
| ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16}) |
| |
| # 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1) |
| # and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above. |
| pairs = [(QUERY, doc) for doc in DOCUMENTS] |
| scores = ce.predict(pairs) |
| print(scores) |
| # array([0.98, 0.01], dtype=float32) |
| |
| # 2) Rank documents directly. |
| ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True) |
| for r in ranked: |
| print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}") |
| ``` |
|
|
| To get raw logit differences instead of [0, 1] probabilities, pass `activation_fn=torch.nn.Identity()` to `ce.predict(...)`. |
|
|
| #### A note on numerical parity with path A |
|
|
| In **fp32**, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints). |
|
|
| In **bf16** with the default batched call (`batch_size > 1`), CE scores can drift from path A by **~1–3%** for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. **Ranking order is unaffected.** If you need bit-for-bit parity with path A: |
|
|
| ```python |
| # Option 1: keep bf16, disable batching |
| ce.predict(pairs, batch_size=1) |
| |
| # Option 2: use fp32 (slower, larger memory) |
| ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32}) |
| ``` |
|
|
|
|
| ## Notes on usage |
|
|
| - The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence. |
| - Inputs longer than 10K tokens may degrade — truncate the document side first. |
| - Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`. |
|
|
|
|
|
|
| ## Citation |
|
|
| If you use Prism-Reranker in your research, please cite: |
|
|
| ```bibtex |
| @misc{zhang2025prismreranker, |
| title = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval}, |
| author = {Dun Zhang}, |
| year = {2025}, |
| eprint = {2604.23734}, |
| archivePrefix = {arXiv}, |
| primaryClass = {cs.IR}, |
| url = {https://arxiv.org/abs/2604.23734}, |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Dun Zhang — `dunnzhang0@gmail.com` (independent researcher). |