File size: 10,821 Bytes
1ee91ed bfa8c64 1ee91ed 31346b0 1ee91ed 31346b0 1ee91ed 31346b0 1ee91ed 31346b0 1ee91ed 31346b0 1ee91ed 31346b0 1ee91ed 9173c36 1ee91ed 31346b0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | ---
license: mit
language:
- en
- zh
- multilingual
pipeline_tag: text-ranking
library_name: transformers
tags:
- reranker
- retrieval
- rag
- agentic-search
- qwen3.5
- sentence-transformers
---
# Prism-Reranker
**Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.**
A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document.

## Released models
Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality.
| Model | Backbone | Parameters | Hugging Face |
|---|---|---|---|
| Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) |
| Prism-Qwen3.5-Reranker-2B | Qwen3.5 | 2B | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) |
| Prism-Qwen3.5-Reranker-4B | Qwen3.5 | 4B | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) |
| Prism-Qwen3.5-Reranker-9B | Qwen3.5 | 9B | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) |
| Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) |
## Why this model?
In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself:
- **Relevance score** — `s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready.
- **`<contribution>`** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc.
- **`<evidence>`** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `<evidence>` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise.
If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated.
## Highlights
- **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B.
- **Context length**: training data capped at **10K tokens** per example, covering most real-world documents.
- **Multilingual**: Chinese / English primary; other languages supported but with less coverage.
- **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries.
- **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time.
- **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut.
- **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `<contribution>` + `<evidence>`, supervised by a 5-LLM-as-judge ensemble.
## Quickstart
Two ways to call the model. Both produce the **same** relevance score `s(q, d) = σ(ℓ_yes − ℓ_no)`. Use **A** when you also want `<contribution>` / `<evidence>`. Use **B** when you only need a score and want a drop-in replacement for any other CrossEncoder reranker.
We use one shared example throughout so you can compare the outputs side by side:
```python
QUERY = "What is the boiling point of water at sea level?"
DOCUMENTS = [
"Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), "
"which corresponds to sea-level conditions.",
"Mount Everest is the highest mountain on Earth, with a peak elevation "
"of 8,848 meters above sea level.",
]
```
### A. Transformers (full output: score + contribution + evidence)
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above
SYSTEM_PROMPT = (
"Judge whether the Document meets the requirements based on "
"the Query and the Instruct provided. "
)
INSTRUCTION = (
'Judge if the document is relevant to the query. Reply "yes" or "no".\n'
'On "yes", also emit:\n'
"<contribution>One sentence covering every core point the document "
"contributes to the query, without elaboration.</contribution>\n"
"<evidence>Self-contained rewrite of the query-relevant content. Rules:\n"
"- Faithful: rephrase only; add or infer nothing.\n"
"- Self-contained: evidence alone must fully answer the query.\n"
"- Concise: drop query-irrelevant background.\n"
"- Verbatim (no translation): proper nouns, terms, abbreviations, "
"numbers, dates, code, URLs.\n"
"- Output language: multilingual doc → query's language; else doc's language."
"</evidence>"
)
PROMPT_TEMPLATE = (
"<|im_start|>system\n{system}<|im_end|>\n"
"<|im_start|>user\n"
"<Instruct>: {instruction}\n"
"<Query>: {query}\n"
"<Document>: {doc}<|im_end|>\n"
"<|im_start|>assistant\n<think>\n\n</think>\n\n"
)
def build_prompt(query: str, doc: str) -> str:
return PROMPT_TEMPLATE.format(
system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="sdpa",
).eval()
yes_id = tokenizer.encode("yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("no", add_special_tokens=False)[0]
@torch.no_grad()
def rerank(query: str, doc: str, max_new_tokens: int = 512):
prompt = build_prompt(query, doc)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
out = model.generate(
input_ids=input_ids,
max_new_tokens=max_new_tokens,
do_sample=False,
return_dict_in_generate=True,
output_scores=True,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
)
# Relevance score = softmax over {yes, no} at the first generated token.
first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1)
yes_p = first_logprobs[yes_id].exp()
no_p = first_logprobs[no_id].exp()
score = (yes_p / (yes_p + no_p)).item()
# Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence>
gen_ids = out.sequences[0, input_ids.shape[1]:]
text = tokenizer.decode(gen_ids, skip_special_tokens=True)
return {"score": score, "text": text}
for doc in DOCUMENTS:
print(rerank(QUERY, doc))
```
Expected output (one dict per document):
```text
{"score": 0.98, "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"}
{"score": 0.01, "text": "no"}
```
For irrelevant pairs the score is close to 0 and `text` is just `"no"`.
### B. Sentence Transformers CrossEncoder (score only)
If you only need the score and want a drop-in CrossEncoder, the same model works directly with `sentence-transformers >= 5.4.0`. **Note:** in this mode `<contribution>` and `<evidence>` are not produced — only the calibrated relevance score.
The system prompt and instruction are baked into the model's `chat_template.jinja` and are **not configurable** — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass `(query, document)`; the rest is hardcoded.
```python
import torch
from sentence_transformers import CrossEncoder
MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above
ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16})
# 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1)
# and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above.
pairs = [(QUERY, doc) for doc in DOCUMENTS]
scores = ce.predict(pairs)
print(scores)
# array([0.98, 0.01], dtype=float32)
# 2) Rank documents directly.
ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True)
for r in ranked:
print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}")
```
To get raw logit differences instead of [0, 1] probabilities, pass `activation_fn=torch.nn.Identity()` to `ce.predict(...)`.
#### A note on numerical parity with path A
In **fp32**, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints).
In **bf16** with the default batched call (`batch_size > 1`), CE scores can drift from path A by **~1–3%** for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. **Ranking order is unaffected.** If you need bit-for-bit parity with path A:
```python
# Option 1: keep bf16, disable batching
ce.predict(pairs, batch_size=1)
# Option 2: use fp32 (slower, larger memory)
ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32})
```
## Notes on usage
- The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence.
- Inputs longer than 10K tokens may degrade — truncate the document side first.
- Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`.
## Citation
If you use Prism-Reranker in your research, please cite:
```bibtex
@misc{zhang2025prismreranker,
title = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval},
author = {Dun Zhang},
year = {2025},
eprint = {2604.23734},
archivePrefix = {arXiv},
primaryClass = {cs.IR},
url = {https://arxiv.org/abs/2604.23734},
}
```
## Contact
Dun Zhang — `dunnzhang0@gmail.com` (independent researcher). |