File size: 10,821 Bytes
1ee91ed
 
 
 
 
 
 
 
 
 
 
 
 
 
bfa8c64
1ee91ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31346b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ee91ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31346b0
1ee91ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31346b0
 
1ee91ed
 
31346b0
1ee91ed
 
31346b0
 
1ee91ed
 
 
 
31346b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ee91ed
 
 
 
 
 
 
 
 
9173c36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ee91ed
 
31346b0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
license: mit
language:
- en
- zh
- multilingual
pipeline_tag: text-ranking
library_name: transformers
tags:
- reranker
- retrieval
- rag
- agentic-search
- qwen3.5
- sentence-transformers
---

# Prism-Reranker

**Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.**

A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document.

![Model Architecture](./model_architecture.png)

## Released models

Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality.

| Model | Backbone | Parameters | Hugging Face |
|---|---|---|---|
| Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) |
| Prism-Qwen3.5-Reranker-2B   | Qwen3.5 | 2B   | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) |
| Prism-Qwen3.5-Reranker-4B   | Qwen3.5 | 4B   | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) |
| Prism-Qwen3.5-Reranker-9B   | Qwen3.5 | 9B   | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) |
| Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) |



## Why this model?

In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself:

- **Relevance score**`s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready.
- **`<contribution>`** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc.
- **`<evidence>`** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `<evidence>` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise.

If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated.

## Highlights

- **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B.
- **Context length**: training data capped at **10K tokens** per example, covering most real-world documents.
- **Multilingual**: Chinese / English primary; other languages supported but with less coverage.
- **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries.
- **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time.
- **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut.
- **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `<contribution>` + `<evidence>`, supervised by a 5-LLM-as-judge ensemble.

## Quickstart

Two ways to call the model. Both produce the **same** relevance score `s(q, d) = σ(ℓ_yes − ℓ_no)`. Use **A** when you also want `<contribution>` / `<evidence>`. Use **B** when you only need a score and want a drop-in replacement for any other CrossEncoder reranker.

We use one shared example throughout so you can compare the outputs side by side:

```python
QUERY = "What is the boiling point of water at sea level?"
DOCUMENTS = [
    "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), "
    "which corresponds to sea-level conditions.",
    "Mount Everest is the highest mountain on Earth, with a peak elevation "
    "of 8,848 meters above sea level.",
]
```

### A. Transformers (full output: score + contribution + evidence)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B"  # or any sibling repo above

SYSTEM_PROMPT = (
    "Judge whether the Document meets the requirements based on "
    "the Query and the Instruct provided. "
)

INSTRUCTION = (
    'Judge if the document is relevant to the query. Reply "yes" or "no".\n'
    'On "yes", also emit:\n'
    "<contribution>One sentence covering every core point the document "
    "contributes to the query, without elaboration.</contribution>\n"
    "<evidence>Self-contained rewrite of the query-relevant content. Rules:\n"
    "- Faithful: rephrase only; add or infer nothing.\n"
    "- Self-contained: evidence alone must fully answer the query.\n"
    "- Concise: drop query-irrelevant background.\n"
    "- Verbatim (no translation): proper nouns, terms, abbreviations, "
    "numbers, dates, code, URLs.\n"
    "- Output language: multilingual doc → query's language; else doc's language."
    "</evidence>"
)

PROMPT_TEMPLATE = (
    "<|im_start|>system\n{system}<|im_end|>\n"
    "<|im_start|>user\n"
    "<Instruct>: {instruction}\n"
    "<Query>: {query}\n"
    "<Document>: {doc}<|im_end|>\n"
    "<|im_start|>assistant\n<think>\n\n</think>\n\n"
)


def build_prompt(query: str, doc: str) -> str:
    return PROMPT_TEMPLATE.format(
        system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc
    )


tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="sdpa",
).eval()

yes_id = tokenizer.encode("yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("no", add_special_tokens=False)[0]


@torch.no_grad()
def rerank(query: str, doc: str, max_new_tokens: int = 512):
    prompt = build_prompt(query, doc)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    out = model.generate(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        return_dict_in_generate=True,
        output_scores=True,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    )

    # Relevance score = softmax over {yes, no} at the first generated token.
    first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1)
    yes_p = first_logprobs[yes_id].exp()
    no_p = first_logprobs[no_id].exp()
    score = (yes_p / (yes_p + no_p)).item()

    # Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence>
    gen_ids = out.sequences[0, input_ids.shape[1]:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
    return {"score": score, "text": text}


for doc in DOCUMENTS:
    print(rerank(QUERY, doc))
```

Expected output (one dict per document):

```text
{"score": 0.98, "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"}
{"score": 0.01, "text": "no"}
```

For irrelevant pairs the score is close to 0 and `text` is just `"no"`.

### B. Sentence Transformers CrossEncoder (score only)

If you only need the score and want a drop-in CrossEncoder, the same model works directly with `sentence-transformers >= 5.4.0`. **Note:** in this mode `<contribution>` and `<evidence>` are not produced — only the calibrated relevance score.

The system prompt and instruction are baked into the model's `chat_template.jinja` and are **not configurable** — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass `(query, document)`; the rest is hardcoded.

```python
import torch
from sentence_transformers import CrossEncoder

MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B"  # or any sibling repo above

ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16})

# 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1)
# and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above.
pairs = [(QUERY, doc) for doc in DOCUMENTS]
scores = ce.predict(pairs)
print(scores)
# array([0.98, 0.01], dtype=float32)

# 2) Rank documents directly.
ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True)
for r in ranked:
    print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}")
```

To get raw logit differences instead of [0, 1] probabilities, pass `activation_fn=torch.nn.Identity()` to `ce.predict(...)`.

#### A note on numerical parity with path A

In **fp32**, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints).

In **bf16** with the default batched call (`batch_size > 1`), CE scores can drift from path A by **~1–3%** for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. **Ranking order is unaffected.** If you need bit-for-bit parity with path A:

```python
# Option 1: keep bf16, disable batching
ce.predict(pairs, batch_size=1)

# Option 2: use fp32 (slower, larger memory)
ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32})
```


## Notes on usage

- The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence.
- Inputs longer than 10K tokens may degrade — truncate the document side first.
- Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`.



## Citation

If you use Prism-Reranker in your research, please cite:

```bibtex
@misc{zhang2025prismreranker,
  title  = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval},
  author = {Dun Zhang},
  year   = {2025},
  eprint = {2604.23734},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url    = {https://arxiv.org/abs/2604.23734},
}
```

## Contact

Dun Zhang — `dunnzhang0@gmail.com` (independent researcher).