--- license: mit language: - en - zh - multilingual pipeline_tag: text-ranking library_name: transformers tags: - reranker - retrieval - rag - agentic-search - qwen3.5 - sentence-transformers --- # Prism-Reranker **Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.** A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document. ![Model Architecture](./model_architecture.png) ## Released models Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality. | Model | Backbone | Parameters | Hugging Face | |---|---|---|---| | Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) | | Prism-Qwen3.5-Reranker-2B | Qwen3.5 | 2B | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) | | Prism-Qwen3.5-Reranker-4B | Qwen3.5 | 4B | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) | | Prism-Qwen3.5-Reranker-9B | Qwen3.5 | 9B | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) | | Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) | ## Why this model? In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself: - **Relevance score** — `s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready. - **``** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc. - **``** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise. If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated. ## Highlights - **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B. - **Context length**: training data capped at **10K tokens** per example, covering most real-world documents. - **Multilingual**: Chinese / English primary; other languages supported but with less coverage. - **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries. - **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time. - **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut. - **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `` + ``, supervised by a 5-LLM-as-judge ensemble. ## Quickstart Two ways to call the model. Both produce the **same** relevance score `s(q, d) = σ(ℓ_yes − ℓ_no)`. Use **A** when you also want `` / ``. Use **B** when you only need a score and want a drop-in replacement for any other CrossEncoder reranker. We use one shared example throughout so you can compare the outputs side by side: ```python QUERY = "What is the boiling point of water at sea level?" DOCUMENTS = [ "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), " "which corresponds to sea-level conditions.", "Mount Everest is the highest mountain on Earth, with a peak elevation " "of 8,848 meters above sea level.", ] ``` ### A. Transformers (full output: score + contribution + evidence) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above SYSTEM_PROMPT = ( "Judge whether the Document meets the requirements based on " "the Query and the Instruct provided. " ) INSTRUCTION = ( 'Judge if the document is relevant to the query. Reply "yes" or "no".\n' 'On "yes", also emit:\n' "One sentence covering every core point the document " "contributes to the query, without elaboration.\n" "Self-contained rewrite of the query-relevant content. Rules:\n" "- Faithful: rephrase only; add or infer nothing.\n" "- Self-contained: evidence alone must fully answer the query.\n" "- Concise: drop query-irrelevant background.\n" "- Verbatim (no translation): proper nouns, terms, abbreviations, " "numbers, dates, code, URLs.\n" "- Output language: multilingual doc → query's language; else doc's language." "" ) PROMPT_TEMPLATE = ( "<|im_start|>system\n{system}<|im_end|>\n" "<|im_start|>user\n" ": {instruction}\n" ": {query}\n" ": {doc}<|im_end|>\n" "<|im_start|>assistant\n\n\n\n\n" ) def build_prompt(query: str, doc: str) -> str: return PROMPT_TEMPLATE.format( system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc ) tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, device_map="cuda", attn_implementation="sdpa", ).eval() yes_id = tokenizer.encode("yes", add_special_tokens=False)[0] no_id = tokenizer.encode("no", add_special_tokens=False)[0] @torch.no_grad() def rerank(query: str, doc: str, max_new_tokens: int = 512): prompt = build_prompt(query, doc) input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) out = model.generate( input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=False, return_dict_in_generate=True, output_scores=True, pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id, ) # Relevance score = softmax over {yes, no} at the first generated token. first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1) yes_p = first_logprobs[yes_id].exp() no_p = first_logprobs[no_id].exp() score = (yes_p / (yes_p + no_p)).item() # Decoded text holds yes/no plus ...... gen_ids = out.sequences[0, input_ids.shape[1]:] text = tokenizer.decode(gen_ids, skip_special_tokens=True) return {"score": score, "text": text} for doc in DOCUMENTS: print(rerank(QUERY, doc)) ``` Expected output (one dict per document): ```text {"score": 0.98, "text": "yes\n...\n..."} {"score": 0.01, "text": "no"} ``` For irrelevant pairs the score is close to 0 and `text` is just `"no"`. ### B. Sentence Transformers CrossEncoder (score only) If you only need the score and want a drop-in CrossEncoder, the same model works directly with `sentence-transformers >= 5.4.0`. **Note:** in this mode `` and `` are not produced — only the calibrated relevance score. The system prompt and instruction are baked into the model's `chat_template.jinja` and are **not configurable** — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass `(query, document)`; the rest is hardcoded. ```python import torch from sentence_transformers import CrossEncoder MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16}) # 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1) # and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above. pairs = [(QUERY, doc) for doc in DOCUMENTS] scores = ce.predict(pairs) print(scores) # array([0.98, 0.01], dtype=float32) # 2) Rank documents directly. ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True) for r in ranked: print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}") ``` To get raw logit differences instead of [0, 1] probabilities, pass `activation_fn=torch.nn.Identity()` to `ce.predict(...)`. #### A note on numerical parity with path A In **fp32**, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints). In **bf16** with the default batched call (`batch_size > 1`), CE scores can drift from path A by **~1–3%** for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. **Ranking order is unaffected.** If you need bit-for-bit parity with path A: ```python # Option 1: keep bf16, disable batching ce.predict(pairs, batch_size=1) # Option 2: use fp32 (slower, larger memory) ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32}) ``` ## Notes on usage - The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence. - Inputs longer than 10K tokens may degrade — truncate the document side first. - Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`. ## Citation If you use Prism-Reranker in your research, please cite: ```bibtex @misc{zhang2025prismreranker, title = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval}, author = {Dun Zhang}, year = {2025}, eprint = {2604.23734}, archivePrefix = {arXiv}, primaryClass = {cs.IR}, url = {https://arxiv.org/abs/2604.23734}, } ``` ## Contact Dun Zhang — `dunnzhang0@gmail.com` (independent researcher).