Text Ranking
Transformers
Safetensors
sentence-transformers
English
Chinese
multilingual
qwen3_5_text
text-generation
reranker
retrieval
rag
agentic-search
qwen3.5
Instructions to use infgrad/Prism-Qwen3.5-Reranker-0.8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use infgrad/Prism-Qwen3.5-Reranker-0.8B with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("infgrad/Prism-Qwen3.5-Reranker-0.8B") model = AutoModelForCausalLM.from_pretrained("infgrad/Prism-Qwen3.5-Reranker-0.8B") - sentence-transformers
How to use infgrad/Prism-Qwen3.5-Reranker-0.8B with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("infgrad/Prism-Qwen3.5-Reranker-0.8B") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| - zh | |
| - multilingual | |
| pipeline_tag: text-ranking | |
| library_name: transformers | |
| tags: | |
| - reranker | |
| - retrieval | |
| - rag | |
| - agentic-search | |
| - qwen3.5 | |
| - sentence-transformers | |
| # Prism-Reranker | |
| **Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.** | |
| A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document. | |
|  | |
| ## Released models | |
| Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality. | |
| | Model | Backbone | Parameters | Hugging Face | | |
| |---|---|---|---| | |
| | Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) | | |
| | Prism-Qwen3.5-Reranker-2B | Qwen3.5 | 2B | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) | | |
| | Prism-Qwen3.5-Reranker-4B | Qwen3.5 | 4B | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) | | |
| | Prism-Qwen3.5-Reranker-9B | Qwen3.5 | 9B | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) | | |
| | Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) | | |
| ## Why this model? | |
| In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself: | |
| - **Relevance score** — `s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready. | |
| - **`<contribution>`** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc. | |
| - **`<evidence>`** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `<evidence>` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise. | |
| If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated. | |
| ## Highlights | |
| - **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B. | |
| - **Context length**: training data capped at **10K tokens** per example, covering most real-world documents. | |
| - **Multilingual**: Chinese / English primary; other languages supported but with less coverage. | |
| - **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries. | |
| - **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time. | |
| - **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut. | |
| - **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `<contribution>` + `<evidence>`, supervised by a 5-LLM-as-judge ensemble. | |
| ## Quickstart | |
| Two ways to call the model. Both produce the **same** relevance score `s(q, d) = σ(ℓ_yes − ℓ_no)`. Use **A** when you also want `<contribution>` / `<evidence>`. Use **B** when you only need a score and want a drop-in replacement for any other CrossEncoder reranker. | |
| We use one shared example throughout so you can compare the outputs side by side: | |
| ```python | |
| QUERY = "What is the boiling point of water at sea level?" | |
| DOCUMENTS = [ | |
| "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), " | |
| "which corresponds to sea-level conditions.", | |
| "Mount Everest is the highest mountain on Earth, with a peak elevation " | |
| "of 8,848 meters above sea level.", | |
| ] | |
| ``` | |
| ### A. Transformers (full output: score + contribution + evidence) | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above | |
| SYSTEM_PROMPT = ( | |
| "Judge whether the Document meets the requirements based on " | |
| "the Query and the Instruct provided. " | |
| ) | |
| INSTRUCTION = ( | |
| 'Judge if the document is relevant to the query. Reply "yes" or "no".\n' | |
| 'On "yes", also emit:\n' | |
| "<contribution>One sentence covering every core point the document " | |
| "contributes to the query, without elaboration.</contribution>\n" | |
| "<evidence>Self-contained rewrite of the query-relevant content. Rules:\n" | |
| "- Faithful: rephrase only; add or infer nothing.\n" | |
| "- Self-contained: evidence alone must fully answer the query.\n" | |
| "- Concise: drop query-irrelevant background.\n" | |
| "- Verbatim (no translation): proper nouns, terms, abbreviations, " | |
| "numbers, dates, code, URLs.\n" | |
| "- Output language: multilingual doc → query's language; else doc's language." | |
| "</evidence>" | |
| ) | |
| PROMPT_TEMPLATE = ( | |
| "<|im_start|>system\n{system}<|im_end|>\n" | |
| "<|im_start|>user\n" | |
| "<Instruct>: {instruction}\n" | |
| "<Query>: {query}\n" | |
| "<Document>: {doc}<|im_end|>\n" | |
| "<|im_start|>assistant\n<think>\n\n</think>\n\n" | |
| ) | |
| def build_prompt(query: str, doc: str) -> str: | |
| return PROMPT_TEMPLATE.format( | |
| system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_PATH, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| attn_implementation="sdpa", | |
| ).eval() | |
| yes_id = tokenizer.encode("yes", add_special_tokens=False)[0] | |
| no_id = tokenizer.encode("no", add_special_tokens=False)[0] | |
| @torch.no_grad() | |
| def rerank(query: str, doc: str, max_new_tokens: int = 512): | |
| prompt = build_prompt(query, doc) | |
| input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) | |
| out = model.generate( | |
| input_ids=input_ids, | |
| max_new_tokens=max_new_tokens, | |
| do_sample=False, | |
| return_dict_in_generate=True, | |
| output_scores=True, | |
| pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id, | |
| ) | |
| # Relevance score = softmax over {yes, no} at the first generated token. | |
| first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1) | |
| yes_p = first_logprobs[yes_id].exp() | |
| no_p = first_logprobs[no_id].exp() | |
| score = (yes_p / (yes_p + no_p)).item() | |
| # Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence> | |
| gen_ids = out.sequences[0, input_ids.shape[1]:] | |
| text = tokenizer.decode(gen_ids, skip_special_tokens=True) | |
| return {"score": score, "text": text} | |
| for doc in DOCUMENTS: | |
| print(rerank(QUERY, doc)) | |
| ``` | |
| Expected output (one dict per document): | |
| ```text | |
| {"score": 0.98, "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"} | |
| {"score": 0.01, "text": "no"} | |
| ``` | |
| For irrelevant pairs the score is close to 0 and `text` is just `"no"`. | |
| ### B. Sentence Transformers CrossEncoder (score only) | |
| If you only need the score and want a drop-in CrossEncoder, the same model works directly with `sentence-transformers >= 5.4.0`. **Note:** in this mode `<contribution>` and `<evidence>` are not produced — only the calibrated relevance score. | |
| The system prompt and instruction are baked into the model's `chat_template.jinja` and are **not configurable** — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass `(query, document)`; the rest is hardcoded. | |
| ```python | |
| import torch | |
| from sentence_transformers import CrossEncoder | |
| MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above | |
| ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16}) | |
| # 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1) | |
| # and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above. | |
| pairs = [(QUERY, doc) for doc in DOCUMENTS] | |
| scores = ce.predict(pairs) | |
| print(scores) | |
| # array([0.98, 0.01], dtype=float32) | |
| # 2) Rank documents directly. | |
| ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True) | |
| for r in ranked: | |
| print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}") | |
| ``` | |
| To get raw logit differences instead of [0, 1] probabilities, pass `activation_fn=torch.nn.Identity()` to `ce.predict(...)`. | |
| #### A note on numerical parity with path A | |
| In **fp32**, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints). | |
| In **bf16** with the default batched call (`batch_size > 1`), CE scores can drift from path A by **~1–3%** for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. **Ranking order is unaffected.** If you need bit-for-bit parity with path A: | |
| ```python | |
| # Option 1: keep bf16, disable batching | |
| ce.predict(pairs, batch_size=1) | |
| # Option 2: use fp32 (slower, larger memory) | |
| ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32}) | |
| ``` | |
| ## Notes on usage | |
| - The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence. | |
| - Inputs longer than 10K tokens may degrade — truncate the document side first. | |
| - Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`. | |
| ## Citation | |
| If you use Prism-Reranker in your research, please cite: | |
| ```bibtex | |
| @misc{zhang2025prismreranker, | |
| title = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval}, | |
| author = {Dun Zhang}, | |
| year = {2025}, | |
| eprint = {2604.23734}, | |
| archivePrefix = {arXiv}, | |
| primaryClass = {cs.IR}, | |
| url = {https://arxiv.org/abs/2604.23734}, | |
| } | |
| ``` | |
| ## Contact | |
| Dun Zhang — `dunnzhang0@gmail.com` (independent researcher). |