infgrad commited on
Commit
2e31a63
·
verified ·
1 Parent(s): 096d13f

Upload 2 files

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +173 -3
  3. model_architecture.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ model_architecture.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ - multilingual
7
+ pipeline_tag: text-ranking
8
+ library_name: transformers
9
+ tags:
10
+ - reranker
11
+ - retrieval
12
+ - rag
13
+ - agentic-search
14
+ - qwen3.5
15
+ ---
16
+
17
+ # Prism-Reranker
18
+
19
+ **Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.**
20
+
21
+ A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence *contribution*, and a self-contained *evidence* passage extracted from the document.
22
+
23
+ ![Model Architecture](./model_architecture.png)
24
+
25
+ ## Released models
26
+
27
+ Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the **Qwen3.5** backbone; one (`-4B-exp`) is an experimental extension built on top of **Qwen3-Reranker-4B**, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality.
28
+
29
+ | Model | Backbone | Parameters | Hugging Face |
30
+ |---|---|---|---|
31
+ | Prism-Qwen3.5-Reranker-0.8B | Qwen3.5 | 0.8B | [infgrad/Prism-Qwen3.5-Reranker-0.8B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-0.8B) |
32
+ | Prism-Qwen3.5-Reranker-2B | Qwen3.5 | 2B | [infgrad/Prism-Qwen3.5-Reranker-2B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-2B) |
33
+ | Prism-Qwen3.5-Reranker-4B | Qwen3.5 | 4B | [infgrad/Prism-Qwen3.5-Reranker-4B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-4B) |
34
+ | Prism-Qwen3.5-Reranker-9B | Qwen3.5 | 9B | [infgrad/Prism-Qwen3.5-Reranker-9B](https://huggingface.co/infgrad/Prism-Qwen3.5-Reranker-9B) |
35
+ | Prism-Qwen3-Reranker-4B-exp | Qwen3-Reranker-4B | 4B | [infgrad/Prism-Qwen3-Reranker-4B-exp](https://huggingface.co/infgrad/Prism-Qwen3-Reranker-4B-exp) |
36
+
37
+
38
+
39
+ ## Why this model?
40
+
41
+ In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself:
42
+
43
+ - **Relevance score** — `s(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1)`. Calibrated, ranking-ready.
44
+ - **`<contribution>`** — one sentence stating *every* core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc.
45
+ - **`<evidence>`** — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed `<evidence>` directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise.
46
+
47
+ If the document is not relevant, the model outputs `no` and stops. No contribution/evidence is generated.
48
+
49
+ ## Highlights
50
+
51
+ - **Backbones**: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B.
52
+ - **Context length**: training data capped at **10K tokens** per example, covering most real-world documents.
53
+ - **Multilingual**: Chinese / English primary; other languages supported but with less coverage.
54
+ - **Keyword-query robust**: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries.
55
+ - **Real-world data distribution**: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time.
56
+ - **Length × score balanced**: training data was rebalanced so that document length is not a relevance shortcut.
57
+ - **Training recipe**: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on `yes/no` + `<contribution>` + `<evidence>`, supervised by a 5-LLM-as-judge ensemble.
58
+
59
+ ## Quickstart
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoModelForCausalLM, AutoTokenizer
64
+
65
+ MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B" # or any sibling repo above
66
+
67
+ SYSTEM_PROMPT = (
68
+ "Judge whether the Document meets the requirements based on "
69
+ "the Query and the Instruct provided. "
70
+ )
71
+
72
+ INSTRUCTION = (
73
+ 'Judge if the document is relevant to the query. Reply "yes" or "no".\n'
74
+ 'On "yes", also emit:\n'
75
+ "<contribution>One sentence covering every core point the document "
76
+ "contributes to the query, without elaboration.</contribution>\n"
77
+ "<evidence>Self-contained rewrite of the query-relevant content. Rules:\n"
78
+ "- Faithful: rephrase only; add or infer nothing.\n"
79
+ "- Self-contained: evidence alone must fully answer the query.\n"
80
+ "- Concise: drop query-irrelevant background.\n"
81
+ "- Verbatim (no translation): proper nouns, terms, abbreviations, "
82
+ "numbers, dates, code, URLs.\n"
83
+ "- Output language: multilingual doc -> query's language; else doc's language."
84
+ "</evidence>"
85
+ )
86
+
87
+ PROMPT_TEMPLATE = (
88
+ "<|im_start|>system\n{system}<|im_end|>\n"
89
+ "<|im_start|>user\n"
90
+ "<Instruct>: {instruction}\n"
91
+ "<Query>: {query}\n"
92
+ "<Document>: {doc}<|im_end|>\n"
93
+ "<|im_start|>assistant\n<think>\n\n</think>\n\n"
94
+ )
95
+
96
+
97
+ def build_prompt(query: str, doc: str) -> str:
98
+ return PROMPT_TEMPLATE.format(
99
+ system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc
100
+ )
101
+
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
104
+ model = AutoModelForCausalLM.from_pretrained(
105
+ MODEL_PATH,
106
+ torch_dtype=torch.bfloat16,
107
+ device_map="cuda",
108
+ attn_implementation="sdpa",
109
+ ).eval()
110
+
111
+ yes_id = tokenizer.encode("yes", add_special_tokens=False)[0]
112
+ no_id = tokenizer.encode("no", add_special_tokens=False)[0]
113
+
114
+
115
+ @torch.no_grad()
116
+ def rerank(query: str, doc: str, max_new_tokens: int = 512):
117
+ prompt = build_prompt(query, doc)
118
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
119
+
120
+ out = model.generate(
121
+ input_ids=input_ids,
122
+ max_new_tokens=max_new_tokens,
123
+ do_sample=False,
124
+ return_dict_in_generate=True,
125
+ output_scores=True,
126
+ pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
127
+ )
128
+
129
+ # Relevance score = softmax over {yes, no} at the first generated token.
130
+ first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1)
131
+ yes_p = first_logprobs[yes_id].exp()
132
+ no_p = first_logprobs[no_id].exp()
133
+ score = (yes_p / (yes_p + no_p)).item()
134
+
135
+ # Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence>
136
+ gen_ids = out.sequences[0, input_ids.shape[1]:]
137
+ text = tokenizer.decode(gen_ids, skip_special_tokens=True)
138
+ return {"score": score, "text": text}
139
+
140
+
141
+ example = rerank(
142
+ query="What is the boiling point of water at sea level?",
143
+ doc=(
144
+ "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), "
145
+ "which corresponds to sea-level conditions."
146
+ ),
147
+ )
148
+ print(example)
149
+ ```
150
+
151
+ Expected shape of the output:
152
+
153
+ ```text
154
+ {
155
+ "score": 0.98,
156
+ "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"
157
+ }
158
+ ```
159
+
160
+ For irrelevant pairs the score is close to 0 and `text` is just `"no"`.
161
+
162
+
163
+ ## Notes on usage
164
+
165
+ - The first generated token is always `yes` or `no` — the score is well-defined even if you stop generation immediately (cheap mode: `max_new_tokens=1`). Generate further only when you also want contribution/evidence.
166
+ - Inputs longer than 10K tokens may degrade — truncate the document side first.
167
+ - Greedy decoding is fine for ranking. For diverse evidence rephrasings, use `temperature=0.3-0.5`.
168
+
169
+
170
+
171
+ ## Contact
172
+
173
+ Dun Zhang — `dunnzhang0@gmail.com` (independent researcher).
model_architecture.png ADDED

Git LFS Details

  • SHA256: a16d09b28559b3b891fd85360156b83a5b495d8c1101269a6bee93f7eba7ac14
  • Pointer size: 132 Bytes
  • Size of remote file: 1.07 MB