bsisduck commited on
Commit
15f45dd
·
verified ·
1 Parent(s): f550822

Add lm_head.safetensors for logit scoring + update model card with verified results

Browse files
Files changed (2) hide show
  1. README.md +77 -18
  2. lm_head.safetensors +3 -0
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
  base_model: Qwen/Qwen3-Reranker-8B
 
3
  library_name: mlx-embeddings
4
  tags:
5
  - mlx
@@ -13,13 +14,11 @@ language:
13
  - multilingual
14
  license: apache-2.0
15
  pipeline_tag: text-classification
16
- datasets:
17
- - Qwen/Reranker-Multilingual-General-Instruct
18
  ---
19
 
20
  # Qwen3-Reranker-8B — MLX fp16
21
 
22
- [Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) converted to MLX format in **float16** precision for Apple Silicon.
23
 
24
  ## Model Details
25
 
@@ -29,11 +28,16 @@ datasets:
29
  | Parameters | 8B |
30
  | Architecture | Qwen3 (decoder-based, cross-encoder) |
31
  | Precision | float16 |
 
32
  | Max context length | 32,768 tokens |
33
  | Languages | 100+ |
34
- | Scoring | "yes"/"no" logit comparison |
35
  | Converted with | [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) v0.1.0 |
36
 
 
 
 
 
37
  ## Usage
38
 
39
  ```bash
@@ -41,29 +45,84 @@ pip install mlx-embeddings
41
  ```
42
 
43
  ```python
44
- from mlx_embeddings import load
45
  import mlx.core as mx
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- model, tokenizer = load("bsisduck/Qwen3-Reranker-8B-fp16-mlx")
 
 
48
 
49
- scores = model.process({
50
- "instruction": "Given a web search query, retrieve relevant passages that answer the query",
51
- "query": {"text": "What is MLX?"},
52
- "documents": [
53
- {"text": "MLX is Apple's array framework for machine learning on Apple Silicon."},
54
- {"text": "Python is a programming language."},
55
- ],
56
- }, processor=tokenizer)
57
 
58
- # Higher score = more relevant
59
- print(scores)
 
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Hardware Requirements
63
 
64
  - Apple Silicon Mac (M1/M2/M3/M4)
65
  - ~16 GB unified memory
66
 
67
- ## Original Model
68
 
69
- See [Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) for benchmarks, training details, and full documentation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model: Qwen/Qwen3-Reranker-8B
3
+ base_model_relation: quantized
4
  library_name: mlx-embeddings
5
  tags:
6
  - mlx
 
14
  - multilingual
15
  license: apache-2.0
16
  pipeline_tag: text-classification
 
 
17
  ---
18
 
19
  # Qwen3-Reranker-8B — MLX fp16
20
 
21
+ [Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) converted to MLX format in **float16** precision for native Apple Silicon inference.
22
 
23
  ## Model Details
24
 
 
28
  | Parameters | 8B |
29
  | Architecture | Qwen3 (decoder-based, cross-encoder) |
30
  | Precision | float16 |
31
+ | Model size | ~14 GB (+1.2 GB lm_head) |
32
  | Max context length | 32,768 tokens |
33
  | Languages | 100+ |
34
+ | Scoring | "yes"/"no" logit comparison (sigmoid-normalized) |
35
  | Converted with | [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) v0.1.0 |
36
 
37
+ ## Important: lm_head Required for Reranking
38
+
39
+ `mlx-embeddings` v0.1.0 does not load the `lm_head` layer needed for logit-based scoring. This repo includes a separate `lm_head.safetensors` file. Use the manual scoring approach below for correct reranker behavior.
40
+
41
  ## Usage
42
 
43
  ```bash
 
45
  ```
46
 
47
  ```python
 
48
  import mlx.core as mx
49
+ from mlx_embeddings import load
50
+ from transformers import AutoTokenizer
51
+ from huggingface_hub import hf_hub_download
52
+
53
+ repo = "bsisduck/Qwen3-Reranker-8B-fp16-mlx"
54
+
55
+ # Load model and tokenizer
56
+ model, _ = load(repo)
57
+ tokenizer = AutoTokenizer.from_pretrained(repo, padding_side="left")
58
+
59
+ # Load lm_head for logit scoring
60
+ lm_head_path = hf_hub_download(repo, "lm_head.safetensors")
61
+ lm_head = mx.load(lm_head_path)["lm_head.weight"]
62
+
63
+ YES_ID = 9693
64
+ NO_ID = 2152
65
+
66
+ def rerank(query, document, instruction="Given a web search query, retrieve relevant passages that answer the query"):
67
+ text = f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {document}"
68
+ inputs = tokenizer(text, return_tensors="np", padding=False)
69
+ input_ids = mx.array(inputs["input_ids"])
70
+
71
+ hidden = model.model(input_ids)
72
+ last_hidden = hidden[0, -1, :]
73
 
74
+ logits = last_hidden @ lm_head.T
75
+ score = float(mx.sigmoid(logits[YES_ID] - logits[NO_ID]).item())
76
+ return score
77
 
78
+ # Example
79
+ query = "What is Apple MLX framework?"
80
+ docs = [
81
+ "MLX is an array framework for ML on Apple silicon.",
82
+ "The capital of France is Paris.",
83
+ ]
 
 
84
 
85
+ for doc in docs:
86
+ score = rerank(query, doc)
87
+ print(f" {score:.4f} | {doc}")
88
  ```
89
 
90
+ ## Verified Results
91
+
92
+ Tested on Apple M2 Max (32 GB):
93
+
94
+ **Query**: "What is Apple MLX framework?"
95
+
96
+ | Score | Label | Document |
97
+ |---|---|---|
98
+ | 0.9297 | Relevant | "MLX is an array framework for machine learning on Apple silicon..." |
99
+ | 0.2905 | Partial | "Apple Silicon uses ARM architecture and unified memory." |
100
+ | 0.1851 | Partial | "TensorFlow is Google's open-source ML framework..." |
101
+ | 0.1075 | Irrelevant | "Banana bread is made with ripe bananas and flour." |
102
+ | 0.0395 | Irrelevant | "The capital of France is Paris..." |
103
+
104
+ **Performance**: Load time ~11s, ~40s per document for scoring (fp16, no batching).
105
+
106
  ## Hardware Requirements
107
 
108
  - Apple Silicon Mac (M1/M2/M3/M4)
109
  - ~16 GB unified memory
110
 
111
+ ## Limitations
112
 
113
+ - This is a format conversion (bf16 to fp16 MLX), not a fine-tune. Accuracy differences vs. the original are due to fp16 precision only.
114
+ - `mlx-embeddings` v0.1.0 does not natively support the LogitScore cross-encoder pipeline; the manual `lm_head` scoring approach above is required.
115
+ - See the [original model card](https://huggingface.co/Qwen/Qwen3-Reranker-8B) for full limitations, biases, and ethical considerations.
116
+
117
+ ## References
118
+
119
+ * [Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models](https://arxiv.org/abs/2506.05176)
120
+
121
+ ```bibtex
122
+ @article{qwen3embedding,
123
+ title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
124
+ author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
125
+ journal={arXiv preprint arXiv:2506.05176},
126
+ year={2025}
127
+ }
128
+ ```
lm_head.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abce700e488e577f364082a15c2cd453f18c50e3cd5ba8fd95707216ed237b84
3
+ size 1242472562