bsisduck
/

Qwen3-Reranker-8B-fp16-mlx

@@ -1,5 +1,6 @@
 ---
 base_model: Qwen/Qwen3-Reranker-8B
 library_name: mlx-embeddings
 tags:
   - mlx
@@ -13,13 +14,11 @@ language:
   - multilingual
 license: apache-2.0
 pipeline_tag: text-classification
-datasets:
-  - Qwen/Reranker-Multilingual-General-Instruct
 ---
 # Qwen3-Reranker-8B — MLX fp16
-[Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) converted to MLX format in **float16** precision for Apple Silicon.
 ## Model Details
@@ -29,11 +28,16 @@ datasets:
 | Parameters | 8B |
 | Architecture | Qwen3 (decoder-based, cross-encoder) |
 | Precision | float16 |
 | Max context length | 32,768 tokens |
 | Languages | 100+ |
-| Scoring | "yes"/"no" logit comparison |
 | Converted with | [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) v0.1.0 |
 ## Usage
 ```bash
@@ -41,29 +45,84 @@ pip install mlx-embeddings
 ```
 ```python
-from mlx_embeddings import load
 import mlx.core as mx
-model, tokenizer = load("bsisduck/Qwen3-Reranker-8B-fp16-mlx")
-scores = model.process({
-    "instruction": "Given a web search query, retrieve relevant passages that answer the query",
-    "query": {"text": "What is MLX?"},
-    "documents": [
-        {"text": "MLX is Apple's array framework for machine learning on Apple Silicon."},
-        {"text": "Python is a programming language."},
-    ],
-}, processor=tokenizer)
-# Higher score = more relevant
-print(scores)
 ```
 ## Hardware Requirements
 - Apple Silicon Mac (M1/M2/M3/M4)
 - ~16 GB unified memory
-## Original Model
-See [Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) for benchmarks, training details, and full documentation.

 ---
 base_model: Qwen/Qwen3-Reranker-8B
+base_model_relation: quantized
 library_name: mlx-embeddings
 tags:
   - mlx
   - multilingual
 license: apache-2.0
 pipeline_tag: text-classification
 ---
 # Qwen3-Reranker-8B — MLX fp16
+[Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) converted to MLX format in **float16** precision for native Apple Silicon inference.
 ## Model Details
 | Parameters | 8B |
 | Architecture | Qwen3 (decoder-based, cross-encoder) |
 | Precision | float16 |
+| Model size | ~14 GB (+1.2 GB lm_head) |
 | Max context length | 32,768 tokens |
 | Languages | 100+ |
+| Scoring | "yes"/"no" logit comparison (sigmoid-normalized) |
 | Converted with | [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) v0.1.0 |
+## Important: lm_head Required for Reranking
+`mlx-embeddings` v0.1.0 does not load the `lm_head` layer needed for logit-based scoring. This repo includes a separate `lm_head.safetensors` file. Use the manual scoring approach below for correct reranker behavior.
 ## Usage
 ```bash
 ```
 ```python
 import mlx.core as mx
+from mlx_embeddings import load
+from transformers import AutoTokenizer
+from huggingface_hub import hf_hub_download
+repo = "bsisduck/Qwen3-Reranker-8B-fp16-mlx"
+# Load model and tokenizer
+model, _ = load(repo)
+tokenizer = AutoTokenizer.from_pretrained(repo, padding_side="left")
+# Load lm_head for logit scoring
+lm_head_path = hf_hub_download(repo, "lm_head.safetensors")
+lm_head = mx.load(lm_head_path)["lm_head.weight"]
+YES_ID = 9693
+NO_ID = 2152
+def rerank(query, document, instruction="Given a web search query, retrieve relevant passages that answer the query"):
+    text = f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {document}"
+    inputs = tokenizer(text, return_tensors="np", padding=False)
+    input_ids = mx.array(inputs["input_ids"])
+    hidden = model.model(input_ids)
+    last_hidden = hidden[0, -1, :]
+    logits = last_hidden @ lm_head.T
+    score = float(mx.sigmoid(logits[YES_ID] - logits[NO_ID]).item())
+    return score
+# Example
+query = "What is Apple MLX framework?"
+docs = [
+    "MLX is an array framework for ML on Apple silicon.",
+    "The capital of France is Paris.",
+]
+for doc in docs:
+    score = rerank(query, doc)
+    print(f"  {score:.4f} | {doc}")
 ```
+## Verified Results
+Tested on Apple M2 Max (32 GB):
+**Query**: "What is Apple MLX framework?"
+| Score | Label | Document |
+|---|---|---|
+| 0.9297 | Relevant | "MLX is an array framework for machine learning on Apple silicon..." |
+| 0.2905 | Partial | "Apple Silicon uses ARM architecture and unified memory." |
+| 0.1851 | Partial | "TensorFlow is Google's open-source ML framework..." |
+| 0.1075 | Irrelevant | "Banana bread is made with ripe bananas and flour." |
+| 0.0395 | Irrelevant | "The capital of France is Paris..." |
+**Performance**: Load time ~11s, ~40s per document for scoring (fp16, no batching).
 ## Hardware Requirements
 - Apple Silicon Mac (M1/M2/M3/M4)
 - ~16 GB unified memory
+## Limitations
+- This is a format conversion (bf16 to fp16 MLX), not a fine-tune. Accuracy differences vs. the original are due to fp16 precision only.
+- `mlx-embeddings` v0.1.0 does not natively support the LogitScore cross-encoder pipeline; the manual `lm_head` scoring approach above is required.
+- See the [original model card](https://huggingface.co/Qwen/Qwen3-Reranker-8B) for full limitations, biases, and ethical considerations.
+## References
+* [Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models](https://arxiv.org/abs/2506.05176)
+```bibtex
+@article{qwen3embedding,
+  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
+  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
+  journal={arXiv preprint arXiv:2506.05176},
+  year={2025}
+}
+```

lm_head.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:abce700e488e577f364082a15c2cd453f18c50e3cd5ba8fd95707216ed237b84
+size 1242472562