Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 82
Qwen/Qwen3-Reranker-8B converted to MLX format in float16 precision for native Apple Silicon inference.
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Reranker-8B |
| Parameters | 8B |
| Architecture | Qwen3 (decoder-based, cross-encoder) |
| Precision | float16 |
| Model size | ~14 GB (+1.2 GB lm_head) |
| Max context length | 32,768 tokens |
| Languages | 100+ |
| Scoring | "yes"/"no" logit comparison (sigmoid-normalized) |
| Converted with | mlx-embeddings v0.1.0 |
mlx-embeddings v0.1.0 does not load the lm_head layer needed for logit-based scoring. This repo includes a separate lm_head.safetensors file. Use the manual scoring approach below for correct reranker behavior.
pip install mlx-embeddings
import mlx.core as mx
from mlx_embeddings import load
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
repo = "bsisduck/Qwen3-Reranker-8B-fp16-mlx"
# Load model and tokenizer
model, _ = load(repo)
tokenizer = AutoTokenizer.from_pretrained(repo, padding_side="left")
# Load lm_head for logit scoring
lm_head_path = hf_hub_download(repo, "lm_head.safetensors")
lm_head = mx.load(lm_head_path)["lm_head.weight"]
YES_ID = 9693
NO_ID = 2152
def rerank(query, document, instruction="Given a web search query, retrieve relevant passages that answer the query"):
text = f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {document}"
inputs = tokenizer(text, return_tensors="np", padding=False)
input_ids = mx.array(inputs["input_ids"])
hidden = model.model(input_ids)
last_hidden = hidden[0, -1, :]
logits = last_hidden @ lm_head.T
score = float(mx.sigmoid(logits[YES_ID] - logits[NO_ID]).item())
return score
# Example
query = "What is Apple MLX framework?"
docs = [
"MLX is an array framework for ML on Apple silicon.",
"The capital of France is Paris.",
]
for doc in docs:
score = rerank(query, doc)
print(f" {score:.4f} | {doc}")
Tested on Apple M2 Max (32 GB):
Query: "What is Apple MLX framework?"
| Score | Label | Document |
|---|---|---|
| 0.9297 | Relevant | "MLX is an array framework for machine learning on Apple silicon..." |
| 0.2905 | Partial | "Apple Silicon uses ARM architecture and unified memory." |
| 0.1851 | Partial | "TensorFlow is Google's open-source ML framework..." |
| 0.1075 | Irrelevant | "Banana bread is made with ripe bananas and flour." |
| 0.0395 | Irrelevant | "The capital of France is Paris..." |
Performance: Load time ~11s, ~40s per document for scoring (fp16, no batching).
mlx-embeddings v0.1.0 does not natively support the LogitScore cross-encoder pipeline; the manual lm_head scoring approach above is required.@article{qwen3embedding,
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
journal={arXiv preprint arXiv:2506.05176},
year={2025}
}
Quantized