Qwen3-Reranker-8B — MLX fp16

Qwen/Qwen3-Reranker-8B converted to MLX format in float16 precision for native Apple Silicon inference.

Model Details

Property	Value
Base model	Qwen/Qwen3-Reranker-8B
Parameters	8B
Architecture	Qwen3 (decoder-based, cross-encoder)
Precision	float16
Model size	~14 GB (+1.2 GB lm_head)
Max context length	32,768 tokens
Languages	100+
Scoring	"yes"/"no" logit comparison (sigmoid-normalized)
Converted with	mlx-embeddings v0.1.0

Important: lm_head Required for Reranking

mlx-embeddings v0.1.0 does not load the lm_head layer needed for logit-based scoring. This repo includes a separate lm_head.safetensors file. Use the manual scoring approach below for correct reranker behavior.

Usage

pip install mlx-embeddings

import mlx.core as mx
from mlx_embeddings import load
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

repo = "bsisduck/Qwen3-Reranker-8B-fp16-mlx"

# Load model and tokenizer
model, _ = load(repo)
tokenizer = AutoTokenizer.from_pretrained(repo, padding_side="left")

# Load lm_head for logit scoring
lm_head_path = hf_hub_download(repo, "lm_head.safetensors")
lm_head = mx.load(lm_head_path)["lm_head.weight"]

YES_ID = 9693
NO_ID = 2152

def rerank(query, document, instruction="Given a web search query, retrieve relevant passages that answer the query"):
    text = f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {document}"
    inputs = tokenizer(text, return_tensors="np", padding=False)
    input_ids = mx.array(inputs["input_ids"])

    hidden = model.model(input_ids)
    last_hidden = hidden[0, -1, :]

    logits = last_hidden @ lm_head.T
    score = float(mx.sigmoid(logits[YES_ID] - logits[NO_ID]).item())
    return score

# Example
query = "What is Apple MLX framework?"
docs = [
    "MLX is an array framework for ML on Apple silicon.",
    "The capital of France is Paris.",
]

for doc in docs:
    score = rerank(query, doc)
    print(f"  {score:.4f} | {doc}")

Verified Results

Tested on Apple M2 Max (32 GB):

Query: "What is Apple MLX framework?"

Score	Label	Document
0.9297	Relevant	"MLX is an array framework for machine learning on Apple silicon..."
0.2905	Partial	"Apple Silicon uses ARM architecture and unified memory."
0.1851	Partial	"TensorFlow is Google's open-source ML framework..."
0.1075	Irrelevant	"Banana bread is made with ripe bananas and flour."
0.0395	Irrelevant	"The capital of France is Paris..."

Performance: Load time ~11s, ~40s per document for scoring (fp16, no batching).

Hardware Requirements

Apple Silicon Mac (M1/M2/M3/M4)
~16 GB unified memory

Limitations

This is a format conversion (bf16 to fp16 MLX), not a fine-tune. Accuracy differences vs. the original are due to fp16 precision only.
mlx-embeddings v0.1.0 does not natively support the LogitScore cross-encoder pipeline; the manual lm_head scoring approach above is required.
See the original model card for full limitations, biases, and ethical considerations.

References

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Downloads last month: 72

Safetensors

Model size

8B params

Tensor type

F16

MLX

Hardware compatibility

Quantized

Model tree for bsisduck/Qwen3-Reranker-8B-fp16-mlx

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Reranker-8B

Quantized

(43)

this model

Paper for bsisduck/Qwen3-Reranker-8B-fp16-mlx

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 82