bge-reranker-v2-m3-onnx (fp32)
ONNX (fp32) export of BAAI/bge-reranker-v2-m3,
a multilingual cross-encoder reranker. One forward pass over a (query, document) pair
produces a single relevance logit. Built for CPU inference with ONNX Runtime.
It powers the CPU flavor of bge-m3-service
(sophiacloud/bge-m3-service:cpu), but works standalone.
Why this exists (and where it actually helps)
On CPU, ONNX Runtime beats PyTorch eager via operator fusion and lower overhead. Measured against the original through FlagEmbedding, same threads, fp32, real documents (x86):
| op | ONNX-CPU | FlagEmbedding-CPU | speedup |
|---|---|---|---|
| rerank (1 query ร 10 docs) | 8.9 s | 11.8 s | 1.33ร |
Honest framing:
- ~1.3ร is the real number, not 2-4ร. The larger multiplier applies to int8, which we deliberately do not ship (see below).
- On GPU, use the original model instead โ fp16 there is as fast or faster.
- Absolute latency is high here only because the inputs are long documents truncated to 1024 tokens on CPU; short pairs are much faster.
Numerically identical to the original
Validated end-to-end (through the full HTTP service) against FlagEmbedding fp32:
| signal | result |
|---|---|
| ranking Spearman vs reference | 1.0000 |
| overlap@1 | 1.000 |
| max abs score diff (sigmoid) | 0.00000 |
Why no int8
Dynamic int8 dropped top-1 overlap to 0.625 โ i.e. the best document changed in ~37% of queries. For a reranker the top result is the whole point, so we ship fp32 only.
Files
model_fp32.onnx+model.onnx_dataโ graph + external weights (~2.2 GB)config.jsontokenizer_hf/โ the HuggingFace tokenizer (use this, not a native ONNX tokenizer)
Usage
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
d = snapshot_download("Sophia-AI/bge-reranker-v2-m3-onnx")
tok = AutoTokenizer.from_pretrained(f"{d}/tokenizer_hf")
sess = ort.InferenceSession(f"{d}/model_fp32.onnx", providers=["CPUExecutionProvider"])
query, docs = "come funziona il rimborso?", ["Rimborso entro 14 giorni.", "Ricetta carbonara."]
enc = tok([query] * len(docs), docs, padding=True, truncation=True,
max_length=1024, return_tensors="np")
logits = sess.run(None, {k: enc[k].astype(np.int64) for k in enc})[0].reshape(-1)
scores = 1 / (1 + np.exp(-logits)) # sigmoid โ [0,1] (optional; ranking is invariant)
Output is a single logit per pair. Apply a sigmoid for a [0,1] score; the ranking is unchanged either way.
License
Apache-2.0, inherited from the base model BAAI/bge-reranker-v2-m3.
- Downloads last month
- 11
Model tree for Sophia-AI/bge-reranker-v2-m3-onnx
Base model
BAAI/bge-reranker-v2-m3