bge-reranker-v2-m3-onnx (fp32)

ONNX (fp32) export of BAAI/bge-reranker-v2-m3, a multilingual cross-encoder reranker. One forward pass over a (query, document) pair produces a single relevance logit. Built for CPU inference with ONNX Runtime.

It powers the CPU flavor of bge-m3-service (sophiacloud/bge-m3-service:cpu), but works standalone.

Why this exists (and where it actually helps)

On CPU, ONNX Runtime beats PyTorch eager via operator fusion and lower overhead. Measured against the original through FlagEmbedding, same threads, fp32, real documents (x86):

op ONNX-CPU FlagEmbedding-CPU speedup
rerank (1 query ร— 10 docs) 8.9 s 11.8 s 1.33ร—

Honest framing:

  • ~1.3ร— is the real number, not 2-4ร—. The larger multiplier applies to int8, which we deliberately do not ship (see below).
  • On GPU, use the original model instead โ€” fp16 there is as fast or faster.
  • Absolute latency is high here only because the inputs are long documents truncated to 1024 tokens on CPU; short pairs are much faster.

Numerically identical to the original

Validated end-to-end (through the full HTTP service) against FlagEmbedding fp32:

signal result
ranking Spearman vs reference 1.0000
overlap@1 1.000
max abs score diff (sigmoid) 0.00000

Why no int8

Dynamic int8 dropped top-1 overlap to 0.625 โ€” i.e. the best document changed in ~37% of queries. For a reranker the top result is the whole point, so we ship fp32 only.

Files

  • model_fp32.onnx + model.onnx_data โ€” graph + external weights (~2.2 GB)
  • config.json
  • tokenizer_hf/ โ€” the HuggingFace tokenizer (use this, not a native ONNX tokenizer)

Usage

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

d = snapshot_download("Sophia-AI/bge-reranker-v2-m3-onnx")
tok = AutoTokenizer.from_pretrained(f"{d}/tokenizer_hf")
sess = ort.InferenceSession(f"{d}/model_fp32.onnx", providers=["CPUExecutionProvider"])

query, docs = "come funziona il rimborso?", ["Rimborso entro 14 giorni.", "Ricetta carbonara."]
enc = tok([query] * len(docs), docs, padding=True, truncation=True,
          max_length=1024, return_tensors="np")
logits = sess.run(None, {k: enc[k].astype(np.int64) for k in enc})[0].reshape(-1)
scores = 1 / (1 + np.exp(-logits))   # sigmoid โ†’ [0,1] (optional; ranking is invariant)

Output is a single logit per pair. Apply a sigmoid for a [0,1] score; the ranking is unchanged either way.

License

Apache-2.0, inherited from the base model BAAI/bge-reranker-v2-m3.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Sophia-AI/bge-reranker-v2-m3-onnx

Quantized
(47)
this model