MS MARCO MiniLM-L12-v2 ONNX QInt8 for CPU
This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-12-v2.
It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.
Base model
- Upstream model: cross-encoder/ms-marco-MiniLM-L-12-v2
- Upstream revision:
7b0235231ca2674cb8ca8f022859a6eba2b1c968 - Upstream license:
Apache-2.0
Quantization
- Runtime: ONNX Runtime
1.22.1 - Method: dynamic quantization
- Weight type:
QInt8 - Intended provider:
CPUExecutionProvider
Files
onnx/model.onnxconfig.jsontokenizer.jsontokenizer_config.jsonspecial_tokens_map.jsonvocab.txt
Quality benchmark
Benchmarked on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.
The main signal is that the quantized model stayed effectively quality-neutral on judged reranking data while improving CPU latency by about 1.27x on average across the evaluated slices.
English reranking sets
| dataset | variant | precision@1 | mrr@10 | ndcg@10 | recall@10 | avg latency (ms) |
|---|---|---|---|---|---|---|
mteb/scidocs-reranking |
upstream ONNX | 0.8500 | 0.9103 | 0.7437 | 0.7425 | 245.31 |
mteb/scidocs-reranking |
this QInt8 repo | 0.8700 | 0.9228 | 0.7460 | 0.7415 | 195.22 |
mteb/stackoverflowdupquestions-reranking |
upstream ONNX | 0.3800 | 0.5015 | 0.5667 | 0.8000 | 218.96 |
mteb/stackoverflowdupquestions-reranking |
this QInt8 repo | 0.3800 | 0.5019 | 0.5662 | 0.8000 | 170.60 |
Irish proxy evaluation
A dedicated public Irish reranking benchmark was not available at packaging time, so the model was also checked on translated Irish triplet retrieval data from ReliableAI/irish_retrieval_data, used as a pairwise reranking proxy.
| dataset | variant | precision@1 | mrr@10 | ndcg@10 | pairwise accuracy | avg latency (ms) |
|---|---|---|---|---|---|---|
msmacro_passage_irish |
upstream ONNX | 0.7300 | 0.8650 | 0.9004 | 0.7300 | 51.34 |
msmacro_passage_irish |
this QInt8 repo | 0.7200 | 0.8600 | 0.8967 | 0.7200 | 38.34 |
nq_irish |
upstream ONNX | 0.8100 | 0.9050 | 0.9299 | 0.8100 | 50.14 |
nq_irish |
this QInt8 repo | 0.8000 | 0.9000 | 0.9262 | 0.8000 | 42.36 |
Interpretation
- On the English reranking sets used here, the quantized model was effectively quality-neutral and sometimes slightly better than the full-precision ONNX baseline.
- On the Irish proxy slices, the quantized model was close to the full-precision baseline, with about a 1-point drop on pairwise accuracy in this sample.
- CPU latency improved materially versus the full-precision ONNX artifact.
As always for rerankers, validate on your own corpus and ranking objective before replacing the upstream model in production.
Usage
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
repo_id = "temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)
- Downloads last month
- 3
Model tree for temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8
Base model
microsoft/MiniLM-L12-H384-uncased