MS MARCO MiniLM-L12-v2 ONNX QInt8 for CPU

This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-12-v2.

It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.

Base model

Upstream model: cross-encoder/ms-marco-MiniLM-L-12-v2
Upstream revision: 7b0235231ca2674cb8ca8f022859a6eba2b1c968
Upstream license: Apache-2.0

Quantization

Runtime: ONNX Runtime 1.22.1
Method: dynamic quantization
Weight type: QInt8
Intended provider: CPUExecutionProvider

Files

onnx/model.onnx
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt

Quality benchmark

Benchmarked on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.

The main signal is that the quantized model stayed effectively quality-neutral on judged reranking data while improving CPU latency by about 1.27x on average across the evaluated slices.

English reranking sets

dataset	variant	precision@1	mrr@10	ndcg@10	recall@10	avg latency (ms)
`mteb/scidocs-reranking`	upstream ONNX	0.8500	0.9103	0.7437	0.7425	245.31
`mteb/scidocs-reranking`	this QInt8 repo	0.8700	0.9228	0.7460	0.7415	195.22
`mteb/stackoverflowdupquestions-reranking`	upstream ONNX	0.3800	0.5015	0.5667	0.8000	218.96
`mteb/stackoverflowdupquestions-reranking`	this QInt8 repo	0.3800	0.5019	0.5662	0.8000	170.60

Irish proxy evaluation

A dedicated public Irish reranking benchmark was not available at packaging time, so the model was also checked on translated Irish triplet retrieval data from ReliableAI/irish_retrieval_data, used as a pairwise reranking proxy.

dataset	variant	precision@1	mrr@10	ndcg@10	pairwise accuracy	avg latency (ms)
`msmacro_passage_irish`	upstream ONNX	0.7300	0.8650	0.9004	0.7300	51.34
`msmacro_passage_irish`	this QInt8 repo	0.7200	0.8600	0.8967	0.7200	38.34
`nq_irish`	upstream ONNX	0.8100	0.9050	0.9299	0.8100	50.14
`nq_irish`	this QInt8 repo	0.8000	0.9000	0.9262	0.8000	42.36

Interpretation

On the English reranking sets used here, the quantized model was effectively quality-neutral and sometimes slightly better than the full-precision ONNX baseline.
On the Irish proxy slices, the quantized model was close to the full-precision baseline, with about a 1-point drop on pairwise accuracy in this sample.
CPU latency improved materially versus the full-precision ONNX artifact.

As always for rerankers, validate on your own corpus and ranking objective before replacing the upstream model in production.

Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)

Downloads last month: 3

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8

Base model

microsoft/MiniLM-L12-H384-uncased

Quantized

cross-encoder/ms-marco-MiniLM-L12-v2

Quantized

(14)

this model