MS MARCO MiniLM-L12-v2 ONNX QInt8 for CPU

This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-12-v2.

It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.

Base model

Quantization

  • Runtime: ONNX Runtime 1.22.1
  • Method: dynamic quantization
  • Weight type: QInt8
  • Intended provider: CPUExecutionProvider

Files

  • onnx/model.onnx
  • config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • vocab.txt

Quality benchmark

Benchmarked on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.

The main signal is that the quantized model stayed effectively quality-neutral on judged reranking data while improving CPU latency by about 1.27x on average across the evaluated slices.

English reranking sets

dataset variant precision@1 mrr@10 ndcg@10 recall@10 avg latency (ms)
mteb/scidocs-reranking upstream ONNX 0.8500 0.9103 0.7437 0.7425 245.31
mteb/scidocs-reranking this QInt8 repo 0.8700 0.9228 0.7460 0.7415 195.22
mteb/stackoverflowdupquestions-reranking upstream ONNX 0.3800 0.5015 0.5667 0.8000 218.96
mteb/stackoverflowdupquestions-reranking this QInt8 repo 0.3800 0.5019 0.5662 0.8000 170.60

Irish proxy evaluation

A dedicated public Irish reranking benchmark was not available at packaging time, so the model was also checked on translated Irish triplet retrieval data from ReliableAI/irish_retrieval_data, used as a pairwise reranking proxy.

dataset variant precision@1 mrr@10 ndcg@10 pairwise accuracy avg latency (ms)
msmacro_passage_irish upstream ONNX 0.7300 0.8650 0.9004 0.7300 51.34
msmacro_passage_irish this QInt8 repo 0.7200 0.8600 0.8967 0.7200 38.34
nq_irish upstream ONNX 0.8100 0.9050 0.9299 0.8100 50.14
nq_irish this QInt8 repo 0.8000 0.9000 0.9262 0.8000 42.36

Interpretation

  • On the English reranking sets used here, the quantized model was effectively quality-neutral and sometimes slightly better than the full-precision ONNX baseline.
  • On the Irish proxy slices, the quantized model was close to the full-precision baseline, with about a 1-point drop on pairwise accuracy in this sample.
  • CPU latency improved materially versus the full-precision ONNX artifact.

As always for rerankers, validate on your own corpus and ranking objective before replacing the upstream model in production.

Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/ms-marco-MiniLM-L-12-v2-onnx-cpu-qint8