MiniLM-L6-v2 Cross-Encoder (4-bit NF4 Quantized)

A 4-bit NF4 quantized version of cross-encoder/ms-marco-MiniLM-L-6-v2 for passage reranking, using bitsandbytes quantization.

Quantization Details

Setting Value
Method bitsandbytes NF4
Bits 4
Double quantization Yes
Compute dtype float16
Skipped modules classifier (kept in fp16)
Base model params 22.7M
Quantized weight size ~17M effective params

Evaluation

Evaluated on three IR benchmarks using a BM25 (top-100) + neural reranking pipeline.

LitSearch (Academic Literature Search)

Model Params R@5 R@20 MRR@10 NDCG@10
BM25 only 0.2951 0.3607 0.1970 0.2287
MiniLM-L6-v2 (fp32) 23M 0.4426 0.6066 0.3445 0.3796
MiniLM-L6-v2 (4-bit NF4) 17M 0.4426 0.6066 0.3435 0.3822
BGE-reranker-base 278M 0.4262 0.5574 0.3243 0.3754
BGE-reranker-v2-m3 568M 0.4426 0.6066 0.3801 0.4070

SciFact (Scientific Fact Verification)

Model Params R@5 R@20 MRR@10 NDCG@10
BM25 only 0.5893 0.7020 0.5416 0.5609
MiniLM-L6-v2 (fp32) 23M 0.7155 0.7628 0.6463 0.6605
MiniLM-L6-v2 (4-bit NF4) 17M 0.7065 0.7628 0.6396 0.6556
BGE-reranker-base 278M 0.6952 0.7793 0.6297 0.6481
BGE-reranker-v2-m3 568M 0.7230 0.8063 0.6460 0.6682

NFCorpus (Biomedical IR)

Model Params R@5 R@20 MRR@10 NDCG@10
BM25 only 0.1048 0.1512 0.4470 0.2688
MiniLM-L6-v2 (fp32) 23M 0.1194 0.1649 0.5181 0.3045
MiniLM-L6-v2 (4-bit NF4) 17M 0.1192 0.1655 0.5155 0.3050
BGE-reranker-base 278M 0.1119 0.1493 0.4676 0.2717
BGE-reranker-v2-m3 568M 0.1067 0.1555 0.4808 0.2726

Summary

4-bit NF4 quantization preserves near-identical quality across all three benchmarks:

Dataset fp32 NDCG@10 4-bit NDCG@10 Delta
LitSearch 0.3796 0.3822 +0.07%
SciFact 0.6605 0.6556 −0.07%
NFCorpus 0.3045 0.3050 +0.02%

Usage

With transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
)

query = "What is the impact of climate change on coral reefs?"
passage = "Rising ocean temperatures cause widespread coral bleaching events..."

inputs = tokenizer(
    query, passage,
    return_tensors="pt",
    truncation=True,
    max_length=512,
    padding=True,
).to(model.device)

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
print(f"Relevance score: {score:.4f}")

With sentence-transformers CrossEncoder

from sentence_transformers.cross_encoder import CrossEncoder

model = CrossEncoder(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
    max_length=512,
)

query = "What is the impact of climate change on coral reefs?"
passages = [
    "Rising ocean temperatures cause widespread coral bleaching events...",
    "The history of marine biology dates back to ancient Greece...",
]

pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
print(scores)

Technical Notes

  • The classifier head is kept in fp16 (not quantized) to maintain output precision.
  • Requires bitsandbytes and a CUDA-capable GPU at inference time.
  • Model size on disk: ~17 MB (vs ~88 MB for fp32).

Citation

Base model:

@misc{ms-marco-MiniLM-L-6-v2,
  title={MS MARCO Cross-Encoder MiniLM-L-6-v2},
  author={Nils Reimers},
  url={https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2},
}
Downloads last month
6
Safetensors
Model size
23.1M params
Tensor type
F32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4

Datasets used to train MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4