MiniLM-L6-v2 Cross-Encoder (4-bit NF4 Quantized)
A 4-bit NF4 quantized version of cross-encoder/ms-marco-MiniLM-L-6-v2 for passage reranking, using bitsandbytes quantization.
Quantization Details
| Setting | Value |
|---|---|
| Method | bitsandbytes NF4 |
| Bits | 4 |
| Double quantization | Yes |
| Compute dtype | float16 |
| Skipped modules | classifier (kept in fp16) |
| Base model params | 22.7M |
| Quantized weight size | ~17M effective params |
Evaluation
Evaluated on three IR benchmarks using a BM25 (top-100) + neural reranking pipeline.
LitSearch (Academic Literature Search)
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.2951 | 0.3607 | 0.1970 | 0.2287 |
| MiniLM-L6-v2 (fp32) | 23M | 0.4426 | 0.6066 | 0.3445 | 0.3796 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.4426 | 0.6066 | 0.3435 | 0.3822 |
| BGE-reranker-base | 278M | 0.4262 | 0.5574 | 0.3243 | 0.3754 |
| BGE-reranker-v2-m3 | 568M | 0.4426 | 0.6066 | 0.3801 | 0.4070 |
SciFact (Scientific Fact Verification)
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.5893 | 0.7020 | 0.5416 | 0.5609 |
| MiniLM-L6-v2 (fp32) | 23M | 0.7155 | 0.7628 | 0.6463 | 0.6605 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.7065 | 0.7628 | 0.6396 | 0.6556 |
| BGE-reranker-base | 278M | 0.6952 | 0.7793 | 0.6297 | 0.6481 |
| BGE-reranker-v2-m3 | 568M | 0.7230 | 0.8063 | 0.6460 | 0.6682 |
NFCorpus (Biomedical IR)
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.1048 | 0.1512 | 0.4470 | 0.2688 |
| MiniLM-L6-v2 (fp32) | 23M | 0.1194 | 0.1649 | 0.5181 | 0.3045 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.1192 | 0.1655 | 0.5155 | 0.3050 |
| BGE-reranker-base | 278M | 0.1119 | 0.1493 | 0.4676 | 0.2717 |
| BGE-reranker-v2-m3 | 568M | 0.1067 | 0.1555 | 0.4808 | 0.2726 |
Summary
4-bit NF4 quantization preserves near-identical quality across all three benchmarks:
| Dataset | fp32 NDCG@10 | 4-bit NDCG@10 | Delta |
|---|---|---|---|
| LitSearch | 0.3796 | 0.3822 | +0.07% |
| SciFact | 0.6605 | 0.6556 | −0.07% |
| NFCorpus | 0.3045 | 0.3050 | +0.02% |
Usage
With transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
)
query = "What is the impact of climate change on coral reefs?"
passage = "Rising ocean temperatures cause widespread coral bleaching events..."
inputs = tokenizer(
query, passage,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True,
).to(model.device)
with torch.no_grad():
score = model(**inputs).logits.squeeze().item()
print(f"Relevance score: {score:.4f}")
With sentence-transformers CrossEncoder
from sentence_transformers.cross_encoder import CrossEncoder
model = CrossEncoder(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
max_length=512,
)
query = "What is the impact of climate change on coral reefs?"
passages = [
"Rising ocean temperatures cause widespread coral bleaching events...",
"The history of marine biology dates back to ancient Greece...",
]
pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
print(scores)
Technical Notes
- The
classifierhead is kept in fp16 (not quantized) to maintain output precision. - Requires
bitsandbytesand a CUDA-capable GPU at inference time. - Model size on disk: ~17 MB (vs ~88 MB for fp32).
Citation
Base model:
@misc{ms-marco-MiniLM-L-6-v2,
title={MS MARCO Cross-Encoder MiniLM-L-6-v2},
author={Nils Reimers},
url={https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2},
}
- Downloads last month
- 6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4
Base model
microsoft/MiniLM-L12-H384-uncased Quantized
cross-encoder/ms-marco-MiniLM-L12-v2 Quantized
cross-encoder/ms-marco-MiniLM-L6-v2