MiniLM-L6-v2 Cross-Encoder (4-bit NF4 Quantized)

A 4-bit NF4 quantized version of cross-encoder/ms-marco-MiniLM-L-6-v2 for passage reranking, using bitsandbytes quantization.

Quantization Details

Setting	Value
Method	bitsandbytes NF4
Bits	4
Double quantization	Yes
Compute dtype	float16
Skipped modules	`classifier` (kept in fp16)
Base model params	22.7M
Quantized weight size	~17M effective params

Evaluation

Evaluated on three IR benchmarks using a BM25 (top-100) + neural reranking pipeline.

LitSearch (Academic Literature Search)

Model	Params	R@5	R@20	MRR@10	NDCG@10
BM25 only	—	0.2951	0.3607	0.1970	0.2287
MiniLM-L6-v2 (fp32)	23M	0.4426	0.6066	0.3445	0.3796
MiniLM-L6-v2 (4-bit NF4)	17M	0.4426	0.6066	0.3435	0.3822
BGE-reranker-base	278M	0.4262	0.5574	0.3243	0.3754
BGE-reranker-v2-m3	568M	0.4426	0.6066	0.3801	0.4070

SciFact (Scientific Fact Verification)

Model	Params	R@5	R@20	MRR@10	NDCG@10
BM25 only	—	0.5893	0.7020	0.5416	0.5609
MiniLM-L6-v2 (fp32)	23M	0.7155	0.7628	0.6463	0.6605
MiniLM-L6-v2 (4-bit NF4)	17M	0.7065	0.7628	0.6396	0.6556
BGE-reranker-base	278M	0.6952	0.7793	0.6297	0.6481
BGE-reranker-v2-m3	568M	0.7230	0.8063	0.6460	0.6682

NFCorpus (Biomedical IR)

Model	Params	R@5	R@20	MRR@10	NDCG@10
BM25 only	—	0.1048	0.1512	0.4470	0.2688
MiniLM-L6-v2 (fp32)	23M	0.1194	0.1649	0.5181	0.3045
MiniLM-L6-v2 (4-bit NF4)	17M	0.1192	0.1655	0.5155	0.3050
BGE-reranker-base	278M	0.1119	0.1493	0.4676	0.2717
BGE-reranker-v2-m3	568M	0.1067	0.1555	0.4808	0.2726

Summary

4-bit NF4 quantization preserves near-identical quality across all three benchmarks:

Dataset	fp32 NDCG@10	4-bit NDCG@10	Delta
LitSearch	0.3796	0.3822	+0.07%
SciFact	0.6605	0.6556	−0.07%
NFCorpus	0.3045	0.3050	+0.02%

Usage

With transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
)

query = "What is the impact of climate change on coral reefs?"
passage = "Rising ocean temperatures cause widespread coral bleaching events..."

inputs = tokenizer(
    query, passage,
    return_tensors="pt",
    truncation=True,
    max_length=512,
    padding=True,
).to(model.device)

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
print(f"Relevance score: {score:.4f}")

With sentence-transformers CrossEncoder

from sentence_transformers.cross_encoder import CrossEncoder

model = CrossEncoder(
    "MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
    max_length=512,
)

query = "What is the impact of climate change on coral reefs?"
passages = [
    "Rising ocean temperatures cause widespread coral bleaching events...",
    "The history of marine biology dates back to ancient Greece...",
]

pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
print(scores)

Technical Notes

The classifier head is kept in fp16 (not quantized) to maintain output precision.
Requires bitsandbytes and a CUDA-capable GPU at inference time.
Model size on disk: ~17 MB (vs ~88 MB for fp32).

Citation

Base model:

@misc{ms-marco-MiniLM-L-6-v2,
  title={MS MARCO Cross-Encoder MiniLM-L-6-v2},
  author={Nils Reimers},
  url={https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2},
}

Downloads last month: 6

Safetensors

Model size

23.1M params

Tensor type

F32

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4

Base model

microsoft/MiniLM-L12-H384-uncased

Quantized

cross-encoder/ms-marco-MiniLM-L12-v2

Quantized

cross-encoder/ms-marco-MiniLM-L6-v2

Quantized

(19)

this model

MO7YW4NG
/

ms-marco-MiniLM-L-6-v2-4bit-nf4

MiniLM-L6-v2 Cross-Encoder (4-bit NF4 Quantized)

Quantization Details

Evaluation

LitSearch (Academic Literature Search)

SciFact (Scientific Fact Verification)

NFCorpus (Biomedical IR)

Summary

Usage

With transformers

With sentence-transformers CrossEncoder

Technical Notes

Citation

Model tree for MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4

Datasets used to train MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4