mxbai-edge-colbert-v0-32m SciFact KMeans PF1-6

This is a research checkpoint based on mixedbread-ai/mxbai-edge-colbert-v0-32m, fine-tuned on SciFact training data with pooling-aware distillation. During training, document embeddings were pooled with k-means and the pool factor was sampled uniformly from 1 to 6 on each batch.

This released checkpoint corresponds to the selected final SciFact run used for the public artifact release. It is intended primarily for research, reproduction, and further experimentation with pooling-aware late interaction retrieval.

Model Details

Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m
Architecture: ColBERT / late interaction retrieval
Query length: 48
Document length: 300
Embedding dimension: 64
Training setup: k-means pooling, multi-factor PF1-6
Training dataset: stefan-jo/scifact-train-mined-reranker-scores

Intended Use

This model is intended for:

reproducing the paper's SciFact pooling-aware fine-tuning results
experimenting with pooling-aware late interaction retrieval
studying how multi-factor training affects retrieval under document compression

It is not presented as a general-purpose production retriever.

Usage

from pylate import models

model = models.ColBERT(
    model_name_or_path="stefan-jo/mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6"
)

queries_embeddings = model.encode(
    ["What evidence supports the claim?"],
    is_query=True,
)

documents_embeddings = model.encode(
    ["The abstract provides evidence relevant to the claim."],
    is_query=False,
    pool_factor=4,
    pool_method="kmeans",
    use_sklearn=True,
)

Evaluation Snapshot

The table below is adapted from the paper's NanoBEIR cross-dataset effects table. All runs use k-means pooling at inference and report NDCG@10.

Dataset	PF	Baseline	FT SciFact KMeans PF1-6	FT FiQA KMeans PF1-6
SciFact	1	0.808	0.817	0.802
SciFact	2	0.765	0.813	0.774
SciFact	4	0.649	0.810	0.808
SciFact	6	0.609	0.795	0.758
FiQA	1	0.526	0.523	0.528
FiQA	2	0.488	0.519	0.505
FiQA	4	0.470	0.490	0.513
FiQA	6	0.431	0.459	0.467
NFCorpus	1	0.375	0.372	0.369
NFCorpus	2	0.370	0.370	0.378
NFCorpus	4	0.342	0.363	0.381
NFCorpus	6	0.307	0.363	0.369
SCIDOCS	1	0.396	0.393	0.382
SCIDOCS	2	0.371	0.385	0.375
SCIDOCS	4	0.347	0.374	0.387
SCIDOCS	6	0.332	0.380	0.371
Touché2020	1	0.596	0.595	0.592
Touché2020	2	0.597	0.602	0.601
Touché2020	4	0.565	0.594	0.572
Touché2020	6	0.545	0.573	0.571

For full experiments and additional tables, see the accompanying paper and repository:

Paper: stefan-jo.github.io/learn-to-pool/downloads/paper.pdf
Code and aggregate metrics: stefan-jo/pylate

Training Data Provenance

The training dataset was built from SciFact train data using a mining-and-reranking pipeline:

hard negative mining with BAAI/bge-small-en-v1.5
teacher scores from BAAI/bge-reranker-v2-gemma
distillation training on mined candidate sets with reranker scores

The released training dataset is available separately as stefan-jo/scifact-train-mined-reranker-scores.

License and Provenance

This model is released under a mixed-source custom notice for caution. The underlying SciFact-based training data combines multiple upstream sources, including SciFact annotations and S2ORC-derived corpus content.

Relevant upstream components:

base model: mixedbread-ai/mxbai-edge-colbert-v0-32m (Apache-2.0)
mining model: BAAI/bge-small-en-v1.5 (MIT)
reranker: BAAI/bge-reranker-v2-gemma (Apache-2.0)
training data source: SciFact / BEIR-style preprocessing
SciFact annotations: CC BY 4.0
SciFact corpus source: S2ORC / ODC-By 1.0

Users should review and comply with the upstream attribution and source terms in addition to this repository notice.

Downloads last month: 17

Safetensors

Model size

31.9M params

Tensor type

F32

Model tree for stefan-jo/mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6

Base model

mixedbread-ai/mxbai-edge-colbert-v0-32m

Finetuned

(4)

this model

stefan-jo
/

mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6