mxbai-edge-colbert-v0-32m SciFact KMeans PF1-6

This is a research checkpoint based on mixedbread-ai/mxbai-edge-colbert-v0-32m, fine-tuned on SciFact training data with pooling-aware distillation. During training, document embeddings were pooled with k-means and the pool factor was sampled uniformly from 1 to 6 on each batch.

This released checkpoint corresponds to the selected final SciFact run used for the public artifact release. It is intended primarily for research, reproduction, and further experimentation with pooling-aware late interaction retrieval.

Model Details

  • Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m
  • Architecture: ColBERT / late interaction retrieval
  • Query length: 48
  • Document length: 300
  • Embedding dimension: 64
  • Training setup: k-means pooling, multi-factor PF1-6
  • Training dataset: stefan-jo/scifact-train-mined-reranker-scores

Intended Use

This model is intended for:

  • reproducing the paper's SciFact pooling-aware fine-tuning results
  • experimenting with pooling-aware late interaction retrieval
  • studying how multi-factor training affects retrieval under document compression

It is not presented as a general-purpose production retriever.

Usage

from pylate import models

model = models.ColBERT(
    model_name_or_path="stefan-jo/mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6"
)

queries_embeddings = model.encode(
    ["What evidence supports the claim?"],
    is_query=True,
)

documents_embeddings = model.encode(
    ["The abstract provides evidence relevant to the claim."],
    is_query=False,
    pool_factor=4,
    pool_method="kmeans",
    use_sklearn=True,
)

Evaluation Snapshot

The table below is adapted from the paper's NanoBEIR cross-dataset effects table. All runs use k-means pooling at inference and report NDCG@10.

Dataset PF Baseline FT SciFact KMeans PF1-6 FT FiQA KMeans PF1-6
SciFact 1 0.808 0.817 0.802
SciFact 2 0.765 0.813 0.774
SciFact 4 0.649 0.810 0.808
SciFact 6 0.609 0.795 0.758
FiQA 1 0.526 0.523 0.528
FiQA 2 0.488 0.519 0.505
FiQA 4 0.470 0.490 0.513
FiQA 6 0.431 0.459 0.467
NFCorpus 1 0.375 0.372 0.369
NFCorpus 2 0.370 0.370 0.378
NFCorpus 4 0.342 0.363 0.381
NFCorpus 6 0.307 0.363 0.369
SCIDOCS 1 0.396 0.393 0.382
SCIDOCS 2 0.371 0.385 0.375
SCIDOCS 4 0.347 0.374 0.387
SCIDOCS 6 0.332 0.380 0.371
Touché2020 1 0.596 0.595 0.592
Touché2020 2 0.597 0.602 0.601
Touché2020 4 0.565 0.594 0.572
Touché2020 6 0.545 0.573 0.571

For full experiments and additional tables, see the accompanying paper and repository:

Training Data Provenance

The training dataset was built from SciFact train data using a mining-and-reranking pipeline:

  • hard negative mining with BAAI/bge-small-en-v1.5
  • teacher scores from BAAI/bge-reranker-v2-gemma
  • distillation training on mined candidate sets with reranker scores

The released training dataset is available separately as stefan-jo/scifact-train-mined-reranker-scores.

License and Provenance

This model is released under a mixed-source custom notice for caution. The underlying SciFact-based training data combines multiple upstream sources, including SciFact annotations and S2ORC-derived corpus content.

Relevant upstream components:

  • base model: mixedbread-ai/mxbai-edge-colbert-v0-32m (Apache-2.0)
  • mining model: BAAI/bge-small-en-v1.5 (MIT)
  • reranker: BAAI/bge-reranker-v2-gemma (Apache-2.0)
  • training data source: SciFact / BEIR-style preprocessing
  • SciFact annotations: CC BY 4.0
  • SciFact corpus source: S2ORC / ODC-By 1.0

Users should review and comply with the upstream attribution and source terms in addition to this repository notice.

Downloads last month
17
Safetensors
Model size
31.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stefan-jo/mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6

Finetuned
(4)
this model

Dataset used to train stefan-jo/mxbai-edge-colbert-v0-32m-scifact-kmeans-pf1-6