mxbai-edge-colbert-v0-32m FiQA KMeans PF1-6
This is a research checkpoint based on mixedbread-ai/mxbai-edge-colbert-v0-32m, fine-tuned on FiQA training data with pooling-aware distillation. During training, document embeddings were pooled with k-means and the pool factor was sampled uniformly from 1 to 6 on each batch.
The model is released as a companion artifact for the paper Learn to Pool: Lightweight Fine-Tuning for Flexible Multi-Vector Compression. It is intended primarily for research, reproduction, and further experimentation with pooling-aware late interaction retrieval.
Model Details
- Base model:
mixedbread-ai/mxbai-edge-colbert-v0-32m - Architecture: ColBERT / late interaction retrieval
- Query length: 32
- Document length: 300
- Embedding dimension: 64
- Training setup: k-means pooling, multi-factor
PF1-6 - Training dataset:
stefan-jo/fiqa-train-mined-reranker-scores
Intended Use
This model is intended for:
- reproducing the paper's FiQA pooling-aware fine-tuning results
- experimenting with pooling-aware late interaction retrieval
- studying how multi-factor training affects retrieval under document compression
It is not presented as a general-purpose production retriever.
Usage
from pylate import models
model = models.ColBERT(
model_name_or_path="stefan-jo/mxbai-edge-colbert-v0-32m-fiqa-kmeans-pf1-6"
)
queries_embeddings = model.encode(
["What is the Sharpe ratio?"],
is_query=True,
)
documents_embeddings = model.encode(
["The Sharpe ratio measures return relative to risk."],
is_query=False,
pool_factor=4,
pool_method="kmeans",
use_sklearn=True,
)
Evaluation Snapshot
The table below is adapted from the paper's NanoBEIR cross-dataset effects table. All runs use k-means pooling at inference and report NDCG@10.
| Dataset | PF | Baseline | FT SciFact KMeans PF1-6 | FT FiQA KMeans PF1-6 |
|---|---|---|---|---|
| SciFact | 1 | 0.808 | 0.817 | 0.802 |
| SciFact | 2 | 0.765 | 0.813 | 0.774 |
| SciFact | 4 | 0.649 | 0.810 | 0.808 |
| SciFact | 6 | 0.609 | 0.795 | 0.758 |
| FiQA | 1 | 0.526 | 0.523 | 0.528 |
| FiQA | 2 | 0.488 | 0.519 | 0.505 |
| FiQA | 4 | 0.470 | 0.490 | 0.513 |
| FiQA | 6 | 0.431 | 0.459 | 0.467 |
| NFCorpus | 1 | 0.375 | 0.372 | 0.369 |
| NFCorpus | 2 | 0.370 | 0.370 | 0.378 |
| NFCorpus | 4 | 0.342 | 0.363 | 0.381 |
| NFCorpus | 6 | 0.307 | 0.363 | 0.369 |
| SCIDOCS | 1 | 0.396 | 0.393 | 0.382 |
| SCIDOCS | 2 | 0.371 | 0.385 | 0.375 |
| SCIDOCS | 4 | 0.347 | 0.374 | 0.387 |
| SCIDOCS | 6 | 0.332 | 0.380 | 0.371 |
| Touché2020 | 1 | 0.596 | 0.595 | 0.592 |
| Touché2020 | 2 | 0.597 | 0.602 | 0.601 |
| Touché2020 | 4 | 0.565 | 0.594 | 0.572 |
| Touché2020 | 6 | 0.545 | 0.573 | 0.571 |
For full experiments and additional tables, see the accompanying paper and repository:
- Paper: stefan-jo.github.io/learn-to-pool/downloads/paper.pdf
- Code and aggregate metrics: stefan-jo/pylate
Training Data Provenance
The training dataset was built from FiQA train data using a mining-and-reranking pipeline:
- hard negative mining with
BAAI/bge-small-en-v1.5 - teacher scores from
BAAI/bge-reranker-v2-gemma - distillation training on mined candidate sets with reranker scores
The released training dataset is available separately as stefan-jo/fiqa-train-mined-reranker-scores.
License and Provenance
This model is released under a custom non-commercial notice because it was fine-tuned on FiQA-2018-derived training data. The official FiQA-2018 source states that the relevant Opinion-based QA data are available only for non-commercial use.
Relevant upstream components:
- base model:
mixedbread-ai/mxbai-edge-colbert-v0-32m(Apache-2.0) - mining model:
BAAI/bge-small-en-v1.5(MIT) - reranker:
BAAI/bge-reranker-v2-gemma(Apache-2.0) - training data source: FiQA-2018 / BEIR-style preprocessing
Users should review and comply with the upstream dataset terms in addition to this repository notice.
- Downloads last month
- 21
Model tree for stefan-jo/mxbai-edge-colbert-v0-32m-fiqa-kmeans-pf1-6
Base model
mixedbread-ai/mxbai-edge-colbert-v0-32m