wmt22-comet-da-pruned-k4

A pruned version of Unbabel/wmt22-comet-da for faster, smaller machine-translation quality estimation with minimal accuracy loss.

Variant	Disk size	Params	Pearson vs full	MAE vs full
Original `wmt22-comet-da`	~2200 MB	580.9 M	1.0000	0.0000
This model	2122 MB	530.5 M	0.9827	0.0467

What was changed

An internal compression pipeline was applied to the base model to reduce size and inference cost while preserving evaluation quality. The pipeline preserves the regressor head and embeddings; only the encoder backbone is modified.

Usage (3 lines)

pip install "unbabel-comet" "setuptools<81" huggingface_hub

from huggingface_hub import snapshot_download
import sys
folder = snapshot_download(repo_id="solailabs/wmt22-comet-da-pruned-k4")
sys.path.insert(0, folder)
from load import load_model

model = load_model()
scores = model.predict(
    [{{"src": "Hello world.", "mt": "Bonjour le monde.", "ref": "Bonjour le monde."}}],
    batch_size=8, gpus=0, num_workers=2,
)
print(scores["scores"])

The bundled load.py handles base-model download, layer pruning, layerwise-attention reshaping, and (for the int8 variant) quantization automatically. First call downloads the original wmt22-comet-da (~2.2 GB cached); subsequent calls are instant.

Real-world benchmark — correlation with human DA scores

Evaluated on 1,200 segments from the WMT17 DA human-evaluation set (RicardoRei/wmt-da-human-evaluation), stratified across 12 language pairs (en↔cs, de, fi, ru, tr, zh). This is the standard way to measure COMET quality: how well do model scores correlate with human judgments?

Model	Disk	Pearson (human)	Spearman (human)	Agreement with full
Original `wmt22-comet-da`	2200 MB	0.6415	0.6724	1.000
This model	2122 MB	0.6181	0.6356	0.8904

Interpretation: this variant loses +0.0234 Pearson points of human correlation vs. the original. Good-vs-bad translation separation is preserved; fine-grained segment-level ranking is slightly degraded.

Synthetic quality check (22 curated multilingual pairs)

Tested on 22 src/mt/ref triples across 11 languages (fr, de, es, it, pt, ja, zh, ko, ar, ru, hi), each language contributing one good and one bad translation.

Metric	Value
Pearson vs full model	0.9827
Spearman vs full model	0.9827
Mean absolute score error	0.0467
Good > Bad discrimination	11/11 languages correct
Good-MT mean score (full → this model)	0.951 → 0.831
Bad-MT mean score (full → this model)	0.404 → 0.354

The model retains the full model's ability to separate good from bad translations across every language tested. See per_case_results.json for per-pair scores.

Notes

No fine-tuning was performed; weights are derived directly from the base model.
Tested on Apple M-series and x86 Linux.

Limitations

Internal evaluation set is small (22 multilingual pairs). On larger WMT22 test sets, expect quality close to the calibration correlations reported above.
Behavior outside the languages listed in the metadata is not guaranteed.

License & attribution

Apache-2.0 (inherited from base model).

Base model: Unbabel/wmt22-comet-da by Unbabel — please cite their paper:

@inproceedings{rei-etal-2022-comet,
  title={{COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task}},
  author={Rei, Ricardo and others},
  booktitle={Proceedings of the Seventh Conference on Machine Translation},
  year={2022}
}

Downloads last month: -

Model tree for solailabs/wmt22-comet-da-pruned-k4

Base model

FacebookAI/xlm-roberta-large

Finetuned

Unbabel/wmt22-comet-da

Finetuned

(4)

this model