wmt22-comet-da-pruned-k4

A pruned version of Unbabel/wmt22-comet-da for faster, smaller machine-translation quality estimation with minimal accuracy loss.

Variant Disk size Params Pearson vs full MAE vs full
Original wmt22-comet-da ~2200 MB 580.9 M 1.0000 0.0000
This model 2122 MB 530.5 M 0.9827 0.0467

What was changed

An internal compression pipeline was applied to the base model to reduce size and inference cost while preserving evaluation quality. The pipeline preserves the regressor head and embeddings; only the encoder backbone is modified.

Usage (3 lines)

pip install "unbabel-comet" "setuptools<81" huggingface_hub
from huggingface_hub import snapshot_download
import sys
folder = snapshot_download(repo_id="solailabs/wmt22-comet-da-pruned-k4")
sys.path.insert(0, folder)
from load import load_model

model = load_model()
scores = model.predict(
    [{{"src": "Hello world.", "mt": "Bonjour le monde.", "ref": "Bonjour le monde."}}],
    batch_size=8, gpus=0, num_workers=2,
)
print(scores["scores"])

The bundled load.py handles base-model download, layer pruning, layerwise-attention reshaping, and (for the int8 variant) quantization automatically. First call downloads the original wmt22-comet-da (~2.2 GB cached); subsequent calls are instant.

Real-world benchmark β€” correlation with human DA scores

Evaluated on 1,200 segments from the WMT17 DA human-evaluation set (RicardoRei/wmt-da-human-evaluation), stratified across 12 language pairs (en↔cs, de, fi, ru, tr, zh). This is the standard way to measure COMET quality: how well do model scores correlate with human judgments?

Model Disk Pearson (human) Spearman (human) Agreement with full
Original wmt22-comet-da 2200 MB 0.6415 0.6724 1.000
This model 2122 MB 0.6181 0.6356 0.8904

Interpretation: this variant loses +0.0234 Pearson points of human correlation vs. the original. Good-vs-bad translation separation is preserved; fine-grained segment-level ranking is slightly degraded.

Synthetic quality check (22 curated multilingual pairs)

Tested on 22 src/mt/ref triples across 11 languages (fr, de, es, it, pt, ja, zh, ko, ar, ru, hi), each language contributing one good and one bad translation.

Metric Value
Pearson vs full model 0.9827
Spearman vs full model 0.9827
Mean absolute score error 0.0467
Good > Bad discrimination 11/11 languages correct
Good-MT mean score (full β†’ this model) 0.951 β†’ 0.831
Bad-MT mean score (full β†’ this model) 0.404 β†’ 0.354

The model retains the full model's ability to separate good from bad translations across every language tested. See per_case_results.json for per-pair scores.

Notes

  • No fine-tuning was performed; weights are derived directly from the base model.
  • Tested on Apple M-series and x86 Linux.

Limitations

  • Internal evaluation set is small (22 multilingual pairs). On larger WMT22 test sets, expect quality close to the calibration correlations reported above.
  • Behavior outside the languages listed in the metadata is not guaranteed.

License & attribution

Apache-2.0 (inherited from base model).

Base model: Unbabel/wmt22-comet-da by Unbabel β€” please cite their paper:

@inproceedings{rei-etal-2022-comet,
  title={{COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task}},
  author={Rei, Ricardo and others},
  booktitle={Proceedings of the Seventh Conference on Machine Translation},
  year={2022}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for solailabs/wmt22-comet-da-pruned-k4

Finetuned
(4)
this model