TartarusXXX
/

ckb-en-marian

Central Kurdish

Model card Files Files and versions

ckb-en Marian model

This repository contains a raw Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the files in work/ckb-en of the local OPUS-MT training workspace.

Model summary

Direction: ckb -> en
Architecture: Marian transformer
Subword setup: SentencePiece spm4k-spm4k
Primary uploaded checkpoint: best-bleu
Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
Validation set: openlanguagedata_flores_plus
Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

BLEU: 30.2383 at epoch 55 / update 70000
chrF: 56.6711 at epoch 51 / update 64000
Perplexity: 7.0681 at epoch 34 / update 43000

Files

translate_with_marian.py: standalone inference helper for downloaded snapshots
curated-floresdev.spm4k-spm4k.vocab.yml: Marian vocabulary
opus.src.spm4k-model: source SentencePiece model
opus.trg.spm4k-model: target SentencePiece model
Decoder config(s) and checkpoint(s):
best-bleu: curated-floresdev.spm4k-spm4k.transformer.model1.npz.best-bleu.npz

Usage

This is a raw Marian model, not a Transformers conversion. To run it you need marian-decoder and spm_encode available locally.

Example:

python translate_with_marian.py input.txt -o output.txt --checkpoint best-bleu

You can also point to custom binaries:

python translate_with_marian.py input.txt -o output.txt \
  --marian-decoder /path/to/marian-decoder \
  --spm-encode /path/to/spm_encode

Notes

The decoder configs in this repo were rewritten to use relative paths so they work from a downloaded Hub snapshot.
Review dataset and license compatibility before redistributing the model publicly.

Downloads last month: -; Downloads are not tracked for this model. How to track