ckb-en Marian model
This repository contains a raw Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the files in work/ckb-en of the local OPUS-MT training workspace.
Model summary
- Direction:
ckb -> en - Architecture: Marian transformer
- Subword setup: SentencePiece
spm4k-spm4k - Primary uploaded checkpoint:
best-bleu - Training dataset selection:
InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus - Validation set:
openlanguagedata_flores_plus - Test set recipe:
openlanguagedata_flores_plus
Best validation metrics seen in training logs
- BLEU: 30.2383 at epoch 55 / update 70000
- chrF: 56.6711 at epoch 51 / update 64000
- Perplexity: 7.0681 at epoch 34 / update 43000
Files
translate_with_marian.py: standalone inference helper for downloaded snapshotscurated-floresdev.spm4k-spm4k.vocab.yml: Marian vocabularyopus.src.spm4k-model: source SentencePiece modelopus.trg.spm4k-model: target SentencePiece model- Decoder config(s) and checkpoint(s):
best-bleu:curated-floresdev.spm4k-spm4k.transformer.model1.npz.best-bleu.npz
Usage
This is a raw Marian model, not a Transformers conversion. To run it you need marian-decoder and spm_encode available locally.
Example:
python translate_with_marian.py input.txt -o output.txt --checkpoint best-bleu
You can also point to custom binaries:
python translate_with_marian.py input.txt -o output.txt \
--marian-decoder /path/to/marian-decoder \
--spm-encode /path/to/spm_encode
Notes
- The decoder configs in this repo were rewritten to use relative paths so they work from a downloaded Hub snapshot.
- Review dataset and license compatibility before redistributing the model publicly.