ckb-en Marian model

This repository contains a raw Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the files in work/ckb-en of the local OPUS-MT training workspace.

Model summary

  • Direction: ckb -> en
  • Architecture: Marian transformer
  • Subword setup: SentencePiece spm4k-spm4k
  • Primary uploaded checkpoint: best-bleu
  • Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
  • Validation set: openlanguagedata_flores_plus
  • Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

  • BLEU: 30.2383 at epoch 55 / update 70000
  • chrF: 56.6711 at epoch 51 / update 64000
  • Perplexity: 7.0681 at epoch 34 / update 43000

Files

  • translate_with_marian.py: standalone inference helper for downloaded snapshots
  • curated-floresdev.spm4k-spm4k.vocab.yml: Marian vocabulary
  • opus.src.spm4k-model: source SentencePiece model
  • opus.trg.spm4k-model: target SentencePiece model
  • Decoder config(s) and checkpoint(s):
  • best-bleu: curated-floresdev.spm4k-spm4k.transformer.model1.npz.best-bleu.npz

Usage

This is a raw Marian model, not a Transformers conversion. To run it you need marian-decoder and spm_encode available locally.

Example:

python translate_with_marian.py input.txt -o output.txt --checkpoint best-bleu

You can also point to custom binaries:

python translate_with_marian.py input.txt -o output.txt \
  --marian-decoder /path/to/marian-decoder \
  --spm-encode /path/to/spm_encode

Notes

  • The decoder configs in this repo were rewritten to use relative paths so they work from a downloaded Hub snapshot.
  • Review dataset and license compatibility before redistributing the model publicly.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support