en-ckb Marian model

This repository contains a Marian NMT model for English (en) -> Sorani Kurdish (ckb) trained from the local en-ckb directory of the OPUS-MT training workspace.

Model summary

  • Direction: en -> ckb
  • Architecture: Marian transformer
  • Subword setup: SentencePiece spm4k-spm4k
  • Primary uploaded checkpoint: best-chrf
  • Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
  • Validation set: openlanguagedata_flores_plus
  • Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

  • BLEU: 14.2475 at epoch 42 / update 55000
  • chrF: 45.1146 at epoch 44 / update 58000
  • Perplexity: 8.6557 at epoch 31 / update 40000

Files

  • config.json: Hugging Face Transformers model config
  • generation_config.json: default generation settings
  • model.safetensors: converted Marian weights
  • source.spm: source SentencePiece model
  • target.spm: target SentencePiece model
  • vocab.json: shared Marian vocabulary
  • tokenizer_config.json: tokenizer metadata
  • special_tokens_map.json: tokenizer special token mapping

Usage

This repository uses the standard Transformers Marian layout, so you can load it directly:

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("your-user/en-ckb-marian")
model = MarianMTModel.from_pretrained("your-user/en-ckb-marian")

inputs = tokenizer("Hello world", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.decode(generated[0], skip_special_tokens=True))

Notes

  • The weights were converted from the local Marian checkpoint into the Hugging Face MarianMTModel format.
  • Review dataset and license compatibility before redistributing the model publicly.
Downloads last month
15
Safetensors
Model size
51.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support