en-ckb Marian model

This repository contains a Marian NMT model for English (en) -> Sorani Kurdish (ckb) trained from the local en-ckb directory of the OPUS-MT training workspace.

Model summary

Direction: en -> ckb
Architecture: Marian transformer
Subword setup: SentencePiece spm4k-spm4k
Primary uploaded checkpoint: best-chrf
Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
Validation set: openlanguagedata_flores_plus
Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

BLEU: 14.2475 at epoch 42 / update 55000
chrF: 45.1146 at epoch 44 / update 58000
Perplexity: 8.6557 at epoch 31 / update 40000

Files

config.json: Hugging Face Transformers model config
generation_config.json: default generation settings
model.safetensors: converted Marian weights
source.spm: source SentencePiece model
target.spm: target SentencePiece model
vocab.json: shared Marian vocabulary
tokenizer_config.json: tokenizer metadata
special_tokens_map.json: tokenizer special token mapping

Usage

This repository uses the standard Transformers Marian layout, so you can load it directly:

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("your-user/en-ckb-marian")
model = MarianMTModel.from_pretrained("your-user/en-ckb-marian")

inputs = tokenizer("Hello world", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.decode(generated[0], skip_special_tokens=True))

Notes

The weights were converted from the local Marian checkpoint into the Hugging Face MarianMTModel format.
Review dataset and license compatibility before redistributing the model publicly.

Downloads last month: 15

Safetensors

Model size

51.9M params

Tensor type

F32