ckb-en Marian model

This repository contains a Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the local ckb-en directory of the OPUS-MT training workspace.

Model summary

Direction: ckb -> en
Architecture: Marian transformer
Subword setup: SentencePiece spm4k-spm4k
Primary uploaded checkpoint: best-chrf
Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
Validation set: openlanguagedata_flores_plus
Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

BLEU: 30.2383 at epoch 55 / update 70000
chrF: 56.6711 at epoch 51 / update 64000
Perplexity: 7.0681 at epoch 34 / update 43000

Files

config.json: Hugging Face Transformers model config
generation_config.json: default generation settings
model.safetensors: converted Marian weights
source.spm: source SentencePiece model
target.spm: target SentencePiece model
vocab.json: shared Marian vocabulary
tokenizer_config.json: tokenizer metadata
special_tokens_map.json: tokenizer special token mapping

Usage

This repository uses the standard Transformers Marian layout, so you can load it directly:

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("your-user/ckb-en-marian")
model = MarianMTModel.from_pretrained("your-user/ckb-en-marian")

inputs = tokenizer("Hello world", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.decode(generated[0], skip_special_tokens=True))

Notes

The weights were converted from the local Marian checkpoint into the Hugging Face MarianMTModel format.
Review dataset and license compatibility before redistributing the model publicly.

Downloads last month: 20

Safetensors

Model size

51.9M params

Tensor type

F32