ckb-en Marian model

This repository contains a Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the local ckb-en directory of the OPUS-MT training workspace.

Model summary

  • Direction: ckb -> en
  • Architecture: Marian transformer
  • Subword setup: SentencePiece spm4k-spm4k
  • Primary uploaded checkpoint: best-chrf
  • Training dataset selection: InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus
  • Validation set: openlanguagedata_flores_plus
  • Test set recipe: openlanguagedata_flores_plus

Best validation metrics seen in training logs

  • BLEU: 30.2383 at epoch 55 / update 70000
  • chrF: 56.6711 at epoch 51 / update 64000
  • Perplexity: 7.0681 at epoch 34 / update 43000

Files

  • config.json: Hugging Face Transformers model config
  • generation_config.json: default generation settings
  • model.safetensors: converted Marian weights
  • source.spm: source SentencePiece model
  • target.spm: target SentencePiece model
  • vocab.json: shared Marian vocabulary
  • tokenizer_config.json: tokenizer metadata
  • special_tokens_map.json: tokenizer special token mapping

Usage

This repository uses the standard Transformers Marian layout, so you can load it directly:

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("your-user/ckb-en-marian")
model = MarianMTModel.from_pretrained("your-user/ckb-en-marian")

inputs = tokenizer("Hello world", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.decode(generated[0], skip_special_tokens=True))

Notes

  • The weights were converted from the local Marian checkpoint into the Hugging Face MarianMTModel format.
  • Review dataset and license compatibility before redistributing the model publicly.
Downloads last month
20
Safetensors
Model size
51.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support