ckb-en Marian model
This repository contains a Marian NMT model for Sorani Kurdish (ckb) -> English (en) trained from the local ckb-en directory of the OPUS-MT training workspace.
Model summary
- Direction:
ckb -> en - Architecture: Marian transformer
- Subword setup: SentencePiece
spm4k-spm4k - Primary uploaded checkpoint:
best-chrf - Training dataset selection:
InterdialectCorpus Tatoeba wikimedia tico-19 navinaananthan_kurdish_sorani_parallel_corpus - Validation set:
openlanguagedata_flores_plus - Test set recipe:
openlanguagedata_flores_plus
Best validation metrics seen in training logs
- BLEU: 30.2383 at epoch 55 / update 70000
- chrF: 56.6711 at epoch 51 / update 64000
- Perplexity: 7.0681 at epoch 34 / update 43000
Files
config.json: Hugging Face Transformers model configgeneration_config.json: default generation settingsmodel.safetensors: converted Marian weightssource.spm: source SentencePiece modeltarget.spm: target SentencePiece modelvocab.json: shared Marian vocabularytokenizer_config.json: tokenizer metadataspecial_tokens_map.json: tokenizer special token mapping
Usage
This repository uses the standard Transformers Marian layout, so you can load it directly:
from transformers import MarianMTModel, MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("your-user/ckb-en-marian")
model = MarianMTModel.from_pretrained("your-user/ckb-en-marian")
inputs = tokenizer("Hello world", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.decode(generated[0], skip_special_tokens=True))
Notes
- The weights were converted from the local Marian checkpoint into the Hugging Face
MarianMTModelformat. - Review dataset and license compatibility before redistributing the model publicly.
- Downloads last month
- 20