You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

EkiTil-600M Translate: Kazakh-Russian Translation Model

Bilingual Kazakh-Russian translation model fine-tuned from EkiTil-600M base on 5.1M parallel sentence pairs.

Model Details

Property Value
Architecture Qwen3ForCausalLM (decoder-only)
Parameters 673.8M
Base model ekitil-core-qwen3-600m-kkru-base-v1
Training data ekitil-parallel-kkru-v2 (5.1M kk-ru pairs)
Training examples 10.2M (both directions: kk->ru + ru->kk)
Training steps 19,921 (~0.5 epochs)
Final loss 2.62
LR 2e-5 (cosine decay, 500 warmup)
Effective batch 256 (32 x 8 grad accum)
Hardware 1x NVIDIA H100 80GB
Training time ~4.5h

Translation Format

<|kk|> казахский текст <|translate|> <|ru|>  -> generates Russian translation
<|ru|> русский текст <|translate|> <|kk|>    -> generates Kazakh translation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "stukenov/ekitil-core-qwen3-600m-kkru-translate-v1",
    dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("stukenov/ekitil-core-qwen3-600m-kkru-translate-v1")

# Kazakh -> Russian
prompt = "<|kk|> Қазақстан — Орталық Азиядағы ең үлкен мемлекет. <|translate|> <|ru|>"
ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=100, repetition_penalty=1.3, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

# Russian -> Kazakh
prompt = "<|ru|> Образование является ключом к успеху. <|translate|> <|kk|>"
ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=100, repetition_penalty=1.3, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Training Data

5.1M deduplicated kk-ru parallel pairs from 13 sources:

Source Pairs
WMT19 crawl 4,512,841
KazParC (human-translated) 362,208
OPUS (12 corpora) 225,924
Total 5,100,973

See ekitil-parallel-kkru-v2 for full details.

Generation Examples

Generated with repetition_penalty=1.3, num_beams=4:

Kazakh -> Russian:

SRC: Қазақстан — Орталық Азиядағы ең үлкен мемлекет.

OUT: Казахстан является одним из крупнейших государств мира.

SRC: Бүгін ауа райы өте жақсы.

OUT: У нас очень хорошая погода.

SRC: Мен университетте оқимын.

OUT: У нас в университете.

Russian -> Kazakh:

SRC: Казахстан — красивая страна с богатой историей.

OUT: Қазақстан — Қазақстанның ең бай тарихы. Қазақстан — өте бай ел.

SRC: Здравствуйте, как у вас дела?

OUT: Сіздер туралы айтып беріңізші.

SRC: Образование является ключом к успеху.

OUT: Өнеркәсiптiк iс-әрекеттiң ерекшелiктерi бiр-бiрiмен байланысты.

Note: Translation captures meaning but quality varies. The model was trained for only 0.5 epochs — more training would improve fluency. Use repetition_penalty >= 1.3 to avoid repetition.

Limitations

  • Decoder-only architecture may repeat output; use repetition_penalty >= 1.2
  • Trained for only 0.5 epochs; more training could improve quality
  • Best for sentence-level translation, not full documents
  • Translation quality varies by domain (strongest on legal/government text from WMT19)

EkiTil Model Family

Model Type Params HF
EkiTil-123M Base LM 124.7M base-v1
EkiTil-300M Base LM 245.9M base-v1
EkiTil-600M Base LM 673.8M base-v1
EkiTil-600M Translate Translation 673.8M this model

License

MIT

Downloads last month
11
Safetensors
Model size
0.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/ekitil-core-qwen3-600m-kkru-translate-v1

Finetuned
(1)
this model

Dataset used to train stukenov/ekitil-core-qwen3-600m-kkru-translate-v1