SozKZ Core: Kazakh Language Models
Collection
Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures • 22 items • Updated
A 50M-parameter Kazakh language model trained from scratch using a Llama architecture on ~1B tokens of curated Kazakh text. Chinchilla-optimal compute budget (20 tokens per parameter).
| Property | Value |
|---|---|
| Architecture | Llama (decoder-only transformer) |
| Parameters | ~50M |
| Hidden size | 512 |
| Layers | 8 |
| Attention heads | 8 |
| Vocab size | 50,257 (ByteLevel BPE) |
| Context length | 1,024 tokens |
| Activation | SwiGLU |
| Tokenizer | stukenov/kazakh-gpt2-50k |
| Property | Value |
|---|---|
| Dataset | stukenov/kazakh-clean-pretrain-v2 |
| Total tokens | ~1.04B |
| Epochs | 1 (Chinchilla-optimal) |
| Batch size | 128 (16 × 2 grad_accum × 4 GPUs) |
| Learning rate | 6e-4 (cosine decay) |
| Warmup | 500 steps |
| Hardware | 4× RTX 4090 |
| Training time | 44 minutes |
| Final eval loss | 3.675 |
| Perplexity | ~39.4 |
| Epoch | Eval Loss | Perplexity |
|---|---|---|
| 0.25 | 4.145 | 63.1 |
| 0.38 | 3.955 | 52.4 |
| 0.51 | 3.838 | 46.4 |
| 0.64 | 3.757 | 42.8 |
| 0.76 | 3.703 | 40.6 |
| 0.89 | 3.675 | 39.4 |
| 1.00 | 3.675 | 39.4 |
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-llama-50m-v2")
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m-v2")
text = "Қазақстан — "
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Trained on kazakh-clean-pretrain-v2 — a curated Kazakh corpus (18.9M texts, ~1.04B tokens) from 6 sources: CC-100, OSCAR, Wikipedia, Leipzig, Kazakh News, Kazakh Books.
Cleaning pipeline: NFC normalization → Kazakh char check → script profiling → fastText LID → junk removal → repetition filter → exact + near dedup (MinHash LSH) → domain balancing.
@misc{kazakh-llama-50m-v2,
author = {Saken Tukenov},
title = {Kazakh Llama 50M v2},
year = {2026},
url = {https://huggingface.co/stukenov/kazakh-llama-50m-v2}
}