You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kazakh Llama 50M v2

A 50M-parameter Kazakh language model trained from scratch using a Llama architecture on ~1B tokens of curated Kazakh text. Chinchilla-optimal compute budget (20 tokens per parameter).

Model Details

Property Value
Architecture Llama (decoder-only transformer)
Parameters ~50M
Hidden size 512
Layers 8
Attention heads 8
Vocab size 50,257 (ByteLevel BPE)
Context length 1,024 tokens
Activation SwiGLU
Tokenizer stukenov/kazakh-gpt2-50k

Training

Property Value
Dataset stukenov/kazakh-clean-pretrain-v2
Total tokens ~1.04B
Epochs 1 (Chinchilla-optimal)
Batch size 128 (16 × 2 grad_accum × 4 GPUs)
Learning rate 6e-4 (cosine decay)
Warmup 500 steps
Hardware 4× RTX 4090
Training time 44 minutes
Final eval loss 3.675
Perplexity ~39.4

Loss curve

Epoch Eval Loss Perplexity
0.25 4.145 63.1
0.38 3.955 52.4
0.51 3.838 46.4
0.64 3.757 42.8
0.76 3.703 40.6
0.89 3.675 39.4
1.00 3.675 39.4

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-llama-50m-v2")
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m-v2")

text = "Қазақстан — "
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data

Trained on kazakh-clean-pretrain-v2 — a curated Kazakh corpus (18.9M texts, ~1.04B tokens) from 6 sources: CC-100, OSCAR, Wikipedia, Leipzig, Kazakh News, Kazakh Books.

Cleaning pipeline: NFC normalization → Kazakh char check → script profiling → fastText LID → junk removal → repetition filter → exact + near dedup (MinHash LSH) → domain balancing.

Limitations

  • Small model (50M params) — limited generation quality
  • Kazakh-only, no multilingual capability
  • 1,024 token context window
  • No instruction tuning or RLHF

Citation

@misc{kazakh-llama-50m-v2,
  author = {Saken Tukenov},
  title = {Kazakh Llama 50M v2},
  year = {2026},
  url = {https://huggingface.co/stukenov/kazakh-llama-50m-v2}
}
Downloads last month
27
Safetensors
Model size
50.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-core-llama-50m-kk-base-v2