You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kazakh Llama 50M v2

A 50M-parameter Kazakh language model trained from scratch using a Llama architecture on ~1B tokens of curated Kazakh text. Chinchilla-optimal compute budget (20 tokens per parameter).

Model Details

Property	Value
Architecture	Llama (decoder-only transformer)
Parameters	~50M
Hidden size	512
Layers	8
Attention heads	8
Vocab size	50,257 (ByteLevel BPE)
Context length	1,024 tokens
Activation	SwiGLU
Tokenizer	stukenov/kazakh-gpt2-50k

Training

Property	Value
Dataset	stukenov/kazakh-clean-pretrain-v2
Total tokens	~1.04B
Epochs	1 (Chinchilla-optimal)
Batch size	128 (16 × 2 grad_accum × 4 GPUs)
Learning rate	6e-4 (cosine decay)
Warmup	500 steps
Hardware	4× RTX 4090
Training time	44 minutes
Final eval loss	3.675
Perplexity	~39.4

Loss curve

Epoch	Eval Loss	Perplexity
0.25	4.145	63.1
0.38	3.955	52.4
0.51	3.838	46.4
0.64	3.757	42.8
0.76	3.703	40.6
0.89	3.675	39.4
1.00	3.675	39.4

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-llama-50m-v2")
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m-v2")

text = "Қазақстан — "
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data

Trained on kazakh-clean-pretrain-v2 — a curated Kazakh corpus (18.9M texts, ~1.04B tokens) from 6 sources: CC-100, OSCAR, Wikipedia, Leipzig, Kazakh News, Kazakh Books.

Cleaning pipeline: NFC normalization → Kazakh char check → script profiling → fastText LID → junk removal → repetition filter → exact + near dedup (MinHash LSH) → domain balancing.

Limitations

Small model (50M params) — limited generation quality
Kazakh-only, no multilingual capability
1,024 token context window
No instruction tuning or RLHF

Citation

@misc{kazakh-llama-50m-v2,
  author = {Saken Tukenov},
  title = {Kazakh Llama 50M v2},
  year = {2026},
  url = {https://huggingface.co/stukenov/kazakh-llama-50m-v2}
}

Downloads last month: 27

Safetensors

Model size

50.6M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-core-llama-50m-kk-base-v2

SozKZ Core: Kazakh Language Models

Collection

Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures • 22 items • Updated 29 days ago