SozKZ GEC: Kazakh Grammar Error Correction
Collection
Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models • 10 items • Updated
Казақ тіліне арналған грамматикалық қателерді түзету моделі (Grammar Error Correction).
v4 — thinking формат: модель алдымен қатені анықтайды (💭), содан кейін түзетеді (→).
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLM (decoder-only) |
| Parameters | 325M |
| Base model | stukenov/sozkz-core-llama-300m-kk-base-v1 |
| Training data | sozkz-corpus-synthetic-kk-gec-v1 |
| Training | 5 epochs, LR=2e-5, BS=128, cosine schedule |
| Clean ratio | 80% (error pairs + identity pairs) |
| Data filter | word edit distance ≤ 2 only |
| Hardware | 4× RTX 4090 (vast.ai), ~2.8h |
| License | MIT (gated access) |
New thinking format — model first identifies the error, then corrects:
Error example:
<тег> ошибочный текст
💭 ларде→ларда (дауысты дыбыс үндесімі)
→ исправленный текст
Clean example:
<тег> дұрыс текст
💭 қате жоқ
→ дұрыс текст
| Tag | Error Type |
|---|---|
<грамматика> |
General grammar (catch-all) |
<сингармонизм> |
Vowel harmony |
<септік> |
Case suffixes |
<тәуелдік> |
Possessive |
<жіктік> |
Personal endings |
<шылау> |
Postpositions |
<көптік> |
Plural |
<болымсыз> |
Negation |
<шақ> |
Tense |
<жалғау> |
General suffixes |
<қате> |
Typos/noise |
<таза> |
Clean (no error) |
from transformers import AutoModelForCausalLM, GPT2TokenizerFast
from huggingface_hub import hf_hub_download
import torch
model_id = "stukenov/sozkz-core-llama-300m-kk-gec-v4"
tok_file = hf_hub_download(repo_id=model_id, filename="tokenizer.json")
tokenizer = GPT2TokenizerFast(tokenizer_file=tok_file)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
def correct(tag, text):
prompt = f"<{tag}> {text}\n"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=480)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=len(inputs["input_ids"][0]) + 60,
temperature=0.3, top_p=0.9, do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(out[0], skip_special_tokens=True)
# Extract thinking and correction
if "→ " in result:
result = result.split("→ ", 1)[1]
for stop in ["\n<", "\n\n"]:
if stop in result:
result = result[:result.index(stop)]
return result.strip() or text
print(correct("көптік", "Студенттар университетте оқиды."))
# → Студенттер университетте оқиды.
| Version | Epochs | Clean% | Accuracy | FP Rate |
|---|---|---|---|---|
| v1 | 1 | 0% | 9.0% | 84% |
| v2a | 1 | 15% | 6.6% | 82% |
| v3 | 5 | 50% | 15.8% | 76% |
| v3+filters | — | — | 21.4% | 12% |
| v4 | 5 | 80% | TBD | TBD |
This model is part of the SozKZ project for Kazakh language AI.