You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SozKZ OmniAudio v2 — Kazakh ASR (70M)

A native Kazakh automatic speech recognition model trained from scratch. No pretrained components — both encoder and decoder are trained entirely on Kazakh speech data.

Model Details

Parameter Value
Architecture Custom encoder-decoder with cross-attention
Total params 69.58M (266 MB)
Encoder 5.1M params — 256d, 4 heads, 6 layers, 2 conv (4x downsample), bidirectional
Decoder 51.4M params — 512d, 8 heads, 8 layers, causal, tied embeddings
Components RoPE, RMSNorm, SwiGLU (Llama-style)
Tokenizer kazakh-gpt2-50k (50,257 vocab, BPE)
Training data kzcalm-tts-kk-v1 (232K samples, 439 hours)
Input 16kHz mono audio, 80-bin log-mel spectrogram
Output Kazakh text

Results

Metric Value
WER 21.28%
CER 12.80%
Val loss 2.3547

Assessed on 50 test samples from kzcalm-tts-kk-v1.

Example Outputs

Reference Prediction
Тоқ етер түйіні – адам өзін–өзі зерттеуі керек! Тоқ етер түйіні – адам өзін–өзі зерттеуі керек!
Тамыр емес, бейтаныс қонжықтың жүніне қолы тигенін ол енді ғана сезгендей болды. Тамыр емес, бейтаныс қонжықтың жүніне қолы тигенін ол енді ғана сезгендей болды.
Ол сонда да есін жоғалтпай тұрды. Ол сонда да есін жоғалтпай тұрды.
Сондай–ақ барша Қазақстан халқына шексіз алғысын жолдап, бақ–береке тіледі. Сондай–ақ, барша Қазақстан халқына шексіз алғысын жолдап, бақ-береке тіледі.

Training

3-Stage Pipeline

  1. CTC Pretraining (5 epochs): Encoder + CTC head only. Teaches the encoder to produce meaningful audio representations.
  2. End-to-End with CTC (4+ epochs): Full model with hybrid loss (0.7 CE + 0.3 CTC). Trains cross-attention and decoder.
  3. Pure CE Fine-tuning (1 epoch): CE loss only, LR=5e-6, label_smoothing=0.05. Final refinement.

Training Details

  • Hardware: 1x NVIDIA RTX 4090 (24 GB)
  • Augmentation: SpecAugment (freq=27, time=100, 2 masks each) + speed perturbation (0.9x, 1.0x, 1.1x)
  • Optimizer: AdamW, weight_decay=0.01
  • Schedule: Linear warmup + linear decay
  • Precision: bfloat16

Key Findings

  • EOS token is critical: Without EOS in training targets, the model generates correct text followed by infinite garbage. Adding EOS reduced WER from 855% to ~21%.
  • DDP hurts this model: DistributedDataParallel consistently degraded WER (tested twice). Single GPU training works best for this model size.
  • Lower LR for fine-tuning: 5e-6 >> 1e-5 >> 1e-4 when continuing from a good checkpoint.
  • Pure CE > hybrid CTC+CE for final fine-tuning stage.

Architecture

Audio (16kHz)
  -> Mel Spectrogram (80 bins)
  -> Conv1 (1->256, k=3, s=2) -> Conv2 (256->256, k=3, s=2)  [4x downsample]
  -> 6x Encoder Blocks (bidirectional self-attention + SwiGLU FFN)
  -> Projector (256->512)
  -> 8x Decoder Blocks (causal self-attention + cross-attention + SwiGLU FFN)
  -> LM Head (512->50257, tied with embeddings)
  -> Text tokens -> Tokenizer decode -> Kazakh text

Usage

import torch
import torchaudio
from omniaudio.model_v2 import OmniAudioScratchModel
from transformers import PreTrainedTokenizerFast

# Load model
model = OmniAudioScratchModel(
    encoder_config={"n_mels": 80, "d_model": 256, "n_heads": 4, "n_layers": 6, "n_conv": 2},
    decoder_config={"d_model": 512, "n_heads": 8, "n_layers": 8},
    vocab_size=50257,
)
state = torch.load("model.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state, strict=False)
model.cuda()

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/")

# Transcribe
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=16000, n_mels=80, n_fft=400, hop_length=160
)
mel = torch.log(torch.clamp(mel_transform(waveform), min=1e-10)).cuda()

with torch.no_grad():
    tokens = model.generate(mel, max_len=256)
text = tokenizer.decode(tokens, skip_special_tokens=True)
print(text)

Limitations

  • Trained on studio-quality TTS data (kzcalm) — may perform worse on noisy real-world audio
  • 70M model has limited capacity — larger models with more diverse data would improve results
  • Only Kazakh language supported
  • Max audio length: 10 seconds (during training)

Citation

@misc{omniaudio-v2-2026,
  title={OmniAudio v2: Native Kazakh ASR from Scratch},
  author={Saken Tukenov},
  year={2026},
  url={https://huggingface.co/stukenov/sozkz-core-omniaudio-70m-kk-asr-v1}
}
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using stukenov/sozkz-core-omniaudio-70m-kk-asr-v1 2

Evaluation results