Qyra-350M

Continued-pretrained Urdu language model based on LiquidAI/LFM2-350M.

Model Details

  • Base: LiquidAI/LFM2-350M
  • Architecture: Decoder-only causal LM
  • Parameters: 350M
  • Context length: 512 tokens
  • Tokenizer: LFM-2 + Urdu vocabulary expansion (+10k tokens, hard cap)

Training

Data

  • Dataset: ReySajju742/shaistagi_clean (train split)
  • Samples: 500,000 (streamed, no full materialization)
  • Text field: text

Hyperparameters

  • num_train_epochs = 3
  • per_device_train_batch_size = 1
  • gradient_accumulation_steps = 16 (effective batch size = 16)
  • learning_rate = 2.0e-4
  • weight_decay = 0.01
  • warmup_ratio = 0.03
  • max_grad_norm = 1.0
  • max_seq_length = 512
  • Mixed precision: bf16 (fallback to fp16)
  • gradient_checkpointing = True
  • save_strategy = "epoch", save_total_limit = 3

Data Pipeline

  • Streaming mode: load_dataset(..., streaming=True)
  • Sample cap enforced before tokenization
  • All .map() calls: load_from_cache_file=False, keep_in_memory=False, writer_batch_size=1000
  • Column pruning: only text column retained

Computational Cost

  • Training time: ~48 hours (3 epochs)
  • Local cost: ~$1.15 USD
  • Cloud equivalent: ~$50-80 USD

Evaluation

Metrics

  • Perplexity: 27.82
  • Fragmentation ratio: 4.78 tokens/word

Evaluation Prompts

  1. اردو ایک خوبصورت زبان ہے۔
  2. مجھے پاکستان کی تاریخ کے بارے میں بتائیں۔
  3. آج موسم کیسا ہے؟
  4. آپ کا پسندیدہ شاعر کون ہے؟

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "mahwizzzz/qyra-350m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "اردو ایک خوبصورت زبان ہے۔"
inputs = tokenizer(prompt, return_tensors="pt")
gen = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(gen[0], skip_special_tokens=True))

Training Code

Training code available at: GitHub Repository

Key files:

  • src/pretrain.py - Continued pretraining script
  • src/data_utils.py - Memory-safe data pipeline
  • src/tokenizer_utils.py - Tokenizer expansion
  • config.yaml - Configuration

Limitations

  • Trained on 500k-sample subset of full corpus
  • Context length limited to 512 tokens
  • No explicit alignment or safety filtering

Intended Use

This model is intended for:

  • Research into Urdu language modeling
  • Downstream continued pretraining on specialized Urdu corpora
  • Supervised fine-tuning for Urdu dialogue, QA, and educational content
  • Evaluation of Urdu tokenization and vocabulary design

Not intended for:

  • Production deployment without further safety evaluation
  • Safety critical applications
  • High risk use cases without additional alignment

Risks and Limitations

  • May generate biased, offensive, or inaccurate content
  • No explicit safety filtering or alignment
  • Limited context window (512 tokens)
  • Trained on subset of available Urdu corpus
  • May exhibit code mixing behavior (Urdu-English)

Citation

@misc{qyra-350m,
  title={mahwizzzz/qyra-350m: Continued Pretraining of LFM-2 for Urdu},
  author={Mahwiz Khalil},
  year={2025},
  url={https://huggingface.co/mahwizzzz/qyra-350m}
}
Downloads last month
14
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/qyra-350m

Finetuned
(57)
this model
Quantizations
1 model

Dataset used to train mahwizzzz/qyra-350m

Evaluation results