Qyra-350M
Continued-pretrained Urdu language model based on LiquidAI/LFM2-350M.
Model Details
- Base:
LiquidAI/LFM2-350M - Architecture: Decoder-only causal LM
- Parameters: 350M
- Context length: 512 tokens
- Tokenizer: LFM-2 + Urdu vocabulary expansion (+10k tokens, hard cap)
Training
Data
- Dataset:
ReySajju742/shaistagi_clean(train split) - Samples: 500,000 (streamed, no full materialization)
- Text field:
text
Hyperparameters
num_train_epochs = 3per_device_train_batch_size = 1gradient_accumulation_steps = 16(effective batch size = 16)learning_rate = 2.0e-4weight_decay = 0.01warmup_ratio = 0.03max_grad_norm = 1.0max_seq_length = 512- Mixed precision:
bf16(fallback tofp16) gradient_checkpointing = Truesave_strategy = "epoch",save_total_limit = 3
Data Pipeline
- Streaming mode:
load_dataset(..., streaming=True) - Sample cap enforced before tokenization
- All
.map()calls:load_from_cache_file=False,keep_in_memory=False,writer_batch_size=1000 - Column pruning: only
textcolumn retained
Computational Cost
- Training time: ~48 hours (3 epochs)
- Local cost: ~$1.15 USD
- Cloud equivalent: ~$50-80 USD
Evaluation
Metrics
- Perplexity: 27.82
- Fragmentation ratio: 4.78 tokens/word
Evaluation Prompts
اردو ایک خوبصورت زبان ہے۔مجھے پاکستان کی تاریخ کے بارے میں بتائیں۔آج موسم کیسا ہے؟آپ کا پسندیدہ شاعر کون ہے؟
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mahwizzzz/qyra-350m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "اردو ایک خوبصورت زبان ہے۔"
inputs = tokenizer(prompt, return_tensors="pt")
gen = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(gen[0], skip_special_tokens=True))
Training Code
Training code available at: GitHub Repository
Key files:
src/pretrain.py- Continued pretraining scriptsrc/data_utils.py- Memory-safe data pipelinesrc/tokenizer_utils.py- Tokenizer expansionconfig.yaml- Configuration
Limitations
- Trained on 500k-sample subset of full corpus
- Context length limited to 512 tokens
- No explicit alignment or safety filtering
Intended Use
This model is intended for:
- Research into Urdu language modeling
- Downstream continued pretraining on specialized Urdu corpora
- Supervised fine-tuning for Urdu dialogue, QA, and educational content
- Evaluation of Urdu tokenization and vocabulary design
Not intended for:
- Production deployment without further safety evaluation
- Safety critical applications
- High risk use cases without additional alignment
Risks and Limitations
- May generate biased, offensive, or inaccurate content
- No explicit safety filtering or alignment
- Limited context window (512 tokens)
- Trained on subset of available Urdu corpus
- May exhibit code mixing behavior (Urdu-English)
Citation
@misc{qyra-350m,
title={mahwizzzz/qyra-350m: Continued Pretraining of LFM-2 for Urdu},
author={Mahwiz Khalil},
year={2025},
url={https://huggingface.co/mahwizzzz/qyra-350m}
}
- Downloads last month
- 14
Model tree for mahwizzzz/qyra-350m
Dataset used to train mahwizzzz/qyra-350m
Evaluation results
- perplexity on Urdu Evaluation Promptsself-reported27.820