Konkani Qwen2-1.5B
The first dedicated Konkani language model — fine-tuned on Qwen2-1.5B using Continued Pre-Training (CPT) and Supervised Fine-Tuning (SFT).
Model Description
This model is specifically trained for the Konkani language (Goan dialect, Devanagari script). It can:
- Answer questions about Goa, Konkani culture, history, and food
- Hold conversations in pure Konkani
- Translate English/Hindi to Konkani
- Write Konkani poetry and short stories
- Describe Goan places, festivals, and traditions
Training Pipeline
Qwen2-1.5B (base)
↓ CPT — 300,000 Konkani text chunks
↓ SFT — 13,417 instruction-response pairs
= konkani-qwen2-1.5b
CPT Dataset: 300K chunks of pure Konkani text (Devanagari script, Goan dialect)
SFT Dataset: 13,417 curated Konkani instruction-response pairs across 12 categories:
- Simple chitchat, Goa history, Goan culture & festivals
- Goan food, Poetry, Short stories, Descriptive writing
- Translation, Factual Q&A, Language help
- Konkani identity, Multi-turn conversation
Usage
# Install
!pip install -q transformers accelerate
#load from hugging face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "nischay185/konkani-qwen2-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float16,
device_map="auto"
)
SYSTEM = "तूं एक कोंकणी भाशेचो सहाय्यक. फकत शुद्ध कोंकणींत जाप दिवप."
def chat(question):
prompt = f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
prompt += "<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=400,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
)
return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(chat("पणजी शाराचें वर्णन कर"))
Example Outputs
Q: गोंयची मुक्ती केन्ना जाली?
गोंय 19 डिसेंबर 1961 दिसा पोर्तुगेजांखाला मुक्त जाले. भारतीय सैन्यान 'ऑपरेशन विजय' राबवन गोंयची स्वतंत्रताय मेळयली...
Q: शिगमो सण कसो मनयतात?
शिगमो हो गोंयांतलो एक व्हड लोकउत्सव, जो फाल्गुन म्हयन्यांत मनयतात. रथांचे बांधकाम पळोवपाखातीर लोक येतात...
Q: Translate to Konkani: I love Goa.
हांव गोंयाक मोग करता.
Limitations
- Responses may occasionally truncate — increase
max_new_tokensif needed - Identity responses may hallucinate names — v2 training planned
- Best results with Goa/Konkani specific topics
- Not suitable for tasks requiring real-time information
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2-1.5B |
| CPT data | 300,000 chunks |
| SFT data | 13,417 pairs |
| LoRA r | 16 |
| LoRA alpha | 32 |
| Learning rate | 2e-4 |
| Epochs (SFT) | 3 |
| Hardware | Kaggle T4 GPU |
| Training time | ~12 hours total |
Language
Konkani (Goan dialect, Devanagari script) — the official language of Goa, India. This is believed to be the first open-source LLM specifically trained for Konkani.
License
Apache 2.0 — free to use, modify, and distribute.
Author
Developed by Nischay Mandrekar
- Downloads last month
- 29