Konkani Qwen2-1.5B

The first dedicated Konkani language model — fine-tuned on Qwen2-1.5B using Continued Pre-Training (CPT) and Supervised Fine-Tuning (SFT).

Model Description

This model is specifically trained for the Konkani language (Goan dialect, Devanagari script). It can:

Answer questions about Goa, Konkani culture, history, and food
Hold conversations in pure Konkani
Translate English/Hindi to Konkani
Write Konkani poetry and short stories
Describe Goan places, festivals, and traditions

Training Pipeline

Qwen2-1.5B (base)
    ↓ CPT — 300,000 Konkani text chunks
    ↓ SFT — 13,417 instruction-response pairs
    = konkani-qwen2-1.5b

CPT Dataset: 300K chunks of pure Konkani text (Devanagari script, Goan dialect)

SFT Dataset: 13,417 curated Konkani instruction-response pairs across 12 categories:

Simple chitchat, Goa history, Goan culture & festivals
Goan food, Poetry, Short stories, Descriptive writing
Translation, Factual Q&A, Language help
Konkani identity, Multi-turn conversation

Usage

# Install
!pip install -q transformers accelerate

#load from hugging face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nischay185/konkani-qwen2-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,
    device_map="auto"
)

SYSTEM = "तूं एक कोंकणी भाशेचो सहाय्यक. फकत शुद्ध कोंकणींत जाप दिवप."

def chat(question):
    prompt = f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
    prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
    prompt += "<|im_start|>assistant\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=400,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(chat("पणजी शाराचें वर्णन कर"))

Example Outputs

Q: गोंयची मुक्ती केन्ना जाली?

गोंय 19 डिसेंबर 1961 दिसा पोर्तुगेजांखाला मुक्त जाले. भारतीय सैन्यान 'ऑपरेशन विजय' राबवन गोंयची स्वतंत्रताय मेळयली...

Q: शिगमो सण कसो मनयतात?

शिगमो हो गोंयांतलो एक व्हड लोकउत्सव, जो फाल्गुन म्हयन्यांत मनयतात. रथांचे बांधकाम पळोवपाखातीर लोक येतात...

Q: Translate to Konkani: I love Goa.

हांव गोंयाक मोग करता.

Limitations

Responses may occasionally truncate — increase max_new_tokens if needed
Identity responses may hallucinate names — v2 training planned
Best results with Goa/Konkani specific topics
Not suitable for tasks requiring real-time information

Training Details

Parameter	Value
Base model	Qwen/Qwen2-1.5B
CPT data	300,000 chunks
SFT data	13,417 pairs
LoRA r	16
LoRA alpha	32
Learning rate	2e-4
Epochs (SFT)	3
Hardware	Kaggle T4 GPU
Training time	~12 hours total

Language

Konkani (Goan dialect, Devanagari script) — the official language of Goa, India. This is believed to be the first open-source LLM specifically trained for Konkani.

License

Apache 2.0 — free to use, modify, and distribute.

Author

Developed by Nischay Mandrekar

Downloads last month: 29

Safetensors

Model size

2B params

Tensor type

F16

Model tree for nischay185/konkani-qwen2-1.5b

Base model

Qwen/Qwen2-1.5B

Adapter

(1428)

this model

Adapters

2 models