Konkani Qwen2-1.5B

The first dedicated Konkani language model — fine-tuned on Qwen2-1.5B using Continued Pre-Training (CPT) and Supervised Fine-Tuning (SFT).

Model Description

This model is specifically trained for the Konkani language (Goan dialect, Devanagari script). It can:

  • Answer questions about Goa, Konkani culture, history, and food
  • Hold conversations in pure Konkani
  • Translate English/Hindi to Konkani
  • Write Konkani poetry and short stories
  • Describe Goan places, festivals, and traditions

Training Pipeline

Qwen2-1.5B (base)
    ↓ CPT — 300,000 Konkani text chunks
    ↓ SFT — 13,417 instruction-response pairs
    = konkani-qwen2-1.5b

CPT Dataset: 300K chunks of pure Konkani text (Devanagari script, Goan dialect)

SFT Dataset: 13,417 curated Konkani instruction-response pairs across 12 categories:

  • Simple chitchat, Goa history, Goan culture & festivals
  • Goan food, Poetry, Short stories, Descriptive writing
  • Translation, Factual Q&A, Language help
  • Konkani identity, Multi-turn conversation

Usage

# Install
!pip install -q transformers accelerate

#load from hugging face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nischay185/konkani-qwen2-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,
    device_map="auto"
)

SYSTEM = "तूं एक कोंकणी भाशेचो सहाय्यक. फकत शुद्ध कोंकणींत जाप दिवप."

def chat(question):
    prompt = f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
    prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
    prompt += "<|im_start|>assistant\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=400,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(chat("पणजी शाराचें वर्णन कर"))

Example Outputs

Q: गोंयची मुक्ती केन्ना जाली?

गोंय 19 डिसेंबर 1961 दिसा पोर्तुगेजांखाला मुक्त जाले. भारतीय सैन्यान 'ऑपरेशन विजय' राबवन गोंयची स्वतंत्रताय मेळयली...

Q: शिगमो सण कसो मनयतात?

शिगमो हो गोंयांतलो एक व्हड लोकउत्सव, जो फाल्गुन म्हयन्यांत मनयतात. रथांचे बांधकाम पळोवपाखातीर लोक येतात...

Q: Translate to Konkani: I love Goa.

हांव गोंयाक मोग करता.

Limitations

  • Responses may occasionally truncate — increase max_new_tokens if needed
  • Identity responses may hallucinate names — v2 training planned
  • Best results with Goa/Konkani specific topics
  • Not suitable for tasks requiring real-time information

Training Details

Parameter Value
Base model Qwen/Qwen2-1.5B
CPT data 300,000 chunks
SFT data 13,417 pairs
LoRA r 16
LoRA alpha 32
Learning rate 2e-4
Epochs (SFT) 3
Hardware Kaggle T4 GPU
Training time ~12 hours total

Language

Konkani (Goan dialect, Devanagari script) — the official language of Goa, India. This is believed to be the first open-source LLM specifically trained for Konkani.

License

Apache 2.0 — free to use, modify, and distribute.

Author

Developed by Nischay Mandrekar

Downloads last month
29
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nischay185/konkani-qwen2-1.5b

Base model

Qwen/Qwen2-1.5B
Adapter
(1428)
this model
Adapters
2 models