Konkani LLM: Bringing a Multi-Script Low-Resource Language to the AI Era
Large Language Models (LLMs) have taken the world by storm but their linguistic prowess is largely skewed toward high-resource languages. When it comes to regional and low-resource Indian languages with complex multi-script orthographies even state-of-the-art models stumble.
Today we are thrilled to introduce the Konkani LLM Project: an initiative dedicated to bringing Konkani into the modern AI ecosystem. Konkani is a recognized Indian scheduled language native to the Goa and Konkan regions.
In this article we will walk you through our methodology.
Explore the models and datasets and try the demo yourself:
- Hugging Face Org: huggingface.co/konkani
- Live Demo: konkani.app
The Challenge: Data Scarcity and Script Fragmentation
Konkani presents a unique challenge for Natural Language Processing (NLP). High-quality parallel data is scarce and the language is written in various diverging scripts: Devanagari, Romi (Latin) and Kannada and many more. Existing Indic LLM initiatives such provide strong baselines for languages like Hindi and other regional languages but often lack explicit support for Konkani. This gap is especially evident in code-mixed and multi-script settings. Even top-tier models like GPT-5 and DeepSeek V3 continue to lack meaningful support for the language.
The Solution: Konkani-Instruct-100k
To solve the data bottleneck we generated Konkani-Instruct-100k: the first large-scale multi-script instruction-tuning dataset for Konkani.
Since standard transliteration tools introduced phonological errors we utilized a highly controlled synthetic generation pipeline powered by Gemini 3 flash. We did not just generate Q&A pairs. We designed a "Tutor-Style" pedagogical framework to teach models the core linguistic mechanics of Konkani.
Every sample in our dataset includes the following components:
- The Answer: Culturally accurate responses.
- A Grammar Decomposition Table: An explicit structural breakdown including part-of-speech tags and base-modifier morphology (e.g. gender, politeness and oblique cases).
- Domain Diversity: Balanced representation across 18 foundational topics (food, family, tenses etc.) and varied domains.
The final dataset comprises over 105,000 samples strictly balanced across the Devanagari, Romi and Kannada scripts.
Building Konkani LLMs
Equipped with Konkani-Instruct-100k we fine-tuned leading open-weight architectures using Parameter-Efficient Fine-Tuning (LoRA). Our lineup includes:
- Gemma 3 (4B, 12B, 27B)
- Llama 3.1 (8B)
- Qwen 2.5 (1.5B, 14B)
By targeting attention and MLP projections with a LoRA rank of 64 we maintained lightweight adapter weights. This makes the models highly accessible for low-cost serverless LoRA deployment.
Evaluation: The Konkani Multi-Script Benchmark
To rigorously test our models we built the Konkani-Bench: a 200-pair human-annotated benchmark designed to stress-test translation (Konkani-to-English) and transliteration (cross-script robustness) across Devanagari, Romi and Kannada.
Results: Translation
Our fine-tuned models delivered consistent and massive gains over their base counterparts. In many cases they surpassed proprietary models like GPT-4 and Claude 3.5 Sonnet on Konkani-to-English translation.
As seen in the chart below models like konkani-Qwen2.5-14B-Instruct and konkani-gemma-3-27b-it dominate the open-weights category across both BLEU and chrF++ metrics.
Note: Our
konkani-Qwen2.5-14Bmodel outperforms base models that are 5x its size.
Results: Transliteration
Handling transliteration gracefully between Devanagari, Romi and Kannada is a complex task where base models heavily struggle. Using automatic metrics our fine-tuned models demonstrated massive improvements in cross-script robustness.
The heatmaps below illustrate the dramatic performance leap. While base models fail to accurately map scripts (often hallucinating or mixing languages) our fine-tuned variants achieve top-tier BLEU and chrF++ scores across all directions. They often double or triple the accuracy of their base counterparts.
BLEU Score Heatmap:
chrF++ Score Heatmap:
LLM-as-a-Judge
Beyond raw translation and transliteration we wanted to measure conversational quality. We evaluated model outputs using an LLM-as-a-judge to score helpfulness, script fidelity and the absence of Hindi/Marathi contamination on a scale of 1 to 5:
- Llama 8B proved to be the strongest overall performer for the Romi and Devanagari scripts (scoring 4.40 on both).
- Qwen2.5 14B emerged as the champion for the Kannada script (scoring 4.05).
Why This Matters
Konkani is a low-resource language that suffers from a severe lack of digital data and fragmented scripts. Through this project we aim to ensure Konkani's survival in the digital age. Our ultimate goal is to lift Konkani out of its low-resource status by providing robust AI tools that empower people to seamlessly learn, preserve and translate Konkani text.
Try It Out!
We are open-sourcing our work to the community. You can find our models, datasets and adapters on our Hugging Face organization page.
- 🔗 Hugging Face Organization: huggingface.co/konkani
- 💬 Interact with the Models: Head over to konkani.app to chat with the models in Romi, Devanagari or Kannada!
Acknowledgements
We extend our deep gratitude to Cloud Riff, the Hugging Face community, the Cohere Labs community (especially @alexrs) and Modal for providing the H200 GPU compute that made this training possible.
If you are working on Indic LLMs or low-resource NLP we would love to connect. Feel free to open a PR on our datasets or drop a comment in the community tab of our models.




