You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

English to Khasi Machine Translation Model: mBART-Large-50 (LoRA Fine-Tuned)

This model is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt for English to Khasi translation using LoRA adapters, trained on a custom English–Khasi parallel corpus for low-resource language translation.

This is a PEFT adapter model - requires loading with the base model and peft library.

Summary

Property	Value
Base Model	facebook/mbart-large-50-many-to-many-mmt
Type	Seq2Seq MT (LoRA Adapter)
Languages	English → Khasi (`en_XX` → `kh_IN`)
Technique	LoRA fine-tuning + continuation training
Training Data	58,176 train / 6,465 validation English–Khasi sentence pairs (90/10 split)
Max Sequence Length	128 tokens for source and 128 tokens for target (truncation enabled)
Custom Token	`kh_IN`

Demo

You can try the model directly in your browser using the interactive Hugging Face Space:

👉 [mBART English-Khasi Translator] https://huggingface.co/spaces/toiar/mbart-english-khasi-translator

Training Overview

The model was fine-tuned using LoRA (Low-Rank Adaptation) with continuation training, showing consistent convergence and strong performance on the validation set.

Final checkpoint performance:

Validation Loss: 0.7366
BLEU Score: 37.59
chrF++ Score: 55.56
TER Score: 52.56

How to Use

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
from peft import PeftModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
base_model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32
).to(device)

# Load tokenizer
tokenizer = MBart50TokenizerFast.from_pretrained("toiar/mbart-finetuned-english-khasi")

# Resize embeddings for custom kh_IN token
base_model.resize_token_embeddings(len(tokenizer))

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "toiar/mbart-finetuned-english-khasi")
model.eval()

# Translate
def translate(text):
    tokenizer.src_lang = "en_XX"
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids("kh_IN"),
            max_length=128,
            num_beams=5
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
text = "The fruit basket is full of apples and oranges."
translation = translate(text)

print("English:", text)
print("Khasi:", translation)

Downloads last month: -

Model tree for toiar/mbart-finetuned-english-khasi

Base model

facebook/mbart-large-50-many-to-many-mmt

Adapter

(18)

this model

Space using toiar/mbart-finetuned-english-khasi 1

Collection including toiar/mbart-finetuned-english-khasi

Khasi Translation Models

Collection

A collection of machine learning models for translating text to and from the Khasi language. • 2 items • Updated Mar 13

Evaluation results

BLEU on English-Khasi Parallel Corpus
self-reported

37.590
chrF++ on English-Khasi Parallel Corpus
self-reported

55.560
TER on English-Khasi Parallel Corpus
self-reported

52.560