You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

English to Khasi Machine Translation Model: mBART-Large-50 (LoRA Fine-Tuned)

This model is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt for English to Khasi translation using LoRA adapters, trained on a custom English–Khasi parallel corpus for low-resource language translation.

This is a PEFT adapter model - requires loading with the base model and peft library.

Summary

Property Value
Base Model facebook/mbart-large-50-many-to-many-mmt
Type Seq2Seq MT (LoRA Adapter)
Languages English → Khasi (en_XXkh_IN)
Technique LoRA fine-tuning + continuation training
Training Data 58,176 train / 6,465 validation English–Khasi sentence pairs (90/10 split)
Max Sequence Length 128 tokens for source and 128 tokens for target (truncation enabled)
Custom Token kh_IN

Demo

You can try the model directly in your browser using the interactive Hugging Face Space:

👉 [mBART English-Khasi Translator] https://huggingface.co/spaces/toiar/mbart-english-khasi-translator


Training Overview

The model was fine-tuned using LoRA (Low-Rank Adaptation) with continuation training, showing consistent convergence and strong performance on the validation set.

Final checkpoint performance:

  • Validation Loss: 0.7366
  • BLEU Score: 37.59
  • chrF++ Score: 55.56
  • TER Score: 52.56

How to Use

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
from peft import PeftModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
base_model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32
).to(device)

# Load tokenizer
tokenizer = MBart50TokenizerFast.from_pretrained("toiar/mbart-finetuned-english-khasi")

# Resize embeddings for custom kh_IN token
base_model.resize_token_embeddings(len(tokenizer))

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "toiar/mbart-finetuned-english-khasi")
model.eval()

# Translate
def translate(text):
    tokenizer.src_lang = "en_XX"
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids("kh_IN"),
            max_length=128,
            num_beams=5
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
text = "The fruit basket is full of apples and oranges."
translation = translate(text)

print("English:", text)
print("Khasi:", translation)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for toiar/mbart-finetuned-english-khasi

Adapter
(18)
this model

Space using toiar/mbart-finetuned-english-khasi 1

Collection including toiar/mbart-finetuned-english-khasi

Evaluation results