English to Khasi Machine Translation Model: mBART-Large-50 (LoRA Fine-Tuned)
This model is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt for English to Khasi translation using LoRA adapters, trained on a custom English–Khasi parallel corpus for low-resource language translation.
This is a PEFT adapter model - requires loading with the base model and peft library.
Summary
| Property | Value |
|---|---|
| Base Model | facebook/mbart-large-50-many-to-many-mmt |
| Type | Seq2Seq MT (LoRA Adapter) |
| Languages | English → Khasi (en_XX → kh_IN) |
| Technique | LoRA fine-tuning + continuation training |
| Training Data | 58,176 train / 6,465 validation English–Khasi sentence pairs (90/10 split) |
| Max Sequence Length | 128 tokens for source and 128 tokens for target (truncation enabled) |
| Custom Token | kh_IN |
Demo
You can try the model directly in your browser using the interactive Hugging Face Space:
👉 [mBART English-Khasi Translator] https://huggingface.co/spaces/toiar/mbart-english-khasi-translator
Training Overview
The model was fine-tuned using LoRA (Low-Rank Adaptation) with continuation training, showing consistent convergence and strong performance on the validation set.
Final checkpoint performance:
- Validation Loss: 0.7366
- BLEU Score: 37.59
- chrF++ Score: 55.56
- TER Score: 52.56
How to Use
from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
from peft import PeftModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load base model
base_model = MBartForConditionalGeneration.from_pretrained(
"facebook/mbart-large-50-many-to-many-mmt",
torch_dtype=torch.float16 if device == "cuda" else torch.float32
).to(device)
# Load tokenizer
tokenizer = MBart50TokenizerFast.from_pretrained("toiar/mbart-finetuned-english-khasi")
# Resize embeddings for custom kh_IN token
base_model.resize_token_embeddings(len(tokenizer))
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "toiar/mbart-finetuned-english-khasi")
model.eval()
# Translate
def translate(text):
tokenizer.src_lang = "en_XX"
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("kh_IN"),
max_length=128,
num_beams=5
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
text = "The fruit basket is full of apples and oranges."
translation = translate(text)
print("English:", text)
print("Khasi:", translation)
- Downloads last month
- -
Model tree for toiar/mbart-finetuned-english-khasi
Base model
facebook/mbart-large-50-many-to-many-mmtSpace using toiar/mbart-finetuned-english-khasi 1
Collection including toiar/mbart-finetuned-english-khasi
Collection
A collection of machine learning models for translating text to and from the Khasi language. • 2 items • Updated
Evaluation results
- BLEU on English-Khasi Parallel Corpusself-reported37.590
- chrF++ on English-Khasi Parallel Corpusself-reported55.560
- TER on English-Khasi Parallel Corpusself-reported52.560