You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

English to Pnar Machine Translation Model: NLLB-200-Distilled-600M (Fine-Tuned)

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for English to Pnar (Jaintia) translation. It has been trained on a custom dataset specifically curated for this low-resource language spoken in Meghalaya, India.

Summary

Property Value
Base Model facebook/nllb-200-distilled-600M
Type Seq2Seq MT
Languages English → Pnar (eng_Latn → pbv_Latn)
Technique LoRA fine-tuning + Continuation Training
License CC-BY-NC-4.0 (inherits from Meta)
Training Data Custom English–Pnar parallel corpus
Max Sequence Length 128 tokens (truncation enabled)

Training Overview

The training utilized the LoRA (Low-Rank Adaptation) technique on a substantial corpus of parallel sentences.

Final Test Metrics:

  • Validation Loss: 0.520427
  • BLEU Score: 40.32
  • chrF++ Score: 58.58
  • TER Score: 52.07

Limitations

While this model achieves impressive scores, users should be aware of the following limitations:

  • Directionality: Only supports English to Pnar. The reverse direction (Pnar to English) is not optimized in this specific checkpoint.
  • Context Window: Sentences longer than 128 tokens may be truncated.
  • Domain Specificity: The model was trained on a general-purpose corpus and may not perform well on highly technical medical, legal, or archaic religious texts not covered during training.

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "toiar/nllb-finetuned-english-pnar"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    output = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("pbv_Latn"),
        max_length=128,
        num_beams=5,
        early_stopping=True
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example usage
text = "They are learning new skills to improve their future."
translation = translate(text)

print(f"English: {text}")
print(f"Pnar:    {translation}")
Downloads last month
4
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for toiar/nllb-finetuned-english-pnar

Finetuned
(273)
this model

Space using toiar/nllb-finetuned-english-pnar 1

Collection including toiar/nllb-finetuned-english-pnar

Evaluation results