You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

English to Pnar Machine Translation Model: NLLB-200-Distilled-600M (Fine-Tuned)

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for English to Pnar (Jaintia) translation. It has been trained on a custom dataset specifically curated for this low-resource language spoken in Meghalaya, India.

Summary

Property	Value
Base Model	facebook/nllb-200-distilled-600M
Type	Seq2Seq MT
Languages	English → Pnar (`eng_Latn` → `pbv_Latn`)
Technique	LoRA fine-tuning + Continuation Training
License	CC-BY-NC-4.0 (inherits from Meta)
Training Data	Custom English–Pnar parallel corpus
Max Sequence Length	128 tokens (truncation enabled)

Training Overview

The training utilized the LoRA (Low-Rank Adaptation) technique on a substantial corpus of parallel sentences.

Final Test Metrics:

Validation Loss: 0.520427
BLEU Score: 40.32
chrF++ Score: 58.58
TER Score: 52.07

Limitations

While this model achieves impressive scores, users should be aware of the following limitations:

Directionality: Only supports English to Pnar. The reverse direction (Pnar to English) is not optimized in this specific checkpoint.
Context Window: Sentences longer than 128 tokens may be truncated.
Domain Specificity: The model was trained on a general-purpose corpus and may not perform well on highly technical medical, legal, or archaic religious texts not covered during training.

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "toiar/nllb-finetuned-english-pnar"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    output = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("pbv_Latn"),
        max_length=128,
        num_beams=5,
        early_stopping=True
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example usage
text = "They are learning new skills to improve their future."
translation = translate(text)

print(f"English: {text}")
print(f"Pnar:    {translation}")

Downloads last month: 4

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for toiar/nllb-finetuned-english-pnar

Base model

facebook/nllb-200-distilled-600M

Finetuned

(273)

this model

Space using toiar/nllb-finetuned-english-pnar 1

Collection including toiar/nllb-finetuned-english-pnar

Pnar Translation Models

Collection

A collection of machine learning models for translating text to and from the Pnar language • 1 item • Updated Mar 13

Evaluation results

BLEU on English-Pnar Parallel Corpus
self-reported

40.320
chrF++ on English-Pnar Parallel Corpus
self-reported

58.580
TER on English-Pnar Parallel Corpus
self-reported

52.070