Model Card for Tajik-Persian LaBSE fine-tuned for Lexical Induction

This model is a fine-tuned version of sentence-transformers/LaBSE on the TajPersParallelLexicalCorpus dataset. It produces aligned embeddings for Tajik (Cyrillic) and Persian (Arabic script) words, enabling cross-lingual word retrieval and semantic similarity tasks.

Model Details

Model Description

  • Developed by: Mullosharaf K. Arabov (TajikNLPWorld)
  • Funded by: [More Information Needed]
  • Shared by: TajikNLPWorld
  • Model type: Sentence Transformer (contrastive learning)
  • Language(s) (NLP): Tajik (tg), Persian (fa)
  • License: Apache 2.0
  • Finetuned from model: sentence-transformers/LaBSE

Model Sources

Uses

Direct Use

The model can be used directly to obtain cross-lingual embeddings for Tajik and Persian words. It is optimised for single‑word inputs. Example use cases:

  • Finding translations of a Tajik word from a Persian candidate list.
  • Computing semantic similarity between words in the two languages.
  • Building bilingual lexical resources or improving machine translation pre-processing.

Downstream Use

The model can be integrated into larger systems that require cross-lingual alignment, such as:

  • Bilingual lexicon induction
  • Cross-lingual information retrieval
  • Unsupervised or semi-supervised machine translation (as a pre-processing step for word alignment)

Out-of-Scope Use

  • The model is not designed for multi‑word phrases or full sentences; performance on such inputs may degrade.
  • It does not handle out-of-vocabulary words beyond its subword tokenization – for rare words, subword‑based FastText models may be more appropriate.
  • It should not be used for languages other than Tajik and Persian.

Bias, Risks, and Limitations

  • The model was trained on a parallel lexical corpus which may contain biases present in the source data (e.g., under‑representation of certain domains or dialects).
  • Performance varies by part‑of‑speech; proper nouns show slightly lower retrieval accuracy.
  • As a neural model, it may exhibit unpredictable behaviour on adversarial or nonsensical inputs.

Recommendations

Users should evaluate the model on their specific task and consider combining it with other resources (e.g., rule‑based checks) for critical applications. When using the model, be aware of the limitations described above.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "TajikNLPWorld/tajik-persian-labse-finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return F.normalize(embeddings, p=2, dim=1)

# Example similarity
tajik_word = "модар"
persian_word = "مادر"
emb_tg = get_embedding(tajik_word)
emb_fa = get_embedding(persian_word)
similarity = torch.cosine_similarity(emb_tg, emb_fa)
print(f"Similarity: {similarity.item():.4f}")  # ~0.77

# Find translations from a list
persian_words = ["مادر", "پدر", "برادر", "خواهر", "دختر", "پسر"]
persian_embeddings = torch.cat([get_embedding(w) for w in persian_words])
query_emb = get_embedding("модар")
similarities = torch.mm(query_emb, persian_embeddings.T).squeeze(0)
top_indices = similarities.argsort(descending=True)[:5]
for i in top_indices:
    print(f"{persian_words[i]}: {similarities[i].item():.4f}")

Training Details

Training Data

  • Dataset: TajPersParallelLexicalCorpus
  • Size: 33,222 parallel word pairs (Tajik–Persian)
  • Split: Training set (the remainder after holding out 4,153 pairs for evaluation)

Training Procedure

The model was fine‑tuned using contrastive learning (NT‑Xent loss) to align embeddings of translation pairs.

Preprocessing

  • Inputs were tokenized with the LaBSE tokenizer (max length 64).
  • No additional filtering was applied.

Training Hyperparameters

  • Batch size: 16
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • Epochs: 3
  • Loss: NT‑Xent with temperature 0.05
  • Max sequence length: 64 tokens

Speeds, Sizes, Times

  • Training was performed on a single GPU (exact hardware unknown). Each epoch took approximately 1 hour.

Evaluation

Testing Data, Factors & Metrics

Testing Data

A held‑out test set of 4,153 Tajik–Persian word pairs from the same corpus, not seen during training.

Factors

  • Part‑of‑speech: The test set includes nouns, adjectives, verbs, adverbs, proper nouns, interjections, conjunctions, and numerals (tagged for Tajik).
  • Domain: General vocabulary.

Metrics

  • Precision@k (P@1, P@5, P@10): proportion of queries where the correct translation is in the top‑k retrieved candidates.
  • Mean Reciprocal Rank (MRR): average of reciprocal ranks of the correct translation.

Results

Overall Performance

Metric Value
P@1 0.684
P@5 0.878
P@10 0.913
MRR 0.771

Performance by Part of Speech (P@1)

POS (Tajik) English P@1
исм Noun 0.656
сифат Adjective 0.731
феъл Verb 0.652
зарф Adverb 0.711
исми хос Proper Noun 0.619
нидо Interjection 0.677
пайвандак Conjunction 0.625
шумора Numeral 0.688

Comparison with Other Models

Model P@1 P@5 P@10 MRR
LaBSE fine-tuned (this model) 0.684 0.878 0.913 0.771
multilingual-e5 + contrastive 0.559 0.787 0.841 0.661
XLM‑RoBERTa + LoRA 0.000 0.001 0.001 0.001
FastText+VecMap 0.000 0.000 0.000 0.000

This model achieves the best performance among all compared models for Tajik–Persian lexical induction.

Environmental Impact

  • Hardware Type: Single GPU (NVIDIA unspecified)
  • Hours used: ~3 hours
  • Cloud Provider: Not applicable (local machine)
  • Compute Region: Not applicable
  • Carbon Emitted: Not estimated

Technical Specifications

Model Architecture and Objective

  • Base architecture: sentence-transformers/LaBSE (Transformer‑based, 12 layers, 768 hidden size, dual-encoder)
  • Training objective: Contrastive (NT‑Xent) loss to maximise similarity between translation pairs and minimise similarity between non‑pairs.

Compute Infrastructure

Hardware

  • GPU with at least 8 GB VRAM (e.g., NVIDIA Tesla T4 or similar)

Software

  • Python 3.9
  • PyTorch 1.13
  • Transformers 4.25
  • Datasets 2.7

Citation

@misc{tajik_persian_labse_2026,
    title = {Tajik-Persian LaBSE Fine-tuned for Lexical Induction},
    author = {Arabov, Mullosharaf Kurbonovich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/TajikNLPWorld/tajik-persian-labse-finetuned}
}

Model Card Authors

Mullosharaf K. Arabov (TajikNLPWorld)

Model Card Contact

Downloads last month
5
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TajikNLPWorld/tajik-persian-labse-finetuned

Finetuned
(85)
this model

Dataset used to train TajikNLPWorld/tajik-persian-labse-finetuned