Model Card for Tajik-Persian LaBSE fine-tuned for Lexical Induction

This model is a fine-tuned version of sentence-transformers/LaBSE on the TajPersParallelLexicalCorpus dataset. It produces aligned embeddings for Tajik (Cyrillic) and Persian (Arabic script) words, enabling cross-lingual word retrieval and semantic similarity tasks.

Model Details

Model Description

Developed by: Mullosharaf K. Arabov (TajikNLPWorld)
Funded by: [More Information Needed]
Shared by: TajikNLPWorld
Model type: Sentence Transformer (contrastive learning)
Language(s) (NLP): Tajik (tg), Persian (fa)
License: Apache 2.0
Finetuned from model: sentence-transformers/LaBSE

Model Sources

Repository: https://huggingface.co/TajikNLPWorld/tajik-persian-labse-finetuned
Paper: [More Information Needed]
Demo: [More Information Needed]

Uses

Direct Use

The model can be used directly to obtain cross-lingual embeddings for Tajik and Persian words. It is optimised for single‑word inputs. Example use cases:

Finding translations of a Tajik word from a Persian candidate list.
Computing semantic similarity between words in the two languages.
Building bilingual lexical resources or improving machine translation pre-processing.

Downstream Use

The model can be integrated into larger systems that require cross-lingual alignment, such as:

Bilingual lexicon induction
Cross-lingual information retrieval
Unsupervised or semi-supervised machine translation (as a pre-processing step for word alignment)

Out-of-Scope Use

The model is not designed for multi‑word phrases or full sentences; performance on such inputs may degrade.
It does not handle out-of-vocabulary words beyond its subword tokenization – for rare words, subword‑based FastText models may be more appropriate.
It should not be used for languages other than Tajik and Persian.

Bias, Risks, and Limitations

The model was trained on a parallel lexical corpus which may contain biases present in the source data (e.g., under‑representation of certain domains or dialects).
Performance varies by part‑of‑speech; proper nouns show slightly lower retrieval accuracy.
As a neural model, it may exhibit unpredictable behaviour on adversarial or nonsensical inputs.

Recommendations

Users should evaluate the model on their specific task and consider combining it with other resources (e.g., rule‑based checks) for critical applications. When using the model, be aware of the limitations described above.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "TajikNLPWorld/tajik-persian-labse-finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return F.normalize(embeddings, p=2, dim=1)

# Example similarity
tajik_word = "модар"
persian_word = "مادر"
emb_tg = get_embedding(tajik_word)
emb_fa = get_embedding(persian_word)
similarity = torch.cosine_similarity(emb_tg, emb_fa)
print(f"Similarity: {similarity.item():.4f}")  # ~0.77

# Find translations from a list
persian_words = ["مادر", "پدر", "برادر", "خواهر", "دختر", "پسر"]
persian_embeddings = torch.cat([get_embedding(w) for w in persian_words])
query_emb = get_embedding("модар")
similarities = torch.mm(query_emb, persian_embeddings.T).squeeze(0)
top_indices = similarities.argsort(descending=True)[:5]
for i in top_indices:
    print(f"{persian_words[i]}: {similarities[i].item():.4f}")

Training Details

Training Data

Dataset: TajPersParallelLexicalCorpus
Size: 33,222 parallel word pairs (Tajik–Persian)
Split: Training set (the remainder after holding out 4,153 pairs for evaluation)

Training Procedure

The model was fine‑tuned using contrastive learning (NT‑Xent loss) to align embeddings of translation pairs.

Preprocessing

Inputs were tokenized with the LaBSE tokenizer (max length 64).
No additional filtering was applied.

Training Hyperparameters

Batch size: 16
Learning rate: 2e-5
Optimizer: AdamW
Epochs: 3
Loss: NT‑Xent with temperature 0.05
Max sequence length: 64 tokens

Speeds, Sizes, Times

Training was performed on a single GPU (exact hardware unknown). Each epoch took approximately 1 hour.

Evaluation

Testing Data, Factors & Metrics

Testing Data

A held‑out test set of 4,153 Tajik–Persian word pairs from the same corpus, not seen during training.

Factors

Part‑of‑speech: The test set includes nouns, adjectives, verbs, adverbs, proper nouns, interjections, conjunctions, and numerals (tagged for Tajik).
Domain: General vocabulary.

Metrics

Precision@k (P@1, P@5, P@10): proportion of queries where the correct translation is in the top‑k retrieved candidates.
Mean Reciprocal Rank (MRR): average of reciprocal ranks of the correct translation.

Results

Overall Performance

Metric	Value
P@1	0.684
P@5	0.878
P@10	0.913
MRR	0.771

Performance by Part of Speech (P@1)

POS (Tajik)	English	P@1
исм	Noun	0.656
сифат	Adjective	0.731
феъл	Verb	0.652
зарф	Adverb	0.711
исми хос	Proper Noun	0.619
нидо	Interjection	0.677
пайвандак	Conjunction	0.625
шумора	Numeral	0.688

Comparison with Other Models

Model	P@1	P@5	P@10	MRR
LaBSE fine-tuned (this model)	0.684	0.878	0.913	0.771
multilingual-e5 + contrastive	0.559	0.787	0.841	0.661
XLM‑RoBERTa + LoRA	0.000	0.001	0.001	0.001
FastText+VecMap	0.000	0.000	0.000	0.000

This model achieves the best performance among all compared models for Tajik–Persian lexical induction.

Environmental Impact

Hardware Type: Single GPU (NVIDIA unspecified)
Hours used: ~3 hours
Cloud Provider: Not applicable (local machine)
Compute Region: Not applicable
Carbon Emitted: Not estimated

Technical Specifications

Model Architecture and Objective

Base architecture: sentence-transformers/LaBSE (Transformer‑based, 12 layers, 768 hidden size, dual-encoder)
Training objective: Contrastive (NT‑Xent) loss to maximise similarity between translation pairs and minimise similarity between non‑pairs.

Compute Infrastructure

Hardware

GPU with at least 8 GB VRAM (e.g., NVIDIA Tesla T4 or similar)

Software

Python 3.9
PyTorch 1.13
Transformers 4.25
Datasets 2.7

Citation

@misc{tajik_persian_labse_2026,
    title = {Tajik-Persian LaBSE Fine-tuned for Lexical Induction},
    author = {Arabov, Mullosharaf Kurbonovich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/TajikNLPWorld/tajik-persian-labse-finetuned}
}

Model Card Authors

Mullosharaf K. Arabov (TajikNLPWorld)

Model Card Contact

Hugging Face: TajikNLPWorld
Email: [cool.araby@gmail.com]

Downloads last month: 5

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TajikNLPWorld/tajik-persian-labse-finetuned

Base model

sentence-transformers/LaBSE

Finetuned

(85)

this model

TajikNLPWorld
/

tajik-persian-labse-finetuned