az-legal-retrieval-xlm

A sentence embedding model fine-tuned on Azerbaijani legislation for legal information retrieval. Built on xlm-roberta-base (278M params) and trained on 780K+ query-passage pairs with BM25-mined hard negatives scored by a cross-encoder reranker.

Matches BGE-m3 (568M) on MRR@10 while being 2x smaller and 4x faster.

Benchmark Results

Evaluated on 939 independent LLM-generated and validated queries over 262K legislative passages:

Model Params R@1 R@10 MRR@10 NDCG@10
az-legal-retrieval-xlm 278M 0.291 0.573 0.381 0.427
BAAI/bge-m3 568M 0.282 0.595 0.381 0.432
intfloat/multilingual-e5-large 560M 0.232 0.521 0.321 0.369
intfloat/multilingual-e5-base 278M 0.189 0.456 0.271 0.315

Usage

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("vrashad/az-legal-retrieval-xlm")

query = "Şərab istehsalı növləri hansılardır?"
passages = [
    "Hazırlanma texnologiyalarına görə şərablar təbii və xüsusi olur.",
    "Torpaq vergisi torpaq sahəsinin ölçüsünə görə müəyyən edilir.",
]

q_emb = model.encode(query, normalize_embeddings=True)
p_embs = model.encode(passages, normalize_embeddings=True)
scores = util.cos_sim(q_emb, p_embs)[0]

for passage, score in zip(passages, scores):
    print(f"{score:.4f} | {passage}")

Training

Parameter Value
Base model xlm-roberta-base
Training data 780K query-passage pairs (3 query types × 262K chunks)
Hard negatives 7 per query (BM25 + cross-encoder scored)
Loss MultipleNegativesRankingLoss
Epochs 3
Batch size 8
Max seq length 512
Hardware RTX 4090 (24GB)
Training time ~24.5 hours

Dataset

Training data: LocalDoc/azerbaijan_legislation_queries_passages

Source corpus: LocalDoc/azerbaijan_legislation

Contact

vrashad — v.resad.89@gmail.com

Downloads last month
25
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrashad/az-legal-retrieval-xlm

Finetuned
(3893)
this model

Dataset used to train vrashad/az-legal-retrieval-xlm