az-legal-retrieval-xlm
A sentence embedding model fine-tuned on Azerbaijani legislation for legal information retrieval. Built on xlm-roberta-base (278M params) and trained on 780K+ query-passage pairs with BM25-mined hard negatives scored by a cross-encoder reranker.
Matches BGE-m3 (568M) on MRR@10 while being 2x smaller and 4x faster.
Benchmark Results
Evaluated on 939 independent LLM-generated and validated queries over 262K legislative passages:
| Model | Params | R@1 | R@10 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| az-legal-retrieval-xlm | 278M | 0.291 | 0.573 | 0.381 | 0.427 |
| BAAI/bge-m3 | 568M | 0.282 | 0.595 | 0.381 | 0.432 |
| intfloat/multilingual-e5-large | 560M | 0.232 | 0.521 | 0.321 | 0.369 |
| intfloat/multilingual-e5-base | 278M | 0.189 | 0.456 | 0.271 | 0.315 |
Usage
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("vrashad/az-legal-retrieval-xlm")
query = "Şərab istehsalı növləri hansılardır?"
passages = [
"Hazırlanma texnologiyalarına görə şərablar təbii və xüsusi olur.",
"Torpaq vergisi torpaq sahəsinin ölçüsünə görə müəyyən edilir.",
]
q_emb = model.encode(query, normalize_embeddings=True)
p_embs = model.encode(passages, normalize_embeddings=True)
scores = util.cos_sim(q_emb, p_embs)[0]
for passage, score in zip(passages, scores):
print(f"{score:.4f} | {passage}")
Training
| Parameter | Value |
|---|---|
| Base model | xlm-roberta-base |
| Training data | 780K query-passage pairs (3 query types × 262K chunks) |
| Hard negatives | 7 per query (BM25 + cross-encoder scored) |
| Loss | MultipleNegativesRankingLoss |
| Epochs | 3 |
| Batch size | 8 |
| Max seq length | 512 |
| Hardware | RTX 4090 (24GB) |
| Training time | ~24.5 hours |
Dataset
Training data: LocalDoc/azerbaijan_legislation_queries_passages
Source corpus: LocalDoc/azerbaijan_legislation
Contact
vrashad — v.resad.89@gmail.com
- Downloads last month
- 25
Model tree for vrashad/az-legal-retrieval-xlm
Base model
FacebookAI/xlm-roberta-base