VietLegal-E5
A Vietnamese legal domain embedding model fine-tuned from intfloat/multilingual-e5-large (560M params, 1024-dim).
Achieves NDCG@10 = 0.7229 on the Zalo AI Legal Text Retrieval benchmark, outperforming all baselines including microsoft/harrier-oss-v1-0.6b.
Benchmark Results
Evaluated on MTEB ZacLegalTextRetrieval (61.4K corpus documents, 818 test queries).
| Model | Params | Dim | NDCG@10 |
|---|---|---|---|
| mainguyen9/vietlegal-e5 | 560M | 1024 | 0.7229 |
| mainguyen9/vietlegal-e5 | 560M | 512 | 0.7208 |
| mainguyen9/vietlegal-e5 | 560M | 256 | 0.7058 |
| mainguyen9/vietlegal-e5 | 560M | 128 | 0.7073 |
| microsoft/harrier-oss-v1-0.6b | 600M | 1024 | 0.7210 |
| intfloat/multilingual-e5-large | 560M | 1024 | 0.6660 |
| bkai-foundation-models/vietnamese-bi-encoder | 135M | 768 | 0.6160 |
| intfloat/multilingual-e5-base | 278M | 768 | 0.6030 |
| contextboxai/halong_embedding | 278M | 768 | 0.6009 |
Key highlights:
- +5.69 points over the mE5-large baseline (0.7229 vs 0.6660)
- Outperforms Harrier (600M params) with fewer parameters
- Matryoshka support: 128-dim (0.7073) still beats mE5-large at 1024-dim, enabling 8x compression
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mainguyen9/vietlegal-e5")
# Important: E5 models require "query: " and "passage: " prefixes
queries = ["query: Thủ tục đăng ký kinh doanh gồm những bước nào?"]
passages = ["passage: Điều 27. Trình tự, thủ tục đăng ký doanh nghiệp..."]
q_emb = model.encode(queries)
p_emb = model.encode(passages)
# Matryoshka: truncate to smaller dimension
model.truncate_dim = 256
q_emb_256 = model.encode(queries)
Model Details
- Model Type: Sentence Transformer
- Base model: intfloat/multilingual-e5-large
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions (Matryoshka: 1024, 512, 256, 128)
- Similarity Function: Cosine Similarity
- Language: Vietnamese
- License: Apache 2.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
(2): Normalize()
)
Training
Training Data
- Legal documents: 518K Vietnamese legal documents from th1nhng0/vietnamese-legal-documents
- Query-passage pairs: 507K pairs from phamson02/large-vi-legal-queries
Training Pipeline
Stage 1: Data Preparation
│ 518K docs → ~500K chunks (article-aware segmentation)
│
Stage 2: Contrastive Fine-tuning
│ MatryoshkaLoss(MultipleNegativesRankingLoss)
│
Stage 3: Hard Negative Mining
│ FAISS retrieval → mine rank 50-100 as hard negatives
│
Stage 4: Multi-task Blending
│ 70% retrieval + 20% classification + 10% STS
│ → Final model (NDCG@10 = 0.7229)
Training Hyperparameters
- Learning rate: 5e-6
- Batch size: 48 per device × 4 GPUs × 2 gradient accumulation = 384 effective
- Epochs: 1 (multitask stage)
- Warmup: 10%
- Scheduler: Cosine
- Precision: bf16
- Hardware: 4x NVIDIA H100 80GB
Citation
@misc{vietlegal-e5,
title={VietLegal-E5: Vietnamese Legal Domain Embedding Model},
author={Nguyen, Mai},
year={2026},
url={https://huggingface.co/mainguyen9/vietlegal-e5}
}
Acknowledgements
- Base model: intfloat/multilingual-e5-large
- Training framework: sentence-transformers
- Benchmark: Zalo AI Legal Text Retrieval
- Downloads last month
- 597
Model tree for mainguyen9/vietlegal-e5
Base model
intfloat/multilingual-e5-large