VietLegal-E5

A Vietnamese legal domain embedding model fine-tuned from intfloat/multilingual-e5-large (560M params, 1024-dim).

Achieves NDCG@10 = 0.7229 on the Zalo AI Legal Text Retrieval benchmark, outperforming all baselines including microsoft/harrier-oss-v1-0.6b.

Benchmark Results

Evaluated on MTEB ZacLegalTextRetrieval (61.4K corpus documents, 818 test queries).

Model Params Dim NDCG@10
mainguyen9/vietlegal-e5 560M 1024 0.7229
mainguyen9/vietlegal-e5 560M 512 0.7208
mainguyen9/vietlegal-e5 560M 256 0.7058
mainguyen9/vietlegal-e5 560M 128 0.7073
microsoft/harrier-oss-v1-0.6b 600M 1024 0.7210
intfloat/multilingual-e5-large 560M 1024 0.6660
bkai-foundation-models/vietnamese-bi-encoder 135M 768 0.6160
intfloat/multilingual-e5-base 278M 768 0.6030
contextboxai/halong_embedding 278M 768 0.6009

Key highlights:

  • +5.69 points over the mE5-large baseline (0.7229 vs 0.6660)
  • Outperforms Harrier (600M params) with fewer parameters
  • Matryoshka support: 128-dim (0.7073) still beats mE5-large at 1024-dim, enabling 8x compression

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mainguyen9/vietlegal-e5")

# Important: E5 models require "query: " and "passage: " prefixes
queries = ["query: Thủ tục đăng ký kinh doanh gồm những bước nào?"]
passages = ["passage: Điều 27. Trình tự, thủ tục đăng ký doanh nghiệp..."]

q_emb = model.encode(queries)
p_emb = model.encode(passages)

# Matryoshka: truncate to smaller dimension
model.truncate_dim = 256
q_emb_256 = model.encode(queries)

Model Details

  • Model Type: Sentence Transformer
  • Base model: intfloat/multilingual-e5-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions (Matryoshka: 1024, 512, 256, 128)
  • Similarity Function: Cosine Similarity
  • Language: Vietnamese
  • License: Apache 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
  (2): Normalize()
)

Training

Training Data

Training Pipeline

Stage 1: Data Preparation
│  518K docs → ~500K chunks (article-aware segmentation)
│
Stage 2: Contrastive Fine-tuning
│  MatryoshkaLoss(MultipleNegativesRankingLoss)
│
Stage 3: Hard Negative Mining
│  FAISS retrieval → mine rank 50-100 as hard negatives
│
Stage 4: Multi-task Blending
│  70% retrieval + 20% classification + 10% STS
│  → Final model (NDCG@10 = 0.7229)

Training Hyperparameters

  • Learning rate: 5e-6
  • Batch size: 48 per device × 4 GPUs × 2 gradient accumulation = 384 effective
  • Epochs: 1 (multitask stage)
  • Warmup: 10%
  • Scheduler: Cosine
  • Precision: bf16
  • Hardware: 4x NVIDIA H100 80GB

Citation

@misc{vietlegal-e5,
  title={VietLegal-E5: Vietnamese Legal Domain Embedding Model},
  author={Nguyen, Mai},
  year={2026},
  url={https://huggingface.co/mainguyen9/vietlegal-e5}
}

Acknowledgements

Downloads last month
597
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mainguyen9/vietlegal-e5

Finetuned
(175)
this model

Datasets used to train mainguyen9/vietlegal-e5