E5-small mix50 v2 — Vietnamese archive embedder

Fine-tuned intfloat/multilingual-e5-small for retrieval on Vietnamese archived administrative documents. Trained on a 50/50 mix of (a) in-domain Vietnamese corpus and (b) general retrieval pairs, exported to ONNX fp32.

Used as the dense passage encoder in the ScanIndex hybrid search (Tantivy BM25 + FAISS HNSW + RRF fusion).

Files

archive_models/e5-small-mix50-v2-onnx-fp32/model.onnx (+ model.onnx_data)
Tokenizer + sentence-transformers metadata (config.json, tokenizer.json, sentencepiece.bpe.model, 1_Pooling/, modules.json, …)

Asymmetric input

E5 requires query/passage prefixes:

queries  = [f"query: {q}" for q in raw_queries]
passages = [f"passage: {p}" for p in raw_passages]

Loading (ONNX)

import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

local = snapshot_download("welcomyou/e5-small-vn-archive-mix50", local_dir="models")
sub = f"{local}/archive_models/e5-small-mix50-v2-onnx-fp32"
tok = AutoTokenizer.from_pretrained(sub)
sess = ort.InferenceSession(f"{sub}/model.onnx")

Training

See train-convert/archive-embedder/train/mix50_v2/.

License

MIT, inherited from intfloat/multilingual-e5-small.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for welcomyou/e5-small-vn-archive-mix50

Base model

intfloat/multilingual-e5-small

Quantized

(19)

this model

Collection including welcomyou/e5-small-vn-archive-mix50

ScanIndex

Collection

Models loaded by https://github.com/welcomyou/scanindex — OCR, KIE, layout, tables, embedder for Vietnamese admin docs. • 8 items • Updated 2 days ago