welcomyou commited on
Commit
0478a4a
·
verified ·
1 Parent(s): 27ad772

docs: model card

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
+ license: mit
5
+ base_model: intfloat/multilingual-e5-small
6
+ language:
7
+ - vi
8
+ - en
9
+ tags:
10
+ - sentence-similarity
11
+ - sentence-transformers
12
+ - e5
13
+ - vietnamese
14
+ - onnx
15
+ - fp32
16
+ - retrieval
17
+ - document-search
18
+ ---
19
+
20
+ # E5-small mix50 v2 — Vietnamese archive embedder
21
+
22
+ Fine-tuned [`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) for retrieval on Vietnamese archived administrative documents. Trained on a 50/50 mix of (a) in-domain Vietnamese corpus and (b) general retrieval pairs, exported to ONNX fp32.
23
+
24
+ Used as the dense passage encoder in the [ScanIndex](https://github.com/welcomyou/scanindex) hybrid search (Tantivy BM25 + FAISS HNSW + RRF fusion).
25
+
26
+ ## Files
27
+
28
+ - `archive_models/e5-small-mix50-v2-onnx-fp32/model.onnx` (+ `model.onnx_data`)
29
+ - Tokenizer + sentence-transformers metadata (`config.json`, `tokenizer.json`, `sentencepiece.bpe.model`, `1_Pooling/`, `modules.json`, …)
30
+
31
+ ## Asymmetric input
32
+
33
+ E5 requires query/passage prefixes:
34
+
35
+ ```python
36
+ queries = [f"query: {q}" for q in raw_queries]
37
+ passages = [f"passage: {p}" for p in raw_passages]
38
+ ```
39
+
40
+ ## Loading (ONNX)
41
+
42
+ ```python
43
+ import onnxruntime as ort
44
+ from transformers import AutoTokenizer
45
+ from huggingface_hub import snapshot_download
46
+
47
+ local = snapshot_download("welcomyou/e5-small-vn-archive-mix50", local_dir="models")
48
+ sub = f"{local}/archive_models/e5-small-mix50-v2-onnx-fp32"
49
+ tok = AutoTokenizer.from_pretrained(sub)
50
+ sess = ort.InferenceSession(f"{sub}/model.onnx")
51
+ ```
52
+
53
+ ## Training
54
+
55
+ See [train-convert/archive-embedder/train/mix50_v2/](https://github.com/welcomyou/scanindex/tree/main/train-convert/archive-embedder/train/mix50_v2).
56
+
57
+ ## License
58
+
59
+ MIT, inherited from `intfloat/multilingual-e5-small`.