docs: model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: sentence-transformers
|
| 3 |
+
pipeline_tag: sentence-similarity
|
| 4 |
+
license: mit
|
| 5 |
+
base_model: intfloat/multilingual-e5-small
|
| 6 |
+
language:
|
| 7 |
+
- vi
|
| 8 |
+
- en
|
| 9 |
+
tags:
|
| 10 |
+
- sentence-similarity
|
| 11 |
+
- sentence-transformers
|
| 12 |
+
- e5
|
| 13 |
+
- vietnamese
|
| 14 |
+
- onnx
|
| 15 |
+
- fp32
|
| 16 |
+
- retrieval
|
| 17 |
+
- document-search
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# E5-small mix50 v2 — Vietnamese archive embedder
|
| 21 |
+
|
| 22 |
+
Fine-tuned [`intfloat/multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) for retrieval on Vietnamese archived administrative documents. Trained on a 50/50 mix of (a) in-domain Vietnamese corpus and (b) general retrieval pairs, exported to ONNX fp32.
|
| 23 |
+
|
| 24 |
+
Used as the dense passage encoder in the [ScanIndex](https://github.com/welcomyou/scanindex) hybrid search (Tantivy BM25 + FAISS HNSW + RRF fusion).
|
| 25 |
+
|
| 26 |
+
## Files
|
| 27 |
+
|
| 28 |
+
- `archive_models/e5-small-mix50-v2-onnx-fp32/model.onnx` (+ `model.onnx_data`)
|
| 29 |
+
- Tokenizer + sentence-transformers metadata (`config.json`, `tokenizer.json`, `sentencepiece.bpe.model`, `1_Pooling/`, `modules.json`, …)
|
| 30 |
+
|
| 31 |
+
## Asymmetric input
|
| 32 |
+
|
| 33 |
+
E5 requires query/passage prefixes:
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
queries = [f"query: {q}" for q in raw_queries]
|
| 37 |
+
passages = [f"passage: {p}" for p in raw_passages]
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Loading (ONNX)
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
import onnxruntime as ort
|
| 44 |
+
from transformers import AutoTokenizer
|
| 45 |
+
from huggingface_hub import snapshot_download
|
| 46 |
+
|
| 47 |
+
local = snapshot_download("welcomyou/e5-small-vn-archive-mix50", local_dir="models")
|
| 48 |
+
sub = f"{local}/archive_models/e5-small-mix50-v2-onnx-fp32"
|
| 49 |
+
tok = AutoTokenizer.from_pretrained(sub)
|
| 50 |
+
sess = ort.InferenceSession(f"{sub}/model.onnx")
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Training
|
| 54 |
+
|
| 55 |
+
See [train-convert/archive-embedder/train/mix50_v2/](https://github.com/welcomyou/scanindex/tree/main/train-convert/archive-embedder/train/mix50_v2).
|
| 56 |
+
|
| 57 |
+
## License
|
| 58 |
+
|
| 59 |
+
MIT, inherited from `intfloat/multilingual-e5-small`.
|