Embedl Paraphrase Multilingual Minilm L12 V2 (Quantized for TensorRT)
Deployable INT8-quantized version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT inference on edge GPUs. Produces
the same L2-normalised sentence embedding as the upstream encoder.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
- Drop-in replacement for
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2in TensorRT pipelines β same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised). - Validated accuracy within 0.0122 of the FP32 Spearman Ο on sts17 (see Accuracy table below).
- Faster than
trtexec --beston supported NVIDIA hardware (see Performance table below). - Includes both ONNX (for TensorRT) and PT2
(
torch.export-loadable) artifacts plus runnable inference scripts.
Quick Start
pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food." # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food." # TensorRT (requires pycuda + tensorrt)
Files
| File | Purpose |
|---|---|
embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.onnx |
INT8-quantized ONNX with Q/DQ nodes β feed to TensorRT. |
embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.pt2 |
INT8-quantized torch.export ExportedProgram. |
infer_trt.py |
Build a TRT engine from the ONNX and run sample inference. |
infer_pt2.py |
Load the .pt2 with torch.export.load and run sample inference. |
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
NVIDIA Jetson AGX Orin
| Configuration | Mean Latency | Speedup vs FP16 |
|---|---|---|
| TensorRT FP16 | 0.78 ms | 1.00x |
| TensorRT --best (unconstrained) | 0.77 ms | 1.00x |
| Embedl Deploy INT8 | 0.73 ms | 1.06x |
Accuracy
Evaluated on the sts17 validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.
| Metric | FP32 (ours) | Embedl INT8 | Ξ |
|---|---|---|---|
| Spearman Ο | 0.8130 | 0.8008 | -0.0122 |
| Ο (ar-ar) | 0.7915 | 0.7906 | -0.0010 |
| Ο (default) | 0.7970 | 0.7868 | -0.0102 |
| Ο (en-ar) | 0.8122 | 0.7914 | -0.0208 |
| Ο (en-de) | 0.8422 | 0.8215 | -0.0207 |
| Ο (en-en) | 0.8687 | 0.8638 | -0.0049 |
| Ο (en-tr) | 0.7674 | 0.7555 | -0.0119 |
| Ο (es-en) | 0.8444 | 0.8300 | -0.0143 |
| Ο (es-es) | 0.8556 | 0.8328 | -0.0228 |
| Ο (fr-en) | 0.7659 | 0.7536 | -0.0123 |
| Ο (it-en) | 0.8235 | 0.8148 | -0.0087 |
| Ο (ko-ko) | 0.7703 | 0.7628 | -0.0075 |
| Ο (nl-en) | 0.8171 | 0.8059 | -0.0112 |
FP32 baseline: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. You can apply the same workflow to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Paraphrase Multilingual Minilm L12 V2 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
- Downloads last month
- -