license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
quantized_from:
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
tags:
- sentence-similarity
- quantization
- onnx
- tensorrt
- edge
- embedl
gated: true
extra_gated_heading: Access Embedl Paraphrase Multilingual Minilm L12 V2
extra_gated_description: >-
To access this model, please review and accept the terms below. Your contact
information is collected solely to manage access and, with your explicit
consent, to notify you about updated or new optimized models from Embedl.
extra_gated_button_content: Agree and request access
extra_gated_prompt: >-
By requesting access you agree to the Embedl Models Community Licence and the
upstream Paraphrase Multilingual Minilm L12 V2 License
extra_gated_fields:
Company: text
I agree to the Embedl Models Community Licence and upstream Paraphrase Multilingual Minilm L12 V2 License: checkbox
I consent to being contacted by Embedl about products and services (optional): checkbox
Embedl Paraphrase Multilingual Minilm L12 V2 (Quantized for TensorRT)
Deployable INT8-quantized version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT inference on edge GPUs. Produces
the same L2-normalised sentence embedding as the upstream encoder.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
- Drop-in replacement for
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2in TensorRT pipelines — same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised). - Validated accuracy within 0.0122 of the FP32 Spearman ρ on sts17 (see Accuracy table below).
- Faster than
trtexec --beston supported NVIDIA hardware (see Performance table below). - Includes both ONNX (for TensorRT) and PT2
(
torch.export-loadable) artifacts plus runnable inference scripts.
Quick Start
pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food." # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food." # TensorRT (requires pycuda + tensorrt)
Files
| File | Purpose |
|---|---|
embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.onnx |
INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT. |
embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.pt2 |
INT8-quantized torch.export ExportedProgram. |
infer_trt.py |
Build a TRT engine from the ONNX and run sample inference. |
infer_pt2.py |
Load the .pt2 with torch.export.load and run sample inference. |
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
NVIDIA Jetson AGX Orin
| Configuration | Mean Latency | Speedup vs FP16 |
|---|---|---|
| TensorRT FP16 | 0.78 ms | 1.00x |
| TensorRT --best (unconstrained) | 0.77 ms | 1.00x |
| Embedl Deploy INT8 | 0.73 ms | 1.06x |
Accuracy
Evaluated on the sts17 validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.
| Metric | FP32 (ours) | Embedl INT8 | Δ |
|---|---|---|---|
| Spearman ρ | 0.8130 | 0.8008 | -0.0122 |
| ρ (ar-ar) | 0.7915 | 0.7906 | -0.0010 |
| ρ (default) | 0.7970 | 0.7868 | -0.0102 |
| ρ (en-ar) | 0.8122 | 0.7914 | -0.0208 |
| ρ (en-de) | 0.8422 | 0.8215 | -0.0207 |
| ρ (en-en) | 0.8687 | 0.8638 | -0.0049 |
| ρ (en-tr) | 0.7674 | 0.7555 | -0.0119 |
| ρ (es-en) | 0.8444 | 0.8300 | -0.0143 |
| ρ (es-es) | 0.8556 | 0.8328 | -0.0228 |
| ρ (fr-en) | 0.7659 | 0.7536 | -0.0123 |
| ρ (it-en) | 0.8235 | 0.8148 | -0.0087 |
| ρ (ko-ko) | 0.7703 | 0.7628 | -0.0075 |
| ρ (nl-en) | 0.8171 | 0.8059 | -0.0112 |
FP32 baseline: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 — no redistribution as a hosted service |
| Upstream architecture and weights | Paraphrase Multilingual Minilm L12 V2 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.