Embedl Paraphrase Multilingual Minilm L12 V2 (Quantized for TensorRT)

Deployable INT8-quantized version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Produces the same L2-normalised sentence embedding as the upstream encoder.

Upstream Model

Open sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 in hfviewer

Highlights

Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
Drop-in replacement for sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 in TensorRT pipelines — same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised).
Validated accuracy within 0.0122 of the FP32 Spearman ρ on sts17 (see Accuracy table below).
Faster than trtexec --best on supported NVIDIA hardware (see Performance table below).
Includes both ONNX (for TensorRT) and PT2 (torch.export-loadable) artifacts plus runnable inference scripts.

Quick Start

pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food."   # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food."   # TensorRT (requires pycuda + tensorrt)

Files

File	Purpose
`embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.onnx`	INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT.
`embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.pt2`	INT8-quantized `torch.export` ExportedProgram.
`infer_trt.py`	Build a TRT engine from the ONNX and run sample inference.
`infer_pt2.py`	Load the `.pt2` with `torch.export.load` and run sample inference.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

NVIDIA Jetson AGX Orin

Configuration	Mean Latency	Speedup vs FP16
TensorRT FP16	0.78 ms	1.00x
TensorRT --best (unconstrained)	0.77 ms	1.00x
Embedl Deploy INT8	0.73 ms	1.06x

Accuracy

Evaluated on the sts17 validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.

Metric	FP32 (ours)	Embedl INT8	Δ
Spearman ρ	0.8130	0.8008	-0.0122
ρ (ar-ar)	0.7915	0.7906	-0.0010
ρ (default)	0.7970	0.7868	-0.0102
ρ (en-ar)	0.8122	0.7914	-0.0208
ρ (en-de)	0.8422	0.8215	-0.0207
ρ (en-en)	0.8687	0.8638	-0.0049
ρ (en-tr)	0.7674	0.7555	-0.0119
ρ (es-en)	0.8444	0.8300	-0.0143
ρ (es-es)	0.8556	0.8328	-0.0228
ρ (fr-en)	0.7659	0.7536	-0.0123
ρ (it-en)	0.8235	0.8148	-0.0087
ρ (ko-ko)	0.7703	0.7628	-0.0075
ρ (nl-en)	0.8171	0.8059	-0.0112

FP32 baseline: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Paraphrase Multilingual Minilm L12 V2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: -

Model tree for embedl/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Quantized

(13)

this model