Embedl All Minilm L6 V2 (Quantized for TensorRT)

Deployable INT8-quantized version of sentence-transformers/all-MiniLM-L6-v2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Produces the same L2-normalised sentence embedding as the upstream encoder.

Upstream Model

Open sentence-transformers/all-MiniLM-L6-v2 in hfviewer

Highlights

Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
Drop-in replacement for sentence-transformers/all-MiniLM-L6-v2 in TensorRT pipelines — same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised).
Validated accuracy within 0.0026 of the FP32 Spearman ρ on stsb (see Accuracy table below).
Faster than trtexec --best on supported NVIDIA hardware (see Performance table below).
Includes both ONNX (for TensorRT) and PT2 (torch.export-loadable) artifacts plus runnable inference scripts.

Quick Start

pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/all-MiniLM-L6-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food."   # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food."   # TensorRT (requires pycuda + tensorrt)

Files

File	Purpose
`embedl_all-MiniLM-L6-v2_int8.onnx`	INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT.
`embedl_all-MiniLM-L6-v2_int8.pt2`	INT8-quantized `torch.export` ExportedProgram.
`infer_trt.py`	Build a TRT engine from the ONNX and run sample inference.
`infer_pt2.py`	Load the `.pt2` with `torch.export.load` and run sample inference.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

NVIDIA Jetson AGX Orin

Configuration	Mean Latency	Speedup vs FP16
TensorRT FP16	0.41 ms	1.00x
TensorRT --best (unconstrained)	0.41 ms	1.01x
Embedl Deploy INT8	0.38 ms	1.07x

Accuracy

Evaluated on the stsb validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.

Model	Spearman ρ
`sentence-transformers/all-MiniLM-L6-v2` FP32 (ours)	0.8672
Embedl All Minilm L6 V2 INT8	0.8646

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	All Minilm L6 V2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: 14

Model tree for embedl/all-MiniLM-L6-v2-quantized-trt

Base model

sentence-transformers/all-MiniLM-L6-v2

Quantized

(78)

this model