How to use from the
Use from the
TensorRT library
# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js
Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch β†’

Embedl All Minilm L6 V2 (Quantized for TensorRT)

Deployable INT8-quantized version of sentence-transformers/all-MiniLM-L6-v2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Produces the same L2-normalised sentence embedding as the upstream encoder.

Upstream Model

Open sentence-transformers/all-MiniLM-L6-v2 in hfviewer

Highlights

  • Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
  • Drop-in replacement for sentence-transformers/all-MiniLM-L6-v2 in TensorRT pipelines β€” same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised).
  • Validated accuracy within 0.0026 of the FP32 Spearman ρ on stsb (see Accuracy table below).
  • Faster than trtexec --best on supported NVIDIA hardware (see Performance table below).
  • Includes both ONNX (for TensorRT) and PT2 (torch.export-loadable) artifacts plus runnable inference scripts.

Quick Start

pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/all-MiniLM-L6-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food."   # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food."   # TensorRT (requires pycuda + tensorrt)

Files

File Purpose
embedl_all-MiniLM-L6-v2_int8.onnx INT8-quantized ONNX with Q/DQ nodes β€” feed to TensorRT.
embedl_all-MiniLM-L6-v2_int8.pt2 INT8-quantized torch.export ExportedProgram.
infer_trt.py Build a TRT engine from the ONNX and run sample inference.
infer_pt2.py Load the .pt2 with torch.export.load and run sample inference.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

All Minilm L6 V2 latency on NVIDIA Jetson AGX Orin All Minilm L6 V2 peak memory on NVIDIA Jetson AGX Orin

NVIDIA Jetson AGX Orin

Configuration Mean Latency Speedup vs FP16
TensorRT FP16 0.41 ms 1.00x
TensorRT --best (unconstrained) 0.41 ms 1.01x
Embedl Deploy INT8 0.38 ms 1.07x

Accuracy

Evaluated on the stsb validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.

Model Spearman ρ
sentence-transformers/all-MiniLM-L6-v2 FP32 (ours) 0.8672
Embedl All Minilm L6 V2 INT8 0.8646

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β†’ TensorRT deployment library. You can apply the same workflow to your own models β€” see the documentation for installation and usage.

License

Component License
Optimized model artifacts (this repo) Embedl Models Community Licence v1.0 β€” no redistribution as a hosted service
Upstream architecture and weights All Minilm L6 V2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips β€” bring them all.
Join our Discord β†’
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for embedl/all-MiniLM-L6-v2-quantized-trt

Quantized
(78)
this model