Shorter Embedl accuracy column title

dd4c9da verified about 10 hours ago

11.8 kB

	---
	license: other
	license_name: embedl-models-community-licence-1.0
	license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
	base_model:
	- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	quantized_from:
	- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	tags:
	- sentence-similarity
	- quantization
	- onnx
	- tensorrt
	- edge
	- embedl
	gated: true
	extra_gated_heading: "Access Embedl Paraphrase Multilingual Minilm L12 V2"
	extra_gated_description: "To access this model, please review and accept the terms below. Your contact information is collected solely to manage access and, with your explicit consent, to notify you about updated or new optimized models from Embedl."
	extra_gated_button_content: "Agree and request access"
	extra_gated_prompt: "By requesting access you agree to the Embedl Models Community Licence and the upstream Paraphrase Multilingual Minilm L12 V2 License"
	extra_gated_fields:
	Company: text
	I agree to the Embedl Models Community Licence and upstream Paraphrase Multilingual Minilm L12 V2 License: checkbox
	I consent to being contacted by Embedl about products and services (optional): checkbox
	---
	<!-- embedl-banner:start -->
	<style>
	.embedl-btn-primary { transition: background 160ms ease, box-shadow 160ms ease; }
	.embedl-btn-primary:hover { background: #4FDCE4 !important; box-shadow: 0 8px 22px rgba(45,212,221,0.45) !important; }
	.embedl-btn-secondary { transition: background 160ms ease; }
	.embedl-btn-secondary:hover { background: rgba(45,212,221,0.15) !important; }
	.embedl-headline { font-size: clamp(11px, 2.15vw, 15px) !important; }
	.embedl-btn-primary, .embedl-btn-secondary {
	font-size: clamp(11px, 1.65vw, 13px) !important;
	padding: clamp(6px, 1.1vw, 9px) clamp(10px, 1.6vw, 14px) !important;
	}
	</style>
	<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(45,212,221,0.22) 0%,rgba(45,212,221,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(45,212,221,0.10) 0%,rgba(45,212,221,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(45,212,221,0.28);border-radius:12px;padding:22px 24px;margin:0 0 24px 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
	<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
	<tr style="background:transparent;">
	<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
	<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#2DD4DD;background:rgba(45,212,221,0.15);border:1px solid rgba(45,212,221,0.35);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Optimized by Embedl</div>
	<div class="embedl-headline" style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need to <span style="color:#2DD4DD;white-space:nowrap;">fine-tune</span>, hit <span style="color:#2DD4DD;white-space:nowrap;">performance targets</span>, or deploy on <span style="color:#2DD4DD;white-space:nowrap;">specific hardware</span>?</div>
	<div style="font-size:13px;color:#9BA7B5;">We've got you covered.</div>
	</td>
	<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
	<a href="https://www.embedl.com/models" class="embedl-btn-secondary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;color:#2DD4DD;text-decoration:none;margin-right:8px;">Learn more</a>
	<a href="https://www.embedl.com/contact" class="embedl-btn-primary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;background:#2DD4DD;color:#0B1626;text-decoration:none;box-shadow:0 6px 18px rgba(45,212,221,0.28);">Get in touch →</a>
	</td>
	</tr>
	</table>
	</div>
	<!-- embedl-banner:end -->

	# Embedl Paraphrase Multilingual Minilm L12 V2 (Quantized for TensorRT)

	Deployable INT8-quantized version of [`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2),
	optimized with [embedl-deploy](https://github.com/embedl/embedl-deploy)
	for low-latency NVIDIA TensorRT inference on edge GPUs. Produces
	the same L2-normalised sentence embedding as the upstream encoder.

	## Upstream Model

	<a href="https://hfviewer.com/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2?utm_source=huggingface&utm_medium=embedded_model_card&utm_campaign=sentence-transformers__paraphrase-multilingual-MiniLM-L12-v2_card" target="_blank" rel="noopener">
	<img
	src="https://hfviewer.com/api/card.svg?source=sentence-transformers%2Fparaphrase-multilingual-MiniLM-L12-v2&v=20260501clipcard"
	alt="Open sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 in hfviewer"
	width="100%"
	/>
	</a>

	## Highlights

	- Mixed-precision INT8/FP16 quantization with hardware-aware
	optimizations from [embedl-deploy](https://github.com/embedl/embedl-deploy).
	- Drop-in replacement for `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` in TensorRT pipelines —
	same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics
	(mean-pooled, L2-normalised).
	- Validated accuracy within 0.0122 of the FP32 Spearman ρ on sts17
	(see Accuracy table below).
	- Faster than `trtexec --best` on supported NVIDIA hardware (see Performance table below).
	- Includes both ONNX (for TensorRT) and PT2
	(`torch.export`-loadable) artifacts plus runnable inference scripts.

	## Quick Start

	```bash
	pip install huggingface_hub transformers numpy
	python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt', local_dir='.')"
	python infer_pt2.py --sentence "A man is eating food." # pure PyTorch via torch.export
	# or
	python infer_trt.py --sentence "A man is eating food." # TensorRT (requires pycuda + tensorrt)
	```

	## Files

	\| File \| Purpose \|
	\|---\|---\|
	\| `embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.onnx` \| INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT. \|
	\| `embedl_paraphrase-multilingual-MiniLM-L12-v2_int8.pt2` \| INT8-quantized `torch.export` ExportedProgram. \|
	\| `infer_trt.py` \| Build a TRT engine from the ONNX and run sample inference. \|
	\| `infer_pt2.py` \| Load the `.pt2` with `torch.export.load` and run sample inference. \|

	## Performance

	Latency measured with TensorRT + `trtexec`, GPU compute time only
	(`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
	(`nvpmodel -m 0 && jetson_clocks` on Jetson).

	<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt__orin-mountain-view__latency.svg" alt="Paraphrase Multilingual Minilm L12 V2 latency on NVIDIA Jetson AGX Orin">

	<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt/paraphrase-multilingual-MiniLM-L12-v2-quantized-trt__orin-mountain-view__memory.svg" alt="Paraphrase Multilingual Minilm L12 V2 peak memory on NVIDIA Jetson AGX Orin">

	### NVIDIA Jetson AGX Orin

	\| Configuration \| Mean Latency \| Speedup vs FP16 \|
	\|---\|---\|---\|
	\| TensorRT FP16 \| 0.78 ms \| 1.00x \|
	\| TensorRT --best (unconstrained) \| 0.77 ms \| 1.00x \|
	\| Embedl Deploy INT8 \| 0.73 ms \| 1.06x \|


	## Accuracy

	Evaluated on the sts17 validation split. The quantized model
	retains nearly all of the FP32 accuracy with a small tolerance.

	\| Metric \| FP32 (ours) \| Embedl INT8 \| Δ \|
	\|---\|---\|---\|---\|
	\| Spearman ρ \| 0.8130 \| 0.8008 \| -0.0122 \|
	\| ρ (ar-ar) \| 0.7915 \| 0.7906 \| -0.0010 \|
	\| ρ (default) \| 0.7970 \| 0.7868 \| -0.0102 \|
	\| ρ (en-ar) \| 0.8122 \| 0.7914 \| -0.0208 \|
	\| ρ (en-de) \| 0.8422 \| 0.8215 \| -0.0207 \|
	\| ρ (en-en) \| 0.8687 \| 0.8638 \| -0.0049 \|
	\| ρ (en-tr) \| 0.7674 \| 0.7555 \| -0.0119 \|
	\| ρ (es-en) \| 0.8444 \| 0.8300 \| -0.0143 \|
	\| ρ (es-es) \| 0.8556 \| 0.8328 \| -0.0228 \|
	\| ρ (fr-en) \| 0.7659 \| 0.7536 \| -0.0123 \|
	\| ρ (it-en) \| 0.8235 \| 0.8148 \| -0.0087 \|
	\| ρ (ko-ko) \| 0.7703 \| 0.7628 \| -0.0075 \|
	\| ρ (nl-en) \| 0.8171 \| 0.8059 \| -0.0112 \|

	FP32 baseline: [`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2).

	## Creating Your Own Optimized Models

	This artifact was produced with
	[embedl-deploy](https://github.com/embedl/embedl-deploy),
	Embedl's open-source PyTorch → TensorRT deployment library. You can
	apply the same workflow to your own models — see
	[the documentation](https://github.com/embedl/embedl-deploy#readme)
	for installation and usage.

	## License

	\| Component \| License \|
	\|---\|---\|
	\| Optimized model artifacts (this repo) \| [Embedl Models Community Licence v1.0](https://github.com/embedl/embedl-models/blob/main/LICENSE) — no redistribution as a hosted service \|
	\| Upstream architecture and weights \| [Paraphrase Multilingual Minilm L12 V2 License](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) \|

	## Contact

	We offer engineering support for on-prem/edge deployments and partner
	co-marketing opportunities. Reach out at
	[contact@embedl.com](mailto:contact@embedl.com), or open an issue on
	[GitHub](https://github.com/embedl/embedl-deploy).

	<!-- embedl-discord-banner:start -->
	<style>
	.embedl-discord-btn { transition: background 160ms ease, box-shadow 160ms ease; }
	.embedl-discord-btn:hover { background: #6C77F5 !important; box-shadow: 0 8px 22px rgba(88,101,242,0.55) !important; }
	</style>
	<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(88,101,242,0.22) 0%,rgba(88,101,242,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(88,101,242,0.10) 0%,rgba(88,101,242,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(88,101,242,0.35);border-radius:12px;padding:22px 24px;margin:24px 0 0 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
	<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
	<tr style="background:transparent;">
	<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
	<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#A5B4FC;background:rgba(88,101,242,0.18);border:1px solid rgba(88,101,242,0.45);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Community & support</div>
	<div style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need help with this model? Chat with the Embedl team and other engineers on <span style="color:#A5B4FC;white-space:nowrap;">Discord</span>.</div>
	<div style="font-size:13px;color:#9BA7B5;">Quantization gotchas, hardware questions, fine-tuning tips — bring them all.</div>
	</td>
	<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
	<a href="https://discord.gg/MTbMWdKqE" class="embedl-discord-btn" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #5865F2;background:#5865F2;color:#FFFFFF;text-decoration:none;box-shadow:0 6px 18px rgba(88,101,242,0.35);">Join our Discord →</a>
	</td>
	</tr>
	</table>
	</div>
	<!-- embedl-discord-banner:end -->