embedl
/

mobilevit-small-quantized

@@ -30,6 +30,16 @@ Deployable INT8-quantized version of [`apple/mobilevit-small`](https://huggingfa
 optimized with [embedl-deploy](https://github.com/embedl/embedl-deploy)
 for low-latency NVIDIA TensorRT inference on edge GPUs.
 ## Highlights
 - **Mixed-precision INT8/FP16 quantization** with hardware-aware
@@ -39,8 +49,9 @@ for low-latency NVIDIA TensorRT inference on edge GPUs.
   semantics.
 - **Validated accuracy** within 3.30 pp of the FP32
   baseline on ImageNet (see Accuracy table below).
-- **Faster than `trtexec --best`** on supported NVIDIA hardware
-  (see Performance table below).
 - Includes both **ONNX** (for TensorRT) and **PT2**
   (`torch.export`-loadable) artifacts plus runnable inference scripts.
@@ -62,16 +73,14 @@ python infer_pt2.py --image path/to/image.jpg   # pure PyTorch via torch.export
 | `embedl_mobilevit_small_int8.pt2` | INT8-quantized `torch.export` ExportedProgram. |
 | `infer_trt.py` | Build a TRT engine from the ONNX and run sample inference. |
 | `infer_pt2.py` | Load the `.pt2` with `torch.export.load` and run sample inference. |
-| `latency_comparison.png` | Latency comparison across precisions and devices. |
 ## Performance
 Latency measured with TensorRT + `trtexec`, GPU compute time only
 (`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
-(`nvpmodel -m 0 && jetson_clocks` on Jetson). See
-`latency_comparison.png` for a visual summary.
-![Latency comparison across precisions](latency_comparison.png)
 ### NVIDIA Jetson AGX Orin

 optimized with [embedl-deploy](https://github.com/embedl/embedl-deploy)
 for low-latency NVIDIA TensorRT inference on edge GPUs.
+## Upstream Model
+<a href="https://hfviewer.com/apple/mobilevit-small?utm_source=huggingface&amp;utm_medium=embedded_model_card&amp;utm_campaign=apple__mobilevit-small_card" target="_blank" rel="noopener">
+  <img
+    src="https://hfviewer.com/api/card.svg?source=apple%2Fmobilevit-small&amp;v=20260501clipcard"
+    alt="Open apple/mobilevit-small in hfviewer"
+    width="100%"
+  />
+</a>
 ## Highlights
 - **Mixed-precision INT8/FP16 quantization** with hardware-aware
   semantics.
 - **Validated accuracy** within 3.30 pp of the FP32
   baseline on ImageNet (see Accuracy table below).
+- **Matches the latency of `trtexec --best`** on supported NVIDIA
+  hardware while preserving INT8 accuracy (see Performance table
+  below).
 - Includes both **ONNX** (for TensorRT) and **PT2**
   (`torch.export`-loadable) artifacts plus runnable inference scripts.
 | `embedl_mobilevit_small_int8.pt2` | INT8-quantized `torch.export` ExportedProgram. |
 | `infer_trt.py` | Build a TRT engine from the ONNX and run sample inference. |
 | `infer_pt2.py` | Load the `.pt2` with `torch.export.load` and run sample inference. |
 ## Performance
 Latency measured with TensorRT + `trtexec`, GPU compute time only
 (`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
+(`nvpmodel -m 0 && jetson_clocks` on Jetson).
+<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/mobilevit-small-quantized/mobilevit-small-quantized__orin-mountain-view.svg" alt="MobileViT-Small benchmark on NVIDIA Jetson AGX Orin">
 ### NVIDIA Jetson AGX Orin