MobileVit-Small First commit

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +104 -20
embedl_mobilevit_small_int8.onnx +3 -0
embedl_mobilevit_small_int8.pt2 +3 -0
infer_pt2.py +65 -0
infer_trt.py +139 -0

.gitattributes CHANGED Viewed

@@ -20,6 +20,7 @@
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text

 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
+*.pt2 filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -3,30 +3,114 @@ license: other
 license_name: embedl-models-community-licence-1.0
 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
 base_model:
-- facebook/sam3
 quantized_from:
-- facebook/sam3
 tags:
-- segmentation
-- sam
-- sam3
-- quantization
-- onnx
-- tensorrt
-- edge
-- embedl
 gated: true
-extra_gated_heading: "Access Embedl SAM3 (Quantized)"
-extra_gated_description: >-
-  To access this model, please review and accept the terms below.
-  Your contact information is collected solely to manage access and,
-  with your explicit consent, to notify you about updated or new
-  optimized models from Embedl. You can withdraw consent at any time
-  by contacting us (see Contact section below). See our license for full terms.
 extra_gated_button_content: "Agree and request access"
-extra_gated_prompt: "By requesting access you agree to the Embedl Models Community Licence and the upstream SAM License"
 extra_gated_fields:
   Company: text
-  I agree to the Embedl Models Community Licence and upstream SAM License: checkbox
   I consent to being contacted by Embedl about products and services (optional): checkbox
----

 license_name: embedl-models-community-licence-1.0
 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
 base_model:
+  - apple/mobilevit-small
 quantized_from:
+  - apple/mobilevit-small
 tags:
+  - image-classification
+  - quantization
+  - onnx
+  - tensorrt
+  - edge
+  - embedl
 gated: true
+extra_gated_heading: "Access Embedl Mobilevit Small"
+extra_gated_description: "To access this model, please review and accept the terms below. Your contact information is collected solely to manage access and, with your explicit consent, to notify you about updated or new optimized models from Embedl."
 extra_gated_button_content: "Agree and request access"
+extra_gated_prompt: "By requesting access you agree to the Embedl Models Community Licence and the upstream Mobilevit Small License"
 extra_gated_fields:
   Company: text
+  I agree to the Embedl Models Community Licence and upstream Mobilevit Small License: checkbox
   I consent to being contacted by Embedl about products and services (optional): checkbox
+---
+# Embedl Mobilevit Small (Quantized for TensorRT)
+Deployable INT8-quantized version of [`apple/mobilevit-small`](https://huggingface.co/apple/mobilevit-small),
+optimized with [embedl-deploy](https://github.com/embedl/embedl-deploy)
+for low-latency NVIDIA TensorRT inference on edge GPUs.
+## Highlights
+- **Mixed-precision INT8/FP16 quantization** with hardware-aware
+  optimizations from [embedl-deploy](https://github.com/embedl/embedl-deploy).
+- **Drop-in replacement** for `apple/mobilevit-small` in TensorRT pipelines —
+  same input shape (256×256), same output
+  semantics.
+- **Validated accuracy** within 3.30 pp of the FP32
+  baseline on ImageNet (see Accuracy table below).
+- **Faster than `trtexec --best`** on supported NVIDIA hardware
+  (see Performance table below).
+- Includes both **ONNX** (for TensorRT) and **PT2**
+  (`torch.export`-loadable) artifacts plus runnable inference scripts.
+## Quick Start
+```bash
+pip install huggingface_hub onnxruntime-gpu pillow numpy
+python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/mobilevit-small-quantized', local_dir='.')"
+python infer_trt.py --image path/to/image.jpg   # TensorRT
+# or
+python infer_pt2.py --image path/to/image.jpg   # pure PyTorch via torch.export
+```
+## Files
+| File | Purpose |
+|---|---|
+| `embedl_mobilevit_small_int8.onnx` | INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT. |
+| `embedl_mobilevit_small_int8.pt2` | INT8-quantized `torch.export` ExportedProgram. |
+| `infer_trt.py` | Build a TRT engine from the ONNX and run sample inference. |
+| `infer_pt2.py` | Load the `.pt2` with `torch.export.load` and run sample inference. |
+| `latency_comparison.png` | Latency comparison across precisions and devices. |
+## Performance
+Latency measured with TensorRT + `trtexec`, GPU compute time only
+(`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
+(`nvpmodel -m 0 && jetson_clocks` on Jetson). See
+`latency_comparison.png` for a visual summary.
+![Latency comparison across precisions](latency_comparison.png)
+### NVIDIA Jetson AGX Orin
+| Configuration | Mean Latency | Speedup vs FP16 |
+|---|---|---|
+| TensorRT FP16 | 1.28 ms | 1.00x |
+| TensorRT --best (unconstrained) | 1.09 ms | 1.17x |
+| **Embedl Deploy INT8** | **1.09 ms** | **1.17x** |
+## Accuracy
+Evaluated on the ImageNet validation split. The quantized model
+retains nearly all of the FP32 accuracy with a small tolerance.
+| Model | Top-1 | Top-5 |
+|---|---|---|
+| `apple/mobilevit-small` FP32 (ours) | 78.14% | 94.08% |
+| **Embedl Mobilevit Small INT8** | **74.83%** | **92.28%** |
+## Creating Your Own Optimized Models
+This artifact was produced with
+[embedl-deploy](https://github.com/embedl/embedl-deploy),
+Embedl's open-source PyTorch → TensorRT deployment library. You can
+apply the same workflow to your own models — see
+[the documentation](https://github.com/embedl/embedl-deploy#readme)
+for installation and usage.
+## License
+| Component | License |
+|---|---|
+| Optimized model artifacts (this repo) | [Embedl Models Community Licence v1.0](https://github.com/embedl/embedl-models/blob/main/LICENSE) — no redistribution as a hosted service |
+| Upstream architecture and weights | [Mobilevit Small License](https://huggingface.co/apple/mobilevit-small) |
+## Contact
+We offer engineering support for on-prem/edge deployments and partner
+co-marketing opportunities. Reach out at
+[contact@embedl.com](mailto:contact@embedl.com), or open an issue on
+[GitHub](https://github.com/embedl/embedl-deploy).

embedl_mobilevit_small_int8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:320b83fa490c5ec19a549ff4bfcb4d7413fb9100b3c2e485cd3e016a81389bf1
+size 22518832

embedl_mobilevit_small_int8.pt2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f674b91e7aac2c7ad0a84516108285dd079d50f670aec553b9887ad573e4fd04
+size 46826712

infer_pt2.py ADDED Viewed

	@@ -0,0 +1,65 @@

+# Copyright (C) 2026 Embedl AB
+"""Run inference on the Embedl Mobilevit Small INT8 model via torch.export.
+This script loads the shipped ``embedl_mobilevit_small_int8.pt2``
+artifact with ``torch.export.load`` and runs a single image through
+it. No TensorRT or ONNX runtime is required — just PyTorch.
+Usage::
+    python infer_pt2.py --image path/to/image.jpg
+"""
+import argparse
+from pathlib import Path
+import numpy as np
+import torch
+from PIL import Image
+PT2_PATH = Path(__file__).with_name("embedl_mobilevit_small_int8.pt2")
+INPUT_SIZE = (256, 256)
+MEAN = np.array([0.0, 0.0, 0.0], dtype=np.float32)
+STD = np.array([1.0, 1.0, 1.0], dtype=np.float32)
+def preprocess(image_path: Path) -> torch.Tensor:
+    # MobileViT-Small uses BGR channel order, [0, 1] range, NO mean/std
+    # normalization (matches the upstream HF processor: do_normalize=None).
+    image = Image.open(image_path).convert("RGB").resize(INPUT_SIZE)
+    arr = np.asarray(image, dtype=np.float32) / 255.0
+    arr = (arr - MEAN) / STD
+    arr = arr[..., ::-1].copy()          # RGB -> BGR
+    arr = arr.transpose(2, 0, 1)[None]   # NCHW
+    return torch.from_numpy(arr)
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--image", required=True, type=Path)
+    parser.add_argument("--topk", type=int, default=5)
+    args = parser.parse_args()
+    if not PT2_PATH.exists():
+        raise SystemExit(
+            f"Expected {PT2_PATH.name} next to this script. "
+            "Did you `huggingface-cli download` the repo?"
+        )
+    # The ExportedProgram captured the model in eval mode at export
+    # time, so no further .eval() / no_grad toggling is needed (and
+    # neither is supported on the .module() wrapper).
+    model = torch.export.load(str(PT2_PATH)).module()
+    x = preprocess(args.image)
+    logits = model(x)
+    probs = torch.softmax(logits, dim=-1).squeeze(0)
+    topk_vals, topk_idx = probs.topk(args.topk)
+    print(f"Top-{args.topk} predictions for {args.image}:")
+    for rank, (idx, val) in enumerate(zip(topk_idx.tolist(), topk_vals.tolist()), 1):
+        print(f"  {rank}. class {idx:>5d}  ({val * 100:5.2f}%)")
+if __name__ == "__main__":
+    main()

infer_trt.py ADDED Viewed

	@@ -0,0 +1,139 @@

+# Copyright (C) 2026 Embedl AB
+"""Run inference on the Embedl Mobilevit Small INT8 model via TensorRT.
+This script builds a TensorRT engine from the shipped
+``embedl_mobilevit_small_int8.onnx`` artifact (Q/DQ nodes baked in by
+embedl-deploy) and runs a single image through it. The first run
+caches the engine to ``embedl_mobilevit_small_int8.engine`` so reuse is
+fast.
+Requires TensorRT >= 10.1 and pycuda (or cuda-python). Tested on
+NVIDIA Jetson AGX Orin (JetPack 6) and discrete GPUs with CUDA 12.
+Usage::
+    python infer_trt.py --image path/to/image.jpg
+"""
+import argparse
+import time
+from pathlib import Path
+import numpy as np
+import tensorrt as trt
+from PIL import Image
+try:
+    import pycuda.autoinit  # noqa: F401  (initializes CUDA context)
+    import pycuda.driver as cuda
+except ImportError as exc:  # pragma: no cover
+    raise SystemExit(
+        "pycuda is required. Install with: pip install pycuda"
+    ) from exc
+ONNX_PATH = Path(__file__).with_name("embedl_mobilevit_small_int8.onnx")
+ENGINE_PATH = Path(__file__).with_name("embedl_mobilevit_small_int8.engine")
+INPUT_SIZE = (256, 256)
+MEAN = np.array([0.0, 0.0, 0.0], dtype=np.float32)
+STD = np.array([1.0, 1.0, 1.0], dtype=np.float32)
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+def build_engine() -> bytes:
+    builder = trt.Builder(TRT_LOGGER)
+    network = builder.create_network(
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
+    parser = trt.OnnxParser(network, TRT_LOGGER)
+    with open(ONNX_PATH, "rb") as f:
+        if not parser.parse(f.read()):
+            for i in range(parser.num_errors):
+                print(parser.get_error(i))
+            raise RuntimeError("ONNX parse failed.")
+    config = builder.create_builder_config()
+    config.set_flag(trt.BuilderFlag.FP16)
+    config.set_flag(trt.BuilderFlag.INT8)
+    config.builder_optimization_level = 5
+    serialized = builder.build_serialized_network(network, config)
+    if serialized is None:
+        raise RuntimeError("Engine build failed.")
+    return bytes(serialized)
+def load_or_build_engine() -> trt.ICudaEngine:
+    if ENGINE_PATH.exists():
+        data = ENGINE_PATH.read_bytes()
+    else:
+        print(f"Building engine (first run) → {ENGINE_PATH.name} …")
+        data = build_engine()
+        ENGINE_PATH.write_bytes(data)
+    runtime = trt.Runtime(TRT_LOGGER)
+    return runtime.deserialize_cuda_engine(data)
+def preprocess(image_path: Path) -> np.ndarray:
+    # MobileViT-Small uses BGR channel order, [0, 1] range, NO mean/std
+    # normalization (matches the upstream HF processor: do_normalize=None).
+    image = Image.open(image_path).convert("RGB").resize(INPUT_SIZE)
+    arr = np.asarray(image, dtype=np.float32) / 255.0
+    arr = (arr - MEAN) / STD
+    arr = arr[..., ::-1]                 # RGB -> BGR
+    return np.ascontiguousarray(arr.transpose(2, 0, 1)[None])
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--image", required=True, type=Path)
+    parser.add_argument("--topk", type=int, default=5)
+    args = parser.parse_args()
+    if not ONNX_PATH.exists():
+        raise SystemExit(
+            f"Expected {ONNX_PATH.name} next to this script. "
+            "Did you download the HF repo?"
+        )
+    engine = load_or_build_engine()
+    context = engine.create_execution_context()
+    input_name = engine.get_tensor_name(0)
+    output_name = engine.get_tensor_name(1)
+    out_shape = tuple(engine.get_tensor_shape(output_name))
+    x = preprocess(args.image)
+    h_out = np.empty(out_shape, dtype=np.float32)
+    d_in = cuda.mem_alloc(x.nbytes)
+    d_out = cuda.mem_alloc(h_out.nbytes)
+    stream = cuda.Stream()
+    cuda.memcpy_htod_async(d_in, x, stream)
+    context.set_tensor_address(input_name, int(d_in))
+    context.set_tensor_address(output_name, int(d_out))
+    # Warm-up + timed run.
+    for _ in range(5):
+        context.execute_async_v3(stream.handle)
+    stream.synchronize()
+    t0 = time.perf_counter()
+    context.execute_async_v3(stream.handle)
+    stream.synchronize()
+    latency_ms = (time.perf_counter() - t0) * 1000.0
+    cuda.memcpy_dtoh_async(h_out, d_out, stream)
+    stream.synchronize()
+    logits = h_out.reshape(-1)
+    probs = np.exp(logits - logits.max())
+    probs /= probs.sum()
+    top = probs.argsort()[::-1][: args.topk]
+    print(f"Latency (single-run, GPU compute): {latency_ms:.2f} ms")
+    print(f"Top-{args.topk} predictions for {args.image}:")
+    for rank, idx in enumerate(top, 1):
+        print(f"  {rank}. class {idx:>5d}  ({probs[idx] * 100:5.2f}%)")
+if __name__ == "__main__":
+    main()