OpenPi VLA Models (ONNX Export)

This repository contains ONNX-exported versions of the OpenPi VLA (Vision-Language-Action) models, specifically tuned for NVIDIA Jetson Thor (Blackwell) and GeForce RTX Ada/Blackwell GPUs.

These models are derived from the Pi0.5 Libero baseline (OpenVLA architecture).

πŸ“Š Performance Benchmarks (Jetson Thor)

Variant Latency (ms) Throughput (QPS) GPU Mem (Est) Status Notes
W8A16 (QDQ) 181.81 6.37 ~6.3 GiB βœ… Recommended Production Ready. Verified accuracy parity (MSE 0.0061).
FP16 184.54 6.26 ~13.0 GiB βœ… Verified Baseline. Native FP16 execution.
W8A8 (Sim) 128.37 8.38 ~3.7 GiB 🚧 Benchmark Fastest (1.4x speedup), requires calibration.
INT4 (Sim) 183.15 6.33 ~6.3 GiB 🚧 Experimental Parity speed with W8A16.

Memory Note: INT8 quantization reduces runtime VRAM usage by approximately 3.2x compared to FP32 (4GB vs 13GB measured on runtime).


πŸ“‚ Repository Structure & "Why is this 231GB?"

We preserve all precision variants and intermediate export states for research reproducibility.

Folder Precision Description Size Contents
final_w8a16/ W8A16 [USE THIS] Clean, collapsed ONNX model. ~12GB model.w8a16.onnx, .data
final_fp16/ FP16 Baseline high-precision export. ~24GB model.fp32.onnx
final_w4a4/ W4A4 Experimental INT4/INT4 export. ~12GB Granular (Many files)
final_w8a16_new/ W8A16 Uncollapsed dev export (Duplicate state). ~12GB Granular (Many files)
checkpoints/ PyTorch Raw Archive. Original JAX->PyTorch weights. ~100GB .pt files, config, raw layers

Recommendation: Users should download ONLY final_w8a16/ for deployment.


⚠️ Environment & Compatibility Notes

1. NVIDIA Jetson Thor (SM 11.0)

  • Driver Requirement: CUDA 13.0+ / Driver 560+
  • Issue: Standard onnxruntime and PyTorch binaries may not fully support SM 11.0 (Blackwell) instructions yet.
  • Fix: Use the NVIDIA NGC containers or compile from source.

2. Quantization Support

  • W4A4: Requires a custom TensorRT backend to execute trt.DequantizeLinear ops efficiently. CPU execution is not supported.
  • W8A16: Uses standard ONNX QDQ nodes (QuantizeLinear/DequantizeLinear). Compatible with standard ONNX Runtime (GPU) and TensorRT.

πŸ› οΈ Deployment Code

Python (ONNX Runtime)

import onnxruntime as ort
import os
import tensorrt_libs

# 1. Register TensorRT Plugins (Critical for QDQ models)
os.environ["LD_LIBRARY_PATH"] = os.environ.get("LD_LIBRARY_PATH", "") + ":" + os.path.dirname(tensorrt_libs.__file__)

# 2. Load Model
model_path = "final_w8a16/model.w8a16.onnx"
providers = [
    ("TensorrtExecutionProvider", {
        "device_id": 0,
        "trt_fp16_enable": True,
        "trt_int8_enable": True,    # Enable INT8 Tensor Cores
        "trt_engine_cache_enable": True,
    }),
    "CUDAExecutionProvider"
]

session = ort.InferenceSession(model_path, providers=providers)
print("Model loaded successfully!")

TensorRT CLI (trtexec)

# Compile W8A16 model to Engine
trtexec --onnx=final_w8a16/model.w8a16.onnx         --fp16 --int8         --saveEngine=model.engine         --verbose
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading