ZoeDepth β Metric Monocular Depth (ONNX)
ONNX export of Intel/zoedepth-nyu-kitti β Intel ISL's metric-depth follow-up to DPT-Large. Same DPT-Large backbone with calibrated metric-bin heads grafted on (one trained on NYU indoor depths, one on KITTI outdoor depths, combined via a domain-routing classifier). Outputs depth in real-world meters, not just relative ordering.
Re-exported from upstream PyTorch weights β Intel publishes only safetensors at the source repo. Provenance trail: Bhat et al. β Intel/zoedepth-nyu-kitti β transformers.ZoeDepthForDepthEstimation + thin wrapper β torch.onnx.export β these files. fp16 sibling produced from the fp32 trace via onnxconverter-common.
Toolchain: torch 2.4.x (CUDA 12.4), torchvision 0.19 (matched ABI), transformers 4.45.2, optimum[onnxruntime] 1.24.0, onnxconverter-common>=1.14, opset 17, do_constant_folding=True. Full conversion script: scripts/export-zoedepth.ps1 in the DatumIngest repo (run once for fp32, again with -Fp16 for the half-precision sibling).
Credit: Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias MΓΌller (Intel Intelligent Systems Lab). Paper: "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth", 2023.
What this repo contains
| File | Variant | Size | Use |
|---|---|---|---|
model.onnx |
fp32 | ~1.3 GB | Default β full precision, identical numerics to the PyTorch upstream. |
model_fp16.onnx |
fp16 | ~660 MB | Half precision β same architecture, ~Β½ the disk footprint. Identical output on CPU runtimes that upcast fp16βfp32; modest speedup on GPU/NPU with native fp16. |
config.json |
β | <3 KB | HuggingFace model config (preserved for re-instantiation if needed). |
preprocessor_config.json |
β | <1 KB | ZoeDepthImageProcessor settings β image-size targets, normalization stats. |
The fp32 model.onnx is self-contained β it came in at 1.37 GB, just under the 2 GB protobuf limit, so no external-data .onnx.data sidecar. Same for the fp16 variant at 693 MB.
What "metric depth" means (vs the other depth models on Heliosoph)
| Repo | Output | When to use |
|---|---|---|
| Heliosoph/zoedepth-nyu-kitti-onnx (this repo) | Metric depth (meters) | 3D reconstruction at real-world scale, distance measurement, AR overlays, multi-image cloud fusion |
| Heliosoph/dpt-large-onnx | Relative depth | Visualization, single-image effects, "what's closer than what" without needing real units |
| Heliosoph/midas-small-onnx | Relative depth | Edge / CPU / mobile β fast relative depth |
| onnx-community/depth-anything-v2-small | Relative depth (SOTA) | Modern default for relative depth |
Pick this repo specifically when you need meters β most monocular depth models give you a number per pixel that's only meaningful relative to other pixels in the same image; this one gives you a number that's calibrated against real-world distance.
Input / output
| Spec | |
|---|---|
| Input name | pixel_values |
| Input shape | [batch, 3, H, W] (NCHW) |
| Input dtype | float32 (fp32 variant) or float16 (fp16 variant) |
| Constraint | H and W must each be divisible by 32 |
| Preprocessing | Use ZoeDepthImageProcessor from the included preprocessor_config.json β resize + normalize with the ZoeDepth-specific image stats |
| Output name | predicted_depth |
| Output shape | [batch, H, W] (single-channel; no extra channel dim) |
| Output unit | meters (real-world distance from camera to surface) |
| Dynamic axes | batch, height, width |
How to use
import onnxruntime as ort
import numpy as np
from PIL import Image
from transformers import ZoeDepthImageProcessor
# Use the included preprocessor β it handles the 32-alignment + normalize.
proc = ZoeDepthImageProcessor.from_pretrained(".")
sess = ort.InferenceSession("model.onnx") # or "model_fp16.onnx"
img = Image.open("photo.jpg").convert("RGB")
inputs = proc(images=img, return_tensors="np") # NCHW float32, 32-aligned
depth_meters = sess.run(
None,
{"pixel_values": inputs["pixel_values"]},
)[0][0] # [H, W], meters
# depth_meters[y, x] = distance from camera to that surface point, in meters
For the fp16 model, the input also needs to be float16 β cast inputs["pixel_values"] to np.float16 before feeding it in.
Why two variants
- fp32 is the safe default β identical numerics to the upstream PyTorch reference, no surprises.
- fp16 halves disk footprint and model-load memory. On GPU / NPU with native fp16 you also get a modest speedup; on CPU runtimes that upcast fp16βfp32 internally the speed is identical to fp32 but you save the memory. Depth output is essentially identical (the fp16 quantization noise is below the model's own per-pixel error).
If you're not sure: pick fp32 for accuracy-sensitive scientific work, fp16 for shipping / deployment / edge.
License
MIT β same as upstream Intel/zoedepth-nyu-kitti. LICENSE file included. The ONNX-export step (and the fp16 numerical conversion) doesn't change licensing β same model, different serialization.
- Downloads last month
- 19
Model tree for Heliosoph/zoedepth-nyu-kitti-onnx
Base model
Intel/zoedepth-nyu-kitti