ZoeDepth β€” Metric Monocular Depth (ONNX)

ONNX export of Intel/zoedepth-nyu-kitti β€” Intel ISL's metric-depth follow-up to DPT-Large. Same DPT-Large backbone with calibrated metric-bin heads grafted on (one trained on NYU indoor depths, one on KITTI outdoor depths, combined via a domain-routing classifier). Outputs depth in real-world meters, not just relative ordering.

Re-exported from upstream PyTorch weights β€” Intel publishes only safetensors at the source repo. Provenance trail: Bhat et al. β†’ Intel/zoedepth-nyu-kitti β†’ transformers.ZoeDepthForDepthEstimation + thin wrapper β†’ torch.onnx.export β†’ these files. fp16 sibling produced from the fp32 trace via onnxconverter-common.

Toolchain: torch 2.4.x (CUDA 12.4), torchvision 0.19 (matched ABI), transformers 4.45.2, optimum[onnxruntime] 1.24.0, onnxconverter-common>=1.14, opset 17, do_constant_folding=True. Full conversion script: scripts/export-zoedepth.ps1 in the DatumIngest repo (run once for fp32, again with -Fp16 for the half-precision sibling).

Credit: Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias MΓΌller (Intel Intelligent Systems Lab). Paper: "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth", 2023.

What this repo contains

File Variant Size Use
model.onnx fp32 ~1.3 GB Default β€” full precision, identical numerics to the PyTorch upstream.
model_fp16.onnx fp16 ~660 MB Half precision β€” same architecture, ~Β½ the disk footprint. Identical output on CPU runtimes that upcast fp16β†’fp32; modest speedup on GPU/NPU with native fp16.
config.json β€” <3 KB HuggingFace model config (preserved for re-instantiation if needed).
preprocessor_config.json β€” <1 KB ZoeDepthImageProcessor settings β€” image-size targets, normalization stats.

The fp32 model.onnx is self-contained β€” it came in at 1.37 GB, just under the 2 GB protobuf limit, so no external-data .onnx.data sidecar. Same for the fp16 variant at 693 MB.

What "metric depth" means (vs the other depth models on Heliosoph)

Repo Output When to use
Heliosoph/zoedepth-nyu-kitti-onnx (this repo) Metric depth (meters) 3D reconstruction at real-world scale, distance measurement, AR overlays, multi-image cloud fusion
Heliosoph/dpt-large-onnx Relative depth Visualization, single-image effects, "what's closer than what" without needing real units
Heliosoph/midas-small-onnx Relative depth Edge / CPU / mobile β€” fast relative depth
onnx-community/depth-anything-v2-small Relative depth (SOTA) Modern default for relative depth

Pick this repo specifically when you need meters β€” most monocular depth models give you a number per pixel that's only meaningful relative to other pixels in the same image; this one gives you a number that's calibrated against real-world distance.

Input / output

Spec
Input name pixel_values
Input shape [batch, 3, H, W] (NCHW)
Input dtype float32 (fp32 variant) or float16 (fp16 variant)
Constraint H and W must each be divisible by 32
Preprocessing Use ZoeDepthImageProcessor from the included preprocessor_config.json β€” resize + normalize with the ZoeDepth-specific image stats
Output name predicted_depth
Output shape [batch, H, W] (single-channel; no extra channel dim)
Output unit meters (real-world distance from camera to surface)
Dynamic axes batch, height, width

How to use

import onnxruntime as ort
import numpy as np
from PIL import Image
from transformers import ZoeDepthImageProcessor

# Use the included preprocessor β€” it handles the 32-alignment + normalize.
proc = ZoeDepthImageProcessor.from_pretrained(".")
sess = ort.InferenceSession("model.onnx")          # or "model_fp16.onnx"

img = Image.open("photo.jpg").convert("RGB")
inputs = proc(images=img, return_tensors="np")     # NCHW float32, 32-aligned

depth_meters = sess.run(
    None,
    {"pixel_values": inputs["pixel_values"]},
)[0][0]                                            # [H, W], meters

# depth_meters[y, x] = distance from camera to that surface point, in meters

For the fp16 model, the input also needs to be float16 β€” cast inputs["pixel_values"] to np.float16 before feeding it in.

Why two variants

  • fp32 is the safe default β€” identical numerics to the upstream PyTorch reference, no surprises.
  • fp16 halves disk footprint and model-load memory. On GPU / NPU with native fp16 you also get a modest speedup; on CPU runtimes that upcast fp16β†’fp32 internally the speed is identical to fp32 but you save the memory. Depth output is essentially identical (the fp16 quantization noise is below the model's own per-pixel error).

If you're not sure: pick fp32 for accuracy-sensitive scientific work, fp16 for shipping / deployment / edge.

License

MIT β€” same as upstream Intel/zoedepth-nyu-kitti. LICENSE file included. The ONNX-export step (and the fp16 numerical conversion) doesn't change licensing β€” same model, different serialization.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Heliosoph/zoedepth-nyu-kitti-onnx

Quantized
(2)
this model