File size: 2,903 Bytes
e09d57f
b7927a1
19e3ae8
 
b7927a1
 
 
 
19e3ae8
 
e09d57f
19e3ae8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7927a1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
library_name: onnx
tags:
  - depth-estimation
  - dpt
  - midas
  - onnx
base_model: Intel/dpt-large
pipeline_tag: depth-estimation
---

# DPT-Large β€” Monocular Depth Estimation (ONNX)

ONNX export of [Intel/dpt-large](https://huggingface.co/Intel/dpt-large) β€” the Dense Prediction Transformer for monocular depth. ~330M params, originally published as part of the [MiDaS](https://github.com/isl-org/MiDaS) project at Intel Intelligent Systems Lab.

Re-hosted under Heliosoph for distribution stability β€” Intel's published checkpoint is the authoritative source.

Credit: Intel ISL (DPT / MiDaS team β€” Ranftl et al.).

## What this repo contains

```
dpt_large_384.onnx     # ~1.3 GB
```

A single ONNX file. No tokenizer, no preprocessor config β€” preprocessing is fixed by convention.

## Input/output shape

| | Spec |
|---|---|
| Input name | `pixel_values` (or `image` β€” verify in Netron) |
| Input shape | `[1, 3, 384, 384]` |
| Input dtype | float32 |
| Preprocessing | RGB, divide by 255, normalize by `mean=[0.5, 0.5, 0.5]` / `std=[0.5, 0.5, 0.5]` |
| Output shape | `[1, 384, 384]` |
| Output meaning | Relative depth β€” **not** metric. Lower values = farther; higher values = closer. Linearly map to your visualization range. |

## How to use

```python
import onnxruntime as ort
import numpy as np
from PIL import Image

sess = ort.InferenceSession("dpt_large_384.onnx")

# Resize input image to 384Γ—384, normalize, NCHW
img = Image.open("photo.jpg").convert("RGB").resize((384, 384))
arr = (np.asarray(img, dtype=np.float32) / 255.0 - 0.5) / 0.5  # HWC, [-1,1]
arr = arr.transpose(2, 0, 1)[None, ...]                         # 1x3x384x384

depth = sess.run(None, {sess.get_inputs()[0].name: arr})[0][0]  # 384x384
```

For metric depth, pair with a calibration scheme β€” DPT-Large is trained for relative depth and will not give you "this object is 1.7m away" without further work.

## When to pick DPT-Large

- **Quality matters more than speed**: ~330M params, slowest variant in the MiDaS family.
- **Single static image, not video**: no temporal smoothing built in.
- **GPU available**: CPU inference is workable but slow (~1–2 sec on consumer CPU).

For real-time or edge use, prefer `dpt-hybrid` or `midas-small` β€” not in this repo, but available as separate uploads upstream.

## License

**Apache-2.0** β€” same as [Intel's published checkpoint on HuggingFace](https://huggingface.co/Intel/dpt-large). `LICENSE` file included.

Note: the original [isl-org/MiDaS](https://github.com/isl-org/MiDaS) GitHub repo (where the DPT architecture was first released) is **MIT**. Intel re-released the trained DPT-Large weights on HuggingFace under **Apache-2.0**, which is what this repo mirrors. Same model family, different distribution channel, different licenses. The `midas-small` Heliosoph repo (sourced from the GitHub release) inherits MIT.