File size: 2,820 Bytes
580770b
 
a3f1cc6
580770b
a3f1cc6
 
580770b
 
a3f1cc6
 
 
580770b
 
 
 
 
a3f1cc6
580770b
a97af77
580770b
a3f1cc6
 
 
 
 
 
 
 
 
580770b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3f1cc6
580770b
 
 
 
 
 
a97af77
580770b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: other
library_name: pytorch
tags:
- multi-task-learning
- dense-prediction
- monocular-depth-estimation
- semantic-segmentation
- surface-normal-estimation
- edge-detection
- geometric-perception
- robotics
- scene-graph
- dinov3
---

# M2H-MX Multi-Task Weights

This repository hosts model-only weights for **M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction**.

M2H-MX is a **multi-task dense visual perception model**, not a semantic-segmentation-only model. Given a monocular RGB image, the network can predict:

- metric depth or disparity, depending on the dataset configuration;
- semantic segmentation logits;
- surface normals;
- edge maps.

Depth and semantics are the primary deployment outputs used by Mono-Hydra++ or a compatible mapping backend for metric-semantic mapping and downstream 3D scene graph construction. Surface normals and edges are auxiliary training heads used to improve geometric and semantic consistency. The network improves the dense evidence used by the mapping backend; it does not directly predict the 3D scene graph.

Code and instructions: https://github.com/BavanthaU/m2h_mx

## Artifacts

| Dataset | Variant | File | Paper result |
| --- | --- | --- | --- |
| NYUDv2 | M2H-MX-L | `weights/nyudv2/m2h_mx_l_nyudv2_weights.pt` | mIoU 65.60, depth RMSE 0.3800 |
| NYUDv2 | M2H-MX-B | `weights/nyudv2/m2h_mx_b_nyudv2_weights.pt` | mIoU 61.80, depth RMSE 0.4170 |
| ScanNet | M2H-MX-L | `weights/scannet/m2h_mx_l_scannet_weights.pt` | ScanNet25k mIoU 76.10, depth RMSE 0.2210; Mono-Hydra++ ATE 6.91 cm |
| ScanNet | M2H-MX-B | `weights/scannet/m2h_mx_b_scannet_weights.pt` | Base variant artifact |
| Cityscapes | M2H-MX-L | `weights/cityscapes/m2h_mx_l_cityscapes_weights.pt` | mIoU 82.28, disparity RMSE 3.89 |

These are model-only state dictionaries. They do not include optimizer, scheduler, gradient scaler, or EMA state.

## Download

From the code repository:

```bash
python3 scripts/download_weights.py --repo-id Bavantha11/m2h-mx --verify
```

## Citation

```bibtex
@misc{udugama2026m2hmxmultitaskdensevisual,
  title={M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction},
  author={U. V. B. L. Udugama and George Vosselman and Francesco Nex},
  year={2026},
  eprint={2603.29236},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.29236},
}

@misc{udugama2026monohydrarealtimemonocularscene,
  title={Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping},
  author={U. V. B. L. Udugama and George Vosselman and Francesco Nex},
  year={2026},
  eprint={2605.17661},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2605.17661},
}
```