Depth Estimation
Transformers
Safetensors
qwen3_vl
image-text-to-text
vision-language-model
3d-vision
multimodal
qwen3-vl
Instructions to use JonnyYu828/DepthVLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B") model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B") - Notebooks
- Google Colab
- Kaggle
File size: 2,064 Bytes
627d308 c9e5ec0 585453e c9e5ec0 585453e bd84f4b 585453e 13ae84c 0b46424 13ae84c 585453e 13ae84c 585453e 13ae84c 585453e 13ae84c 585453e b8ee814 585453e b8ee814 585453e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | ---
base_model:
- Qwen/Qwen3-VL-4B-Instruct
license: apache-2.0
pipeline_tag: depth-estimation
library_name: transformers
tags:
- vision-language-model
- depth-estimation
- 3d-vision
- multimodal
- qwen3-vl
---
Update 2026-05-18 (v1.0): Initial release
# DepthVLM-4B
DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability.
## Highlights
- **Native dense metric depth estimation in VLMs**: Directly predicts geometry within the VLM framework.
- **Unified multimodal understanding and geometry prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass.
- **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
- **Versatile Application**: Supports both indoor and outdoor metric depth estimation.
- **Improved 3D spatial reasoning**: Moving toward a truly unified foundation model.
## Resources
- **Paper:** [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
- **Project Page:** [https://depthvlm.github.io/](https://depthvlm.github.io/)
- **Repository:** [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM)
## Usage
Please refer to the official repository for detailed instructions on:
- Data preprocessing
- Training
- Evaluation
- Inference and visualization
## Citation
If you find this work useful, please cite:
```bibtex
@article{yu2026unlocking,
title={Unlocking Dense Metric Depth Estimation in VLMs},
author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
journal={arXiv preprint arXiv:2605.15876},
year={2026}
}
``` |