Depth Estimation
Transformers
Safetensors
qwen3_vl
image-text-to-text
vision-language-model
3d-vision
multimodal
qwen3-vl
Instructions to use JonnyYu828/DepthVLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonnyYu828/DepthVLM-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("depth-estimation", model="JonnyYu828/DepthVLM-4B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("JonnyYu828/DepthVLM-4B") model = AutoModelForImageTextToText.from_pretrained("JonnyYu828/DepthVLM-4B") - Notebooks
- Google Colab
- Kaggle
| base_model: | |
| - Qwen/Qwen3-VL-4B-Instruct | |
| license: apache-2.0 | |
| pipeline_tag: depth-estimation | |
| library_name: transformers | |
| tags: | |
| - vision-language-model | |
| - depth-estimation | |
| - 3d-vision | |
| - multimodal | |
| - qwen3-vl | |
| Update 2026-05-18 (v1.0): Initial release | |
| # DepthVLM-4B | |
| DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL. | |
| By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. | |
| ## Highlights | |
| - **Native dense metric depth estimation in VLMs**: Directly predicts geometry within the VLM framework. | |
| - **Unified multimodal understanding and geometry prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass. | |
| - **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs. | |
| - **Versatile Application**: Supports both indoor and outdoor metric depth estimation. | |
| - **Improved 3D spatial reasoning**: Moving toward a truly unified foundation model. | |
| ## Resources | |
| - **Paper:** [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876) | |
| - **Project Page:** [https://depthvlm.github.io/](https://depthvlm.github.io/) | |
| - **Repository:** [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM) | |
| ## Usage | |
| Please refer to the official repository for detailed instructions on: | |
| - Data preprocessing | |
| - Training | |
| - Evaluation | |
| - Inference and visualization | |
| ## Citation | |
| If you find this work useful, please cite: | |
| ```bibtex | |
| @article{yu2026unlocking, | |
| title={Unlocking Dense Metric Depth Estimation in VLMs}, | |
| author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke}, | |
| journal={arXiv preprint arXiv:2605.15876}, | |
| year={2026} | |
| } | |
| ``` |