JonnyYu828
/

DepthVLM-4B

Depth Estimation

image-text-to-text

vision-language-model

Model card Files Files and versions

DepthVLM-4B / README.md

nielsr's picture

nielsr HF Staff

Improve model card metadata and content

795d60d verified 1 day ago

|

2.06 kB

	---
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	license: apache-2.0
	pipeline_tag: depth-estimation
	library_name: transformers
	tags:
	- vision-language-model
	- depth-estimation
	- 3d-vision
	- multimodal
	- qwen3-vl
	---

	Update 2026-05-18 (v1.0): Initial release

	# DepthVLM-4B

	DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

	By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability.

	## Highlights

	- Native dense metric depth estimation in VLMs: Directly predicts geometry within the VLM framework.
	- Unified multimodal understanding and geometry prediction: Generates full-resolution depth maps alongside language outputs in a single forward pass.
	- Efficient Inference: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs.
	- Versatile Application: Supports both indoor and outdoor metric depth estimation.
	- Improved 3D spatial reasoning: Moving toward a truly unified foundation model.

	## Resources

	- Paper: [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876)
	- Project Page: [https://depthvlm.github.io/](https://depthvlm.github.io/)
	- Repository: [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM)

	## Usage

	Please refer to the official repository for detailed instructions on:

	- Data preprocessing
	- Training
	- Evaluation
	- Inference and visualization

	## Citation

	If you find this work useful, please cite:

	```bibtex
	@article{yu2026unlocking,
	title={Unlocking Dense Metric Depth Estimation in VLMs},
	author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},
	journal={arXiv preprint arXiv:2605.15876},
	year={2026}
	}
	```