--- base_model: - Qwen/Qwen3-VL-4B-Instruct license: apache-2.0 pipeline_tag: depth-estimation library_name: transformers tags: - vision-language-model - depth-estimation - 3d-vision - multimodal - qwen3-vl --- Update 2026-05-18 (v1.0): Initial release # DepthVLM-4B DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm, DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. ## Highlights - **Native dense metric depth estimation in VLMs**: Directly predicts geometry within the VLM framework. - **Unified multimodal understanding and geometry prediction**: Generates full-resolution depth maps alongside language outputs in a single forward pass. - **Efficient Inference**: Achieves higher efficiency compared to per-pixel query or coarse token-level outputs. - **Versatile Application**: Supports both indoor and outdoor metric depth estimation. - **Improved 3D spatial reasoning**: Moving toward a truly unified foundation model. ## Resources - **Paper:** [Unlocking Dense Metric Depth Estimation in VLMs](https://arxiv.org/abs/2605.15876) - **Project Page:** [https://depthvlm.github.io/](https://depthvlm.github.io/) - **Repository:** [https://github.com/hanxunyu/DepthVLM](https://github.com/hanxunyu/DepthVLM) ## Usage Please refer to the official repository for detailed instructions on: - Data preprocessing - Training - Evaluation - Inference and visualization ## Citation If you find this work useful, please cite: ```bibtex @article{yu2026unlocking, title={Unlocking Dense Metric Depth Estimation in VLMs}, author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke}, journal={arXiv preprint arXiv:2605.15876}, year={2026} } ```