Multimodal Perception, Understanding and Reasoning
VCLab's research in MLLM-based visual perception, understanding and reasoning, enhancing the multi-tasking capabilities of MLLMs.
Paper • 2507.13353 • Published • 2Note [CVPR 2026 Highlight] Instructed temporal grounding for video MLLM. | Code: https://github.com/NVlabs/VideoITG
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
Paper • 2509.03951 • Published • 1Note [CVPR 2026 Oral] MLLM-shaped negative textual space for OOD. | Code: https://github.com/ZhuWenjie98/ANTS
AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models
Paper • 2410.20149 • Published • 1Note [NeurIPS 2024] Adaptive negative proxy OOD with VLMs. | Code: https://github.com/YBZh/OpenOOD-VLM
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
Paper • 2407.08966 • PublishedNote [ECCV 2024] Label-driven auto prompt tuning for OOD. | Code: https://github.com/YBZh/LAPT
Osprey: Pixel Understanding with Visual Instruction Tuning
Paper • 2312.10032 • Published • 3Note [CVPR 2024] Pixel-level understanding via visual instruction tuning. | Code: https://github.com/CircleRadon/Osprey | HF: https://huggingface.co/sunshine-lwt/Osprey-7b
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
Paper • 2403.17589 • Published • 1Note [CVPR 2024] Dual memory networks for VLM adaptation. | Code: https://github.com/YBZh/DMN
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper • 2407.02392 • Published • 23Note [IJCV 2025] Efficient visual projector for MLLMs. | Code: https://github.com/CircleRadon/TokenPacker