Multimodal Perception, Understanding and Reasoning

VCLab-HKPU 's Collections

Image/Video Restoration, Enhancement and Quality Assessment

Image and Video Synthesis and Generation

3D Perception, Reconstruction and Generation

Architecture and Training Paradigms

Benchmarks and Datasets

Multimodal Perception, Understanding and Reasoning

updated 19 days ago

VCLab's research in MLLM-based visual perception, understanding and reasoning, enhancing the multi-tasking capabilities of MLLMs.

Upvote

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Paper • 2507.13353 • Published Jul 17, 2025 • 2

Note [CVPR 2026 Highlight] Instructed temporal grounding for video MLLM. | Code: https://github.com/NVlabs/VideoITG
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Paper • 2509.03951 • Published Mar 17 • 1

Note [CVPR 2026 Oral] MLLM-shaped negative textual space for OOD. | Code: https://github.com/ZhuWenjie98/ANTS
AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models

Paper • 2410.20149 • Published Oct 26, 2024 • 1

Note [NeurIPS 2024] Adaptive negative proxy OOD with VLMs. | Code: https://github.com/YBZh/OpenOOD-VLM
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Paper • 2407.08966 • Published Jul 12, 2024

Note [ECCV 2024] Label-driven auto prompt tuning for OOD. | Code: https://github.com/YBZh/LAPT
Osprey: Pixel Understanding with Visual Instruction Tuning

Paper • 2312.10032 • Published Dec 15, 2023 • 3

Note [CVPR 2024] Pixel-level understanding via visual instruction tuning. | Code: https://github.com/CircleRadon/Osprey | HF: https://huggingface.co/sunshine-lwt/Osprey-7b
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Paper • 2403.17589 • Published Mar 26, 2024 • 1

Note [CVPR 2024] Dual memory networks for VLM adaptation. | Code: https://github.com/YBZh/DMN
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 23

Note [IJCV 2025] Efficient visual projector for MLLMs. | Code: https://github.com/CircleRadon/TokenPacker

Upvote