oguzhanercan 's Collections Image-Video MultiModal Understanding
updated
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published • 147
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal
Perturbation and Learning Stabilization
Paper
• 2501.01245
• Published • 5
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
• 2501.00599
• Published • 46
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published • 34
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
• 2501.07783
• Published • 8
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published • 217
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published • 61
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
• 2503.10596
• Published • 18
Large-scale Pre-training for Grounded Video Caption Generation
Paper
• 2503.10781
• Published • 16
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
• 2504.01934
• Published • 22
LiveVQA: Live Visual Knowledge Seeking
Paper
• 2504.05288
• Published • 15
Paper
• 2504.07491
• Published • 139
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published • 20
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
• 2504.05541
• Published • 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published • 13
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published • 44
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published • 308
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published • 16
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Paper
• 2504.10443
• Published • 3
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published • 64
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published • 157
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published • 11
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published • 13
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published • 98
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper
• 2505.16933
• Published • 34
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Paper
• 2505.16839
• Published • 13
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper
• 2506.09344
• Published • 31
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published • 24
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published • 7
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction
and Planning
Paper
• 2506.09985
• Published • 31
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published • 132
Towards Multimodal Understanding via Stable Diffusion as a Task-Aware
Feature Extractor
Paper
• 2507.07106
• Published • 2
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
Shorts
Paper
• 2507.20939
• Published • 57
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published • 47
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in
Large Vision-Language Models
Paper
• 2510.09008
• Published • 16
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Paper
• 2510.09608
• Published • 53
OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit
Flows
Paper
• 2510.03506
• Published • 15
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published • 98
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Paper
• 2511.01617
• Published • 3
VideoSSR: Video Self-Supervised Reinforcement Learning
Paper
• 2511.06281
• Published • 25
Towards Universal Video Retrieval: Generalizing Video Embedding via
Synthesized Multimodal Pyramid Curriculum
Paper
• 2510.27571
• Published • 19
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation
Models
Paper
• 2511.02712
• Published • 6
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Paper
• 2603.03276
• Published • 103
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Paper
• 2603.12255
• Published • 91
Taking Shortcuts for Categorical VQA Using Super Neurons
Paper
• 2603.10781
• Published • 7
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Paper
• 2603.12262
• Published • 31
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
Paper
• 2603.12793
• Published • 38