oguzhanercan 's Collections Image-Video General Tasks
updated
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published • 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published • 52
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale
Pre-Training
Paper
• 2501.07556
• Published • 7
MINIMA: Modality Invariant Image Matching
Paper
• 2412.19412
• Published • 4
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Paper
• 2501.12375
• Published • 23
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published • 20
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
• 2502.17157
• Published • 52
"Principal Components" Enable A New Language of Images
Paper
• 2503.08685
• Published • 12
What's in a Latent? Leveraging Diffusion Latent Space for Domain
Generalization
Paper
• 2503.06698
• Published • 4
Segment Any Motion in Videos
Paper
• 2503.22268
• Published • 19
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video
Super-Resolution
Paper
• 2510.12747
• Published • 39
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
Paper
• 2602.08099
• Published • 124
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Paper
• 2602.23339
• Published • 6
RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting
Paper
• 2603.14941
• Published • 8