Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
• 2403.12596
• Published • 11
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published • 31
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published • 37
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published • 14
Dense Connector for MLLMs
Paper
• 2405.13800
• Published • 24
Merlin:Empowering Multimodal LLMs with Foresight Minds
Paper
• 2312.00589
• Published • 27
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
• 2407.15754
• Published • 21
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published • 39
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published • 17
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published • 27
LongVLM: Efficient Long Video Understanding via Large Language Models
Paper
• 2404.03384
• Published
Visual Context Window Extension: A New Perspective for Long Video
Understanding
Paper
• 2409.20018
• Published • 11
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
• 2410.10594
• Published • 29
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published • 91