Video Understanding
updated
Vript: A Video Is Worth Thousands of Words
Paper
• 2406.06040
• Published • 28
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published • 74
MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
Paper
• 2406.01574
• Published • 54
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published • 26
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
• 2405.20340
• Published • 20
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published • 90
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published • 37
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published • 31
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published • 14
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
• 2402.13250
• Published • 26
A Simple LLM Framework for Long-Range Video Question-Answering
Paper
• 2312.17235
• Published
Retrieval-Augmented Egocentric Video Captioning
Paper
• 2401.00789
• Published
Distilling Vision-Language Models on Millions of Videos
Paper
• 2401.06129
• Published • 18
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published • 40
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
• 2406.16338
• Published • 26
Long Context Transfer from Language to Vision
Paper
• 2406.16852
• Published • 33
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published • 63
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published • 39
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
• 2407.15754
• Published • 21
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
• 2409.07239
• Published • 15
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published • 53
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published • 97