MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Paper β’ 2306.13394 β’ Published Jun 23, 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Paper β’ 2306.13394 β’ Published Jun 23, 2023
CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes Paper β’ 2310.09761 β’ Published Oct 15, 2023 β’ 1
FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning Paper β’ 2212.00465 β’ Published Dec 1, 2022
VITA: Towards Open-Source Interactive Omni Multimodal LLM Paper β’ 2408.05211 β’ Published Aug 9, 2024 β’ 50
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper β’ 2411.15296 β’ Published Nov 22, 2024 β’ 21
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs Paper β’ 2411.19951 β’ Published Nov 29, 2024
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification Paper β’ 2412.00876 β’ Published Dec 1, 2024
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression Paper β’ 2412.04317 β’ Published Dec 5, 2024 β’ 1
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM Paper β’ 2411.00774 β’ Published Nov 1, 2024
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM Paper β’ 2411.00774 β’ Published Nov 1, 2024
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper β’ 2501.01957 β’ Published Jan 3, 2025 β’ 47
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Paper β’ 2502.05177 β’ Published Feb 7, 2025 β’ 2
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Paper β’ 2502.05177 β’ Published Feb 7, 2025 β’ 2
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper β’ 2505.03739 β’ Published May 6, 2025 β’ 10
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper β’ 2505.03739 β’ Published May 6, 2025 β’ 10
What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation Paper β’ 2505.19569 β’ Published May 26, 2025
Solving the Catastrophic Forgetting Problem in Generalized Category Discovery Paper β’ 2501.05272 β’ Published Jan 9, 2025 β’ 1
Aligning and Prompting Everything All at Once for Universal Visual Perception Paper β’ 2312.02153 β’ Published Dec 4, 2023