-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
llava-hf/llava-v1.6-mistral-7b-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 594k β’ 306 -
llava-hf/llava-v1.6-vicuna-7b-hf
Image-Text-to-Text β’ 7B β’ Updated β’ 33.3k β’ 30 -
llava-hf/llava-v1.6-vicuna-13b-hf
Image-Text-to-Text β’ 13B β’ Updated β’ 25.5k β’ 22
Collections
Discover the best community collections!
Collections including paper arxiv:2310.03744
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90 -
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 21 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper β’ 2402.14818 β’ Published β’ 23
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper β’ 2310.16045 β’ Published β’ 17 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper β’ 2310.13355 β’ Published β’ 9 -
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Paper β’ 2311.07574 β’ Published β’ 16 -
MyVLM: Personalizing VLMs for User-Specific Queries
Paper β’ 2403.14599 β’ Published β’ 17
-
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 21 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
Flamingo: a Visual Language Model for Few-Shot Learning
Paper β’ 2204.14198 β’ Published β’ 16 -
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Paper β’ 2301.12597 β’ Published β’ 2
-
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper β’ 2401.13601 β’ Published β’ 47 -
Orion-14B: Open-source Multilingual Large Language Models
Paper β’ 2401.12246 β’ Published β’ 14 -
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper β’ 2405.09215 β’ Published β’ 22 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper β’ 2405.14129 β’ Published β’ 14
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text β’ 7B β’ Updated β’ 2.45M β’ 355 -
llava-hf/llava-v1.6-mistral-7b-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 594k β’ 306 -
llava-hf/llava-v1.6-34b-hf
Image-Text-to-Text β’ 35B β’ Updated β’ 17.9k β’ 93 -
llava-hf/llava-1.5-13b-hf
Image-Text-to-Text β’ 13B β’ Updated β’ 11.4k β’ 34
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
llava-hf/llava-v1.6-mistral-7b-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 594k β’ 306 -
llava-hf/llava-v1.6-vicuna-7b-hf
Image-Text-to-Text β’ 7B β’ Updated β’ 33.3k β’ 30 -
llava-hf/llava-v1.6-vicuna-13b-hf
Image-Text-to-Text β’ 13B β’ Updated β’ 25.5k β’ 22
-
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 21 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
Flamingo: a Visual Language Model for Few-Shot Learning
Paper β’ 2204.14198 β’ Published β’ 16 -
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Paper β’ 2301.12597 β’ Published β’ 2
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper β’ 2401.13601 β’ Published β’ 47 -
Orion-14B: Open-source Multilingual Large Language Models
Paper β’ 2401.12246 β’ Published β’ 14 -
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper β’ 2405.09215 β’ Published β’ 22 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper β’ 2405.14129 β’ Published β’ 14
-
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90 -
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 21 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper β’ 2402.14818 β’ Published β’ 23
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text β’ 7B β’ Updated β’ 2.45M β’ 355 -
llava-hf/llava-v1.6-mistral-7b-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 594k β’ 306 -
llava-hf/llava-v1.6-34b-hf
Image-Text-to-Text β’ 35B β’ Updated β’ 17.9k β’ 93 -
llava-hf/llava-1.5-13b-hf
Image-Text-to-Text β’ 13B β’ Updated β’ 11.4k β’ 34
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper β’ 2310.16045 β’ Published β’ 17 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper β’ 2310.13355 β’ Published β’ 9 -
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Paper β’ 2311.07574 β’ Published β’ 16 -
MyVLM: Personalizing VLMs for User-Specific Queries
Paper β’ 2403.14599 β’ Published β’ 17
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27