Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
• 2402.17177
• Published • 87
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with
Audio2Video Diffusion Model under Weak Conditions
Paper
• 2402.17485
• Published • 194
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published • 46
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K
Text-to-Image Generation
Paper
• 2403.04692
• Published • 40
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Paper
• 2311.12793
• Published • 18
FlashFace: Human Image Personalization with High-fidelity Identity
Preservation
Paper
• 2403.17008
• Published • 22
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published • 90
Paper
• 2406.09414
• Published • 103
Vision language models are blind
Paper
• 2407.06581
• Published • 84
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published • 122
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published • 94
Paper
• 2408.07009
• Published • 62
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published • 101
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published • 133
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published • 57
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
• 2409.12576
• Published • 16
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
• 2412.07589
• Published • 48
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models
Paper
• 2502.01061
• Published • 225
Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
Paper
• 2501.10020
• Published • 25
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published • 217
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published • 164
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Paper
• 2411.19108
• Published • 20
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper
• 2511.22699
• Published • 245