-
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Paper • 2401.09048 • Published • 10 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper • 2401.10891 • Published • 62 -
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Paper • 2401.13627 • Published • 78
Collections
Discover the best community collections!
Collections including paper arxiv:2509.10441
-
MaskBit: Embedding-free Image Generation via Bit Tokens
Paper • 2409.16211 • Published • 17 -
Goku: Flow Based Video Generative Foundation Models
Paper • 2502.04896 • Published • 107 -
Discrete Audio Tokens: More Than a Survey!
Paper • 2506.10274 • Published • 32 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper • 2506.20452 • Published • 18
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 33 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 -
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 16
-
Test-Time Scaling with Reflective Generative Model
Paper • 2507.01951 • Published • 108 -
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Paper • 2502.05171 • Published • 154 -
Autoregressive Diffusion Models
Paper • 2110.02037 • Published -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 9
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 14 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 61 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7
-
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Paper • 2401.09048 • Published • 10 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper • 2401.10891 • Published • 62 -
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Paper • 2401.13627 • Published • 78
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 33 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 -
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 16
-
Test-Time Scaling with Reflective Generative Model
Paper • 2507.01951 • Published • 108 -
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Paper • 2502.05171 • Published • 154 -
Autoregressive Diffusion Models
Paper • 2110.02037 • Published -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 9
-
MaskBit: Embedding-free Image Generation via Bit Tokens
Paper • 2409.16211 • Published • 17 -
Goku: Flow Based Video Generative Foundation Models
Paper • 2502.04896 • Published • 107 -
Discrete Audio Tokens: More Than a Survey!
Paper • 2506.10274 • Published • 32 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper • 2506.20452 • Published • 18
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 14 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 61 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7