paper seminar_251001
updated
Reconstruction Alignment Improves Unified Multimodal Models
Paper
• 2509.07295
• Published • 40
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
• 2509.06951
• Published • 33
UMO: Scaling Multi-Identity Consistency for Image Customization via
Matching Reward
Paper
• 2509.06818
• Published • 29
Interleaving Reasoning for Better Text-to-Image Generation
Paper
• 2509.06945
• Published • 16
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published • 73
Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with
Quantization-Aware Scheduling
Paper
• 2509.01624
• Published • 7
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human
Preference
Paper
• 2509.06942
• Published • 19
Understand Before You Generate: Self-Guided Training for Autoregressive
Image Generation
Paper
• 2509.15185
• Published • 29
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
• 2509.13642
• Published • 9
Image Tokenizer Needs Post-Training
Paper
• 2509.12474
• Published • 9
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Paper
• 2509.10441
• Published • 31
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
• 2509.08519
• Published • 130
MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware
Alignment and Disentanglement
Paper
• 2509.01977
• Published • 13
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
• 2509.02460
• Published • 26
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published • 90
Mixture of Contexts for Long Video Generation
Paper
• 2508.21058
• Published • 35
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published • 58
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
• 2509.15496
• Published • 13
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
• 2509.17627
• Published • 66
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
• 2509.19244
• Published • 12
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published • 23
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
• 2510.05094
• Published • 38
Free Lunch Alignment of Text-to-Image Diffusion Models without
Preference Image Pairs
Paper
• 2509.25771
• Published • 11
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
• 2510.01284
• Published • 37
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
• 2510.02283
• Published • 98
UltraGen: High-Resolution Video Generation with Hierarchical Attention
Paper
• 2510.18775
• Published • 18