Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper β’ 2604.05015 β’ Published 12 days ago β’ 233
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs Paper β’ 2506.01674 β’ Published Jun 2, 2025 β’ 28
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Paper β’ 2506.03147 β’ Published Jun 3, 2025 β’ 58
view article Article Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment Feb 11, 2025 β’ 119
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation Paper β’ 2501.01895 β’ Published Jan 3, 2025 β’ 55
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper β’ 2411.06558 β’ Published Nov 10, 2024 β’ 36
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper β’ 2411.06558 β’ Published Nov 10, 2024 β’ 36
Running on CPU Upgrade Agents 10k Kolors Virtual Try-On π 10k Generate a virtual tryβon image of a person wearing a garment
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model Paper β’ 2305.11176 β’ Published May 18, 2023
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction Paper β’ 2205.14871 β’ Published May 30, 2022
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers Paper β’ 2405.05945 β’ Published May 9, 2024 β’ 4
Rethinking Mobile Block for Efficient Attention-based Models Paper β’ 2301.01146 β’ Published Jan 3, 2023
VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation Paper β’ 2405.18156 β’ Published May 28, 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models Paper β’ 2403.11289 β’ Published Mar 17, 2024