Animations OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens Paper • 2603.02138 • Published Mar 2 • 151
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens Paper • 2603.02138 • Published Mar 2 • 151
World Models MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Paper • 2504.08388 • Published Apr 11, 2025 • 42 Advancing Open-source World Models Paper • 2601.20540 • Published Jan 28 • 135
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Paper • 2504.08388 • Published Apr 11, 2025 • 42
3D / Mesh TEXGen: a Generative Diffusion Model for Mesh Textures Paper • 2411.14740 • Published Nov 22, 2024 • 17
TEXGen: a Generative Diffusion Model for Mesh Textures Paper • 2411.14740 • Published Nov 22, 2024 • 17
Gaussians and Nerfs Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation Paper • 2401.14257 • Published Jan 25, 2024 • 12 LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Paper • 2402.05054 • Published Feb 7, 2024 • 29 MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Paper • 2402.12712 • Published Feb 20, 2024 • 18 GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting Paper • 2402.10259 • Published Feb 15, 2024 • 15
Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation Paper • 2401.14257 • Published Jan 25, 2024 • 12
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Paper • 2402.05054 • Published Feb 7, 2024 • 29
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Paper • 2402.12712 • Published Feb 20, 2024 • 18
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting Paper • 2402.10259 • Published Feb 15, 2024 • 15
Image Restoration Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild Paper • 2401.13627 • Published Jan 24, 2024 • 78 λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space Paper • 2402.05195 • Published Feb 7, 2024 • 19
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild Paper • 2401.13627 • Published Jan 24, 2024 • 78
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space Paper • 2402.05195 • Published Feb 7, 2024 • 19
TBR Papers TO BE READ 3D-LLM: Injecting the 3D World into Large Language Models Paper • 2307.12981 • Published Jul 24, 2023 • 40 Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study Paper • 2401.17981 • Published Jan 31, 2024 • 1 SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM Paper • 2312.02126 • Published Dec 4, 2023 • 2 Relightable Gaussian Codec Avatars Paper • 2312.03704 • Published Dec 6, 2023 • 32
3D-LLM: Injecting the 3D World into Large Language Models Paper • 2307.12981 • Published Jul 24, 2023 • 40
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study Paper • 2401.17981 • Published Jan 31, 2024 • 1
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM Paper • 2312.02126 • Published Dec 4, 2023 • 2
Object Detection InstaGen: Enhancing Object Detection by Training on Synthetic Dataset Paper • 2402.05937 • Published Feb 8, 2024 • 14 End-to-End Object Detection with Transformers Paper • 2005.12872 • Published May 26, 2020 • 7
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset Paper • 2402.05937 • Published Feb 8, 2024 • 14
Multimodal AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19, 2024 • 45 Kosmos-2: Grounding Multimodal Large Language Models to the World Paper • 2306.14824 • Published Jun 26, 2023 • 35
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19, 2024 • 45
Kosmos-2: Grounding Multimodal Large Language Models to the World Paper • 2306.14824 • Published Jun 26, 2023 • 35
Datasets TuringEnterprises/Open-RL Viewer • Updated Mar 4 • 40 • 419 • 179 HuggingFaceFW/fineweb Viewer • Updated Jul 11, 2025 • 52.5B • 618k • 2.74k HuggingFaceFW/fineweb-edu Viewer • Updated Jul 11, 2025 • 3.5B • 349k • 1.02k
Audio How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? Paper • 2412.18495 • Published Dec 24, 2024 • 9 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 12 Voxtral TTS Paper • 2603.25551 • Published 18 days ago • 58
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? Paper • 2412.18495 • Published Dec 24, 2024 • 9
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 12
Web Agents WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
Data Generation Genie: Achieving Human Parity in Content-Grounded Datasets Generation Paper • 2401.14367 • Published Jan 25, 2024 • 8 Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 50
Genie: Achieving Human Parity in Content-Grounded Datasets Generation Paper • 2401.14367 • Published Jan 25, 2024 • 8
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 50
3D Avatar Utils Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance Paper • 2401.15687 • Published Jan 28, 2024 • 24 Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians Paper • 2312.03029 • Published Dec 5, 2023 • 27 DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation Paper • 2312.13578 • Published Dec 21, 2023 • 29 Splatter Image: Ultra-Fast Single-View 3D Reconstruction Paper • 2312.13150 • Published Dec 20, 2023 • 15
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance Paper • 2401.15687 • Published Jan 28, 2024 • 24
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians Paper • 2312.03029 • Published Dec 5, 2023 • 27
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation Paper • 2312.13578 • Published Dec 21, 2023 • 29
Splatter Image: Ultra-Fast Single-View 3D Reconstruction Paper • 2312.13150 • Published Dec 20, 2023 • 15
Spatial EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7, 2024 • 23 DepthFM: Fast Monocular Depth Estimation with Flow Matching Paper • 2403.13788 • Published Mar 20, 2024 • 18 Utonia: Toward One Encoder for All Point Clouds Paper • 2603.03283 • Published Mar 3 • 185
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7, 2024 • 23
DepthFM: Fast Monocular Depth Estimation with Flow Matching Paper • 2403.13788 • Published Mar 20, 2024 • 18
LLM Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper • 2402.05140 • Published Feb 6, 2024 • 23 BitDelta: Your Fine-Tune May Only Be Worth One Bit Paper • 2402.10193 • Published Feb 15, 2024 • 21 QLoRA: Efficient Finetuning of Quantized LLMs Paper • 2305.14314 • Published May 23, 2023 • 61 OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22, 2024 • 84
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper • 2402.05140 • Published Feb 6, 2024 • 23
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22, 2024 • 84
Video VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 40 No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14, 2024 • 15 Helios: Real Real-Time Long Video Generation Model Paper • 2603.04379 • Published Mar 4 • 186
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 40
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14, 2024 • 15
Agents LLM Agent Operating System Paper • 2403.16971 • Published Mar 25, 2024 • 73 Real-Time Reasoning Agents in Evolving Environments Paper • 2511.04898 • Published Nov 7, 2025 • 13 AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications Paper • 2508.16279 • Published Aug 22, 2025 • 61 Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Paper • 2604.03016 • Published 11 days ago • 36
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications Paper • 2508.16279 • Published Aug 22, 2025 • 61
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Paper • 2604.03016 • Published 11 days ago • 36
Animations OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens Paper • 2603.02138 • Published Mar 2 • 151
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens Paper • 2603.02138 • Published Mar 2 • 151
Datasets TuringEnterprises/Open-RL Viewer • Updated Mar 4 • 40 • 419 • 179 HuggingFaceFW/fineweb Viewer • Updated Jul 11, 2025 • 52.5B • 618k • 2.74k HuggingFaceFW/fineweb-edu Viewer • Updated Jul 11, 2025 • 3.5B • 349k • 1.02k
World Models MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Paper • 2504.08388 • Published Apr 11, 2025 • 42 Advancing Open-source World Models Paper • 2601.20540 • Published Jan 28 • 135
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Paper • 2504.08388 • Published Apr 11, 2025 • 42
Audio How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? Paper • 2412.18495 • Published Dec 24, 2024 • 9 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 12 Voxtral TTS Paper • 2603.25551 • Published 18 days ago • 58
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? Paper • 2412.18495 • Published Dec 24, 2024 • 9
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 12
3D / Mesh TEXGen: a Generative Diffusion Model for Mesh Textures Paper • 2411.14740 • Published Nov 22, 2024 • 17
TEXGen: a Generative Diffusion Model for Mesh Textures Paper • 2411.14740 • Published Nov 22, 2024 • 17
Web Agents WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
Gaussians and Nerfs Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation Paper • 2401.14257 • Published Jan 25, 2024 • 12 LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Paper • 2402.05054 • Published Feb 7, 2024 • 29 MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Paper • 2402.12712 • Published Feb 20, 2024 • 18 GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting Paper • 2402.10259 • Published Feb 15, 2024 • 15
Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation Paper • 2401.14257 • Published Jan 25, 2024 • 12
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Paper • 2402.05054 • Published Feb 7, 2024 • 29
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Paper • 2402.12712 • Published Feb 20, 2024 • 18
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting Paper • 2402.10259 • Published Feb 15, 2024 • 15
Data Generation Genie: Achieving Human Parity in Content-Grounded Datasets Generation Paper • 2401.14367 • Published Jan 25, 2024 • 8 Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 50
Genie: Achieving Human Parity in Content-Grounded Datasets Generation Paper • 2401.14367 • Published Jan 25, 2024 • 8
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 50
Image Restoration Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild Paper • 2401.13627 • Published Jan 24, 2024 • 78 λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space Paper • 2402.05195 • Published Feb 7, 2024 • 19
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild Paper • 2401.13627 • Published Jan 24, 2024 • 78
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space Paper • 2402.05195 • Published Feb 7, 2024 • 19
3D Avatar Utils Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance Paper • 2401.15687 • Published Jan 28, 2024 • 24 Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians Paper • 2312.03029 • Published Dec 5, 2023 • 27 DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation Paper • 2312.13578 • Published Dec 21, 2023 • 29 Splatter Image: Ultra-Fast Single-View 3D Reconstruction Paper • 2312.13150 • Published Dec 20, 2023 • 15
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance Paper • 2401.15687 • Published Jan 28, 2024 • 24
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians Paper • 2312.03029 • Published Dec 5, 2023 • 27
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation Paper • 2312.13578 • Published Dec 21, 2023 • 29
Splatter Image: Ultra-Fast Single-View 3D Reconstruction Paper • 2312.13150 • Published Dec 20, 2023 • 15
Spatial EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7, 2024 • 23 DepthFM: Fast Monocular Depth Estimation with Flow Matching Paper • 2403.13788 • Published Mar 20, 2024 • 18 Utonia: Toward One Encoder for All Point Clouds Paper • 2603.03283 • Published Mar 3 • 185
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7, 2024 • 23
DepthFM: Fast Monocular Depth Estimation with Flow Matching Paper • 2403.13788 • Published Mar 20, 2024 • 18
TBR Papers TO BE READ 3D-LLM: Injecting the 3D World into Large Language Models Paper • 2307.12981 • Published Jul 24, 2023 • 40 Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study Paper • 2401.17981 • Published Jan 31, 2024 • 1 SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM Paper • 2312.02126 • Published Dec 4, 2023 • 2 Relightable Gaussian Codec Avatars Paper • 2312.03704 • Published Dec 6, 2023 • 32
3D-LLM: Injecting the 3D World into Large Language Models Paper • 2307.12981 • Published Jul 24, 2023 • 40
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study Paper • 2401.17981 • Published Jan 31, 2024 • 1
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM Paper • 2312.02126 • Published Dec 4, 2023 • 2
LLM Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper • 2402.05140 • Published Feb 6, 2024 • 23 BitDelta: Your Fine-Tune May Only Be Worth One Bit Paper • 2402.10193 • Published Feb 15, 2024 • 21 QLoRA: Efficient Finetuning of Quantized LLMs Paper • 2305.14314 • Published May 23, 2023 • 61 OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22, 2024 • 84
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper • 2402.05140 • Published Feb 6, 2024 • 23
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22, 2024 • 84
Object Detection InstaGen: Enhancing Object Detection by Training on Synthetic Dataset Paper • 2402.05937 • Published Feb 8, 2024 • 14 End-to-End Object Detection with Transformers Paper • 2005.12872 • Published May 26, 2020 • 7
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset Paper • 2402.05937 • Published Feb 8, 2024 • 14
Video VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 40 No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14, 2024 • 15 Helios: Real Real-Time Long Video Generation Model Paper • 2603.04379 • Published Mar 4 • 186
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 40
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14, 2024 • 15
Multimodal AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19, 2024 • 45 Kosmos-2: Grounding Multimodal Large Language Models to the World Paper • 2306.14824 • Published Jun 26, 2023 • 35
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19, 2024 • 45
Kosmos-2: Grounding Multimodal Large Language Models to the World Paper • 2306.14824 • Published Jun 26, 2023 • 35
Agents LLM Agent Operating System Paper • 2403.16971 • Published Mar 25, 2024 • 73 Real-Time Reasoning Agents in Evolving Environments Paper • 2511.04898 • Published Nov 7, 2025 • 13 AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications Paper • 2508.16279 • Published Aug 22, 2025 • 61 Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Paper • 2604.03016 • Published 11 days ago • 36
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications Paper • 2508.16279 • Published Aug 22, 2025 • 61
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Paper • 2604.03016 • Published 11 days ago • 36