-
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
Generalist IDM
📊Process gameplay videos to predict keyboard and mouse actions
-
open-world-agents/Generalist-IDM-1B
Image-Text-to-Text • 0.9B • Updated • 315 • 3 -
open-world-agents/D2E-480p
Viewer • Updated • 460 • 2.47k • 1
Collections
Discover the best community collections!
Collections including paper arxiv:2510.05684
-
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 28 -
Dr.LLM: Dynamic Layer Routing in LLMs
Paper • 2510.12773 • Published • 32 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Paper • 2510.08759 • Published • 46
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 60 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 53 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 64
-
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper • 2510.23587 • Published • 67 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 127 -
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Paper • 2511.08892 • Published • 215
-
HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Paper • 2510.05560 • Published • 8 -
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Paper • 2510.06217 • Published • 67 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Fast-dLLM v2: Efficient Block-Diffusion LLM
Paper • 2509.26328 • Published • 58
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
Generalist IDM
📊Process gameplay videos to predict keyboard and mouse actions
-
open-world-agents/Generalist-IDM-1B
Image-Text-to-Text • 0.9B • Updated • 315 • 3 -
open-world-agents/D2E-480p
Viewer • Updated • 460 • 2.47k • 1
-
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper • 2510.23587 • Published • 67 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 127 -
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Paper • 2511.08892 • Published • 215
-
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 28 -
Dr.LLM: Dynamic Layer Routing in LLMs
Paper • 2510.12773 • Published • 32 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 146 -
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Paper • 2510.08759 • Published • 46
-
HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Paper • 2510.05560 • Published • 8 -
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Paper • 2510.06217 • Published • 67 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Fast-dLLM v2: Efficient Block-Diffusion LLM
Paper • 2509.26328 • Published • 58
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 60 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 53 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 64
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23