Geometric Context Transformer for Streaming 3D Reconstruction Paper • 2604.14141 • Published 4 days ago • 4
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens Paper • 2604.04913 • Published 13 days ago • 10
MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines Paper • 2603.06679 • Published 20 days ago • 5
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios Paper • 2603.28130 • Published 20 days ago • 11
AVO: Agentic Variation Operators for Autonomous Evolutionary Search Paper • 2603.24517 • Published 25 days ago • 10
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders Paper • 2603.19209 • Published about 1 month ago • 5
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning Paper • 2603.14482 • Published Mar 15 • 30
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising Paper • 2603.16792 • Published Mar 17 • 3
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections Paper • 2603.12180 • Published Mar 12 • 65
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model Paper • 2602.17807 • Published Feb 19 • 7
Causal-JEPA: Learning World Models through Object-Level Latent Interventions Paper • 2602.11389 • Published Feb 11 • 8
SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization Paper • 2602.04811 • Published Feb 4 • 2
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders Paper • 2601.17950 • Published Jan 25 • 4
UM-Text: A Unified Multimodal Model for Image Understanding Paper • 2601.08321 • Published Jan 13 • 12