arxiv:2604.08626

WildDet3D: Scaling Promptable 3D Detection in the Wild

Published on Apr 9

· Submitted by

Weikai Huang on Apr 13

#2 Paper of the day

Ai2

Upvote

Authors:

Abstract

A unified 3D object detection framework with a large-scale dataset enables open-world detection with multiple prompt types and geometric cue integration.

AI-generated summary

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

View arXiv page View PDF Project page GitHub 229 Add to collection

Community

weikaih

Paper submitter about 8 hours ago

Today we're releasing WildDet3D, an open model that can look at a single photo and understand objects in three dimensions—how far away they are, how big they are, and how they're oriented in space.

Type a category name, click on an object, or pass in a 2D detection from another model—WildDet3D returns a full 3D bounding box. When a depth sensor is available, it folds that data in automatically for improved accuracy, no architecture changes needed.

This means any vision system that already identifies objects in 2D can gain enhanced spatial awareness—a pair of smart glasses or a robotic arm can get back position, size, and orientation in 3D without being retrained.

On standard benchmarks, WildDet3D sets a new state of the art while training on a fraction of the compute used by prior methods. And on scenes it was never trained on – autonomous driving environments, indoor spaces, and object categories it has never encountered – it nearly doubles the best prior scores.

We're also releasing WildDet3D-Data, the largest open 3D detection dataset available:

📊 1M+ images

📐 3.7M verified 3D annotations

🏷️ 13K+ object categories

✋ 100K+ human-annotated images

And there's a smartphone app📱—point your camera at a scene, select a category or draw a 2D box, and get 3D bounding boxes in real time → https://apps.apple.com/us/app/wilddet3d/id6760861157

Spatial intelligence is core to where AI is heading—the same model that helps an AR app place directions over a street can help a robot estimate the size of a package on a shelf. We think the most interesting applications are ones no one has built yet, and we're releasing everything openly for the benefit of the community.

📝 Blog: https://allenai.org/blog/wilddet3d (more demo videos)

🤖 Models: https://huggingface.co/collections/allenai/wilddet3d

📊 Code: https://github.com/allenai/WildDet3D

🗂️ Data: https://huggingface.co/datasets/allenai/WildDet3D-Data

🎮 Demo: https://huggingface.co/spaces/allenai/WildDet3D

📄 Tech report: https://allenai.org/papers/wilddet3d

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

WildDet3D: Scaling Promptable 3D Detection in the Wild

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 2