Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: DepthLens
emoji: π
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.23.0
python_version: '3.10'
app_file: app.py
pinned: false
license: mit
DepthLens β Monocular Depth Estimation
Estimate depth from a single image using state-of-the-art deep learning models. Upload any photo and get a detailed depth map showing how far away each part of the scene is.
What It Does
DepthLens takes a single 2D image and predicts a per-pixel depth map β no stereo cameras, no LiDAR, just one photo. The output is a color-mapped visualization showing relative distances: warm colors (red/yellow) for nearby objects and cool colors (blue/purple) for faraway ones.
This is the same core technique used in autonomous vehicles, AR/VR applications, robotics, and 3D scene reconstruction.
π Project Structure
depthlens/
βββ app.py # Gradio web app (for HuggingFace Spaces)
βββ inference.py # Standalone inference script
βββ requirements.txt # Python dependencies
βββ models/
β βββ depth_estimator.py # Model wrapper with MiDaS integration
βββ utils/
β βββ visualization.py # Depth map coloring and overlays
βββ examples/ # Sample images for testing
Quick Start
1. Install Dependencies
git clone https://github.com/getmokshshah/depthlens.git
cd depthlens
pip install -r requirements.txt
2. Run the Web App Locally
python app.py
Opens a Gradio interface at http://localhost:7860 where you can upload images and see depth maps.
3. Run Inference from the Command Line
# Single image
python inference.py --input photo.jpg --output depth_result.png
# Folder of images
python inference.py --input ./photos/ --output ./results/ --batch
# Choose model size
python inference.py --input photo.jpg --output depth.png --model large
# Save raw depth as NumPy array
python inference.py --input photo.jpg --output depth.npy --save-raw
Model Options
| Model | Speed | Quality | Memory | Best For |
|---|---|---|---|---|
small (default) |
~0.5s/image | Good | ~200MB | Real-time apps, demos |
large |
~3s/image | Best | ~1GB | High-quality results |
The small model (MiDaS v2.1 Small) is optimized for mobile and edge devices, and has been validated in resource-constrained environments while still producing accurate relative depth maps. The large model (DPT-Large) uses a Vision Transformer backbone for maximum accuracy.
Visualization Modes
DepthLens generates multiple visualization styles:
- Colored Depth Map: A viridis/inferno/magma colormap applied to the depth prediction, producing a striking false-color image
- Side-by-Side Comparison: Original image next to its depth map for easy comparison
- Depth Overlay: Semi-transparent depth map blended on top of the original image
How It Works
- Preprocessing: The input image is resized and normalized to match the model's expected input format using MiDaS transforms
- Depth Prediction: The image passes through a deep neural network (CNN or Vision Transformer) that outputs a per-pixel inverse depth map
- Normalization: Raw depth values are normalized to [0, 1] range for visualization
- Colormap Application: NumPy and Matplotlib apply scientific colormaps to create visually informative depth images
Architecture Details
The small model uses the EfficientNet-Lite backbone with a lightweight decoder, designed for fast inference. The large model uses DPT (Dense Prediction Transformer) β a Vision Transformer encoder with convolutional decoder heads that produces sharper depth boundaries and more consistent large-scale depth predictions.
Understanding the Output
- Warm colors (red, orange, yellow) β close to the camera
- Cool colors (blue, purple) β far from the camera
- Depth values are relative, not absolute β the model predicts which parts are closer/farther, not exact distances in meters
Performance
Benchmarked under resource-constrained conditions:
| Model | Resolution | Inference Time | Peak RAM |
|---|---|---|---|
| Small | 256Γ256 | ~0.4s | ~350MB |
| Small | 512Γ512 | ~0.8s | ~500MB |
| Large | 384Γ384 | ~2.8s | ~1.2GB |
Configuration
Inference Options
| Argument | Default | Description |
|---|---|---|
--input |
required | Path to image or folder |
--output |
required | Output path for results |
--model |
small |
Model size: small or large |
--colormap |
inferno |
Colormap: inferno, magma, viridis, plasma |
--side-by-side |
False |
Generate side-by-side comparison |
--overlay |
False |
Generate depth overlay on original |
--overlay-alpha |
0.5 |
Transparency for overlay mode |
--save-raw |
False |
Save raw depth as .npy file |
--batch |
False |
Process a folder of images |
Use Cases
- 3D Scene Understanding: Understand spatial layout from a single photo
- Autonomous Systems: Depth perception for robots and drones
- AR/VR: Generate depth data for immersive experiences
- Photography: Create depth-based focus effects (synthetic bokeh)
- Accessibility: Help describe spatial relationships in scenes
Limitations
- Depth is relative, not metric β objects are ranked near-to-far but without real-world distances
- Transparent and reflective surfaces (glass, mirrors, water) can confuse the model
- Very dark or overexposed regions may have unreliable depth predictions
- The model performs best on natural outdoor scenes and indoor rooms
License
MIT License β free to use for research or commercial projects.
Credits
- MiDaS: Ranftl et al., "Towards Robust Monocular Depth Estimation" (Intel ISL)
- DPT: Ranftl et al., "Vision Transformers for Dense Prediction" (ICCV 2021)
- Built with: PyTorch, OpenCV, Gradio, NumPy, Matplotlib