depthlens / README.md
getmokshshah's picture
Updated README
02d2f64

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: DepthLens
emoji: πŸŒ€
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.23.0
python_version: '3.10'
app_file: app.py
pinned: false
license: mit

DepthLens β€” Monocular Depth Estimation

Estimate depth from a single image using state-of-the-art deep learning models. Upload any photo and get a detailed depth map showing how far away each part of the scene is.

Try the Live Demo β†’


What It Does

DepthLens takes a single 2D image and predicts a per-pixel depth map β€” no stereo cameras, no LiDAR, just one photo. The output is a color-mapped visualization showing relative distances: warm colors (red/yellow) for nearby objects and cool colors (blue/purple) for faraway ones.

This is the same core technique used in autonomous vehicles, AR/VR applications, robotics, and 3D scene reconstruction.

πŸ“ Project Structure

depthlens/
β”œβ”€β”€ app.py                  # Gradio web app (for HuggingFace Spaces)
β”œβ”€β”€ inference.py            # Standalone inference script
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ models/
β”‚   └── depth_estimator.py  # Model wrapper with MiDaS integration
β”œβ”€β”€ utils/
β”‚   └── visualization.py    # Depth map coloring and overlays
└── examples/               # Sample images for testing

Quick Start

1. Install Dependencies

git clone https://github.com/getmokshshah/depthlens.git
cd depthlens
pip install -r requirements.txt

2. Run the Web App Locally

python app.py

Opens a Gradio interface at http://localhost:7860 where you can upload images and see depth maps.

3. Run Inference from the Command Line

# Single image
python inference.py --input photo.jpg --output depth_result.png

# Folder of images
python inference.py --input ./photos/ --output ./results/ --batch

# Choose model size
python inference.py --input photo.jpg --output depth.png --model large

# Save raw depth as NumPy array
python inference.py --input photo.jpg --output depth.npy --save-raw

Model Options

Model Speed Quality Memory Best For
small (default) ~0.5s/image Good ~200MB Real-time apps, demos
large ~3s/image Best ~1GB High-quality results

The small model (MiDaS v2.1 Small) is optimized for mobile and edge devices, and has been validated in resource-constrained environments while still producing accurate relative depth maps. The large model (DPT-Large) uses a Vision Transformer backbone for maximum accuracy.

Visualization Modes

DepthLens generates multiple visualization styles:

  • Colored Depth Map: A viridis/inferno/magma colormap applied to the depth prediction, producing a striking false-color image
  • Side-by-Side Comparison: Original image next to its depth map for easy comparison
  • Depth Overlay: Semi-transparent depth map blended on top of the original image

How It Works

  1. Preprocessing: The input image is resized and normalized to match the model's expected input format using MiDaS transforms
  2. Depth Prediction: The image passes through a deep neural network (CNN or Vision Transformer) that outputs a per-pixel inverse depth map
  3. Normalization: Raw depth values are normalized to [0, 1] range for visualization
  4. Colormap Application: NumPy and Matplotlib apply scientific colormaps to create visually informative depth images

Architecture Details

The small model uses the EfficientNet-Lite backbone with a lightweight decoder, designed for fast inference. The large model uses DPT (Dense Prediction Transformer) β€” a Vision Transformer encoder with convolutional decoder heads that produces sharper depth boundaries and more consistent large-scale depth predictions.

Understanding the Output

  • Warm colors (red, orange, yellow) β†’ close to the camera
  • Cool colors (blue, purple) β†’ far from the camera
  • Depth values are relative, not absolute β€” the model predicts which parts are closer/farther, not exact distances in meters

Performance

Benchmarked under resource-constrained conditions:

Model Resolution Inference Time Peak RAM
Small 256Γ—256 ~0.4s ~350MB
Small 512Γ—512 ~0.8s ~500MB
Large 384Γ—384 ~2.8s ~1.2GB

Configuration

Inference Options

Argument Default Description
--input required Path to image or folder
--output required Output path for results
--model small Model size: small or large
--colormap inferno Colormap: inferno, magma, viridis, plasma
--side-by-side False Generate side-by-side comparison
--overlay False Generate depth overlay on original
--overlay-alpha 0.5 Transparency for overlay mode
--save-raw False Save raw depth as .npy file
--batch False Process a folder of images

Use Cases

  • 3D Scene Understanding: Understand spatial layout from a single photo
  • Autonomous Systems: Depth perception for robots and drones
  • AR/VR: Generate depth data for immersive experiences
  • Photography: Create depth-based focus effects (synthetic bokeh)
  • Accessibility: Help describe spatial relationships in scenes

Limitations

  • Depth is relative, not metric β€” objects are ranked near-to-far but without real-world distances
  • Transparent and reflective surfaces (glass, mirrors, water) can confuse the model
  • Very dark or overexposed regions may have unreliable depth predictions
  • The model performs best on natural outdoor scenes and indoor rooms

License

MIT License β€” free to use for research or commercial projects.

Credits

  • MiDaS: Ranftl et al., "Towards Robust Monocular Depth Estimation" (Intel ISL)
  • DPT: Ranftl et al., "Vision Transformers for Dense Prediction" (ICCV 2021)
  • Built with: PyTorch, OpenCV, Gradio, NumPy, Matplotlib