depthlens / README.md
getmokshshah's picture
Updated README
02d2f64
---
title: DepthLens
emoji: πŸŒ€
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "5.23.0"
python_version: "3.10"
app_file: app.py
pinned: false
license: mit
---
# DepthLens β€” Monocular Depth Estimation
Estimate depth from a single image using state-of-the-art deep learning models. Upload any photo and get a detailed depth map showing how far away each part of the scene is.
**[Try the Live Demo β†’](https://huggingface.co/spaces/getmokshshah/depthlens)**
---
## What It Does
DepthLens takes a single 2D image and predicts a per-pixel depth map β€” no stereo cameras, no LiDAR, just one photo. The output is a color-mapped visualization showing relative distances: warm colors (red/yellow) for nearby objects and cool colors (blue/purple) for faraway ones.
This is the same core technique used in autonomous vehicles, AR/VR applications, robotics, and 3D scene reconstruction.
## πŸ“ Project Structure
```
depthlens/
β”œβ”€β”€ app.py # Gradio web app (for HuggingFace Spaces)
β”œβ”€β”€ inference.py # Standalone inference script
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ models/
β”‚ └── depth_estimator.py # Model wrapper with MiDaS integration
β”œβ”€β”€ utils/
β”‚ └── visualization.py # Depth map coloring and overlays
└── examples/ # Sample images for testing
```
## Quick Start
### 1. Install Dependencies
```bash
git clone https://github.com/getmokshshah/depthlens.git
cd depthlens
pip install -r requirements.txt
```
### 2. Run the Web App Locally
```bash
python app.py
```
Opens a Gradio interface at `http://localhost:7860` where you can upload images and see depth maps.
### 3. Run Inference from the Command Line
```bash
# Single image
python inference.py --input photo.jpg --output depth_result.png
# Folder of images
python inference.py --input ./photos/ --output ./results/ --batch
# Choose model size
python inference.py --input photo.jpg --output depth.png --model large
# Save raw depth as NumPy array
python inference.py --input photo.jpg --output depth.npy --save-raw
```
## Model Options
| Model | Speed | Quality | Memory | Best For |
|-------|-------|---------|--------|----------|
| `small` (default) | ~0.5s/image | Good | ~200MB | Real-time apps, demos |
| `large` | ~3s/image | Best | ~1GB | High-quality results |
The **small** model (MiDaS v2.1 Small) is optimized for mobile and edge devices, and has been validated in resource-constrained environments while still producing accurate relative depth maps. The **large** model (DPT-Large) uses a Vision Transformer backbone for maximum accuracy.
## Visualization Modes
DepthLens generates multiple visualization styles:
- **Colored Depth Map**: A viridis/inferno/magma colormap applied to the depth prediction, producing a striking false-color image
- **Side-by-Side Comparison**: Original image next to its depth map for easy comparison
- **Depth Overlay**: Semi-transparent depth map blended on top of the original image
## How It Works
1. **Preprocessing**: The input image is resized and normalized to match the model's expected input format using MiDaS transforms
2. **Depth Prediction**: The image passes through a deep neural network (CNN or Vision Transformer) that outputs a per-pixel inverse depth map
3. **Normalization**: Raw depth values are normalized to [0, 1] range for visualization
4. **Colormap Application**: NumPy and Matplotlib apply scientific colormaps to create visually informative depth images
### Architecture Details
The **small** model uses the EfficientNet-Lite backbone with a lightweight decoder, designed for fast inference. The **large** model uses DPT (Dense Prediction Transformer) β€” a Vision Transformer encoder with convolutional decoder heads that produces sharper depth boundaries and more consistent large-scale depth predictions.
## Understanding the Output
- **Warm colors** (red, orange, yellow) β†’ **close** to the camera
- **Cool colors** (blue, purple) β†’ **far** from the camera
- **Depth values are relative**, not absolute β€” the model predicts which parts are closer/farther, not exact distances in meters
## Performance
Benchmarked under resource-constrained conditions:
| Model | Resolution | Inference Time | Peak RAM |
|-------|-----------|----------------|----------|
| Small | 256Γ—256 | ~0.4s | ~350MB |
| Small | 512Γ—512 | ~0.8s | ~500MB |
| Large | 384Γ—384 | ~2.8s | ~1.2GB |
## Configuration
### Inference Options
| Argument | Default | Description |
|----------|---------|-------------|
| `--input` | required | Path to image or folder |
| `--output` | required | Output path for results |
| `--model` | `small` | Model size: `small` or `large` |
| `--colormap` | `inferno` | Colormap: `inferno`, `magma`, `viridis`, `plasma` |
| `--side-by-side` | `False` | Generate side-by-side comparison |
| `--overlay` | `False` | Generate depth overlay on original |
| `--overlay-alpha` | `0.5` | Transparency for overlay mode |
| `--save-raw` | `False` | Save raw depth as .npy file |
| `--batch` | `False` | Process a folder of images |
## Use Cases
- **3D Scene Understanding**: Understand spatial layout from a single photo
- **Autonomous Systems**: Depth perception for robots and drones
- **AR/VR**: Generate depth data for immersive experiences
- **Photography**: Create depth-based focus effects (synthetic bokeh)
- **Accessibility**: Help describe spatial relationships in scenes
## Limitations
- Depth is **relative**, not metric β€” objects are ranked near-to-far but without real-world distances
- Transparent and reflective surfaces (glass, mirrors, water) can confuse the model
- Very dark or overexposed regions may have unreliable depth predictions
- The model performs best on natural outdoor scenes and indoor rooms
## License
MIT License β€” free to use for research or commercial projects.
## Credits
- **MiDaS**: Ranftl et al., "Towards Robust Monocular Depth Estimation" (Intel ISL)
- **DPT**: Ranftl et al., "Vision Transformers for Dense Prediction" (ICCV 2021)
- **Built with**: PyTorch, OpenCV, Gradio, NumPy, Matplotlib