Depth Anything V2 ViT-S — D405 Close-Range Fine-tuned
Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human-Computer Interaction
CVPR 2026
Mukhiddin Toshpulatov1,2,4,5 · Wookey Lee2 · Suan Lee3 · Geehyuk Lee1
1SpaceTop, SoC, KAIST 2VoiceAI, BMSE, Inha University 3SoCS, Semyung University 4Dep. of CE, Gachon University, South Korea 5Jizzakh branch of the National University of Uzbekistan
Model Description
This is a fine-tuned Depth Anything V2 ViT-S model, adapted for metric depth estimation at close range (7–50 cm) using data from an Intel RealSense D405 stereo depth camera. The model reduces depth estimation error by 68% (12.3 mm → 3.84 mm MAE) compared to the pre-trained baseline in the operating range relevant to hand-surface interaction.
The fine-tuned model serves as the depth backbone for a real-time vision-based virtual keyboard system that achieves 94.2% contact detection accuracy at 30 FPS on consumer hardware.
Key Improvements over Base Model
| Metric | Pre-trained DA2-ViTS | This Model (fine-tuned) | Improvement |
|---|---|---|---|
| MAE (mm) | 12.3 | 3.84 | 68% reduction |
| δ1 (%) | 87.2 | 95.96 | +8.8 pp |
| RMSE (mm) | 18.4 | 4.8 | 74% reduction |
| abs_rel | 0.042 | 0.008 | 5.3× better |
| SiLog | 0.312 | 0.057 | 5.5× better |
Usage
Installation
pip install torch torchvision timm opencv-python numpy
git clone https://github.com/DepthAnything/Depth-Anything-V2.git
Inference
Critical:
max_depthmust be set to0.5at inference. The DPT head computesdepth = sigmoid(features) × max_depth. Using the defaultmax_depth=20.0will produce predictions 40× too large.
import torch
import cv2
import numpy as np
from depth_anything_v2.dpt import DepthAnythingV2
# Initialize model with correct max_depth
model = DepthAnythingV2(
encoder='vits',
features=64,
out_channels=[48, 96, 192, 384],
max_depth=0.5 # MUST match training config
)
# Load fine-tuned weights
state_dict = torch.load(
'depth_anything_v2_vits_d405_finetuned.pth',
map_location='cpu'
)['model']
# Strip DDP prefix if present
cleaned = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(cleaned)
model.eval().cuda()
# Predict depth from RGB image
image = cv2.imread('your_image.png')
with torch.no_grad():
depth_m = model.infer_image(image) # Returns metric depth in meters
# Convert to millimeters
depth_mm = depth_m * 1000.0
print(f"Depth range: {depth_mm.min():.1f} - {depth_mm.max():.1f} mm")
Batch Inference
from torchvision.transforms import Compose, Normalize, Resize, ToTensor
transform = Compose([
Resize((518, 518)),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Process a batch
images = torch.stack([transform(img) for img in image_list]).cuda()
with torch.no_grad():
predictions = model(images) # [B, 1, H, W] metric depth in meters
Training Details
Architecture
| Component | Specification |
|---|---|
| Encoder | ViT-S/14 (DINOv2 backbone) |
| Decoder | DPT head (features=64) |
| Output channels | [48, 96, 192, 384] |
| Depth activation | sigmoid × max_depth (0.5 m) |
| Parameters | ~24.8M |
Training Configuration
| Parameter | Value | Reference |
|---|---|---|
| Base checkpoint | depth_anything_v2_vits.pth (relative depth) |
— |
| Loss function | SiLog (Scale-Invariant Logarithmic) | Paper §4.1 |
| Optimizer | AdamW (β₁=0.9, β₂=0.999, wd=0.01) | Paper §4.1 |
| Encoder learning rate | 5 × 10⁻⁶ | Supp. §6 |
| Decoder learning rate | 5 × 10⁻⁵ (10× encoder) | Supp. §6 |
| LR schedule | Cosine annealing | Paper §4.1 |
| Effective batch size | 16 (bs=4 × grad_accum=4) | Paper §4.1 |
| Epochs | 80 | Paper §4.1 |
| Total optimizer steps | ~200K | Paper §4.1 |
| Input resolution | 518 × 518 | DA2 default |
| Depth range | 0.001 m – 0.5 m | Paper §3.1 |
| Augmentation | Random horizontal flip, color jitter | Paper §4.2 |
| Normalization | ImageNet mean/std | DA2 standard |
| Hardware | 1× NVIDIA RTX 3090 (24 GB) | — |
| Training time | ~8.2 hours | — |
Why Relative Pretrained Weights (Not Metric)?
The base depth_anything_v2_vits.pth (relative depth) was chosen over depth_anything_v2_metric_vits.pth (Hypersim-pretrained metric) because:
- The metric checkpoint is trained on indoor scenes at 0.5–20 m range, which introduces strong bias away from our 7–50 cm operating range
- The relative checkpoint provides a better feature foundation for domain adaptation to close-range hand-surface geometry
- Empirically, fine-tuning from relative weights converged faster and achieved lower validation loss
Training Dataset
D405 Hand-Surface Depth Dataset — 53,300 RGB-depth pairs from 15 participants at four camera angles (30°, 45°, 60°, 90°), captured with Intel RealSense D405.
| Split | Frames | Participants |
|---|---|---|
| Train | 42,640 (80%) | P01, P03, P04, P06, P07, P09–P13, P18 + supplementary |
| Validation | 5,330 (10%) | P05, P16 + supplementary |
| Test | 5,330 (10%) | P08, P17 + supplementary |
Splits are participant-stratified (no identity leakage).
Evaluation
Depth Metrics on Test Set
Evaluated on 9,021 held-out frames from 2 participants (P05, P08) across all 4 camera angles.
| Metric | Pre-trained | Fine-tuned (this model) |
|---|---|---|
| MAE (mm) | 12.3 | 3.84 |
| δ1 (%) | 87.2 | 95.96 |
| RMSE (mm) | 18.4 | 4.8 |
| abs_rel | 0.042 | 0.008 |
| SiLog | 0.312 | 0.057 |
Downstream Task: Contact Detection
When used as the depth backbone for fingertip contact detection with a velocity-gated hysteresis state machine:
| Metric | Value |
|---|---|
| Contact detection accuracy | 94.2% |
| F1-score | 94.4% |
| False positive rate | 4.2% |
| Typing speed (WPM) | 45.6 |
| Character error rate | 3.1% |
| Inference speed | 30 FPS (RTX 3060 Ti) |
Model Files
| File | Size | Description |
|---|---|---|
depth_anything_v2_vits_d405_finetuned.pth |
~284 MB | Fine-tuned checkpoint (contains model, optimizer, epoch, previous_best) |
config.json |
— | Model configuration |
Checkpoint Contents
checkpoint = torch.load('depth_anything_v2_vits_d405_finetuned.pth')
checkpoint.keys() # ['model', 'optimizer', 'epoch', 'previous_best']
# To extract model weights only:
model_weights = checkpoint['model']
Limitations
- Depth range: Optimized for 7–50 cm. Performance degrades significantly outside this range.
- Scene type: Trained on hand-surface typing scenes. May not generalize well to arbitrary indoor/outdoor scenes.
- max_depth dependency: Inference MUST use
max_depth=0.5. Mismatched values produce incorrect depth scales. - Camera specificity: Depth supervision comes from Intel RealSense D405. Performance may vary when applied to depth data from other sensors.
- Lighting: Trained primarily under controlled LED lighting (~5000K, 800+ lux). Performance in extreme lighting conditions is not validated.
Intended Use
Primary Use Cases
- Real-time fingertip contact detection for vision-based virtual keyboards
- Close-range metric depth estimation for hand-object interaction
- Monocular depth estimation in tabletop AR/VR scenarios
- Research on depth model adaptation for specialized domains
Out-of-Scope Use
- General-purpose depth estimation at arbitrary ranges
- Outdoor or long-range depth estimation
- Safety-critical applications without additional validation
Citation
@inproceedings{toshpulatov2026realtime,
title={Real-Time Multimodal Fingertip Contact Detection via Depth and Motion
Fusion for Vision-Based Human-Computer Interaction},
author={Toshpulatov, Mukhiddin and Lee, Wookey and Lee, Suan and Lee, Geehyuk},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR)},
year={2026}
}
Acknowledgements
- Depth Anything V2 — base model architecture and pre-trained weights
- DINOv2 — vision transformer backbone
- Intel RealSense — D405 depth camera for ground truth supervision
License
This model is released under the Apache 2.0 License.
Evaluation results
- MAE (mm) on D405 Hand-Surface Depthself-reported3.840
- RMSE (mm) on D405 Hand-Surface Depthself-reported4.800
- delta1 (%) on D405 Hand-Surface Depthself-reported95.960
- abs_rel on D405 Hand-Surface Depthself-reported0.008
- SiLog on D405 Hand-Surface Depthself-reported0.057