Depth Anything V2 ViT-S — D405 Close-Range Fine-tuned

Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human-Computer Interaction

CVPR 2026

Mukhiddin Toshpulatov1,2,4,5 · Wookey Lee2 · Suan Lee3 · Geehyuk Lee1

1SpaceTop, SoC, KAIST    2VoiceAI, BMSE, Inha University    3SoCS, Semyung University    4Dep. of CE, Gachon University, South Korea    5Jizzakh branch of the National University of Uzbekistan

Paper GitHub Dataset Project Page

Model Description

This is a fine-tuned Depth Anything V2 ViT-S model, adapted for metric depth estimation at close range (7–50 cm) using data from an Intel RealSense D405 stereo depth camera. The model reduces depth estimation error by 68% (12.3 mm → 3.84 mm MAE) compared to the pre-trained baseline in the operating range relevant to hand-surface interaction.

The fine-tuned model serves as the depth backbone for a real-time vision-based virtual keyboard system that achieves 94.2% contact detection accuracy at 30 FPS on consumer hardware.

Key Improvements over Base Model

Metric Pre-trained DA2-ViTS This Model (fine-tuned) Improvement
MAE (mm) 12.3 3.84 68% reduction
δ1 (%) 87.2 95.96 +8.8 pp
RMSE (mm) 18.4 4.8 74% reduction
abs_rel 0.042 0.008 5.3× better
SiLog 0.312 0.057 5.5× better

Usage

Installation

pip install torch torchvision timm opencv-python numpy
git clone https://github.com/DepthAnything/Depth-Anything-V2.git

Inference

Critical: max_depth must be set to 0.5 at inference. The DPT head computes depth = sigmoid(features) × max_depth. Using the default max_depth=20.0 will produce predictions 40× too large.

import torch
import cv2
import numpy as np
from depth_anything_v2.dpt import DepthAnythingV2

# Initialize model with correct max_depth
model = DepthAnythingV2(
    encoder='vits',
    features=64,
    out_channels=[48, 96, 192, 384],
    max_depth=0.5  # MUST match training config
)

# Load fine-tuned weights
state_dict = torch.load(
    'depth_anything_v2_vits_d405_finetuned.pth',
    map_location='cpu'
)['model']

# Strip DDP prefix if present
cleaned = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(cleaned)
model.eval().cuda()

# Predict depth from RGB image
image = cv2.imread('your_image.png')
with torch.no_grad():
    depth_m = model.infer_image(image)  # Returns metric depth in meters

# Convert to millimeters
depth_mm = depth_m * 1000.0
print(f"Depth range: {depth_mm.min():.1f} - {depth_mm.max():.1f} mm")

Batch Inference

from torchvision.transforms import Compose, Normalize, Resize, ToTensor

transform = Compose([
    Resize((518, 518)),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Process a batch
images = torch.stack([transform(img) for img in image_list]).cuda()
with torch.no_grad():
    predictions = model(images)  # [B, 1, H, W] metric depth in meters

Training Details

Architecture

Component Specification
Encoder ViT-S/14 (DINOv2 backbone)
Decoder DPT head (features=64)
Output channels [48, 96, 192, 384]
Depth activation sigmoid × max_depth (0.5 m)
Parameters ~24.8M

Training Configuration

Parameter Value Reference
Base checkpoint depth_anything_v2_vits.pth (relative depth)
Loss function SiLog (Scale-Invariant Logarithmic) Paper §4.1
Optimizer AdamW (β₁=0.9, β₂=0.999, wd=0.01) Paper §4.1
Encoder learning rate 5 × 10⁻⁶ Supp. §6
Decoder learning rate 5 × 10⁻⁵ (10× encoder) Supp. §6
LR schedule Cosine annealing Paper §4.1
Effective batch size 16 (bs=4 × grad_accum=4) Paper §4.1
Epochs 80 Paper §4.1
Total optimizer steps ~200K Paper §4.1
Input resolution 518 × 518 DA2 default
Depth range 0.001 m – 0.5 m Paper §3.1
Augmentation Random horizontal flip, color jitter Paper §4.2
Normalization ImageNet mean/std DA2 standard
Hardware 1× NVIDIA RTX 3090 (24 GB)
Training time ~8.2 hours

Why Relative Pretrained Weights (Not Metric)?

The base depth_anything_v2_vits.pth (relative depth) was chosen over depth_anything_v2_metric_vits.pth (Hypersim-pretrained metric) because:

  • The metric checkpoint is trained on indoor scenes at 0.5–20 m range, which introduces strong bias away from our 7–50 cm operating range
  • The relative checkpoint provides a better feature foundation for domain adaptation to close-range hand-surface geometry
  • Empirically, fine-tuning from relative weights converged faster and achieved lower validation loss

Training Dataset

D405 Hand-Surface Depth Dataset — 53,300 RGB-depth pairs from 15 participants at four camera angles (30°, 45°, 60°, 90°), captured with Intel RealSense D405.

Split Frames Participants
Train 42,640 (80%) P01, P03, P04, P06, P07, P09–P13, P18 + supplementary
Validation 5,330 (10%) P05, P16 + supplementary
Test 5,330 (10%) P08, P17 + supplementary

Splits are participant-stratified (no identity leakage).

Evaluation

Depth Metrics on Test Set

Evaluated on 9,021 held-out frames from 2 participants (P05, P08) across all 4 camera angles.

Metric Pre-trained Fine-tuned (this model)
MAE (mm) 12.3 3.84
δ1 (%) 87.2 95.96
RMSE (mm) 18.4 4.8
abs_rel 0.042 0.008
SiLog 0.312 0.057

Downstream Task: Contact Detection

When used as the depth backbone for fingertip contact detection with a velocity-gated hysteresis state machine:

Metric Value
Contact detection accuracy 94.2%
F1-score 94.4%
False positive rate 4.2%
Typing speed (WPM) 45.6
Character error rate 3.1%
Inference speed 30 FPS (RTX 3060 Ti)

Model Files

File Size Description
depth_anything_v2_vits_d405_finetuned.pth ~284 MB Fine-tuned checkpoint (contains model, optimizer, epoch, previous_best)
config.json Model configuration

Checkpoint Contents

checkpoint = torch.load('depth_anything_v2_vits_d405_finetuned.pth')
checkpoint.keys()  # ['model', 'optimizer', 'epoch', 'previous_best']

# To extract model weights only:
model_weights = checkpoint['model']

Limitations

  • Depth range: Optimized for 7–50 cm. Performance degrades significantly outside this range.
  • Scene type: Trained on hand-surface typing scenes. May not generalize well to arbitrary indoor/outdoor scenes.
  • max_depth dependency: Inference MUST use max_depth=0.5. Mismatched values produce incorrect depth scales.
  • Camera specificity: Depth supervision comes from Intel RealSense D405. Performance may vary when applied to depth data from other sensors.
  • Lighting: Trained primarily under controlled LED lighting (~5000K, 800+ lux). Performance in extreme lighting conditions is not validated.

Intended Use

Primary Use Cases

  • Real-time fingertip contact detection for vision-based virtual keyboards
  • Close-range metric depth estimation for hand-object interaction
  • Monocular depth estimation in tabletop AR/VR scenarios
  • Research on depth model adaptation for specialized domains

Out-of-Scope Use

  • General-purpose depth estimation at arbitrary ranges
  • Outdoor or long-range depth estimation
  • Safety-critical applications without additional validation

Citation

@inproceedings{toshpulatov2026realtime,
  title={Real-Time Multimodal Fingertip Contact Detection via Depth and Motion
         Fusion for Vision-Based Human-Computer Interaction},
  author={Toshpulatov, Mukhiddin and Lee, Wookey and Lee, Suan and Lee, Geehyuk},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
             and Pattern Recognition (CVPR)},
  year={2026}
}

Acknowledgements

License

This model is released under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results