Depth Anything V2 ViT-S — D405 Close-Range Fine-tuned

Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human-Computer Interaction

CVPR 2026

Mukhiddin Toshpulatov^1,2,4,5 · Wookey Lee² · Suan Lee³ · Geehyuk Lee¹

¹SpaceTop, SoC, KAIST ²VoiceAI, BMSE, Inha University ³SoCS, Semyung University ⁴Dep. of CE, Gachon University, South Korea ⁵Jizzakh branch of the National University of Uzbekistan

Model Description

This is a fine-tuned Depth Anything V2 ViT-S model, adapted for metric depth estimation at close range (7–50 cm) using data from an Intel RealSense D405 stereo depth camera. The model reduces depth estimation error by 68% (12.3 mm → 3.84 mm MAE) compared to the pre-trained baseline in the operating range relevant to hand-surface interaction.

The fine-tuned model serves as the depth backbone for a real-time vision-based virtual keyboard system that achieves 94.2% contact detection accuracy at 30 FPS on consumer hardware.

Key Improvements over Base Model

Metric	Pre-trained DA2-ViTS	This Model (fine-tuned)	Improvement
MAE (mm)	12.3	3.84	68% reduction
δ1 (%)	87.2	95.96	+8.8 pp
RMSE (mm)	18.4	4.8	74% reduction
abs_rel	0.042	0.008	5.3× better
SiLog	0.312	0.057	5.5× better

Usage

Installation

pip install torch torchvision timm opencv-python numpy
git clone https://github.com/DepthAnything/Depth-Anything-V2.git

Inference

Critical: max_depth must be set to 0.5 at inference. The DPT head computes depth = sigmoid(features) × max_depth. Using the default max_depth=20.0 will produce predictions 40× too large.

import torch
import cv2
import numpy as np
from depth_anything_v2.dpt import DepthAnythingV2

# Initialize model with correct max_depth
model = DepthAnythingV2(
    encoder='vits',
    features=64,
    out_channels=[48, 96, 192, 384],
    max_depth=0.5  # MUST match training config
)

# Load fine-tuned weights
state_dict = torch.load(
    'depth_anything_v2_vits_d405_finetuned.pth',
    map_location='cpu'
)['model']

# Strip DDP prefix if present
cleaned = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(cleaned)
model.eval().cuda()

# Predict depth from RGB image
image = cv2.imread('your_image.png')
with torch.no_grad():
    depth_m = model.infer_image(image)  # Returns metric depth in meters

# Convert to millimeters
depth_mm = depth_m * 1000.0
print(f"Depth range: {depth_mm.min():.1f} - {depth_mm.max():.1f} mm")

Batch Inference

from torchvision.transforms import Compose, Normalize, Resize, ToTensor

transform = Compose([
    Resize((518, 518)),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Process a batch
images = torch.stack([transform(img) for img in image_list]).cuda()
with torch.no_grad():
    predictions = model(images)  # [B, 1, H, W] metric depth in meters

Training Details

Architecture

Component	Specification
Encoder	ViT-S/14 (DINOv2 backbone)
Decoder	DPT head (features=64)
Output channels	[48, 96, 192, 384]
Depth activation	sigmoid × max_depth (0.5 m)
Parameters	~24.8M

Training Configuration

Parameter	Value	Reference
Base checkpoint	`depth_anything_v2_vits.pth` (relative depth)	—
Loss function	SiLog (Scale-Invariant Logarithmic)	Paper §4.1
Optimizer	AdamW (β₁=0.9, β₂=0.999, wd=0.01)	Paper §4.1
Encoder learning rate	5 × 10⁻⁶	Supp. §6
Decoder learning rate	5 × 10⁻⁵ (10× encoder)	Supp. §6
LR schedule	Cosine annealing	Paper §4.1
Effective batch size	16 (bs=4 × grad_accum=4)	Paper §4.1
Epochs	80	Paper §4.1
Total optimizer steps	~200K	Paper §4.1
Input resolution	518 × 518	DA2 default
Depth range	0.001 m – 0.5 m	Paper §3.1
Augmentation	Random horizontal flip, color jitter	Paper §4.2
Normalization	ImageNet mean/std	DA2 standard
Hardware	1× NVIDIA RTX 3090 (24 GB)	—
Training time	~8.2 hours	—

Why Relative Pretrained Weights (Not Metric)?

The base depth_anything_v2_vits.pth (relative depth) was chosen over depth_anything_v2_metric_vits.pth (Hypersim-pretrained metric) because:

The metric checkpoint is trained on indoor scenes at 0.5–20 m range, which introduces strong bias away from our 7–50 cm operating range
The relative checkpoint provides a better feature foundation for domain adaptation to close-range hand-surface geometry
Empirically, fine-tuning from relative weights converged faster and achieved lower validation loss

Training Dataset

D405 Hand-Surface Depth Dataset — 53,300 RGB-depth pairs from 15 participants at four camera angles (30°, 45°, 60°, 90°), captured with Intel RealSense D405.

Split	Frames	Participants
Train	42,640 (80%)	P01, P03, P04, P06, P07, P09–P13, P18 + supplementary
Validation	5,330 (10%)	P05, P16 + supplementary
Test	5,330 (10%)	P08, P17 + supplementary

Splits are participant-stratified (no identity leakage).

Evaluation

Depth Metrics on Test Set

Evaluated on 9,021 held-out frames from 2 participants (P05, P08) across all 4 camera angles.

Metric	Pre-trained	Fine-tuned (this model)
MAE (mm)	12.3	3.84
δ1 (%)	87.2	95.96
RMSE (mm)	18.4	4.8
abs_rel	0.042	0.008
SiLog	0.312	0.057

Downstream Task: Contact Detection

When used as the depth backbone for fingertip contact detection with a velocity-gated hysteresis state machine:

Metric	Value
Contact detection accuracy	94.2%
F1-score	94.4%
False positive rate	4.2%
Typing speed (WPM)	45.6
Character error rate	3.1%
Inference speed	30 FPS (RTX 3060 Ti)

Model Files

File	Size	Description
`depth_anything_v2_vits_d405_finetuned.pth`	~284 MB	Fine-tuned checkpoint (contains `model`, `optimizer`, `epoch`, `previous_best`)
`config.json`	—	Model configuration

Checkpoint Contents

checkpoint = torch.load('depth_anything_v2_vits_d405_finetuned.pth')
checkpoint.keys()  # ['model', 'optimizer', 'epoch', 'previous_best']

# To extract model weights only:
model_weights = checkpoint['model']

Limitations

Depth range: Optimized for 7–50 cm. Performance degrades significantly outside this range.
Scene type: Trained on hand-surface typing scenes. May not generalize well to arbitrary indoor/outdoor scenes.
max_depth dependency: Inference MUST use max_depth=0.5. Mismatched values produce incorrect depth scales.
Camera specificity: Depth supervision comes from Intel RealSense D405. Performance may vary when applied to depth data from other sensors.
Lighting: Trained primarily under controlled LED lighting (~5000K, 800+ lux). Performance in extreme lighting conditions is not validated.

Intended Use

Primary Use Cases

Real-time fingertip contact detection for vision-based virtual keyboards
Close-range metric depth estimation for hand-object interaction
Monocular depth estimation in tabletop AR/VR scenarios
Research on depth model adaptation for specialized domains

Out-of-Scope Use

General-purpose depth estimation at arbitrary ranges
Outdoor or long-range depth estimation
Safety-critical applications without additional validation

Citation

@inproceedings{toshpulatov2026realtime,
  title={Real-Time Multimodal Fingertip Contact Detection via Depth and Motion
         Fusion for Vision-Based Human-Computer Interaction},
  author={Toshpulatov, Mukhiddin and Lee, Wookey and Lee, Suan and Lee, Geehyuk},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
             and Pattern Recognition (CVPR)},
  year={2026}
}

Acknowledgements

Depth Anything V2 — base model architecture and pre-trained weights
DINOv2 — vision transformer backbone
Intel RealSense — D405 depth camera for ground truth supervision

License

This model is released under the Apache 2.0 License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

MAE (mm) on D405 Hand-Surface Depth
self-reported

3.840
RMSE (mm) on D405 Hand-Surface Depth
self-reported

4.800
delta1 (%) on D405 Hand-Surface Depth
self-reported

95.960
abs_rel on D405 Hand-Surface Depth
self-reported

0.008
SiLog on D405 Hand-Surface Depth
self-reported

0.057