YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

YOLO26 Hand Pose & Face Detection Models

Custom-trained YOLO26 models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.

Models

Model Task Keypoints ONNX (FP16) ONNX (FP32) Output Shape
yolo26_hand_pose Hand detection + pose 21 20.5 MB 41.0 MB (1, 300, 69)
yolo26_face Face detection β€” 18.3 MB 36.4 MB (1, 300, 6)

Both models use YOLO26's end-to-end NMS-free architecture. Output is in xyxy format β€” no post-processing needed.

Hand Pose Output Format

Each of the 300 detections contains 69 values:

[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]
 β”œβ”€ bbox (xyxy) β”€β”˜    β”‚          β”‚       └──────── 21 keypoints Γ— 3 β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚          └─ always 0 (single class: hand)
                       └─ detection confidence

Face Detection Output Format

Each of the 300 detections contains 6 values:

[x1, y1, x2, y2, confidence, class_id]
 β”œβ”€ bbox (xyxy) β”€β”˜    β”‚          └─ always 0 (single class: face)
                       └─ detection confidence

21 Hand Keypoints

    8 (index tip) ← used for cursor
    |
    7
    |      12    16    20
    6      |     |     |
    |      11    15    19
    5      |     |     |
    |  4   10    14    18
    |  |   |     |     |
    |  3   9     13    17
    |  |   |     |     |
    |  2   |     |     |
    |  |   |     |     |
    |  1   |     |     |
    β””β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
              0 (wrist)

Training Results

Hand Pose

Base model yolo26s-pose.pt (Ultralytics, pretrained on COCO-pose)
Architecture YOLO26s-pose β€” 10.6M params, 25.0 GFLOPs
Dataset Ultralytics Hand Keypoints β€” 18,776 train / 7,847 val images
Keypoints 21 per hand (wrist + 4 per finger), generated with Google MediaPipe
GPU NVIDIA H200 NVL (143 GB VRAM) on RunPod
Batch size 512
Epochs 100
Optimizer AdamW (lr=0.002, momentum=0.9)
Training time 1h 45min
Training cost ~$6 (H200 NVL @ $3.40/hr)
Framework Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8

Final metrics (epoch 100/100):

Metric Score
Pose mAP50 0.942
Pose mAP50-95 0.843
Box mAP50 0.993
Box mAP50-95 0.912

Face Detection

Base model yolo26s.pt (Ultralytics, pretrained on COCO)
Architecture YOLO26s β€” 9.6M params, 20.5 GFLOPs
Dataset WiderFace β€” 12,876 train / 3,222 val images (downloaded via HuggingFace CUHK-CSE mirror)
GPU NVIDIA H200 NVL (143 GB VRAM) on RunPod
Batch size 64
Epochs 50
Optimizer MuSGD (lr=0.01, momentum=0.9)
Training time 54min
Training cost ~$3 (H200 NVL @ $3.40/hr)
Framework Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8

Final metrics (epoch 50/50):

Metric Score
Box mAP50 0.744
Box mAP50-95 0.413
Precision 0.861
Recall 0.656

Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.

Total Training Cost

Model GPU Time Cost
Hand pose (100 epochs) H200 NVL (143GB) 1h 45min ~$6
Face detection (50 epochs) H200 NVL (143GB) 54min ~$3
Total 2h 39min ~$9

Quick Start

Download Pre-trained Models

# Clone with Git LFS
git clone https://github.com/YOUR_USERNAME/yolo26-training.git
cd yolo26-training

# Models are in models/ (tracked with Git LFS)
ls -lh models/

Export .pt to ONNX

pip install ultralytics onnx onnxslim

# Exports both FP32 and FP16 versions
bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt
bash scripts/export_onnx.sh checkpoints/yolo26_face.pt

Inference (Python)

from ultralytics import YOLO

# Hand pose
model = YOLO("checkpoints/yolo26_hand_pose.pt")
results = model.predict("image.jpg")

for r in results:
    keypoints = r.keypoints.xy   # [N, 21, 2] β€” pixel coordinates
    boxes = r.boxes.xyxy         # [N, 4] β€” bounding boxes
    confs = r.boxes.conf         # [N] β€” confidence scores

# Face detection
model = YOLO("checkpoints/yolo26_face.pt")
results = model.predict("photo.jpg")
has_face = len(results[0].boxes) > 0

Inference (C# / ONNX Runtime + DirectML)

// Load model
var session = new InferenceSession("yolo26_hand_pose.onnx", options);

// Run inference β†’ output shape [1, 300, 69]
using var results = session.Run(inputs);
var tensor = results.First().AsTensor<float>();

// Parse: [x1, y1, x2, y2, conf, class, kp1_x, kp1_y, kp1_vis, ...]
var conf = tensor[0, i, 4];        // confidence
var indexTipX = tensor[0, i, 30];   // keypoint 8 (index tip) x
var indexTipY = tensor[0, i, 31];   // keypoint 8 (index tip) y

Train from Scratch

On RunPod (or any cloud GPU)

pip install ultralytics onnxruntime-gpu onnx onnxslim

# Hand pose (~1h45 on H200, ~3h on A40)
bash scripts/train_hand_pose.sh

# Face detection (~54min on H200, ~2h on A40)
bash scripts/train_face_detect.sh

With Docker

docker build -t yolo26-training -f docker/Dockerfile .
docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training

Recommended GPUs

GPU VRAM Hand Pose Face Detect Total Cost
H200 NVL 143 GB 1h 45min 54min ~$9
H100 SXM 80 GB ~2h 30min ~1h 15min ~$10
A100 80 GB ~3h ~1h 30min ~$6
A40 48 GB ~4h ~2h ~$3

Training Tips

  • Hand pose: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.
  • Face detect: batch=64 recommended. WiderFace has 100+ faces/image β€” higher batch causes OOM.
  • Early stopping: Both scripts use patience to stop early if metrics plateau.
  • WiderFace download: Script auto-downloads from HuggingFace CUHK-CSE mirror (Google Drive links are unreliable).

Project Structure

yolo26-training/
β”œβ”€β”€ README.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .gitattributes             # Git LFS tracking for .onnx and .pt files
β”œβ”€β”€ models/                    # Exported ONNX models
β”‚   β”œβ”€β”€ yolo26_hand_pose_fp16.onnx   # FP16 (20.5 MB) β€” recommended
β”‚   β”œβ”€β”€ yolo26_hand_pose_fp32.onnx   # FP32 (41.0 MB)
β”‚   β”œβ”€β”€ yolo26_face_fp16.onnx        # FP16 (18.3 MB) β€” recommended
β”‚   └── yolo26_face_fp32.onnx        # FP32 (36.4 MB)
β”œβ”€β”€ checkpoints/               # PyTorch checkpoints
β”‚   β”œβ”€β”€ yolo26_hand_pose.pt          # Hand pose (19.3 MB stripped)
β”‚   └── yolo26_face.pt               # Face detect (19.3 MB stripped)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_hand_pose.sh     # Hand pose training (auto-downloads dataset)
β”‚   β”œβ”€β”€ train_face_detect.sh   # Face detection training (auto-downloads WiderFace)
β”‚   └── export_onnx.sh         # PT β†’ ONNX export (FP32 + FP16)
└── docker/
    └── Dockerfile             # GPU training container

Technical Details

Why YOLO26?

Feature Benefit
NMS-free (end-to-end) No post-processing, consistent latency
RLE keypoints More accurate keypoint localization
DFL removal Cleaner ONNX, better DirectML/TensorRT compatibility
43% faster on CPU Better fallback on weak GPUs
Non-human keypoint support Better for hand keypoints (no human body bias)

ONNX Compatibility

Opset 17
FP16 Supported natively by DirectML, TensorRT, CoreML
Input (1, 3, 640, 640) NCHW, RGB, normalized [0, 1]
Tested on ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux)

FP16 vs FP32

FP16 FP32
Size ~50% smaller Full size
Speed Faster on GPU (native FP16 compute) Standard
Accuracy <0.1% difference Baseline
Recommended Deployment Debugging / fine-tuning

Datasets

Dataset Images Annotations Source
Hand Keypoints 26,768 21 keypoints/hand (MediaPipe) Ultralytics
WiderFace 32,203 393,703 face bboxes CUHK via HuggingFace

License


Trained on April 3, 2026 using Ultralytics 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on RunPod.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support