YOLO26 Hand Pose & Face Detection Models
Custom-trained YOLO26 models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.
Models
| Model |
Task |
Keypoints |
ONNX (FP16) |
ONNX (FP32) |
Output Shape |
yolo26_hand_pose |
Hand detection + pose |
21 |
20.5 MB |
41.0 MB |
(1, 300, 69) |
yolo26_face |
Face detection |
β |
18.3 MB |
36.4 MB |
(1, 300, 6) |
Both models use YOLO26's end-to-end NMS-free architecture. Output is in xyxy format β no post-processing needed.
Hand Pose Output Format
Each of the 300 detections contains 69 values:
[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]
ββ bbox (xyxy) ββ β β βββββββββ 21 keypoints Γ 3 βββββββββββββββββββββββββ
β ββ always 0 (single class: hand)
ββ detection confidence
Face Detection Output Format
Each of the 300 detections contains 6 values:
[x1, y1, x2, y2, confidence, class_id]
ββ bbox (xyxy) ββ β ββ always 0 (single class: face)
ββ detection confidence
21 Hand Keypoints
8 (index tip) β used for cursor
|
7
| 12 16 20
6 | | |
| 11 15 19
5 | | |
| 4 10 14 18
| | | | |
| 3 9 13 17
| | | | |
| 2 | | |
| | | | |
| 1 | | |
ββββ΄ββββ΄ββββββ΄ββββββ
0 (wrist)
Training Results
Hand Pose
|
|
| Base model |
yolo26s-pose.pt (Ultralytics, pretrained on COCO-pose) |
| Architecture |
YOLO26s-pose β 10.6M params, 25.0 GFLOPs |
| Dataset |
Ultralytics Hand Keypoints β 18,776 train / 7,847 val images |
| Keypoints |
21 per hand (wrist + 4 per finger), generated with Google MediaPipe |
| GPU |
NVIDIA H200 NVL (143 GB VRAM) on RunPod |
| Batch size |
512 |
| Epochs |
100 |
| Optimizer |
AdamW (lr=0.002, momentum=0.9) |
| Training time |
1h 45min |
| Training cost |
~$6 (H200 NVL @ $3.40/hr) |
| Framework |
Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |
Final metrics (epoch 100/100):
| Metric |
Score |
| Pose mAP50 |
0.942 |
| Pose mAP50-95 |
0.843 |
| Box mAP50 |
0.993 |
| Box mAP50-95 |
0.912 |
Face Detection
|
|
| Base model |
yolo26s.pt (Ultralytics, pretrained on COCO) |
| Architecture |
YOLO26s β 9.6M params, 20.5 GFLOPs |
| Dataset |
WiderFace β 12,876 train / 3,222 val images (downloaded via HuggingFace CUHK-CSE mirror) |
| GPU |
NVIDIA H200 NVL (143 GB VRAM) on RunPod |
| Batch size |
64 |
| Epochs |
50 |
| Optimizer |
MuSGD (lr=0.01, momentum=0.9) |
| Training time |
54min |
| Training cost |
~$3 (H200 NVL @ $3.40/hr) |
| Framework |
Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |
Final metrics (epoch 50/50):
| Metric |
Score |
| Box mAP50 |
0.744 |
| Box mAP50-95 |
0.413 |
| Precision |
0.861 |
| Recall |
0.656 |
Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.
Total Training Cost
| Model |
GPU |
Time |
Cost |
| Hand pose (100 epochs) |
H200 NVL (143GB) |
1h 45min |
~$6 |
| Face detection (50 epochs) |
H200 NVL (143GB) |
54min |
~$3 |
| Total |
|
2h 39min |
~$9 |
Quick Start
Download Pre-trained Models
git clone https://github.com/YOUR_USERNAME/yolo26-training.git
cd yolo26-training
ls -lh models/
Export .pt to ONNX
pip install ultralytics onnx onnxslim
bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt
bash scripts/export_onnx.sh checkpoints/yolo26_face.pt
Inference (Python)
from ultralytics import YOLO
model = YOLO("checkpoints/yolo26_hand_pose.pt")
results = model.predict("image.jpg")
for r in results:
keypoints = r.keypoints.xy
boxes = r.boxes.xyxy
confs = r.boxes.conf
model = YOLO("checkpoints/yolo26_face.pt")
results = model.predict("photo.jpg")
has_face = len(results[0].boxes) > 0
Inference (C# / ONNX Runtime + DirectML)
var session = new InferenceSession("yolo26_hand_pose.onnx", options);
using var results = session.Run(inputs);
var tensor = results.First().AsTensor<float>();
var conf = tensor[0, i, 4];
var indexTipX = tensor[0, i, 30];
var indexTipY = tensor[0, i, 31];
Train from Scratch
On RunPod (or any cloud GPU)
pip install ultralytics onnxruntime-gpu onnx onnxslim
bash scripts/train_hand_pose.sh
bash scripts/train_face_detect.sh
With Docker
docker build -t yolo26-training -f docker/Dockerfile .
docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training
Recommended GPUs
| GPU |
VRAM |
Hand Pose |
Face Detect |
Total Cost |
| H200 NVL |
143 GB |
1h 45min |
54min |
~$9 |
| H100 SXM |
80 GB |
~2h 30min |
~1h 15min |
~$10 |
| A100 |
80 GB |
~3h |
~1h 30min |
~$6 |
| A40 |
48 GB |
~4h |
~2h |
~$3 |
Training Tips
- Hand pose: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.
- Face detect: batch=64 recommended. WiderFace has 100+ faces/image β higher batch causes OOM.
- Early stopping: Both scripts use
patience to stop early if metrics plateau.
- WiderFace download: Script auto-downloads from HuggingFace CUHK-CSE mirror (Google Drive links are unreliable).
Project Structure
yolo26-training/
βββ README.md
βββ .gitignore
βββ .gitattributes # Git LFS tracking for .onnx and .pt files
βββ models/ # Exported ONNX models
β βββ yolo26_hand_pose_fp16.onnx # FP16 (20.5 MB) β recommended
β βββ yolo26_hand_pose_fp32.onnx # FP32 (41.0 MB)
β βββ yolo26_face_fp16.onnx # FP16 (18.3 MB) β recommended
β βββ yolo26_face_fp32.onnx # FP32 (36.4 MB)
βββ checkpoints/ # PyTorch checkpoints
β βββ yolo26_hand_pose.pt # Hand pose (19.3 MB stripped)
β βββ yolo26_face.pt # Face detect (19.3 MB stripped)
βββ scripts/
β βββ train_hand_pose.sh # Hand pose training (auto-downloads dataset)
β βββ train_face_detect.sh # Face detection training (auto-downloads WiderFace)
β βββ export_onnx.sh # PT β ONNX export (FP32 + FP16)
βββ docker/
βββ Dockerfile # GPU training container
Technical Details
Why YOLO26?
| Feature |
Benefit |
| NMS-free (end-to-end) |
No post-processing, consistent latency |
| RLE keypoints |
More accurate keypoint localization |
| DFL removal |
Cleaner ONNX, better DirectML/TensorRT compatibility |
| 43% faster on CPU |
Better fallback on weak GPUs |
| Non-human keypoint support |
Better for hand keypoints (no human body bias) |
ONNX Compatibility
|
|
| Opset |
17 |
| FP16 |
Supported natively by DirectML, TensorRT, CoreML |
| Input |
(1, 3, 640, 640) NCHW, RGB, normalized [0, 1] |
| Tested on |
ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux) |
FP16 vs FP32
|
FP16 |
FP32 |
| Size |
~50% smaller |
Full size |
| Speed |
Faster on GPU (native FP16 compute) |
Standard |
| Accuracy |
<0.1% difference |
Baseline |
| Recommended |
Deployment |
Debugging / fine-tuning |
Datasets
| Dataset |
Images |
Annotations |
Source |
| Hand Keypoints |
26,768 |
21 keypoints/hand (MediaPipe) |
Ultralytics |
| WiderFace |
32,203 |
393,703 face bboxes |
CUHK via HuggingFace |
License
Trained on April 3, 2026 using Ultralytics 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on RunPod.