YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

YOLO26 Hand Pose & Face Detection Models

Custom-trained YOLO26 models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.

Models

Model	Task	Keypoints	ONNX (FP16)	ONNX (FP32)	Output Shape
`yolo26_hand_pose`	Hand detection + pose	21	20.5 MB	41.0 MB	`(1, 300, 69)`
`yolo26_face`	Face detection	—	18.3 MB	36.4 MB	`(1, 300, 6)`

Both models use YOLO26's end-to-end NMS-free architecture. Output is in xyxy format — no post-processing needed.

Hand Pose Output Format

Each of the 300 detections contains 69 values:

[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]
 ├─ bbox (xyxy) ─┘    │          │       └──────── 21 keypoints × 3 ────────────────────────┘
                       │          └─ always 0 (single class: hand)
                       └─ detection confidence

Face Detection Output Format

Each of the 300 detections contains 6 values:

[x1, y1, x2, y2, confidence, class_id]
 ├─ bbox (xyxy) ─┘    │          └─ always 0 (single class: face)
                       └─ detection confidence

21 Hand Keypoints

    8 (index tip) ← used for cursor
    |
    7
    |      12    16    20
    6      |     |     |
    |      11    15    19
    5      |     |     |
    |  4   10    14    18
    |  |   |     |     |
    |  3   9     13    17
    |  |   |     |     |
    |  2   |     |     |
    |  |   |     |     |
    |  1   |     |     |
    └──┴───┴─────┴─────┘
              0 (wrist)

Training Results

Hand Pose


Base model	yolo26s-pose.pt (Ultralytics, pretrained on COCO-pose)
Architecture	YOLO26s-pose — 10.6M params, 25.0 GFLOPs
Dataset	Ultralytics Hand Keypoints — 18,776 train / 7,847 val images
Keypoints	21 per hand (wrist + 4 per finger), generated with Google MediaPipe
GPU	NVIDIA H200 NVL (143 GB VRAM) on RunPod
Batch size	512
Epochs	100
Optimizer	AdamW (lr=0.002, momentum=0.9)
Training time	1h 45min
Training cost	~$6 (H200 NVL @ $3.40/hr)
Framework	Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8

Final metrics (epoch 100/100):

Metric	Score
Pose mAP50	0.942
Pose mAP50-95	0.843
Box mAP50	0.993
Box mAP50-95	0.912

Face Detection


Base model	yolo26s.pt (Ultralytics, pretrained on COCO)
Architecture	YOLO26s — 9.6M params, 20.5 GFLOPs
Dataset	WiderFace — 12,876 train / 3,222 val images (downloaded via HuggingFace CUHK-CSE mirror)
GPU	NVIDIA H200 NVL (143 GB VRAM) on RunPod
Batch size	64
Epochs	50
Optimizer	MuSGD (lr=0.01, momentum=0.9)
Training time	54min
Training cost	~$3 (H200 NVL @ $3.40/hr)
Framework	Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8

Final metrics (epoch 50/50):

Metric	Score
Box mAP50	0.744
Box mAP50-95	0.413
Precision	0.861
Recall	0.656

Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.

Total Training Cost

Model	GPU	Time	Cost
Hand pose (100 epochs)	H200 NVL (143GB)	1h 45min	~$6
Face detection (50 epochs)	H200 NVL (143GB)	54min	~$3
Total		2h 39min	~$9

Quick Start

Download Pre-trained Models

# Clone with Git LFS
git clone https://github.com/YOUR_USERNAME/yolo26-training.git
cd yolo26-training

# Models are in models/ (tracked with Git LFS)
ls -lh models/

Export .pt to ONNX

pip install ultralytics onnx onnxslim

# Exports both FP32 and FP16 versions
bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt
bash scripts/export_onnx.sh checkpoints/yolo26_face.pt

Inference (Python)

from ultralytics import YOLO

# Hand pose
model = YOLO("checkpoints/yolo26_hand_pose.pt")
results = model.predict("image.jpg")

for r in results:
    keypoints = r.keypoints.xy   # [N, 21, 2] — pixel coordinates
    boxes = r.boxes.xyxy         # [N, 4] — bounding boxes
    confs = r.boxes.conf         # [N] — confidence scores

# Face detection
model = YOLO("checkpoints/yolo26_face.pt")
results = model.predict("photo.jpg")
has_face = len(results[0].boxes) > 0

Inference (C# / ONNX Runtime + DirectML)

// Load model
var session = new InferenceSession("yolo26_hand_pose.onnx", options);

// Run inference → output shape [1, 300, 69]
using var results = session.Run(inputs);
var tensor = results.First().AsTensor<float>();

// Parse: [x1, y1, x2, y2, conf, class, kp1_x, kp1_y, kp1_vis, ...]
var conf = tensor[0, i, 4];        // confidence
var indexTipX = tensor[0, i, 30];   // keypoint 8 (index tip) x
var indexTipY = tensor[0, i, 31];   // keypoint 8 (index tip) y

Train from Scratch

On RunPod (or any cloud GPU)

pip install ultralytics onnxruntime-gpu onnx onnxslim

# Hand pose (~1h45 on H200, ~3h on A40)
bash scripts/train_hand_pose.sh

# Face detection (~54min on H200, ~2h on A40)
bash scripts/train_face_detect.sh

With Docker

docker build -t yolo26-training -f docker/Dockerfile .
docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training

Recommended GPUs

GPU	VRAM	Hand Pose	Face Detect	Total Cost
H200 NVL	143 GB	1h 45min	54min	~$9
H100 SXM	80 GB	~2h 30min	~1h 15min	~$10
A100	80 GB	~3h	~1h 30min	~$6
A40	48 GB	~4h	~2h	~$3

Training Tips

Hand pose: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.
Face detect: batch=64 recommended. WiderFace has 100+ faces/image — higher batch causes OOM.
Early stopping: Both scripts use patience to stop early if metrics plateau.
WiderFace download: Script auto-downloads from HuggingFace CUHK-CSE mirror (Google Drive links are unreliable).

Project Structure

yolo26-training/
├── README.md
├── .gitignore
├── .gitattributes             # Git LFS tracking for .onnx and .pt files
├── models/                    # Exported ONNX models
│   ├── yolo26_hand_pose_fp16.onnx   # FP16 (20.5 MB) — recommended
│   ├── yolo26_hand_pose_fp32.onnx   # FP32 (41.0 MB)
│   ├── yolo26_face_fp16.onnx        # FP16 (18.3 MB) — recommended
│   └── yolo26_face_fp32.onnx        # FP32 (36.4 MB)
├── checkpoints/               # PyTorch checkpoints
│   ├── yolo26_hand_pose.pt          # Hand pose (19.3 MB stripped)
│   └── yolo26_face.pt               # Face detect (19.3 MB stripped)
├── scripts/
│   ├── train_hand_pose.sh     # Hand pose training (auto-downloads dataset)
│   ├── train_face_detect.sh   # Face detection training (auto-downloads WiderFace)
│   └── export_onnx.sh         # PT → ONNX export (FP32 + FP16)
└── docker/
    └── Dockerfile             # GPU training container

Technical Details

Why YOLO26?

Feature	Benefit
NMS-free (end-to-end)	No post-processing, consistent latency
RLE keypoints	More accurate keypoint localization
DFL removal	Cleaner ONNX, better DirectML/TensorRT compatibility
43% faster on CPU	Better fallback on weak GPUs
Non-human keypoint support	Better for hand keypoints (no human body bias)

ONNX Compatibility


Opset	17
FP16	Supported natively by DirectML, TensorRT, CoreML
Input	`(1, 3, 640, 640)` NCHW, RGB, normalized [0, 1]
Tested on	ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux)

FP16 vs FP32

	FP16	FP32
Size	~50% smaller	Full size
Speed	Faster on GPU (native FP16 compute)	Standard
Accuracy	<0.1% difference	Baseline
Recommended	Deployment	Debugging / fine-tuning

Datasets

Dataset	Images	Annotations	Source
Hand Keypoints	26,768	21 keypoints/hand (MediaPipe)	Ultralytics
WiderFace	32,203	393,703 face bboxes	CUHK via HuggingFace

License

Training code: MIT
Base models: Ultralytics YOLO26 — AGPL-3.0 (or Enterprise License)
Hand Keypoints dataset: See Ultralytics docs
WiderFace dataset: See WiderFace terms

Trained on April 3, 2026 using Ultralytics 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on RunPod.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support