uitag-yolo11s-ui-detect-v1
A YOLO11s model fine-tuned to detect UI elements in desktop screenshots. Trained on GroundCUA (55K screenshots, 3.56M human-verified annotations) with SAHI-style tiling (640x640, 20% overlap). Built for the uitag SoM pipeline, where it runs alongside Apple Vision to provide combined 90.8% detection coverage on ScreenSpot-Pro.
What it detects
Nine element classes derived from GroundCUA's annotation taxonomy:
| Class | Examples |
|---|---|
| Button | Toolbar buttons, dialog buttons, toggles |
| Menu | Menu bars, context menus, dropdowns |
| Input_Elements | Text fields, search boxes, spinners |
| Navigation | Tabs, breadcrumbs, tree nodes |
| Information_Display | Status bars, tooltips, labels |
| Sidebar | Side panels, nav rails |
| Visual_Elements | Icons, thumbnails, separators |
| Others | Scrollbars, handles, dividers |
| Unknown | Ambiguous elements |
Evaluation
Detection coverage on ScreenSpot-Pro (1,581 targets, 26 professional applications, 3 platforms). Center-hit: does any detection's bounding box contain the center of the ground-truth target?
| Pipeline | Text | Icon | Overall |
|---|---|---|---|
| Apple Vision + this model | 92.7% | 87.6% | 90.8% |
| This model only | 82.4% | 75.7% | 80.1% |
| Apple Vision only | 66.4% | 42.5% | 57.3% |
The model is most useful in combination with Apple Vision: Vision provides OCR text labels, this model provides structural element detection. In the combined pipeline, icon coverage improves from 42.5% to 87.6% (+45.1 percentage points).
Additional benchmarks (this model only, no Apple Vision):
| Benchmark | Metric | Score |
|---|---|---|
| GroundCUA (500 images, 30K GT elements) | Recall@IoU>=0.5 | 94.0% |
| GroundCUA | Precision@IoU>=0.5 | 83.6% |
| UI-Vision (1,181 images) | Recall@IoU>=0.5 | 83.5% |
GroundCUA scores are high because the model was trained on that distribution. UI-Vision and ScreenSpot-Pro are out-of-distribution evaluations.
Limitations
This model detects element bounding boxes. It does not perform OCR, read text content, or understand natural language instructions. It provides class labels (Button, Menu, etc.) but not semantic labels ("the save button").
The model was trained on GroundCUA, which contains 87 desktop applications. Applications with UI patterns not represented in GroundCUA may see lower coverage. In evaluation, the weakest results were on AutoCAD (68% center-hit in the combined pipeline) and FL Studio (79%).
The model uses tiled inference (640x640 tiles with 20% overlap) at both training and inference time. Running on a full-resolution image without tiling will produce very few detections.
Training
| Parameter | Value |
|---|---|
| Base model | YOLO11s (pretrained) |
| Dataset | GroundCUA tiled (224K train, 25K val tiles) |
| Tile size | 640x640, 20% overlap |
| Classes | 9 |
| Epochs | 100 |
| Hardware | 2x H100 PCIe 80GB (DDP) |
| Wall clock | 19.75 hours |
| Optimizer | AdamW (cosine LR) |
| Augmentation | translate=0.1, scale=0.2 (no mosaic, no mixup, no rotation, no flip) |
| mAP@0.5 (val) | 0.792 |
| mAP@0.5:0.95 (val) | 0.540 |
Augmentation choices reflect the domain: UI elements are axis-aligned and spatially meaningful, so rotation, flipping, and mosaic are disabled. Mild translation and scale simulate dialog movement and responsive layouts. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.
Usage
With uitag (recommended)
pip install uitag[yolo]
uitag screenshot.png --yolo -o out/
The model weights are bundled with uitag at uitag/models/yolo-ui.pt.
Standalone with ultralytics
from ultralytics import YOLO
from PIL import Image
model = YOLO("path/to/yolo-ui.pt")
# Tiled inference (required — matches training)
img = Image.open("screenshot.png")
tile_size = 640
step = 512 # 20% overlap
for y in range(0, img.height, step):
for x in range(0, img.width, step):
tile = img.crop((x, y, min(x + tile_size, img.width), min(y + tile_size, img.height)))
results = model(tile, imgsz=640, conf=0.25)
# Translate detections back to full-image coordinates
for box in results[0].boxes:
xyxy = box.xyxy[0].tolist()
xyxy[0] += x; xyxy[1] += y; xyxy[2] += x; xyxy[3] += y
Apply cross-tile NMS after collecting all detections.
Intended Use
This model is intended for UI element detection in desktop screenshots as part of an automated testing, accessibility, or agent pipeline. It is not intended for detecting UI elements in mobile screenshots, web-only interfaces, or non-screenshot images.
Model Files
yolo-ui.pt— YOLO11s weights (18 MB)
Citation
If you use this model, please cite the GroundCUA dataset:
@article{groundcua2025,
title={GroundCUA: Ground Truth for Computer Use Agents},
author={ServiceNow Research},
year={2025}
}
License
MIT