You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

uitag-yolo11s-ui-detect-v1

A YOLO11s model fine-tuned to detect UI elements in desktop screenshots. Trained on GroundCUA (55K screenshots, 3.56M human-verified annotations) with SAHI-style tiling (640x640, 20% overlap). Built for the uitag SoM pipeline, where it runs alongside Apple Vision to provide combined 90.8% detection coverage on ScreenSpot-Pro.

What it detects

Nine element classes derived from GroundCUA's annotation taxonomy:

Class	Examples
Button	Toolbar buttons, dialog buttons, toggles
Menu	Menu bars, context menus, dropdowns
Input_Elements	Text fields, search boxes, spinners
Navigation	Tabs, breadcrumbs, tree nodes
Information_Display	Status bars, tooltips, labels
Sidebar	Side panels, nav rails
Visual_Elements	Icons, thumbnails, separators
Others	Scrollbars, handles, dividers
Unknown	Ambiguous elements

Evaluation

Detection coverage on ScreenSpot-Pro (1,581 targets, 26 professional applications, 3 platforms). Center-hit: does any detection's bounding box contain the center of the ground-truth target?

Pipeline	Text	Icon	Overall
Apple Vision + this model	92.7%	87.6%	90.8%
This model only	82.4%	75.7%	80.1%
Apple Vision only	66.4%	42.5%	57.3%

The model is most useful in combination with Apple Vision: Vision provides OCR text labels, this model provides structural element detection. In the combined pipeline, icon coverage improves from 42.5% to 87.6% (+45.1 percentage points).

Additional benchmarks (this model only, no Apple Vision):

Benchmark	Metric	Score
GroundCUA (500 images, 30K GT elements)	Recall@IoU>=0.5	94.0%
GroundCUA	Precision@IoU>=0.5	83.6%
UI-Vision (1,181 images)	Recall@IoU>=0.5	83.5%

GroundCUA scores are high because the model was trained on that distribution. UI-Vision and ScreenSpot-Pro are out-of-distribution evaluations.

Limitations

This model detects element bounding boxes. It does not perform OCR, read text content, or understand natural language instructions. It provides class labels (Button, Menu, etc.) but not semantic labels ("the save button").

The model was trained on GroundCUA, which contains 87 desktop applications. Applications with UI patterns not represented in GroundCUA may see lower coverage. In evaluation, the weakest results were on AutoCAD (68% center-hit in the combined pipeline) and FL Studio (79%).

The model uses tiled inference (640x640 tiles with 20% overlap) at both training and inference time. Running on a full-resolution image without tiling will produce very few detections.

Training

Parameter	Value
Base model	YOLO11s (pretrained)
Dataset	GroundCUA tiled (224K train, 25K val tiles)
Tile size	640x640, 20% overlap
Classes	9
Epochs	100
Hardware	2x H100 PCIe 80GB (DDP)
Wall clock	19.75 hours
Optimizer	AdamW (cosine LR)
Augmentation	translate=0.1, scale=0.2 (no mosaic, no mixup, no rotation, no flip)
mAP@0.5 (val)	0.792
mAP@0.5:0.95 (val)	0.540

Augmentation choices reflect the domain: UI elements are axis-aligned and spatially meaningful, so rotation, flipping, and mosaic are disabled. Mild translation and scale simulate dialog movement and responsive layouts. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.

Usage

With uitag (recommended)

pip install uitag[yolo]
uitag screenshot.png --yolo -o out/

The model weights are bundled with uitag at uitag/models/yolo-ui.pt.

Standalone with ultralytics

from ultralytics import YOLO
from PIL import Image

model = YOLO("path/to/yolo-ui.pt")

# Tiled inference (required — matches training)
img = Image.open("screenshot.png")
tile_size = 640
step = 512  # 20% overlap

for y in range(0, img.height, step):
    for x in range(0, img.width, step):
        tile = img.crop((x, y, min(x + tile_size, img.width), min(y + tile_size, img.height)))
        results = model(tile, imgsz=640, conf=0.25)
        # Translate detections back to full-image coordinates
        for box in results[0].boxes:
            xyxy = box.xyxy[0].tolist()
            xyxy[0] += x; xyxy[1] += y; xyxy[2] += x; xyxy[3] += y

Apply cross-tile NMS after collecting all detections.

Intended Use

This model is intended for UI element detection in desktop screenshots as part of an automated testing, accessibility, or agent pipeline. It is not intended for detecting UI elements in mobile screenshots, web-only interfaces, or non-screenshot images.

Model Files

yolo-ui.pt — YOLO11s weights (18 MB)

Citation

If you use this model, please cite the GroundCUA dataset:

@article{groundcua2025,
  title={GroundCUA: Ground Truth for Computer Use Agents},
  author={ServiceNow Research},
  year={2025}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

swaylenhayes
/

uitag-yolo11s-ui-detect-v1