You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

uitag-yolo11s-ui-detect-v1

A YOLO11s model fine-tuned to detect UI elements in desktop screenshots. Trained on GroundCUA (55K screenshots, 3.56M human-verified annotations) with SAHI-style tiling (640x640, 20% overlap). Built for the uitag SoM pipeline, where it runs alongside Apple Vision to provide combined 90.8% detection coverage on ScreenSpot-Pro.

What it detects

Nine element classes derived from GroundCUA's annotation taxonomy:

Class Examples
Button Toolbar buttons, dialog buttons, toggles
Menu Menu bars, context menus, dropdowns
Input_Elements Text fields, search boxes, spinners
Navigation Tabs, breadcrumbs, tree nodes
Information_Display Status bars, tooltips, labels
Sidebar Side panels, nav rails
Visual_Elements Icons, thumbnails, separators
Others Scrollbars, handles, dividers
Unknown Ambiguous elements

Evaluation

Detection coverage on ScreenSpot-Pro (1,581 targets, 26 professional applications, 3 platforms). Center-hit: does any detection's bounding box contain the center of the ground-truth target?

Pipeline Text Icon Overall
Apple Vision + this model 92.7% 87.6% 90.8%
This model only 82.4% 75.7% 80.1%
Apple Vision only 66.4% 42.5% 57.3%

The model is most useful in combination with Apple Vision: Vision provides OCR text labels, this model provides structural element detection. In the combined pipeline, icon coverage improves from 42.5% to 87.6% (+45.1 percentage points).

Additional benchmarks (this model only, no Apple Vision):

Benchmark Metric Score
GroundCUA (500 images, 30K GT elements) Recall@IoU>=0.5 94.0%
GroundCUA Precision@IoU>=0.5 83.6%
UI-Vision (1,181 images) Recall@IoU>=0.5 83.5%

GroundCUA scores are high because the model was trained on that distribution. UI-Vision and ScreenSpot-Pro are out-of-distribution evaluations.

Limitations

This model detects element bounding boxes. It does not perform OCR, read text content, or understand natural language instructions. It provides class labels (Button, Menu, etc.) but not semantic labels ("the save button").

The model was trained on GroundCUA, which contains 87 desktop applications. Applications with UI patterns not represented in GroundCUA may see lower coverage. In evaluation, the weakest results were on AutoCAD (68% center-hit in the combined pipeline) and FL Studio (79%).

The model uses tiled inference (640x640 tiles with 20% overlap) at both training and inference time. Running on a full-resolution image without tiling will produce very few detections.

Training

Parameter Value
Base model YOLO11s (pretrained)
Dataset GroundCUA tiled (224K train, 25K val tiles)
Tile size 640x640, 20% overlap
Classes 9
Epochs 100
Hardware 2x H100 PCIe 80GB (DDP)
Wall clock 19.75 hours
Optimizer AdamW (cosine LR)
Augmentation translate=0.1, scale=0.2 (no mosaic, no mixup, no rotation, no flip)
mAP@0.5 (val) 0.792
mAP@0.5:0.95 (val) 0.540

Augmentation choices reflect the domain: UI elements are axis-aligned and spatially meaningful, so rotation, flipping, and mosaic are disabled. Mild translation and scale simulate dialog movement and responsive layouts. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.

Usage

With uitag (recommended)

pip install uitag[yolo]
uitag screenshot.png --yolo -o out/

The model weights are bundled with uitag at uitag/models/yolo-ui.pt.

Standalone with ultralytics

from ultralytics import YOLO
from PIL import Image

model = YOLO("path/to/yolo-ui.pt")

# Tiled inference (required — matches training)
img = Image.open("screenshot.png")
tile_size = 640
step = 512  # 20% overlap

for y in range(0, img.height, step):
    for x in range(0, img.width, step):
        tile = img.crop((x, y, min(x + tile_size, img.width), min(y + tile_size, img.height)))
        results = model(tile, imgsz=640, conf=0.25)
        # Translate detections back to full-image coordinates
        for box in results[0].boxes:
            xyxy = box.xyxy[0].tolist()
            xyxy[0] += x; xyxy[1] += y; xyxy[2] += x; xyxy[3] += y

Apply cross-tile NMS after collecting all detections.

Intended Use

This model is intended for UI element detection in desktop screenshots as part of an automated testing, accessibility, or agent pipeline. It is not intended for detecting UI elements in mobile screenshots, web-only interfaces, or non-screenshot images.

Model Files

  • yolo-ui.pt — YOLO11s weights (18 MB)

Citation

If you use this model, please cite the GroundCUA dataset:

@article{groundcua2025,
  title={GroundCUA: Ground Truth for Computer Use Agents},
  author={ServiceNow Research},
  year={2025}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train swaylenhayes/uitag-yolo11s-ui-detect-v1