---
license: mit
base_model: microsoft/Florence-2-base-ft
tags:
  - florence2
  - icon-captioning
  - omniparser
  - ui-understanding
  - fine-tuned
datasets:
  - FortAwesome/Font-Awesome
pipeline_tag: image-to-text
---

# OmniParser Florence-2 Fine-tuned Icon Captioner

Fine-tuned [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) for UI icon captioning, used as the caption model in [OmniParser v2](https://github.com/microsoft/OmniParser).

This model extends the original OmniParser icon captioning weights with:
- **1,970 Font Awesome icons** with rotated synonym labels and diverse augmentations
- **Custom Google Maps icons**: Street View pegman, Street View rotation controls
- **Hard negative training** to prevent false positives on similar-looking icons

## What changed from the base OmniParser weights

### Training data
- **Font Awesome Free**: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
- **70/30 weighted sampling**: Primary icon name gets 70% of training steps, alternate synonyms get 30%
- **Label smoothing** (0.1): Prevents overconfidence on any single synonym
- **Screenshot anchors**: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
- **Hard negatives**: 3 specific crops that were false positives in earlier training rounds

### Augmentations (training-time)
- Color inversion (black/white swap)
- Random foreground recoloring on white/gray backgrounds
- White foreground on random colored backgrounds
- Brightness, contrast, rotation, blur, tint
- Random rescale (downscale then upscale for aliasing artifacts)
- JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)

### Architecture
- Vision encoder: **frozen** (90.4M params) — preserves general icon feature extraction
- Language decoder: **trained** (141.0M params) — learns new caption mappings
- 4 epochs, LR 2e-6, AdamW, batch size 8

## Benchmark: Google Maps screenshot (2570x2002, H100)

| Metric | Before | After |
|--------|--------|-------|
| Florence-2 latency | 342ms | 148ms |
| Total elements | 316 | 316 |
| Icons to Florence | 71 | 71 |

### Key caption improvements

| Icon | Before | After |
|------|--------|-------|
| Street View pegman | "A notification or alert." | **"pegman"** |
| Rotation control (BL) | "Refresh or reload." | **"street view rotation"** |
| Rotation control (BR) | "A painting or painting tool." | **"street view rotation"** |
| Location marker | "Location or location marker." | "Location" |
| User profile | "a user profile or account." | "user profile" |
| Record player | "a record player." | "record player" |
| Suitcase | "a suitcase or baggage." | "suitcase" |

47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.

## Usage

```python
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "proteus-computer-use/omniparser-finetuned",
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to("cuda")

# Caption a 64x64 icon crop
inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=20,
    num_beams=1,
)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

## Related

- [OmniParser v2](https://github.com/microsoft/OmniParser) — full UI parsing pipeline
- [omniparser-fast](https://github.com/proteus-computer-use/omniparser-fast) — low-latency GPU server with this model
- [Florence-2](https://huggingface.co/microsoft/Florence-2-base-ft) — base model