--- license: mit base_model: microsoft/Florence-2-base-ft tags: - florence2 - icon-captioning - omniparser - ui-understanding - fine-tuned datasets: - FortAwesome/Font-Awesome pipeline_tag: image-to-text --- # OmniParser Florence-2 Fine-tuned Icon Captioner Fine-tuned [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) for UI icon captioning, used as the caption model in [OmniParser v2](https://github.com/microsoft/OmniParser). This model extends the original OmniParser icon captioning weights with: - **1,970 Font Awesome icons** with rotated synonym labels and diverse augmentations - **Custom Google Maps icons**: Street View pegman, Street View rotation controls - **Hard negative training** to prevent false positives on similar-looking icons ## What changed from the base OmniParser weights ### Training data - **Font Awesome Free**: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer") - **70/30 weighted sampling**: Primary icon name gets 70% of training steps, alternate synonyms get 30% - **Label smoothing** (0.1): Prevents overconfidence on any single synonym - **Screenshot anchors**: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift) - **Hard negatives**: 3 specific crops that were false positives in earlier training rounds ### Augmentations (training-time) - Color inversion (black/white swap) - Random foreground recoloring on white/gray backgrounds - White foreground on random colored backgrounds - Brightness, contrast, rotation, blur, tint - Random rescale (downscale then upscale for aliasing artifacts) - JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops) ### Architecture - Vision encoder: **frozen** (90.4M params) — preserves general icon feature extraction - Language decoder: **trained** (141.0M params) — learns new caption mappings - 4 epochs, LR 2e-6, AdamW, batch size 8 ## Benchmark: Google Maps screenshot (2570x2002, H100) | Metric | Before | After | |--------|--------|-------| | Florence-2 latency | 342ms | 148ms | | Total elements | 316 | 316 | | Icons to Florence | 71 | 71 | ### Key caption improvements | Icon | Before | After | |------|--------|-------| | Street View pegman | "A notification or alert." | **"pegman"** | | Rotation control (BL) | "Refresh or reload." | **"street view rotation"** | | Rotation control (BR) | "A painting or painting tool." | **"street view rotation"** | | Location marker | "Location or location marker." | "Location" | | User profile | "a user profile or account." | "user profile" | | Record player | "a record player." | "record player" | | Suitcase | "a suitcase or baggage." | "suitcase" | 47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned. ## Usage ```python from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "proteus-computer-use/omniparser-finetuned", torch_dtype=torch.float16, trust_remote_code=True, ).to("cuda") # Caption a 64x64 icon crop inputs = processor(images=icon_crop, text="