Temporal-Trio_Multimodal-Fine-Tuning-with-SLM
Model Description
This model is a vision-language captioning model fine-tuned on the RICO Screen2Words dataset to generate natural language descriptions of mobile UI screenshots.
The model takes a mobile interface screenshot as input and produces a short textual description summarizing the screen content. The base architecture is BLIP (Bootstrapping Language-Image Pretraining), which combines a vision encoder with a transformer-based language decoder.
The model was fine-tuned using the Hugging Face Transformers framework.
Model Details
- Developed by: Neha Nair, Niharika Paul, Niharika Saha
- Model type: Vision-Language Image Captioning
- Base model: Salesforce/blip-image-captioning-base
- Language: English
- License: Apache 2.0
- Framework: Hugging Face Transformers
Intended Use
Direct Use
This model can be used to generate captions for mobile UI screenshots.
Example applications include:
- UI accessibility tools
- Automated interface documentation
- Screen summarization
- UI understanding research
Out-of-Scope Use
The model is not intended for:
- General scene captioning outside UI screenshots
- OCR or precise text extraction from images
- Safety-critical decision systems
How to Use
Load Model
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
processor = BlipProcessor.from_pretrained(
"nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)
model = BlipForConditionalGeneration.from_pretrained(
"nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)
Run Inference
image = Image.open("example_ui.png")
inputs = processor(image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
Training Data
The model was fine-tuned on the RICO Screen2Words dataset, which contains mobile UI screenshots paired with short natural language descriptions of each screen.
Dataset features:
- UI screenshots from diverse mobile applications
- Human-written screen descriptions
- Structured captions describing interface layout and function
Dataset link: rootsautomation/RICO-Screen2Words
Training Procedure
Preprocessing
Each dataset sample contains an image of a UI screen and a list of captions. The first caption was used as the training target. Images and captions were processed using the BLIP processor to generate pixel values (image encoder input) and tokenized caption sequences.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Batch size | 4 |
| Epochs | 3 |
| Learning rate | 5e-5 |
| Precision | fp16 |
| Framework | Hugging Face Trainer |
Compute Infrastructure
- Hardware: NVIDIA T4 GPU (16GB VRAM)
- Platform: Google Colab
- Framework: Hugging Face Transformers
The model was intentionally trained using T4-compatible settings to ensure reproducibility on widely available hardware.
Evaluation
Evaluation was performed qualitatively by comparing generated captions against ground truth captions from the validation set.
| Ground Truth | Prediction |
|---|---|
| display page of messages and other options | display page of messages with options |
| create account details for a chat app | create account page for chat application |
| page displaying a search bar | search page of application |
Limitations
- Captions may occasionally be overly generic.
- The model may confuse application categories.
- Performance may degrade on UI layouts very different from the RICO dataset.
Environmental Impact
Training was conducted on a single NVIDIA T4 GPU for several hours using mixed precision training. Estimated carbon emissions are relatively low compared to large-scale multimodal models.
Citation
If you use this model, please cite the RICO dataset:
Deka et al., RICO: A Mobile App Dataset for Building Data-Driven Design Applications
- Downloads last month
- 3
Model tree for nikilovesml/temporal_trio-rico-screen2words-blip-caption
Base model
Salesforce/blip-image-captioning-base