Temporal-Trio_Multimodal-Fine-Tuning-with-SLM

Model Description

This model is a vision-language captioning model fine-tuned on the RICO Screen2Words dataset to generate natural language descriptions of mobile UI screenshots.

The model takes a mobile interface screenshot as input and produces a short textual description summarizing the screen content. The base architecture is BLIP (Bootstrapping Language-Image Pretraining), which combines a vision encoder with a transformer-based language decoder.

The model was fine-tuned using the Hugging Face Transformers framework.

Model Details

Developed by: Neha Nair, Niharika Paul, Niharika Saha
Model type: Vision-Language Image Captioning
Base model: Salesforce/blip-image-captioning-base
Language: English
License: Apache 2.0
Framework: Hugging Face Transformers

Intended Use

Direct Use

This model can be used to generate captions for mobile UI screenshots.

Example applications include:

UI accessibility tools
Automated interface documentation
Screen summarization
UI understanding research

Out-of-Scope Use

The model is not intended for:

General scene captioning outside UI screenshots
OCR or precise text extraction from images
Safety-critical decision systems

How to Use

Load Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

processor = BlipProcessor.from_pretrained(
    "nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)

model = BlipForConditionalGeneration.from_pretrained(
    "nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)

Run Inference

image = Image.open("example_ui.png")

inputs = processor(image, return_tensors="pt")

output = model.generate(**inputs)

caption = processor.decode(output[0], skip_special_tokens=True)

print(caption)

Training Data

The model was fine-tuned on the RICO Screen2Words dataset, which contains mobile UI screenshots paired with short natural language descriptions of each screen.

Dataset features:

UI screenshots from diverse mobile applications
Human-written screen descriptions
Structured captions describing interface layout and function

Dataset link: rootsautomation/RICO-Screen2Words

Training Procedure

Preprocessing

Each dataset sample contains an image of a UI screen and a list of captions. The first caption was used as the training target. Images and captions were processed using the BLIP processor to generate pixel values (image encoder input) and tokenized caption sequences.

Training Hyperparameters

Parameter	Value
Batch size	4
Epochs	3
Learning rate	5e-5
Precision	fp16
Framework	Hugging Face Trainer

Compute Infrastructure

Hardware: NVIDIA T4 GPU (16GB VRAM)
Platform: Google Colab
Framework: Hugging Face Transformers

The model was intentionally trained using T4-compatible settings to ensure reproducibility on widely available hardware.

Evaluation

Evaluation was performed qualitatively by comparing generated captions against ground truth captions from the validation set.

Ground Truth	Prediction
display page of messages and other options	display page of messages with options
create account details for a chat app	create account page for chat application
page displaying a search bar	search page of application

Limitations

Captions may occasionally be overly generic.
The model may confuse application categories.
Performance may degrade on UI layouts very different from the RICO dataset.

Environmental Impact

Training was conducted on a single NVIDIA T4 GPU for several hours using mixed precision training. Estimated carbon emissions are relatively low compared to large-scale multimodal models.

Citation

If you use this model, please cite the RICO dataset:

Deka et al., RICO: A Mobile App Dataset for Building Data-Driven Design Applications

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for nikilovesml/temporal_trio-rico-screen2words-blip-caption

Base model

Salesforce/blip-image-captioning-base

Finetuned

(47)

this model

nikilovesml
/

temporal_trio-rico-screen2words-blip-caption