Temporal-Trio_Multimodal-Fine-Tuning-with-SLM

Model Description

This model is a vision-language captioning model fine-tuned on the RICO Screen2Words dataset to generate natural language descriptions of mobile UI screenshots.

The model takes a mobile interface screenshot as input and produces a short textual description summarizing the screen content. The base architecture is BLIP (Bootstrapping Language-Image Pretraining), which combines a vision encoder with a transformer-based language decoder.

The model was fine-tuned using the Hugging Face Transformers framework.


Model Details

  • Developed by: Neha Nair, Niharika Paul, Niharika Saha
  • Model type: Vision-Language Image Captioning
  • Base model: Salesforce/blip-image-captioning-base
  • Language: English
  • License: Apache 2.0
  • Framework: Hugging Face Transformers

Intended Use

Direct Use

This model can be used to generate captions for mobile UI screenshots.

Example applications include:

  • UI accessibility tools
  • Automated interface documentation
  • Screen summarization
  • UI understanding research

Out-of-Scope Use

The model is not intended for:

  • General scene captioning outside UI screenshots
  • OCR or precise text extraction from images
  • Safety-critical decision systems

How to Use

Load Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

processor = BlipProcessor.from_pretrained(
    "nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)

model = BlipForConditionalGeneration.from_pretrained(
    "nikilovesml/temporal_trio-rico-screen2words-blip-caption"
)

Run Inference

image = Image.open("example_ui.png")

inputs = processor(image, return_tensors="pt")

output = model.generate(**inputs)

caption = processor.decode(output[0], skip_special_tokens=True)

print(caption)

Training Data

The model was fine-tuned on the RICO Screen2Words dataset, which contains mobile UI screenshots paired with short natural language descriptions of each screen.

Dataset features:

  • UI screenshots from diverse mobile applications
  • Human-written screen descriptions
  • Structured captions describing interface layout and function

Dataset link: rootsautomation/RICO-Screen2Words


Training Procedure

Preprocessing

Each dataset sample contains an image of a UI screen and a list of captions. The first caption was used as the training target. Images and captions were processed using the BLIP processor to generate pixel values (image encoder input) and tokenized caption sequences.

Training Hyperparameters

Parameter Value
Batch size 4
Epochs 3
Learning rate 5e-5
Precision fp16
Framework Hugging Face Trainer

Compute Infrastructure

  • Hardware: NVIDIA T4 GPU (16GB VRAM)
  • Platform: Google Colab
  • Framework: Hugging Face Transformers

The model was intentionally trained using T4-compatible settings to ensure reproducibility on widely available hardware.


Evaluation

Evaluation was performed qualitatively by comparing generated captions against ground truth captions from the validation set.

Ground Truth Prediction
display page of messages and other options display page of messages with options
create account details for a chat app create account page for chat application
page displaying a search bar search page of application

Limitations

  • Captions may occasionally be overly generic.
  • The model may confuse application categories.
  • Performance may degrade on UI layouts very different from the RICO dataset.

Environmental Impact

Training was conducted on a single NVIDIA T4 GPU for several hours using mixed precision training. Estimated carbon emissions are relatively low compared to large-scale multimodal models.


Citation

If you use this model, please cite the RICO dataset:

Deka et al., RICO: A Mobile App Dataset for Building Data-Driven Design Applications

Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nikilovesml/temporal_trio-rico-screen2words-blip-caption

Finetuned
(47)
this model

Dataset used to train nikilovesml/temporal_trio-rico-screen2words-blip-caption