rico-vit-gpt2-finetuned

Model Details

Model Description

rico-vit-gpt2-finetuned is a multimodal vision-language model fine-tuned on the RICO Screen2Words dataset to generate natural language captions for mobile app UI screenshots.

Given a screenshot of an Android app screen — such as a gallery page, login form, social profile, or settings menu — the model generates a short human-readable description of the interface.

The model is built by fine-tuning a VisionEncoderDecoder architecture combining a Vision Transformer (ViT) encoder and a GPT-2 decoder.

The project demonstrates multimodal fine-tuning using a Small Language Model (SLM) that runs efficiently on T4 GPU compute.


Model Information

  • Developed by: Priyanka S, Padarthi Neha Sai
  • Institution: PES University
  • Model type: Vision-Language Model
  • Architecture: VisionEncoderDecoder (ViT + GPT-2)
  • Base model: nlpconnect/vit-gpt2-image-captioning
  • Task: Image Captioning
  • Language: English
  • Framework: Hugging Face Transformers

Model Sources


Uses

Direct Use

The model generates textual descriptions of mobile UI screenshots.

Example input:

  • Screenshot of a mobile application screen

Example output:


display of screen shows images on gallery page of app

Downstream Use

Potential applications include:

  • Accessibility tools that describe app screens for visually impaired users
  • UI/UX testing pipelines that generate readable descriptions of app states
  • Semantic search over mobile UI screenshot datasets
  • Automated annotation for UI research datasets

Out-of-Scope Use

This model should not be used for safety-critical systems or decision-making systems.

Limitations include:

  • captions may be overly generic
  • small UI elements may be omitted
  • similar UI layouts may be confused

Bias, Risks, and Limitations

The model inherits limitations from both:

  • the base ViT-GPT2 architecture
  • the RICO Screen2Words dataset

Potential issues include:

  • captions missing small UI details
  • generic predictions for visually similar screens
  • difficulty distinguishing context-dependent screens

Because the model was trained on only 500 samples, it may struggle with rare UI layouts.


Recommendations

Users should treat generated captions as approximate descriptions rather than exact representations of the interface.

Improving performance may require:

  • larger training datasets
  • longer training schedules
  • larger vision-language models

How to Get Started with the Model

Example usage:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch

model_name = "pes1ug23am219/rico-vit-gpt2-finetuned"

model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

image = Image.open("example_ui.png").convert("RGB")

pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values

output = model.generate(pixel_values, max_length=32)

caption = tokenizer.decode(output[0], skip_special_tokens=True)

print(caption)

Training Details

Training Data

The model was fine-tuned using the RICO Screen2Words dataset.

RICO (Rico Is a Collection of UIs) is a large-scale dataset of Android UI screenshots paired with natural language descriptions.

Dataset properties:

Property Value
Total images ~22,000
Captions per image 5
Caption style Short UI descriptions
Subset used 500 samples

Dataset split used for this project:

Split Samples
Training 400
Validation 100

The subset was used to ensure training runs within T4 GPU compute limits.


Training Procedure

Preprocessing

Each example contains:

  • Image: processed using ViTImageProcessor
  • Caption: tokenized using GPT-2 tokenizer

Images are resized to 224×224 and normalized using ImageNet statistics.

Captions are padded or truncated to a maximum length of 32 tokens.


Training Hyperparameters

Parameter Value
Epochs 5
Batch size 4
Learning rate 3e-5
Warmup steps 20
Max caption length 32
Precision FP16 mixed precision

Training was performed using the Hugging Face Trainer API on a Tesla T4 GPU.


Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was conducted on a held-out validation set of 100 UI screenshots from the RICO Screen2Words dataset.


Metrics

The model was evaluated using BLEU-4 (Bilingual Evaluation Understudy).

BLEU measures n-gram overlap between generated captions and reference captions.

This metric is widely used for:

  • machine translation
  • image captioning
  • text generation tasks

Results

Metric Score
BLEU-4 0.1722

Summary

The model achieved a BLEU score of 0.1722 on the validation set.

Given the small training subset (500 samples), the model demonstrates a reasonable ability to produce fluent captions in the style of Screen2Words descriptions.

Performance could be improved by training on the full dataset.


Model Architecture and Objective

The model uses the VisionEncoderDecoder architecture:

Component Role
Vision Transformer (ViT) Extracts visual features from screenshots
GPT-2 Decoder Generates natural language captions

The encoder processes the image and produces embeddings that the decoder uses to generate captions autoregressively.


Compute Infrastructure

Hardware

  • GPU: NVIDIA Tesla T4 (16GB VRAM)

Software

  • Python
  • PyTorch
  • Hugging Face Transformers
  • Hugging Face Datasets

Environmental Impact

Training was performed on a single T4 GPU for approximately 10 minutes.

Given the short training duration and small dataset size, the environmental impact is minimal.


Model Card Authors

Priyanka S Padarthi Neha Sai PES University


Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pes1ug23am219/rico-vit-gpt2-finetuned

Finetuned
(17)
this model

Dataset used to train pes1ug23am219/rico-vit-gpt2-finetuned