rico-vit-gpt2-finetuned

Model Details

Model Description

rico-vit-gpt2-finetuned is a multimodal vision-language model fine-tuned on the RICO Screen2Words dataset to generate natural language captions for mobile app UI screenshots.

Given a screenshot of an Android app screen — such as a gallery page, login form, social profile, or settings menu — the model generates a short human-readable description of the interface.

The model is built by fine-tuning a VisionEncoderDecoder architecture combining a Vision Transformer (ViT) encoder and a GPT-2 decoder.

The project demonstrates multimodal fine-tuning using a Small Language Model (SLM) that runs efficiently on T4 GPU compute.

Model Information

Developed by: Priyanka S, Padarthi Neha Sai
Institution: PES University
Model type: Vision-Language Model
Architecture: VisionEncoderDecoder (ViT + GPT-2)
Base model: nlpconnect/vit-gpt2-image-captioning
Task: Image Captioning
Language: English
Framework: Hugging Face Transformers

Model Sources

Dataset: https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words
Base Model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

Uses

Direct Use

The model generates textual descriptions of mobile UI screenshots.

Example input:

Screenshot of a mobile application screen

Example output:


display of screen shows images on gallery page of app

Downstream Use

Potential applications include:

Accessibility tools that describe app screens for visually impaired users
UI/UX testing pipelines that generate readable descriptions of app states
Semantic search over mobile UI screenshot datasets
Automated annotation for UI research datasets

Out-of-Scope Use

This model should not be used for safety-critical systems or decision-making systems.

Limitations include:

captions may be overly generic
small UI elements may be omitted
similar UI layouts may be confused

Bias, Risks, and Limitations

The model inherits limitations from both:

the base ViT-GPT2 architecture
the RICO Screen2Words dataset

Potential issues include:

captions missing small UI details
generic predictions for visually similar screens
difficulty distinguishing context-dependent screens

Because the model was trained on only 500 samples, it may struggle with rare UI layouts.

Recommendations

Users should treat generated captions as approximate descriptions rather than exact representations of the interface.

Improving performance may require:

larger training datasets
longer training schedules
larger vision-language models

How to Get Started with the Model

Example usage:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch

model_name = "pes1ug23am219/rico-vit-gpt2-finetuned"

model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

image = Image.open("example_ui.png").convert("RGB")

pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values

output = model.generate(pixel_values, max_length=32)

caption = tokenizer.decode(output[0], skip_special_tokens=True)

print(caption)

Training Details

Training Data

The model was fine-tuned using the RICO Screen2Words dataset.

RICO (Rico Is a Collection of UIs) is a large-scale dataset of Android UI screenshots paired with natural language descriptions.

Dataset properties:

Property	Value
Total images	~22,000
Captions per image	5
Caption style	Short UI descriptions
Subset used	500 samples

Dataset split used for this project:

Split	Samples
Training	400
Validation	100

The subset was used to ensure training runs within T4 GPU compute limits.

Training Procedure

Preprocessing

Each example contains:

Image: processed using ViTImageProcessor
Caption: tokenized using GPT-2 tokenizer

Images are resized to 224×224 and normalized using ImageNet statistics.

Captions are padded or truncated to a maximum length of 32 tokens.

Training Hyperparameters

Parameter	Value
Epochs	5
Batch size	4
Learning rate	3e-5
Warmup steps	20
Max caption length	32
Precision	FP16 mixed precision

Training was performed using the Hugging Face Trainer API on a Tesla T4 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was conducted on a held-out validation set of 100 UI screenshots from the RICO Screen2Words dataset.

Metrics

The model was evaluated using BLEU-4 (Bilingual Evaluation Understudy).

BLEU measures n-gram overlap between generated captions and reference captions.

This metric is widely used for:

machine translation
image captioning
text generation tasks

Results

Metric	Score
BLEU-4	0.1722

Summary

The model achieved a BLEU score of 0.1722 on the validation set.

Given the small training subset (500 samples), the model demonstrates a reasonable ability to produce fluent captions in the style of Screen2Words descriptions.

Performance could be improved by training on the full dataset.

Model Architecture and Objective

The model uses the VisionEncoderDecoder architecture:

Component	Role
Vision Transformer (ViT)	Extracts visual features from screenshots
GPT-2 Decoder	Generates natural language captions

The encoder processes the image and produces embeddings that the decoder uses to generate captions autoregressively.

Compute Infrastructure

Hardware

GPU: NVIDIA Tesla T4 (16GB VRAM)

Software

Python
PyTorch
Hugging Face Transformers
Hugging Face Datasets

Environmental Impact

Training was performed on a single T4 GPU for approximately 10 minutes.

Given the short training duration and small dataset size, the environmental impact is minimal.

Model Card Authors

Priyanka S Padarthi Neha Sai PES University

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for pes1ug23am219/rico-vit-gpt2-finetuned

Base model

nlpconnect/vit-gpt2-image-captioning

Finetuned

(17)

this model

pes1ug23am219
/

rico-vit-gpt2-finetuned