rico-vit-gpt2-finetuned
Model Details
Model Description
rico-vit-gpt2-finetuned is a multimodal vision-language model fine-tuned on the RICO Screen2Words dataset to generate natural language captions for mobile app UI screenshots.
Given a screenshot of an Android app screen — such as a gallery page, login form, social profile, or settings menu — the model generates a short human-readable description of the interface.
The model is built by fine-tuning a VisionEncoderDecoder architecture combining a Vision Transformer (ViT) encoder and a GPT-2 decoder.
The project demonstrates multimodal fine-tuning using a Small Language Model (SLM) that runs efficiently on T4 GPU compute.
Model Information
- Developed by: Priyanka S, Padarthi Neha Sai
- Institution: PES University
- Model type: Vision-Language Model
- Architecture: VisionEncoderDecoder (ViT + GPT-2)
- Base model:
nlpconnect/vit-gpt2-image-captioning - Task: Image Captioning
- Language: English
- Framework: Hugging Face Transformers
Model Sources
- Dataset: https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words
- Base Model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
Uses
Direct Use
The model generates textual descriptions of mobile UI screenshots.
Example input:
- Screenshot of a mobile application screen
Example output:
display of screen shows images on gallery page of app
Downstream Use
Potential applications include:
- Accessibility tools that describe app screens for visually impaired users
- UI/UX testing pipelines that generate readable descriptions of app states
- Semantic search over mobile UI screenshot datasets
- Automated annotation for UI research datasets
Out-of-Scope Use
This model should not be used for safety-critical systems or decision-making systems.
Limitations include:
- captions may be overly generic
- small UI elements may be omitted
- similar UI layouts may be confused
Bias, Risks, and Limitations
The model inherits limitations from both:
- the base ViT-GPT2 architecture
- the RICO Screen2Words dataset
Potential issues include:
- captions missing small UI details
- generic predictions for visually similar screens
- difficulty distinguishing context-dependent screens
Because the model was trained on only 500 samples, it may struggle with rare UI layouts.
Recommendations
Users should treat generated captions as approximate descriptions rather than exact representations of the interface.
Improving performance may require:
- larger training datasets
- longer training schedules
- larger vision-language models
How to Get Started with the Model
Example usage:
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch
model_name = "pes1ug23am219/rico-vit-gpt2-finetuned"
model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
image = Image.open("example_ui.png").convert("RGB")
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
output = model.generate(pixel_values, max_length=32)
caption = tokenizer.decode(output[0], skip_special_tokens=True)
print(caption)
Training Details
Training Data
The model was fine-tuned using the RICO Screen2Words dataset.
RICO (Rico Is a Collection of UIs) is a large-scale dataset of Android UI screenshots paired with natural language descriptions.
Dataset properties:
| Property | Value |
|---|---|
| Total images | ~22,000 |
| Captions per image | 5 |
| Caption style | Short UI descriptions |
| Subset used | 500 samples |
Dataset split used for this project:
| Split | Samples |
|---|---|
| Training | 400 |
| Validation | 100 |
The subset was used to ensure training runs within T4 GPU compute limits.
Training Procedure
Preprocessing
Each example contains:
- Image: processed using
ViTImageProcessor - Caption: tokenized using GPT-2 tokenizer
Images are resized to 224×224 and normalized using ImageNet statistics.
Captions are padded or truncated to a maximum length of 32 tokens.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Batch size | 4 |
| Learning rate | 3e-5 |
| Warmup steps | 20 |
| Max caption length | 32 |
| Precision | FP16 mixed precision |
Training was performed using the Hugging Face Trainer API on a Tesla T4 GPU.
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluation was conducted on a held-out validation set of 100 UI screenshots from the RICO Screen2Words dataset.
Metrics
The model was evaluated using BLEU-4 (Bilingual Evaluation Understudy).
BLEU measures n-gram overlap between generated captions and reference captions.
This metric is widely used for:
- machine translation
- image captioning
- text generation tasks
Results
| Metric | Score |
|---|---|
| BLEU-4 | 0.1722 |
Summary
The model achieved a BLEU score of 0.1722 on the validation set.
Given the small training subset (500 samples), the model demonstrates a reasonable ability to produce fluent captions in the style of Screen2Words descriptions.
Performance could be improved by training on the full dataset.
Model Architecture and Objective
The model uses the VisionEncoderDecoder architecture:
| Component | Role |
|---|---|
| Vision Transformer (ViT) | Extracts visual features from screenshots |
| GPT-2 Decoder | Generates natural language captions |
The encoder processes the image and produces embeddings that the decoder uses to generate captions autoregressively.
Compute Infrastructure
Hardware
- GPU: NVIDIA Tesla T4 (16GB VRAM)
Software
- Python
- PyTorch
- Hugging Face Transformers
- Hugging Face Datasets
Environmental Impact
Training was performed on a single T4 GPU for approximately 10 minutes.
Given the short training duration and small dataset size, the environmental impact is minimal.
Model Card Authors
Priyanka S Padarthi Neha Sai PES University
- Downloads last month
- 2
Model tree for pes1ug23am219/rico-vit-gpt2-finetuned
Base model
nlpconnect/vit-gpt2-image-captioning