rico-smolvlm-full

SmolVLM-256M-Instruct fine-tuned with LoRA on RICO-Screen2Words for UI screenshot captioning. LoRA adapters were merged into the base model for single-artifact deployment.

Model details

Base model HuggingFaceTB/SmolVLM-256M-Instruct
Dataset rootsautomation/RICO-Screen2Words
Task Image-to-text captioning of Android UI screenshots
Method LoRA (r=16, α=32, all-linear, dropout=0.05)
Trainable params ~7.7M / 264M (≈3%)
Precision bfloat16
Epochs 1
LR 2e-4 with cosine warmup (3%)

Usage

Full model (recommended)

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "AmitPrakash/rico-smolvlm-full"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16
).cuda().eval()

image = Image.open("screenshot.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this UI screenshot accurately and concisely."}
    ]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {k: v.cuda() for k, v in processor(text=[text], images=[image], return_tensors="pt").items()}

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=96, do_sample=False)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

LoRA adapter variant (load + merge before inference)

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

base = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM-256M-Instruct", dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base, "AmitPrakash/rico-smolvlm-lora")
model = model.merge_and_unload().cuda().eval()
processor = AutoProcessor.from_pretrained("AmitPrakash/rico-smolvlm-lora")

# then run inference as above

Design decisions

Decision Choice Reason
Base model SmolVLM-256M-Instruct Smallest production-quality VLM; fits <4 GB VRAM
Dataset column captions[0] Dataset stores 5 captions per screen as a list; first is used for training
Fine-tuning LoRA on all linear layers ~3% trainable params, fast convergence, low memory
Deployment Merged full model No adapter loading code at inference; simpler and faster
Precision bfloat16 Numerically stable on Ampere+ GPUs
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AmitPrakash/rico-smolvlm-full

Dataset used to train AmitPrakash/rico-smolvlm-full