rico-smolvlm-full

SmolVLM-256M-Instruct fine-tuned with LoRA on RICO-Screen2Words for UI screenshot captioning. LoRA adapters were merged into the base model for single-artifact deployment.

Model details


Base model	`HuggingFaceTB/SmolVLM-256M-Instruct`
Dataset	`rootsautomation/RICO-Screen2Words`
Task	Image-to-text captioning of Android UI screenshots
Method	LoRA (`r=16`, `α=32`, all-linear, dropout=0.05)
Trainable params	~7.7M / 264M (≈3%)
Precision	bfloat16
Epochs	1
LR	2e-4 with cosine warmup (3%)

Usage

Full model (recommended)

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "AmitPrakash/rico-smolvlm-full"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16
).cuda().eval()

image = Image.open("screenshot.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this UI screenshot accurately and concisely."}
    ]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {k: v.cuda() for k, v in processor(text=[text], images=[image], return_tensors="pt").items()}

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=96, do_sample=False)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

LoRA adapter variant (load + merge before inference)

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

base = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM-256M-Instruct", dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base, "AmitPrakash/rico-smolvlm-lora")
model = model.merge_and_unload().cuda().eval()
processor = AutoProcessor.from_pretrained("AmitPrakash/rico-smolvlm-lora")

# then run inference as above

Design decisions

Decision	Choice	Reason
Base model	SmolVLM-256M-Instruct	Smallest production-quality VLM; fits <4 GB VRAM
Dataset column	`captions[0]`	Dataset stores 5 captions per screen as a list; first is used for training
Fine-tuning	LoRA on all linear layers	~3% trainable params, fast convergence, low memory
Deployment	Merged full model	No adapter loading code at inference; simpler and faster
Precision	bfloat16	Numerically stable on Ampere+ GPUs

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for AmitPrakash/rico-smolvlm-full

Base model

HuggingFaceTB/SmolLM2-135M

Quantized

HuggingFaceTB/SmolLM2-135M-Instruct

Quantized

HuggingFaceTB/SmolVLM-256M-Instruct

Adapter

(31)

this model

AmitPrakash
/

rico-smolvlm-full