rico-smolvlm-full
SmolVLM-256M-Instruct fine-tuned with LoRA on RICO-Screen2Words for UI screenshot captioning. LoRA adapters were merged into the base model for single-artifact deployment.
Model details
| Base model | HuggingFaceTB/SmolVLM-256M-Instruct |
| Dataset | rootsautomation/RICO-Screen2Words |
| Task | Image-to-text captioning of Android UI screenshots |
| Method | LoRA (r=16, α=32, all-linear, dropout=0.05) |
| Trainable params | ~7.7M / 264M (≈3%) |
| Precision | bfloat16 |
| Epochs | 1 |
| LR | 2e-4 with cosine warmup (3%) |
Usage
Full model (recommended)
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "AmitPrakash/rico-smolvlm-full"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16
).cuda().eval()
image = Image.open("screenshot.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this UI screenshot accurately and concisely."}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {k: v.cuda() for k, v in processor(text=[text], images=[image], return_tensors="pt").items()}
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=96, do_sample=False)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
LoRA adapter variant (load + merge before inference)
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
base = AutoModelForImageTextToText.from_pretrained(
"HuggingFaceTB/SmolVLM-256M-Instruct", dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base, "AmitPrakash/rico-smolvlm-lora")
model = model.merge_and_unload().cuda().eval()
processor = AutoProcessor.from_pretrained("AmitPrakash/rico-smolvlm-lora")
# then run inference as above
Design decisions
| Decision | Choice | Reason |
|---|---|---|
| Base model | SmolVLM-256M-Instruct | Smallest production-quality VLM; fits <4 GB VRAM |
| Dataset column | captions[0] |
Dataset stores 5 captions per screen as a list; first is used for training |
| Fine-tuning | LoRA on all linear layers | ~3% trainable params, fast convergence, low memory |
| Deployment | Merged full model | No adapter loading code at inference; simpler and faster |
| Precision | bfloat16 | Numerically stable on Ampere+ GPUs |
- Downloads last month
- -
Model tree for AmitPrakash/rico-smolvlm-full
Base model
HuggingFaceTB/SmolLM2-135M Quantized
HuggingFaceTB/SmolLM2-135M-Instruct Quantized
HuggingFaceTB/SmolVLM-256M-Instruct