metadata
license: cc-by-nc-4.0
library_name: transformers
tags:
- mobile-o
- multimodal
- unified-model
- vision-language
- text-to-image
- image-understanding
- on-device
- mobile
pipeline_tag: text-to-image
datasets:
- Amshaker/Mobile-O-Post-Train
- Amshaker/Mobile-O-SFT
- Amshaker/Mobile-O-Pre-Train
base_model:
- Efficient-Large-Model/Sana_600M_512px_diffusers
- apple/FastVLM-0.5B
π Overview
Mobile-O-0.5B is a compact unified visionβlanguageβdiffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, designed for mobile and edge deployment.
| Spec | Detail |
|---|---|
| Total Parameters | 1.6B |
| Image Resolution | 512Γ512 |
| Image Generation | ~3 seconds on iPhone |
| Visual Understanding | ~0.4 seconds on iPhone |
| Memory Footprint | < 2GB |
π― Supported Tasks
| Task | Input β Output |
|---|---|
| π¬ Conversational AI | Text β Text |
| ποΈ Image Understanding | Image + Text β Text |
| πΌοΈ Image Generation | Text β Image |
| βοΈ Image Editing | Image + Text β Image |
π Quick Start
Download
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Amshaker/Mobile-O-0.5B",
repo_type="model",
local_dir="checkpoints",
allow_patterns=["final_merged_model_23620/*"]
)
Image Understanding
python infer_und.py \
--model_path checkpoints/final_merged_model_23620/ \
--image_path assets/cute_cat.png \
--prompt "What is in the image?"
Image Generation
python infer_gen.py \
--model_path checkpoints/final_merged_model_23620/ \
--prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"
Image Editing
python infer_edit.py \
--model_path checkpoints/final_merged_model_23620/ \
--image_path assets/cute_cat.png \
--prompt "Make the cat wear a hat"
ποΈ Architecture
Mobile-O consists of three main components:
- Vision-Language Model (VLM): FastVLM-0.5B β FastViT vision encoder + Qwen2-0.5B language backbone
- Diffusion Decoder: SANA-600M-512 β lightweight linear DiT with VAE for 512Γ512 generation
- Mobile Conditioning Projector (MCP): ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention
ποΈ Training
Trained in three stages:
- Pre-training β Cross-modal alignment on 4M text-image pairs
- SFT β Supervised fine-tuning on ~105K curated pairs
- Post-training β Unified multimodal training on ~105K quadruplets
π Related Resources
| Resource | Link |
|---|---|
| π€ Mobile-O-1.5B | Model |
| π€ Mobile-O-0.5B-iOS | iOS Components |
| π± iOS App Source Code | Mobile-O-App |
π Citation
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:2602.20161},
year={2026}
}
βοΈ License
Released under CC BY-NC 4.0. For research purposes only.