Duplicate from Amshaker/Mobile-O-0.5B

e5f5711 3 days ago

4.97 kB

license: cc-by-nc-4.0
library_name: transformers
tags:
  - mobile-o
  - multimodal
  - unified-model
  - vision-language
  - text-to-image
  - image-understanding
  - on-device
  - mobile
pipeline_tag: text-to-image
datasets:
  - Amshaker/Mobile-O-Post-Train
  - Amshaker/Mobile-O-SFT
  - Amshaker/Mobile-O-Pre-Train
base_model:
  - Efficient-Large-Model/Sana_600M_512px_diffusers
  - apple/FastVLM-0.5B

Mobile-O-0.5B

Unified Multimodal Understanding and Generation on Mobile Device

📌 Overview

Mobile-O-0.5B is a compact unified vision–language–diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, designed for mobile and edge deployment.

Spec	Detail
Total Parameters	1.6B
Image Resolution	512×512
Image Generation	~3 seconds on iPhone
Visual Understanding	~0.4 seconds on iPhone
Memory Footprint	< 2GB

🎯 Supported Tasks

Task	Input → Output
💬 Conversational AI	Text → Text
👁️ Image Understanding	Image + Text → Text
🖼️ Image Generation	Text → Image
✏️ Image Editing	Image + Text → Image

🚀 Quick Start

Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Amshaker/Mobile-O-0.5B",
    repo_type="model",
    local_dir="checkpoints",
    allow_patterns=["final_merged_model_23620/*"]
)

Image Understanding

python infer_und.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"

Image Generation

python infer_gen.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"

Image Editing

python infer_edit.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"

🏗️ Architecture

Mobile-O consists of three main components:

Vision-Language Model (VLM): FastVLM-0.5B — FastViT vision encoder + Qwen2-0.5B language backbone
Diffusion Decoder: SANA-600M-512 — lightweight linear DiT with VAE for 512×512 generation
Mobile Conditioning Projector (MCP): ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention

🏋️ Training

Trained in three stages:

Pre-training — Cross-modal alignment on 4M text-image pairs
SFT — Supervised fine-tuning on ~105K curated pairs
Post-training — Unified multimodal training on ~105K quadruplets

🔗 Related Resources

Resource	Link
🤗 Mobile-O-1.5B	Model
🤗 Mobile-O-0.5B-iOS	iOS Components
📱 iOS App Source Code	Mobile-O-App

📄 Citation

@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:2602.20161},
  year={2026}
}

⚖️ License

Released under CC BY-NC 4.0. For research purposes only.