xhhcode Amshaker commited on
Commit
e5f5711
·
0 Parent(s):

Duplicate from Amshaker/Mobile-O-0.5B

Browse files

Co-authored-by: Abdelrahman Shaker <Amshaker@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ tags:
5
+ - mobile-o
6
+ - multimodal
7
+ - unified-model
8
+ - vision-language
9
+ - text-to-image
10
+ - image-understanding
11
+ - on-device
12
+ - mobile
13
+ pipeline_tag: text-to-image
14
+ datasets:
15
+ - Amshaker/Mobile-O-Post-Train
16
+ - Amshaker/Mobile-O-SFT
17
+ - Amshaker/Mobile-O-Pre-Train
18
+ base_model:
19
+ - Efficient-Large-Model/Sana_600M_512px_diffusers
20
+ - apple/FastVLM-0.5B
21
+ ---
22
+
23
+ <div align="center">
24
+
25
+ <h1>
26
+ <img src="https://github.com/Amshaker/Mobile-O/blob/main/assets/mobile-o-logo.png?raw=true" width="30" /> Mobile-O-0.5B
27
+ </h1>
28
+
29
+ **Unified Multimodal Understanding and Generation on Mobile Device**
30
+
31
+ <p>
32
+ <a href="https://arxiv.org/abs/2602.20161"><img src="https://img.shields.io/badge/arXiv-2602.20161-b31b1b.svg" alt="arXiv"></a>
33
+ <a href="https://github.com/Amshaker/Mobile-O"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="Code"></a>
34
+ <a href="https://amshaker.github.io/Mobile-O/"><img src="https://img.shields.io/badge/🌐-Project_Page-2563eb.svg" alt="Project Page"></a>
35
+ <a href="https://mobileo.cvmbzuai.com/"><img src="https://img.shields.io/badge/🚀-Live_Demo-10b981.svg" alt="Demo"></a>
36
+ <a href="https://huggingface.co/collections/Amshaker/mobile-o-datasets"><img src="https://img.shields.io/badge/🤗-Datasets-yellow.svg" alt="Datasets"></a>
37
+ <a href="https://apps.apple.com/app/mobile-o/id6759238106"><img src="https://img.shields.io/badge/-App_Store-black.svg" alt="App Store"></a>
38
+ </p>
39
+
40
+ </div>
41
+
42
+ ## 📌 Overview
43
+
44
+ Mobile-O-0.5B is a compact unified vision–language–diffusion model that performs both **multimodal understanding** (VQA, OCR, reasoning) and **image generation** within a single architecture, designed for mobile and edge deployment.
45
+
46
+ | Spec | Detail |
47
+ |------|--------|
48
+ | **Total Parameters** | 1.6B |
49
+ | **Image Resolution** | 512×512 |
50
+ | **Image Generation** | ~3 seconds on iPhone |
51
+ | **Visual Understanding** | ~0.4 seconds on iPhone |
52
+ | **Memory Footprint** | < 2GB |
53
+
54
+ ## 🎯 Supported Tasks
55
+
56
+ | Task | Input → Output |
57
+ |------|---------------|
58
+ | 💬 Conversational AI | Text → Text |
59
+ | 👁️ Image Understanding | Image + Text → Text |
60
+ | 🖼️ Image Generation | Text → Image |
61
+ | ✏️ Image Editing | Image + Text → Image |
62
+
63
+ ## 🚀 Quick Start
64
+
65
+ ### Download
66
+
67
+ ```python
68
+ from huggingface_hub import snapshot_download
69
+
70
+ snapshot_download(
71
+ repo_id="Amshaker/Mobile-O-0.5B",
72
+ repo_type="model",
73
+ local_dir="checkpoints",
74
+ allow_patterns=["final_merged_model_23620/*"]
75
+ )
76
+ ```
77
+
78
+ ### Image Understanding
79
+
80
+ ```bash
81
+ python infer_und.py \
82
+ --model_path checkpoints/final_merged_model_23620/ \
83
+ --image_path assets/cute_cat.png \
84
+ --prompt "What is in the image?"
85
+ ```
86
+
87
+ ### Image Generation
88
+
89
+ ```bash
90
+ python infer_gen.py \
91
+ --model_path checkpoints/final_merged_model_23620/ \
92
+ --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"
93
+ ```
94
+
95
+ ### Image Editing
96
+
97
+ ```bash
98
+ python infer_edit.py \
99
+ --model_path checkpoints/final_merged_model_23620/ \
100
+ --image_path assets/cute_cat.png \
101
+ --prompt "Make the cat wear a hat"
102
+ ```
103
+
104
+ ## 🏗️ Architecture
105
+
106
+ Mobile-O consists of three main components:
107
+
108
+ - **Vision-Language Model (VLM):** [FastVLM-0.5B](https://github.com/apple/ml-fastvlm) — FastViT vision encoder + Qwen2-0.5B language backbone
109
+ - **Diffusion Decoder:** [SANA-600M-512](https://github.com/NVlabs/Sana) — lightweight linear DiT with VAE for 512×512 generation
110
+ - **Mobile Conditioning Projector (MCP):** ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention
111
+
112
+ ## 🏋️ Training
113
+
114
+ Trained in three stages:
115
+
116
+ 1. **Pre-training** — Cross-modal alignment on [4M text-image pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-Pre-Train)
117
+ 2. **SFT** — Supervised fine-tuning on [~105K curated pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-SFT)
118
+ 3. **Post-training** — Unified multimodal training on [~105K quadruplets](https://huggingface.co/datasets/Amshaker/Mobile-O-Post-Train)
119
+
120
+ ## 🔗 Related Resources
121
+
122
+ | Resource | Link |
123
+ |----------|------|
124
+ | 🤗 Mobile-O-1.5B | [Model](https://huggingface.co/Amshaker/Mobile-O-1.5B) |
125
+ | 🤗 Mobile-O-0.5B-iOS | [iOS Components](https://huggingface.co/Amshaker/Mobile-O-0.5B-iOS) |
126
+ | 📱 iOS App Source Code | [Mobile-O-App](https://github.com/Amshaker/Mobile-O/tree/main/Mobile-O-App) |
127
+
128
+ ## 📄 Citation
129
+
130
+ ```bibtex
131
+ @article{shaker2026mobileo,
132
+ title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
133
+ author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
134
+ journal={arXiv preprint arXiv:2602.20161},
135
+ year={2026}
136
+ }
137
+ ```
138
+
139
+ ## ⚖️ License
140
+
141
+ Released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). For research purposes only.
added_tokens.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<image>": 151648,
3
+ "<|endoftext|>": 151643,
4
+ "<|im_end|>": 151645,
5
+ "<|im_start|>": 151644,
6
+ "[/IMG]": 151647,
7
+ "[IMG]": 151646
8
+ }
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_attn_implementation_autoset": true,
3
+ "architectures": [
4
+ "MobileOForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "diffusion_name_or_path": "Efficient-Large-Model/Sana_600M_512px_diffusers",
9
+ "eos_token_id": 151645,
10
+ "freeze_mm_mlp_adapter": false,
11
+ "gen_pooling": "early_pool2d_4",
12
+ "hidden_act": "silu",
13
+ "hidden_size": 896,
14
+ "image_aspect_ratio": "pad",
15
+ "image_grid_pinpoints": null,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 4864,
18
+ "is_train": false,
19
+ "max_position_embeddings": 32768,
20
+ "max_window_layers": 24,
21
+ "mm_hidden_size": 3072,
22
+ "mm_patch_merge_type": "flat",
23
+ "mm_projector_lr": null,
24
+ "mm_projector_type": "mlp2x_gelu",
25
+ "mm_use_im_patch_token": false,
26
+ "mm_use_im_start_end": false,
27
+ "mm_vision_select_feature": "patch",
28
+ "mm_vision_select_layer": -2,
29
+ "mm_vision_tower": "mobileclip_l_1024",
30
+ "model_type": "mobile_o_inference",
31
+ "num_attention_heads": 14,
32
+ "num_hidden_layers": 24,
33
+ "num_key_value_heads": 2,
34
+ "pad_token_id": 151643,
35
+ "rms_norm_eps": 1e-06,
36
+ "rope_scaling": null,
37
+ "rope_theta": 1000000.0,
38
+ "sliding_window": 32768,
39
+ "tie_word_embeddings": false,
40
+ "tokenizer_model_max_length": 512,
41
+ "tokenizer_padding_side": "right",
42
+ "torch_dtype": "bfloat16",
43
+ "transformers_version": "4.51.3",
44
+ "tune_mm_mlp_adapter": false,
45
+ "unfreeze_mm_vision_tower": true,
46
+ "use_cache": true,
47
+ "use_mm_proj": true,
48
+ "use_sliding_window": false,
49
+ "vision_tower_pretrained": null,
50
+ "vlm_num_layers": 4,
51
+ "vocab_size": 151649
52
+ }
gen_projector.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d7bf359cbb7138122aa5806ccbe833e84aae0afb5819800ec61eff6335937cb
3
+ size 888
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
mm_projector.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ce8c4b96b023519dd18de21ff74d41b7bcfc82befc23f718d411b0418e59ff6
3
+ size 7116408
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:715db75c7d1aef8a8cd2c167f2a4f33438fe2f3d6ddb68b0c8b2f234d7bcd22b
3
+ size 4775658842
special_tokens_map.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "[IMG]",
4
+ "[/IMG]",
5
+ "<image>"
6
+ ],
7
+ "eos_token": {
8
+ "content": "<|im_end|>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "pad_token": {
15
+ "content": "<|endoftext|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ }
21
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fdd99b646b9315f486441715cc3dc39f4d3175b45b1bcc05c9aa6642b92b690a
3
+ size 11413649
tokenizer_config.json ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "[IMG]",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "[/IMG]",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<image>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ }
52
+ },
53
+ "additional_special_tokens": [
54
+ "[IMG]",
55
+ "[/IMG]",
56
+ "<image>"
57
+ ],
58
+ "bos_token": null,
59
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
60
+ "clean_up_tokenization_spaces": false,
61
+ "eos_token": "<|im_end|>",
62
+ "errors": "replace",
63
+ "extra_special_tokens": {},
64
+ "model_max_length": 512,
65
+ "pad_token": "<|endoftext|>",
66
+ "padding_side": "right",
67
+ "split_special_tokens": false,
68
+ "tokenizer_class": "Qwen2Tokenizer",
69
+ "unk_token": null
70
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff