berkeruveyik commited on
Commit
9758569
ยท
verified ยท
1 Parent(s): 0948744

Uploading FoodExtract-Vision demo folder

Browse files
README.md CHANGED
@@ -1,12 +1,260 @@
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: FoodExtract Vision
3
- emoji: ๐Ÿ†
4
- colorFrom: red
5
- colorTo: pink
6
  sdk: gradio
7
- sdk_version: 6.5.1
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # ๐Ÿ•๐Ÿ” FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
2
+
3
+ [![Model on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
4
+ [![Dataset on HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
5
+
6
+ ## ๐Ÿ“‹ Overview
7
+
8
+ **FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that classifies images as food/not-food and extracts structured food and drink tags in JSON format. Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this model demonstrates that even small VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks.
9
+
10
+
11
  ---
12
+ title: FoodExtract-Vision Fine-tuned VLM Structured Data Extractor
13
+ emoji: ๐Ÿ•๐Ÿ”
14
+ colorFrom: green
15
+ colorTo: blue
16
  sdk: gradio
 
17
  app_file: app.py
18
  pinned: false
19
+ license: apache-2.0
20
+ ---
21
+
22
+
23
+ ### ๐ŸŽฏ What Does It Do?
24
+
25
+ - **Input:** Any image (food or non-food)
26
+ - **Output:** Structured JSON containing:
27
+ - `is_food` โ€” binary classification (0 or 1)
28
+ - `image_title` โ€” short food-related caption
29
+ - `food_items` โ€” list of visible food item nouns
30
+ - `drink_items` โ€” list of visible drink item nouns
31
+
32
+ ### ๐Ÿ’ก Example Output
33
+
34
+ ```json
35
+ {
36
+ "is_food": 1,
37
+ "image_title": "Tandoori chicken with naan bread",
38
+ "food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
39
+ "drink_items": ["lassi"]
40
+ }
41
+ ```
42
+
43
+ ---
44
+
45
+ ## ๐Ÿ—๏ธ Architecture & Training Pipeline
46
+
47
+ ### ๐Ÿง  Base Model
48
+
49
+ - **Model:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
50
+ - **Parameters:** ~500M
51
+ - **Precision:** `bfloat16`
52
+
53
+ ### ๐Ÿ“Š Dataset
54
+
55
+ - **Source:** [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
56
+ - **Size:** ~3,698 image-JSON pairs
57
+ - **Split:** 80% train / 20% validation
58
+ - **Content:**
59
+ - ๐Ÿ” Food images from the Food270 dataset (various cuisines, ingredients, prepared dishes)
60
+ - ๐Ÿ–ผ๏ธ Non-food images (random internet images) to teach correct negative classification
61
+
62
+ ### ๐Ÿ”ง Two-Stage Training Strategy
63
+
64
+ Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the fine-tuning follows a two-stage approach:
65
+
66
+ #### Stage 1: LLM Alignment (Frozen Vision Encoder) ๐ŸงŠ
67
+
68
+ - **Goal:** Teach the language model to output the desired JSON structure
69
+ - **Frozen:** Vision encoder parameters
70
+ - **Trainable:** LLM + connector layers
71
+ - **Learning Rate:** `2e-4`
72
+ - **Epochs:** 2
73
+ - **Batch Size:** 8 (with gradient accumulation of 4)
74
+
75
+ #### Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) ๐Ÿ”ฅ
76
+
77
+ - **Goal:** Allow the vision encoder to adapt for better food recognition
78
+ - **Trainable:** All parameters (vision encoder + LLM + connector)
79
+ - **Learning Rate:** `2e-6` (much lower to prevent catastrophic forgetting)
80
+ - **Epochs:** 2
81
+ - **Batch Size:** 8 (with gradient accumulation of 4)
82
+
83
+ ### โš™๏ธ Training Configuration
84
+
85
+ | Parameter | Stage 1 | Stage 2 |
86
+ |---|---|---|
87
+ | Optimizer | `adamw_torch_fused` | `adamw_torch_fused` |
88
+ | Learning Rate | `2e-4` | `2e-6` |
89
+ | LR Scheduler | `constant` | `constant` |
90
+ | Warmup Ratio | `0.03` | `0.03` |
91
+ | Max Grad Norm | `1.0` | `1.0` |
92
+ | Precision | `bf16` | `bf16` |
93
+ | Gradient Checkpointing | โœ… | โœ… |
94
+ | Vision Encoder | โ„๏ธ Frozen | ๐Ÿ”ฅ Unfrozen |
95
+
96
+ ---
97
+
98
+ ## ๐Ÿš€ Quick Start
99
+
100
+ ### ๐Ÿ“ฆ Installation
101
+
102
+ ```bash
103
+ pip install transformers torch gradio spaces
104
+ ```
105
+
106
+ ### ๐Ÿ”ฎ Inference with Pipeline
107
+
108
+ ```python
109
+ import torch
110
+ from transformers import pipeline
111
+
112
+ FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
113
+
114
+ pipe = pipeline(
115
+ "image-text-to-text",
116
+ model=FINE_TUNED_MODEL_ID,
117
+ dtype=torch.bfloat16,
118
+ device_map="auto",
119
+ )
120
+
121
+ prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
122
+
123
+ Only return valid JSON in the following form:
124
+
125
+ {
126
+ "is_food": 0,
127
+ "image_title": "",
128
+ "food_items": [],
129
+ "drink_items": []
130
+ }
131
+ """
132
+
133
+ messages = [
134
+ {
135
+ "role": "user",
136
+ "content": [
137
+ {"type": "image", "image": "path/to/your/image.jpg"},
138
+ {"type": "text", "text": prompt},
139
+ ],
140
+ }
141
+ ]
142
+
143
+ output = pipe(text=messages, max_new_tokens=256)
144
+ print(output[0][0]["generated_text"][-1]["content"])
145
+ ```
146
+
147
+ ### ๐Ÿงช Inference without Pipeline
148
+
149
+ ```python
150
+ import torch
151
+ from transformers import AutoModelForImageTextToText, AutoProcessor
152
+ from PIL import Image
153
+
154
+ FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
155
+
156
+ model = AutoModelForImageTextToText.from_pretrained(
157
+ FINE_TUNED_MODEL_ID,
158
+ attn_implementation="eager",
159
+ dtype=torch.bfloat16,
160
+ device_map="auto",
161
+ )
162
+ processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)
163
+
164
+ image = Image.open("path/to/your/image.jpg")
165
+
166
+ messages = [
167
+ {
168
+ "role": "user",
169
+ "content": [
170
+ {"type": "image", "image": image},
171
+ {"type": "text", "text": "YOUR_PROMPT_HERE"},
172
+ ],
173
+ }
174
+ ]
175
+
176
+ inputs = processor.apply_chat_template(
177
+ messages,
178
+ add_generation_prompt=True,
179
+ tokenize=True,
180
+ return_dict=True,
181
+ return_tensors="pt",
182
+ ).to(model.device)
183
+
184
+ input_len = inputs["input_ids"].shape[-1]
185
+
186
+ with torch.inference_mode():
187
+ output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
188
+
189
+ decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
190
+ print(decoded)
191
+ ```
192
+
193
+ ---
194
+
195
+ ## ๐ŸŽฎ Gradio Demo
196
+
197
+ ### โ–ถ๏ธ Running Locally
198
+
199
+ ```bash
200
+ cd demos/FoodExtract-Vision-v1
201
+ python app.py
202
+ ```
203
+
204
+ The demo launches a Gradio interface that lets you:
205
+
206
+ 1. ๐Ÿ“ค Upload any image
207
+ 2. ๐Ÿ”„ Compare outputs from the **base model** vs. the **fine-tuned model** side-by-side
208
+ 3. ๐Ÿ“Š See structured JSON extraction in real-time
209
+
210
+ ---
211
+
212
+ ## ๐Ÿ“ Project Structure
213
+
214
+ ```
215
+ demos/FoodExtract-Vision-v1/
216
+ โ”œโ”€โ”€ app.py # ๐Ÿš€ Gradio demo application
217
+ โ”œโ”€โ”€ README.md # ๐Ÿ“– This file
218
+ โ””โ”€โ”€ examples/ # ๐Ÿ–ผ๏ธ Example images for the demo
219
+ โ”œโ”€โ”€ 1.jpeg # ๐Ÿ“ท Non-food example
220
+ โ”œโ”€โ”€ 2.jpg # ๐Ÿ— Food example
221
+ โ””โ”€โ”€ 3.jpeg # ๐ŸŸ Food example
222
+ ```
223
+
224
+ ---
225
+
226
+ ## ๐Ÿ“ Key Learnings & Notes
227
+
228
+ ### โœ… What Worked
229
+
230
+ - ๐Ÿ—๏ธ **Two-stage training** significantly improved output quality compared to single-stage training
231
+ - ๐ŸงŠ **Freezing the vision encoder first** allowed the LLM to learn the output format without interference
232
+ - ๐Ÿข **Lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting of Stage 1 progress
233
+ - ๐Ÿค Even a **500M parameter model** can learn reliable structured output generation
234
+
235
+ ### โš ๏ธ Important Notes
236
+
237
+ - **Dtype consistency:** Ensure model inputs match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
238
+ - **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt to avoid errors
239
+ - **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
240
+
241
+ ---
242
+
243
+ ## ๐Ÿ”— Links
244
+
245
+ | Resource | URL |
246
+ |---|---|
247
+ | ๐Ÿค— Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) |
248
+ | ๐Ÿค— Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
249
+ | ๐Ÿค— Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) |
250
+ | ๐Ÿ“„ SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
251
+ | ๐Ÿ“š TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
252
+ | ๐Ÿ“š PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
253
+
254
+ ---
255
+
256
+ ## ๐Ÿ“„ License
257
+
258
+ Please refer to the respective model and dataset cards for licensing information. The license is Apache 2.0.
259
  ---
260
 
 
app.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import gradio as gr
3
+ import spaces
4
+ from transformers import pipeline
5
+
6
+ BASE_MODEL_ID = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
7
+ FINE_TUNED_MODEL_ID = 'berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3'
8
+ OUTPUT_TOKENS = 256
9
+
10
+ print(f"[INFO] Loading Original Model")
11
+ original_pipeline = pipeline("image-text-to-text", model=BASE_MODEL_ID, dtype=torch.bfloat16, device_map="auto")
12
+
13
+ print(f"[INFO] Loading Fine-tuned Model")
14
+ ft_pipe = pipeline("image-text-to-text", model=FINE_TUNED_MODEL_ID, dtype=torch.bfloat16, device_map="auto")
15
+
16
+ def create_message(input_image):
17
+ return [{'role': 'user', 'content': [{'type': 'image', 'image': input_image}, {'type': 'text', 'text': "Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.\n\nOnly return valid JSON in the following form:\n\n```json\n{\n 'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)\n 'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present\n 'food_items': [], # list[str] - list of visible edible food item nouns\n 'drink_items': [] # list[str] - list of visible edible drink item nouns\n}\n```\n"}]}]
18
+
19
+ @spaces.GPU
20
+ def extract_foods_from_image(input_image):
21
+ input_image = input_image.resize(size=(512, 512))
22
+ input_message = create_message(input_image=input_image)
23
+ original_pipeline_output = original_pipeline(text=[input_message], max_new_tokens=OUTPUT_TOKENS)
24
+ outputs_pretrained = original_pipeline_output[0][0]["generated_text"][-1]["content"]
25
+ ft_pipe_output = ft_pipe(text=[input_message], max_new_tokens=OUTPUT_TOKENS)
26
+ outputs_fine_tuned = ft_pipe_output[0][0]["generated_text"][-1]["content"]
27
+ return outputs_pretrained, outputs_fine_tuned
28
+
29
+ demo_title = "๐Ÿ•๐Ÿ” FoodExtract-Vision: Fine-tuned SmolVLM2-500M"
30
+ demo_description = """* **Base model:** https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct\n* **Fine-tuning dataset:** https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset\n* **Fine-tuned model:** https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3\n\n## ๐Ÿ“‹ Overview\n\nThis demo showcases the power of fine-tuning for structured output generation. Compare a base vision-language model against its fine-tuned version specialized in extracting food and drink items from images in JSON format.\n\nThe **base model** often fails to follow the required output structure, producing inconsistent or unstructured responses. The **fine-tuned model** reliably generates valid JSON outputs matching the specified schema.\n\n## ๐ŸŽฏ Task Description\n\nBoth models receive identical input prompts requesting food/drink classification and extraction:\n\n````\nClassify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.\n\nOnly return valid JSON in the following form:\n\n```json\n{\n 'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)\n 'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present\n 'food_items': [], # list[str] - list of visible edible food item nouns\n 'drink_items': [] # list[str] - list of visible edible drink item nouns\n}\n```\n````\n\n## ๐Ÿ”ง Training Details\n\nThe fine-tuned model was trained on **3,698 images** from the vlm-food-4k-not-food-dataset:\n- **Food images:** Multiple categories from the Food270 dataset including various cuisines, ingredients, and prepared dishes\n- **Non-food images:** Random internet images to teach the model to correctly identify non-food content\n- Each image is labeled with structured JSON outputs including classification, titles, and extracted food/drink items"""
31
+
32
+ demo = gr.Interface(fn=extract_foods_from_image, inputs=gr.Image(type="pil"), title=demo_title, description=demo_description, outputs=[gr.Textbox(lines=4, label="Original Model (not fine-tuned)"), gr.Textbox(lines=4, label="Fine-tuned Model")], examples=[["./examples/36741.jpg"], ["./examples/IMG_3808.JPG"], ["./examples/istockphoto-175500494-612x612.jpg"]])
33
+
34
+ if __name__ == "__main__":
35
+ demo.launch(share=True)
examples/36741.jpg ADDED
examples/IMG_3808.JPG ADDED
examples/istockphoto-175500494-612x612.jpg ADDED
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ transformers
2
+ torch
3
+ gradio
4
+ spaces
5
+ accelerate