Spaces:

berkeruveyik
/

FoodExtract-Vision

Sleeping

App Files Files Community

berkeruveyik commited on Feb 7

Commit

9758569

verified ·

1 Parent(s): 0948744

Uploading FoodExtract-Vision demo folder

Browse files

Files changed (6) hide show

README.md +254 -6
app.py +35 -0
examples/36741.jpg +0 -0
examples/IMG_3808.JPG +0 -0
examples/istockphoto-175500494-612x612.jpg +0 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -1,12 +1,260 @@
 ---
-title: FoodExtract Vision
-emoji: 🏆
-colorFrom: red
-colorTo: pink
 sdk: gradio
-sdk_version: 6.5.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🍕🔍 FoodExtract-Vision v1: Fine-tuned SmolVLM2-500M for Structured Food Tag Extraction
+[![Model on HuggingFace](https://img.shields.io/badge/🤗%20Model-FoodExtract--Vision--SmolVLM2--500M-blue)](https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3)
+[![Dataset on HuggingFace](https://img.shields.io/badge/🤗%20Dataset-vlm--food--4k--not--food-green)](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
+## 📋 Overview
+**FoodExtract-Vision** is a fine-tuned Vision-Language Model (VLM) that classifies images as food/not-food and extracts structured food and drink tags in JSON format. Built on top of [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), this model demonstrates that even small VLMs can be fine-tuned to reliably produce structured outputs for domain-specific tasks.
 ---
+title: FoodExtract-Vision Fine-tuned VLM Structured Data Extractor
+emoji: 🍕🔍
+colorFrom: green
+colorTo: blue
 sdk: gradio
 app_file: app.py
 pinned: false
+license: apache-2.0
+---
+### 🎯 What Does It Do?
+- **Input:** Any image (food or non-food)
+- **Output:** Structured JSON containing:
+  - `is_food` — binary classification (0 or 1)
+  - `image_title` — short food-related caption
+  - `food_items` — list of visible food item nouns
+  - `drink_items` — list of visible drink item nouns
+### 💡 Example Output
+```json
+{
+  "is_food": 1,
+  "image_title": "Tandoori chicken with naan bread",
+  "food_items": ["tandoori chicken", "naan bread", "rice", "salad"],
+  "drink_items": ["lassi"]
+}
+```
+---
+## 🏗️ Architecture & Training Pipeline
+### 🧠 Base Model
+- **Model:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
+- **Parameters:** ~500M
+- **Precision:** `bfloat16`
+### 📊 Dataset
+- **Source:** [`berkeruveyik/vlm-food-4k-not-food-dataset`](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset)
+- **Size:** ~3,698 image-JSON pairs
+- **Split:** 80% train / 20% validation
+- **Content:**
+  - 🍔 Food images from the Food270 dataset (various cuisines, ingredients, prepared dishes)
+  - 🖼️ Non-food images (random internet images) to teach correct negative classification
+### 🔧 Two-Stage Training Strategy
+Inspired by the [SmolVLM Docling paper](https://arxiv.org/pdf/2503.11576), the fine-tuning follows a two-stage approach:
+#### Stage 1: LLM Alignment (Frozen Vision Encoder) 🧊
+- **Goal:** Teach the language model to output the desired JSON structure
+- **Frozen:** Vision encoder parameters
+- **Trainable:** LLM + connector layers
+- **Learning Rate:** `2e-4`
+- **Epochs:** 2
+- **Batch Size:** 8 (with gradient accumulation of 4)
+#### Stage 2: Full Model Fine-tuning (Unfrozen Vision Encoder) 🔥
+- **Goal:** Allow the vision encoder to adapt for better food recognition
+- **Trainable:** All parameters (vision encoder + LLM + connector)
+- **Learning Rate:** `2e-6` (much lower to prevent catastrophic forgetting)
+- **Epochs:** 2
+- **Batch Size:** 8 (with gradient accumulation of 4)
+### ⚙️ Training Configuration
+| Parameter | Stage 1 | Stage 2 |
+|---|---|---|
+| Optimizer | `adamw_torch_fused` | `adamw_torch_fused` |
+| Learning Rate | `2e-4` | `2e-6` |
+| LR Scheduler | `constant` | `constant` |
+| Warmup Ratio | `0.03` | `0.03` |
+| Max Grad Norm | `1.0` | `1.0` |
+| Precision | `bf16` | `bf16` |
+| Gradient Checkpointing | ✅ | ✅ |
+| Vision Encoder | ❄️ Frozen | 🔥 Unfrozen |
+---
+## 🚀 Quick Start
+### 📦 Installation
+```bash
+pip install transformers torch gradio spaces
+```
+### 🔮 Inference with Pipeline
+```python
+import torch
+from transformers import pipeline
+FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
+pipe = pipeline(
+    "image-text-to-text",
+    model=FINE_TUNED_MODEL_ID,
+    dtype=torch.bfloat16,
+    device_map="auto",
+)
+prompt = """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
+Only return valid JSON in the following form:
+{
+  "is_food": 0,
+  "image_title": "",
+  "food_items": [],
+  "drink_items": []
+}
+"""
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "path/to/your/image.jpg"},
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+output = pipe(text=messages, max_new_tokens=256)
+print(output[0][0]["generated_text"][-1]["content"])
+```
+### 🧪 Inference without Pipeline
+```python
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from PIL import Image
+FINE_TUNED_MODEL_ID = "berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3"
+model = AutoModelForImageTextToText.from_pretrained(
+    FINE_TUNED_MODEL_ID,
+    attn_implementation="eager",
+    dtype=torch.bfloat16,
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained(FINE_TUNED_MODEL_ID)
+image = Image.open("path/to/your/image.jpg")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "YOUR_PROMPT_HERE"},
+        ],
+    }
+]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+input_len = inputs["input_ids"].shape[-1]
+with torch.inference_mode():
+    output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
+decoded = processor.decode(output[0][input_len:], skip_special_tokens=True)
+print(decoded)
+```
+---
+## 🎮 Gradio Demo
+### ▶️ Running Locally
+```bash
+cd demos/FoodExtract-Vision-v1
+python app.py
+```
+The demo launches a Gradio interface that lets you:
+1. 📤 Upload any image
+2. 🔄 Compare outputs from the **base model** vs. the **fine-tuned model** side-by-side
+3. 📊 See structured JSON extraction in real-time
+---
+## 📁 Project Structure
+```
+demos/FoodExtract-Vision-v1/
+├── app.py              # 🚀 Gradio demo application
+├── README.md           # 📖 This file
+└── examples/           # 🖼️ Example images for the demo
+    ├── 1.jpeg     # 📷 Non-food example
+    ├── 2.jpg  # 🍗 Food example
+    └── 3.jpeg      # 🍟 Food example
+```
+---
+## 📝 Key Learnings & Notes
+### ✅ What Worked
+- 🏗️ **Two-stage training** significantly improved output quality compared to single-stage training
+- 🧊 **Freezing the vision encoder first** allowed the LLM to learn the output format without interference
+- 🐢 **Lower learning rate in Stage 2** (`2e-6` vs `2e-4`) prevented catastrophic forgetting of Stage 1 progress
+- 🤏 Even a **500M parameter model** can learn reliable structured output generation
+### ⚠️ Important Notes
+- **Dtype consistency:** Ensure model inputs match the model's dtype (e.g., `bfloat16` inputs for a `bfloat16` model)
+- **System prompt handling:** When not using `transformers.pipeline`, the system prompt may need to be folded into the user prompt to avoid errors
+- **`remove_unused_columns = False`** is critical when using a custom data collator with `SFTTrainer`
+---
+## 🔗 Links
+| Resource | URL |
+|---|---|
+| 🤗 Fine-tuned Model | [berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3](https://huggingface.co/berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3) |
+| 🤗 Dataset | [berkeruveyik/vlm-food-4k-not-food-dataset](https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset) |
+| 🤗 Base Model | [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) |
+| 📄 SmolVLM Docling Paper | [arxiv.org/pdf/2503.11576](https://arxiv.org/pdf/2503.11576) |
+| 📚 TRL Documentation | [huggingface.co/docs/trl](https://huggingface.co/docs/trl/main/en/index) |
+| 📚 PEFT GitHub | [github.com/huggingface/peft](https://github.com/huggingface/peft) |
+---
+## 📄 License
+Please refer to the respective model and dataset cards for licensing information. The license is Apache 2.0.
 ---

app.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import torch
+import gradio as gr
+import spaces
+from transformers import pipeline
+BASE_MODEL_ID = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
+FINE_TUNED_MODEL_ID = 'berkeruveyik/FoodExtraqt-Vision-SmoLVLM2-500M-fine-tune-v3'
+OUTPUT_TOKENS = 256
+print(f"[INFO] Loading Original Model")
+original_pipeline = pipeline("image-text-to-text", model=BASE_MODEL_ID, dtype=torch.bfloat16, device_map="auto")
+print(f"[INFO] Loading Fine-tuned Model")
+ft_pipe = pipeline("image-text-to-text", model=FINE_TUNED_MODEL_ID, dtype=torch.bfloat16, device_map="auto")
+def create_message(input_image):
+    return [{'role': 'user', 'content': [{'type': 'image', 'image': input_image}, {'type': 'text', 'text': "Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.\n\nOnly return valid JSON in the following form:\n\n```json\n{\n  'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)\n  'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present\n  'food_items': [], # list[str] - list of visible edible food item nouns\n  'drink_items': [] # list[str] - list of visible edible drink item nouns\n}\n```\n"}]}]
+@spaces.GPU
+def extract_foods_from_image(input_image):
+    input_image = input_image.resize(size=(512, 512))
+    input_message = create_message(input_image=input_image)
+    original_pipeline_output = original_pipeline(text=[input_message], max_new_tokens=OUTPUT_TOKENS)
+    outputs_pretrained = original_pipeline_output[0][0]["generated_text"][-1]["content"]
+    ft_pipe_output = ft_pipe(text=[input_message], max_new_tokens=OUTPUT_TOKENS)
+    outputs_fine_tuned = ft_pipe_output[0][0]["generated_text"][-1]["content"]
+    return outputs_pretrained, outputs_fine_tuned
+demo_title = "🍕🔍 FoodExtract-Vision: Fine-tuned SmolVLM2-500M"
+demo_description = """* **Base model:** https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct\n* **Fine-tuning dataset:** https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset\n* **Fine-tuned model:** https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3\n\n## 📋 Overview\n\nThis demo showcases the power of fine-tuning for structured output generation. Compare a base vision-language model against its fine-tuned version specialized in extracting food and drink items from images in JSON format.\n\nThe **base model** often fails to follow the required output structure, producing inconsistent or unstructured responses. The **fine-tuned model** reliably generates valid JSON outputs matching the specified schema.\n\n## 🎯 Task Description\n\nBoth models receive identical input prompts requesting food/drink classification and extraction:\n\n````\nClassify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.\n\nOnly return valid JSON in the following form:\n\n```json\n{\n  'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)\n  'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present\n  'food_items': [], # list[str] - list of visible edible food item nouns\n  'drink_items': [] # list[str] - list of visible edible drink item nouns\n}\n```\n````\n\n## 🔧 Training Details\n\nThe fine-tuned model was trained on **3,698 images** from the vlm-food-4k-not-food-dataset:\n- **Food images:** Multiple categories from the Food270 dataset including various cuisines, ingredients, and prepared dishes\n- **Non-food images:** Random internet images to teach the model to correctly identify non-food content\n- Each image is labeled with structured JSON outputs including classification, titles, and extracted food/drink items"""
+demo = gr.Interface(fn=extract_foods_from_image, inputs=gr.Image(type="pil"), title=demo_title, description=demo_description, outputs=[gr.Textbox(lines=4, label="Original Model (not fine-tuned)"), gr.Textbox(lines=4, label="Fine-tuned Model")], examples=[["./examples/36741.jpg"], ["./examples/IMG_3808.JPG"], ["./examples/istockphoto-175500494-612x612.jpg"]])
+if __name__ == "__main__":
+    demo.launch(share=True)

examples/36741.jpg ADDED Viewed

examples/IMG_3808.JPG ADDED Viewed

examples/istockphoto-175500494-612x612.jpg ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+transformers
+torch
+gradio
+spaces
+accelerate