Model Card for Model ID
π½ CaLoRAify β Food Nutrition LoRA Adapter A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional information including ingredients, portion estimates, and calorie breakdowns.
Model Details
What This Model Does
Send any food photo and the model outputs a structured CaLoRAify Reasoning Loop: Ingredients detected: grilled chicken breast, steamed rice, broccoli.
JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5} The model first identifies ingredients in plain text (Chain-of-Thought), then outputs a structured JSON with nutritional estimates β making it suitable for meal logging applications.
CaLoRAify β Food Calorie LoRA Adapter π½
A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional breakdowns with a Chain-of-Thought reasoning loop.
Model Description
- Developed by: Unnatrathi
- Model type: LoRA adapter for Vision-Language Model
- Base model: HuggingFaceTB/SmolVLM2-500M-Instruct
- Language: English
- License: MIT
- Fine-tuned from: HuggingFaceTB/SmolVLM2-500M-Instruct
- Adapter size: ~40 MB (vs 500M base model)
- Trainable parameters: 3,178,496 (0.62% of total)
What It Does
Send any food photo β model identifies ingredients β estimates portions β outputs structured JSON with calories and macros.
Output Format (CaLoRAify Reasoning Loop)
Ingredients detected: grilled chicken breast, steamed rice, broccoli.
JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5}
Training Details
Training Data
- Dataset: Codatta/MM-Food-100K
- Samples used: 2,000 real food images
- Categories: Restaurant food, homemade food, packaged food, raw ingredients
Training Procedure
| Hyperparameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, v_proj, k_proj |
| Learning rate | 2e-4 |
| Batch size | 2 |
| Gradient accumulation | 4 (effective batch: 8) |
| Epochs | 5 |
| Optimizer | paged_adamw_8bit |
| LR scheduler | cosine |
| Quantisation | 4-bit NF4 (bitsandbytes) |
| Hardware | Google Colab T4 GPU |
Training Framework
- π€ Transformers 4.51.3
- π€ PEFT 0.15.2
- π€ TRL 0.17.0 (SFTTrainer)
- bitsandbytes 0.43.1
How to Use
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.models.smolvlm.configuration_smolvlm import SmolVLMConfig
from transformers.models.smolvlm.modeling_smolvlm import SmolVLMForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch
MODEL_ID = "HuggingFaceTB/SmolVLM2-500M-Instruct"
ADAPTER_ID = "Unnatrathi/caloraify-lora-adapter"
# Registry patch for transformers 4.51.3
AutoModelForVision2Seq.register(SmolVLMConfig, SmolVLMForConditionalGeneration, exist_ok=True)
# Load processor
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForVision2Seq.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model.eval()
# Run inference
image = Image.open("your_food_photo.jpg").convert("RGB")
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What food is in this image? Reply: Ingredients detected: [list]"},
],
}
]
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[[image]], text=[prompt], return_tensors="pt", truncation=False)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
new_tokens = out[:, inputs["input_ids"].shape[-1]:]
print(processor.batch_decode(new_tokens, skip_special_tokens=True)[0])
Full Project
This adapter is part of the CaLoRAify project β an AI-powered food calorie Telegram bot.
Limitations
Inference is slow on CPU (~30β90 seconds per image)
Accuracy improves with more training data (currently 2K samples)
Works best with clear, well-lit food photos
Portion estimation is approximate
PEFT 0.15.2
- Downloads last month
- 70
Model tree for Unnatrathi/caloraify-lora-adapter
Base model
HuggingFaceTB/SmolLM2-360M