Model Card for Model ID

🍽 CaLoRAify β€” Food Nutrition LoRA Adapter A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional information including ingredients, portion estimates, and calorie breakdowns.

Model Details

What This Model Does

Send any food photo and the model outputs a structured CaLoRAify Reasoning Loop: Ingredients detected: grilled chicken breast, steamed rice, broccoli.

JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5} The model first identifies ingredients in plain text (Chain-of-Thought), then outputs a structured JSON with nutritional estimates β€” making it suitable for meal logging applications.

CaLoRAify β€” Food Calorie LoRA Adapter 🍽

A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional breakdowns with a Chain-of-Thought reasoning loop.

Model Description

  • Developed by: Unnatrathi
  • Model type: LoRA adapter for Vision-Language Model
  • Base model: HuggingFaceTB/SmolVLM2-500M-Instruct
  • Language: English
  • License: MIT
  • Fine-tuned from: HuggingFaceTB/SmolVLM2-500M-Instruct
  • Adapter size: ~40 MB (vs 500M base model)
  • Trainable parameters: 3,178,496 (0.62% of total)

What It Does

Send any food photo β†’ model identifies ingredients β†’ estimates portions β†’ outputs structured JSON with calories and macros.

Output Format (CaLoRAify Reasoning Loop)

Ingredients detected: grilled chicken breast, steamed rice, broccoli.
JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5}

Training Details

Training Data

  • Dataset: Codatta/MM-Food-100K
  • Samples used: 2,000 real food images
  • Categories: Restaurant food, homemade food, packaged food, raw ingredients

Training Procedure

Hyperparameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q_proj, v_proj, k_proj
Learning rate 2e-4
Batch size 2
Gradient accumulation 4 (effective batch: 8)
Epochs 5
Optimizer paged_adamw_8bit
LR scheduler cosine
Quantisation 4-bit NF4 (bitsandbytes)
Hardware Google Colab T4 GPU

Training Framework

  • πŸ€— Transformers 4.51.3
  • πŸ€— PEFT 0.15.2
  • πŸ€— TRL 0.17.0 (SFTTrainer)
  • bitsandbytes 0.43.1

How to Use

from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.models.smolvlm.configuration_smolvlm import SmolVLMConfig
from transformers.models.smolvlm.modeling_smolvlm import SmolVLMForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch

MODEL_ID   = "HuggingFaceTB/SmolVLM2-500M-Instruct"
ADAPTER_ID = "Unnatrathi/caloraify-lora-adapter"

# Registry patch for transformers 4.51.3
AutoModelForVision2Seq.register(SmolVLMConfig, SmolVLMForConditionalGeneration, exist_ok=True)

# Load processor
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForVision2Seq.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model.eval()

# Run inference
image = Image.open("your_food_photo.jpg").convert("RGB")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What food is in this image? Reply: Ingredients detected: [list]"},
        ],
    }
]
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[[image]], text=[prompt], return_tensors="pt", truncation=False)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)

new_tokens = out[:, inputs["input_ids"].shape[-1]:]
print(processor.batch_decode(new_tokens, skip_special_tokens=True)[0])

Full Project

This adapter is part of the CaLoRAify project β€” an AI-powered food calorie Telegram bot.

Limitations

  • Inference is slow on CPU (~30–90 seconds per image)

  • Accuracy improves with more training data (currently 2K samples)

  • Works best with clear, well-lit food photos

  • Portion estimation is approximate

  • PEFT 0.15.2

Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Unnatrathi/caloraify-lora-adapter

Dataset used to train Unnatrathi/caloraify-lora-adapter