Model Card for Model ID

🍽 CaLoRAify — Food Nutrition LoRA Adapter A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional information including ingredients, portion estimates, and calorie breakdowns.

Model Details

What This Model Does

Send any food photo and the model outputs a structured CaLoRAify Reasoning Loop: Ingredients detected: grilled chicken breast, steamed rice, broccoli.

JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5} The model first identifies ingredients in plain text (Chain-of-Thought), then outputs a structured JSON with nutritional estimates — making it suitable for meal logging applications.

CaLoRAify — Food Calorie LoRA Adapter 🍽

A LoRA adapter fine-tuned on SmolVLM2-500M-Instruct to analyze food images and output structured nutritional breakdowns with a Chain-of-Thought reasoning loop.

Model Description

Developed by: Unnatrathi
Model type: LoRA adapter for Vision-Language Model
Base model: HuggingFaceTB/SmolVLM2-500M-Instruct
Language: English
License: MIT
Fine-tuned from: HuggingFaceTB/SmolVLM2-500M-Instruct
Adapter size: ~40 MB (vs 500M base model)
Trainable parameters: 3,178,496 (0.62% of total)

What It Does

Send any food photo → model identifies ingredients → estimates portions → outputs structured JSON with calories and macros.

Output Format (CaLoRAify Reasoning Loop)

Ingredients detected: grilled chicken breast, steamed rice, broccoli.
JSON Summary: {"calories_kcal": 520, "protein_g": 42, "carbs_g": 38, "fat_g": 14, "fibre_g": 5}

Training Details

Training Data

Dataset: Codatta/MM-Food-100K
Samples used: 2,000 real food images
Categories: Restaurant food, homemade food, packaged food, raw ingredients

Training Procedure

Hyperparameter	Value
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, v_proj, k_proj
Learning rate	2e-4
Batch size	2
Gradient accumulation	4 (effective batch: 8)
Epochs	5
Optimizer	paged_adamw_8bit
LR scheduler	cosine
Quantisation	4-bit NF4 (bitsandbytes)
Hardware	Google Colab T4 GPU

Training Framework

🤗 Transformers 4.51.3
🤗 PEFT 0.15.2
🤗 TRL 0.17.0 (SFTTrainer)
bitsandbytes 0.43.1

How to Use

from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.models.smolvlm.configuration_smolvlm import SmolVLMConfig
from transformers.models.smolvlm.modeling_smolvlm import SmolVLMForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch

MODEL_ID   = "HuggingFaceTB/SmolVLM2-500M-Instruct"
ADAPTER_ID = "Unnatrathi/caloraify-lora-adapter"

# Registry patch for transformers 4.51.3
AutoModelForVision2Seq.register(SmolVLMConfig, SmolVLMForConditionalGeneration, exist_ok=True)

# Load processor
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForVision2Seq.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model.eval()

# Run inference
image = Image.open("your_food_photo.jpg").convert("RGB")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What food is in this image? Reply: Ingredients detected: [list]"},
        ],
    }
]
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[[image]], text=[prompt], return_tensors="pt", truncation=False)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)

new_tokens = out[:, inputs["input_ids"].shape[-1]:]
print(processor.batch_decode(new_tokens, skip_special_tokens=True)[0])

Full Project

This adapter is part of the CaLoRAify project — an AI-powered food calorie Telegram bot.

Limitations

Inference is slow on CPU (~30–90 seconds per image)
Accuracy improves with more training data (currently 2K samples)
Works best with clear, well-lit food photos
Portion estimation is approximate
PEFT 0.15.2

Downloads last month: 70

Model tree for Unnatrathi/caloraify-lora-adapter

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Quantized

HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Adapter

(6)

this model

Unnatrathi
/

caloraify-lora-adapter

Model Card for Model ID

Model Details

CaLoRAify — Food Calorie LoRA Adapter 🍽

Model Description

What It Does

Output Format (CaLoRAify Reasoning Loop)

Training Details

Training Data

Training Procedure

Training Framework

How to Use

Full Project

Limitations

Model tree for Unnatrathi/caloraify-lora-adapter

Dataset used to train Unnatrathi/caloraify-lora-adapter