Food / Not Food Classifier β csatv2_11m (v1)
Fastest throughput β DCT frequency compression, best for large-scale filtering (50M+ images)
A binary image classifier for detecting food and drink in images, trained via knowledge distillation from SigLIP2-so400m zero-shot labels on 3.1M images from DataComp-1B-food-and-drink-3M.
Part of a 3-model portfolio for the Nutrify food tracking pipeline.
Model Details
| Field | Value |
|---|---|
| Architecture | csatv2.r512_in1k |
| Parameters | 10.7M |
| Input size | 512x512px |
| Labels | food_or_drink (0), not_food_or_drink (1) |
| FoodVision accuracy | 0.9216 |
| FoodVision F1 | 0.9471 |
| Training val accuracy | 0.9169 (epoch 5/5) |
| Throughput | 4639.1 img/s (batch 64, RTX 4090) |
Quick Start
import timm
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
# Load model
model = timm.create_model("csatv2.r512_in1k", pretrained=False, num_classes=2)
weights = hf_hub_download("mrdbourke/food-not-food-classifier-csatv2-v1", "model.safetensors")
model.load_state_dict(load_file(weights))
model.eval()
# Prepare transform (use timm's built-in config)
from timm.data import resolve_data_config, create_transform
data_config = resolve_data_config(model.pretrained_cfg)
data_config["input_size"] = (3, 512, 512)
transform = create_transform(**data_config, is_training=False)
# Classify
image = Image.open("photo.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)
with torch.inference_mode():
logits = model(input_tensor)
probs = torch.softmax(logits, dim=1)
pred = logits.argmax(dim=1).item()
labels = ["food_or_drink", "not_food_or_drink"]
print(f"{labels[pred]}: {probs[0][pred]:.1%}")
All 3 Models β Comparison
These models were trained together as part of the Nutrify food/not_food classifier portfolio. Pick the right one for your use case:
| Model | Role | FV Accuracy | FV F1 | Params | Throughput | Repo |
|---|---|---|---|---|---|---|
| siglip2_base_256 | Highest accuracy | 91.3% | 94.1% | 92.9M | 2099.9 img/s | link |
| csatv2_11m | Fastest throughput | 92.2% | 94.7% | 10.7M | 4639.1 img/s | link |
| nextvit_small_384 | CoreML deployable | 92.2% | 94.7% | 30.7M | 1156.0 img/s | link |
Evaluation β FoodVision Test Set
Evaluated on 153,911 human-labeled images from the Nutrify FoodVision dataset (118K food + 35K not_food). This is an out-of-distribution test β the model was trained on DataComp-1B web images, not FoodVision images.
| Metric | Value |
|---|---|
| Accuracy | 0.9216 |
| F1 | 0.9471 |
| Precision | 0.9883 |
| Recall | 0.9092 |
| Total samples | 153,911 |
| Correct | 141,848 |
| Wrong | 12,063 |
Training
Data
- Source: mrdbourke/DataComp-1B-food-and-drink-3M β 3.1M images from Recap-DataComp-1B
- Training set: 2,952,644 images (all quality tiers)
- Validation set: 155,403 images
- Labels: Binary (food_or_drink vs not_food_or_drink)
Distillation
- Teacher: google/siglip2-so400m-patch16-512 (878M params, zero-shot)
- Loss: Hybrid KL divergence: alpha * soft_KL + (1-alpha) * hard_CE
- Alpha: 0.7 | Temperature: 3.0
- Backbone LR: 0.0001 * 0.1 (differential learning rate after unfreeze)
- Epochs: 5 | Best epoch: 5
Augmentations (torchvision.transforms.v2)
- RandomResizedCrop (scale 0.6-1.0) β food can be a small part of a scene
- RandomHorizontalFlip + RandomVerticalFlip β orientation robustness
- RandomRotation (15 deg) β tilted phone shots
- RandomPerspective (0.2, p=0.3) β angled views
- ColorJitter (B=0.4, C=0.4, S=0.3, H=0.05) β restaurant lighting variation
- GaussianBlur (p=0.2) β camera shake
- RandomGrayscale (p=0.02) β B&W web images
- RandomErasing (p=0.1) β partial occlusion
Pipeline context
This model is part of the Nutrify VLM pipeline β a cascading filter system for building a food/drink image dataset from billion-scale web crawls:
- Text classification: 1B captions β 106M food/drink rows (ettin-150m)
- Structured extraction: FoodExtract-v2 on 106M rows (Gemma 3 270M)
- Image download: 3.1M images from filtered URLs
- SigLIP2 zero-shot: 92-prompt classification + embeddings (teacher labels)
- This model: Fast binary classifier for scale-up to 50M+ images
Throughput Benchmarks (RTX 4090)
| Batch Size | img/s | VRAM (MB) |
|---|---|---|
| 64 | 4639.1 | 3476.6 |
| 128 | 4435.9 | 1001.9 |
| 256 | 4274.1 | 1807.2 |
| 512 | 4204.0 | 3417.8 |
Peak: 4639.1 img/s at batch 64
Intended Use
- Primary: Fast filtering of food/drink images from large web-crawled datasets
- Secondary: Binary food detection in apps (food tracking, dietary logging)
- Not for: Fine-grained food classification (use a multi-class model), nutrition estimation
Limitations
- Binary only β does not distinguish food types, cuisines, or specific items
- Trained on web images β may underperform on unusual angles, lighting, or cultural foods underrepresented in DataComp-1B
- Confidence scores are compressed due to distillation temperature (T=3) β use relative ranking, not absolute thresholds
- v1 model β trained on DataComp only, v2 will include human-verified FoodVision training data
Related Resources
- Dataset: mrdbourke/DataComp-1B-food-and-drink-3M
- Text classifier: mrdbourke/ettin-150m-food-or-drink-classifier
- Food extraction: mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2
- Highest accuracy: mrdbourke/food-not-food-classifier-siglip2-v1
- CoreML deployable: mrdbourke/food-not-food-classifier-nextvit-v1
- App: nutrify.app
Citation
@misc{food-not-food-csatv2-11m-v1,
author = {Daniel Bourke},
title = {Food/Not Food Classifier β csatv2_11m v1},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/mrdbourke/food-not-food-classifier-csatv2-v1}
}
- Downloads last month
- 47