Food / Not Food Classifier — csatv2_11m (v1)

Fastest throughput — DCT frequency compression, best for large-scale filtering (50M+ images)

A binary image classifier for detecting food and drink in images, trained via knowledge distillation from SigLIP2-so400m zero-shot labels on 3.1M images from DataComp-1B-food-and-drink-3M.

Part of a 3-model portfolio for the Nutrify food tracking pipeline.

Model Details

Field	Value
Architecture	`csatv2.r512_in1k`
Parameters	10.7M
Input size	512x512px
Labels	`food_or_drink` (0), `not_food_or_drink` (1)
FoodVision accuracy	0.9216
FoodVision F1	0.9471
Training val accuracy	0.9169 (epoch 5/5)
Throughput	4639.1 img/s (batch 64, RTX 4090)

Quick Start

import timm
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Load model
model = timm.create_model("csatv2.r512_in1k", pretrained=False, num_classes=2)
weights = hf_hub_download("mrdbourke/food-not-food-classifier-csatv2-v1", "model.safetensors")
model.load_state_dict(load_file(weights))
model.eval()

# Prepare transform (use timm's built-in config)
from timm.data import resolve_data_config, create_transform
data_config = resolve_data_config(model.pretrained_cfg)
data_config["input_size"] = (3, 512, 512)
transform = create_transform(**data_config, is_training=False)

# Classify
image = Image.open("photo.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)

with torch.inference_mode():
    logits = model(input_tensor)
    probs = torch.softmax(logits, dim=1)
    pred = logits.argmax(dim=1).item()

labels = ["food_or_drink", "not_food_or_drink"]
print(f"{labels[pred]}: {probs[0][pred]:.1%}")

All 3 Models — Comparison

These models were trained together as part of the Nutrify food/not_food classifier portfolio. Pick the right one for your use case:

Model	Role	FV Accuracy	FV F1	Params	Throughput	Repo
siglip2_base_256	Highest accuracy	91.3%	94.1%	92.9M	2099.9 img/s	link
csatv2_11m	Fastest throughput	92.2%	94.7%	10.7M	4639.1 img/s	link
nextvit_small_384	CoreML deployable	92.2%	94.7%	30.7M	1156.0 img/s	link

Evaluation — FoodVision Test Set

Evaluated on 153,911 human-labeled images from the Nutrify FoodVision dataset (118K food + 35K not_food). This is an out-of-distribution test — the model was trained on DataComp-1B web images, not FoodVision images.

Metric	Value
Accuracy	0.9216
F1	0.9471
Precision	0.9883
Recall	0.9092
Total samples	153,911
Correct	141,848
Wrong	12,063

Training

Data

Source: mrdbourke/DataComp-1B-food-and-drink-3M — 3.1M images from Recap-DataComp-1B
Training set: 2,952,644 images (all quality tiers)
Validation set: 155,403 images
Labels: Binary (food_or_drink vs not_food_or_drink)

Distillation

Teacher: google/siglip2-so400m-patch16-512 (878M params, zero-shot)
Loss: Hybrid KL divergence: alpha * soft_KL + (1-alpha) * hard_CE
Alpha: 0.7 | Temperature: 3.0
Backbone LR: 0.0001 * 0.1 (differential learning rate after unfreeze)
Epochs: 5 | Best epoch: 5

Augmentations (torchvision.transforms.v2)

RandomResizedCrop (scale 0.6-1.0) — food can be a small part of a scene
RandomHorizontalFlip + RandomVerticalFlip — orientation robustness
RandomRotation (15 deg) — tilted phone shots
RandomPerspective (0.2, p=0.3) — angled views
ColorJitter (B=0.4, C=0.4, S=0.3, H=0.05) — restaurant lighting variation
GaussianBlur (p=0.2) — camera shake
RandomGrayscale (p=0.02) — B&W web images
RandomErasing (p=0.1) — partial occlusion

Pipeline context

This model is part of the Nutrify VLM pipeline — a cascading filter system for building a food/drink image dataset from billion-scale web crawls:

Text classification: 1B captions → 106M food/drink rows (ettin-150m)
Structured extraction: FoodExtract-v2 on 106M rows (Gemma 3 270M)
Image download: 3.1M images from filtered URLs
SigLIP2 zero-shot: 92-prompt classification + embeddings (teacher labels)
This model: Fast binary classifier for scale-up to 50M+ images

Throughput Benchmarks (RTX 4090)

Batch Size	img/s	VRAM (MB)
64	4639.1	3476.6
128	4435.9	1001.9
256	4274.1	1807.2
512	4204.0	3417.8

Peak: 4639.1 img/s at batch 64

Intended Use

Primary: Fast filtering of food/drink images from large web-crawled datasets
Secondary: Binary food detection in apps (food tracking, dietary logging)
Not for: Fine-grained food classification (use a multi-class model), nutrition estimation

Limitations

Binary only — does not distinguish food types, cuisines, or specific items
Trained on web images — may underperform on unusual angles, lighting, or cultural foods underrepresented in DataComp-1B
Confidence scores are compressed due to distillation temperature (T=3) — use relative ranking, not absolute thresholds
v1 model — trained on DataComp only, v2 will include human-verified FoodVision training data

Related Resources

Dataset: mrdbourke/DataComp-1B-food-and-drink-3M
Text classifier: mrdbourke/ettin-150m-food-or-drink-classifier
Food extraction: mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2
Highest accuracy: mrdbourke/food-not-food-classifier-siglip2-v1
CoreML deployable: mrdbourke/food-not-food-classifier-nextvit-v1
App: nutrify.app

Citation

@misc{food-not-food-csatv2-11m-v1,
  author = {Daniel Bourke},
  title = {Food/Not Food Classifier — csatv2_11m v1},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/mrdbourke/food-not-food-classifier-csatv2-v1}
}

Downloads last month: 47

mrdbourke
/

food-not-food-classifier-csatv2-v1