Food / Not Food Classifier β€” csatv2_11m (v1)

Fastest throughput β€” DCT frequency compression, best for large-scale filtering (50M+ images)

A binary image classifier for detecting food and drink in images, trained via knowledge distillation from SigLIP2-so400m zero-shot labels on 3.1M images from DataComp-1B-food-and-drink-3M.

Part of a 3-model portfolio for the Nutrify food tracking pipeline.

Model Details

Field Value
Architecture csatv2.r512_in1k
Parameters 10.7M
Input size 512x512px
Labels food_or_drink (0), not_food_or_drink (1)
FoodVision accuracy 0.9216
FoodVision F1 0.9471
Training val accuracy 0.9169 (epoch 5/5)
Throughput 4639.1 img/s (batch 64, RTX 4090)

Quick Start

import timm
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Load model
model = timm.create_model("csatv2.r512_in1k", pretrained=False, num_classes=2)
weights = hf_hub_download("mrdbourke/food-not-food-classifier-csatv2-v1", "model.safetensors")
model.load_state_dict(load_file(weights))
model.eval()

# Prepare transform (use timm's built-in config)
from timm.data import resolve_data_config, create_transform
data_config = resolve_data_config(model.pretrained_cfg)
data_config["input_size"] = (3, 512, 512)
transform = create_transform(**data_config, is_training=False)

# Classify
image = Image.open("photo.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)

with torch.inference_mode():
    logits = model(input_tensor)
    probs = torch.softmax(logits, dim=1)
    pred = logits.argmax(dim=1).item()

labels = ["food_or_drink", "not_food_or_drink"]
print(f"{labels[pred]}: {probs[0][pred]:.1%}")

All 3 Models β€” Comparison

These models were trained together as part of the Nutrify food/not_food classifier portfolio. Pick the right one for your use case:

Model Role FV Accuracy FV F1 Params Throughput Repo
siglip2_base_256 Highest accuracy 91.3% 94.1% 92.9M 2099.9 img/s link
csatv2_11m Fastest throughput 92.2% 94.7% 10.7M 4639.1 img/s link
nextvit_small_384 CoreML deployable 92.2% 94.7% 30.7M 1156.0 img/s link

Evaluation β€” FoodVision Test Set

Evaluated on 153,911 human-labeled images from the Nutrify FoodVision dataset (118K food + 35K not_food). This is an out-of-distribution test β€” the model was trained on DataComp-1B web images, not FoodVision images.

Metric Value
Accuracy 0.9216
F1 0.9471
Precision 0.9883
Recall 0.9092
Total samples 153,911
Correct 141,848
Wrong 12,063

Training

Data

  • Source: mrdbourke/DataComp-1B-food-and-drink-3M β€” 3.1M images from Recap-DataComp-1B
  • Training set: 2,952,644 images (all quality tiers)
  • Validation set: 155,403 images
  • Labels: Binary (food_or_drink vs not_food_or_drink)

Distillation

  • Teacher: google/siglip2-so400m-patch16-512 (878M params, zero-shot)
  • Loss: Hybrid KL divergence: alpha * soft_KL + (1-alpha) * hard_CE
  • Alpha: 0.7 | Temperature: 3.0
  • Backbone LR: 0.0001 * 0.1 (differential learning rate after unfreeze)
  • Epochs: 5 | Best epoch: 5

Augmentations (torchvision.transforms.v2)

  • RandomResizedCrop (scale 0.6-1.0) β€” food can be a small part of a scene
  • RandomHorizontalFlip + RandomVerticalFlip β€” orientation robustness
  • RandomRotation (15 deg) β€” tilted phone shots
  • RandomPerspective (0.2, p=0.3) β€” angled views
  • ColorJitter (B=0.4, C=0.4, S=0.3, H=0.05) β€” restaurant lighting variation
  • GaussianBlur (p=0.2) β€” camera shake
  • RandomGrayscale (p=0.02) β€” B&W web images
  • RandomErasing (p=0.1) β€” partial occlusion

Pipeline context

This model is part of the Nutrify VLM pipeline β€” a cascading filter system for building a food/drink image dataset from billion-scale web crawls:

  1. Text classification: 1B captions β†’ 106M food/drink rows (ettin-150m)
  2. Structured extraction: FoodExtract-v2 on 106M rows (Gemma 3 270M)
  3. Image download: 3.1M images from filtered URLs
  4. SigLIP2 zero-shot: 92-prompt classification + embeddings (teacher labels)
  5. This model: Fast binary classifier for scale-up to 50M+ images

Throughput Benchmarks (RTX 4090)

Batch Size img/s VRAM (MB)
64 4639.1 3476.6
128 4435.9 1001.9
256 4274.1 1807.2
512 4204.0 3417.8

Peak: 4639.1 img/s at batch 64

Intended Use

  • Primary: Fast filtering of food/drink images from large web-crawled datasets
  • Secondary: Binary food detection in apps (food tracking, dietary logging)
  • Not for: Fine-grained food classification (use a multi-class model), nutrition estimation

Limitations

  • Binary only β€” does not distinguish food types, cuisines, or specific items
  • Trained on web images β€” may underperform on unusual angles, lighting, or cultural foods underrepresented in DataComp-1B
  • Confidence scores are compressed due to distillation temperature (T=3) β€” use relative ranking, not absolute thresholds
  • v1 model β€” trained on DataComp only, v2 will include human-verified FoodVision training data

Related Resources

Citation

@misc{food-not-food-csatv2-11m-v1,
  author = {Daniel Bourke},
  title = {Food/Not Food Classifier β€” csatv2_11m v1},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/mrdbourke/food-not-food-classifier-csatv2-v1}
}
Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train mrdbourke/food-not-food-classifier-csatv2-v1