Ettin-150m Food or Drink Classifier

A 150M parameter binary text classifier fine-tuned to detect food or drink content in image captions. Built for high-throughput inference on billion-scale datasets.

Performance

Metric	Value
Accuracy	0.9462
F1	0.9475
Precision	0.9424
Recall	0.9527
Throughput	5,874 rows/s (fp16, NVIDIA GeForce RTX 4090)
1B rows ETA	~47.3 hours

Confusion matrix

	Pred: food/drink	Pred: not food/drink
True: food/drink	159,311	7,913
True: not food/drink	9,733	151,130

Throughput benchmark (NVIDIA GeForce RTX 4090)

Batch Size	Rows/s	Elapsed	Peak VRAM	Status
128	7,513	43.7s	0.68 GB	✓
256	8,555	38.4s	1.05 GB	✓
512	8,154	40.2s	1.80 GB	✓
1024	8,038	40.8s	3.31 GB	✓
2048	7,956	41.2s	6.32 GB	✓
4096	7,747	42.4s	12.34 GB	✓
8192	—	—	—	OOM

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="mrdbourke/ettin-150m-food-or-drink-classifier",
    device="cuda",       # or "cpu"
    torch_dtype="float16" # for faster inference
)

# Single prediction
result = classifier("A bowl of ramen with soft-boiled egg and nori")
print(result)
# [{'label': 'food_or_drink', 'score': 0.9995}]

# Batch prediction
texts = [
    "A glass of red wine next to a cheese board",
    "A yellow tractor driving over a grassy hill",
    "Fresh squeezed orange juice with ice",
    "A laptop computer on a wooden desk",
]
results = classifier(texts, batch_size=512)
for text, r in zip(texts, results):
    print(f"{r['label']:<20s} {r['score']:.4f}  {text}")

PyTorch direct inference (maximum throughput)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("mrdbourke/ettin-150m-food-or-drink-classifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "mrdbourke/ettin-150m-food-or-drink-classifier",
    torch_dtype=torch.float16,
).to("cuda").eval()

texts = ["A bowl of ramen", "A red car on the highway"]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to("cuda")

with torch.no_grad(), torch.autocast(device_type="cuda", dtype=torch.float16):
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits.float(), dim=-1)
    preds = torch.argmax(probs, dim=-1)

labels = ["food_or_drink", "not_food_or_drink"]
for text, pred, prob in zip(texts, preds, probs):
    print(f"{labels[pred]:<20s} {prob[pred]:.4f}  {text}")

Labels

Label	ID	Description
`food_or_drink`	0	Caption describes food, beverages, meals, ingredients, drinks
`not_food_or_drink`	1	Caption describes anything else

How it was made

This model was created through knowledge distillation:

Teacher model: ModernBERT-large-zeroshot-v2.0 (400M params) classified 10M image captions from Recap-DataComp-1B as food or drink / not food or drink
Labeled dataset: mrdbourke/food-or-drink-10m — 1.57M balanced rows (50/50 split)
Supplementary data: mrdbourke/FoodExtract-135k — 135K human-labeled samples
Student model: jhu-clsp/ettin-encoder-150m fine-tuned on both datasets with both re_caption and org_caption as separate training examples

Training details

Parameter	Value
Base model	jhu-clsp/ettin-encoder-150m
Parameters	149,606,402
Model size (fp16)	285.4 MB
Training examples	2,952,797
Test examples	328,087
Epochs	5
Batch size	128
Learning rate	2e-05
Warmup ratio	0.1
Weight decay	0.01
bf16 training	True
Max length	512

Why knowledge distillation?

The teacher model (ModernBERT-large zero-shot NLI) processes ~871 rows/s because it requires encoding hypothesis pairs for each label. The fine-tuned student model does a single forward pass with a classification head, achieving 5,874+ rows/s — a 7x speedup that makes billion-scale inference practical.

Intended use

Large-scale food/drink filtering: Extract food and drink content from billion-row image-text datasets
Caption classification: Classify image captions as food/drink related or not
Dataset curation: Filter web-scraped data for food/drink content

License

Apache 2.0

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for mrdbourke/ettin-150m-food-or-drink-classifier

Base model

jhu-clsp/ettin-encoder-150m

Finetuned

(14)

this model

Datasets used to train mrdbourke/ettin-150m-food-or-drink-classifier

Evaluation results

Accuracy on food-or-drink-10m (test split)
self-reported

0.946
F1 on food-or-drink-10m (test split)
self-reported

0.948
Precision on food-or-drink-10m (test split)
self-reported

0.942
Recall on food-or-drink-10m (test split)
self-reported

0.953