Ettin-150m Food or Drink Classifier

A 150M parameter binary text classifier fine-tuned to detect food or drink content in image captions. Built for high-throughput inference on billion-scale datasets.

Performance

Metric Value
Accuracy 0.9462
F1 0.9475
Precision 0.9424
Recall 0.9527
Throughput 5,874 rows/s (fp16, NVIDIA GeForce RTX 4090)
1B rows ETA ~47.3 hours

Confusion matrix

Pred: food/drink Pred: not food/drink
True: food/drink 159,311 7,913
True: not food/drink 9,733 151,130

Throughput benchmark (NVIDIA GeForce RTX 4090)

Batch Size Rows/s Elapsed Peak VRAM Status
128 7,513 43.7s 0.68 GB ✓
256 8,555 38.4s 1.05 GB ✓
512 8,154 40.2s 1.80 GB ✓
1024 8,038 40.8s 3.31 GB ✓
2048 7,956 41.2s 6.32 GB ✓
4096 7,747 42.4s 12.34 GB ✓
8192 — — — OOM

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="mrdbourke/ettin-150m-food-or-drink-classifier",
    device="cuda",       # or "cpu"
    torch_dtype="float16" # for faster inference
)

# Single prediction
result = classifier("A bowl of ramen with soft-boiled egg and nori")
print(result)
# [{'label': 'food_or_drink', 'score': 0.9995}]

# Batch prediction
texts = [
    "A glass of red wine next to a cheese board",
    "A yellow tractor driving over a grassy hill",
    "Fresh squeezed orange juice with ice",
    "A laptop computer on a wooden desk",
]
results = classifier(texts, batch_size=512)
for text, r in zip(texts, results):
    print(f"{r['label']:<20s} {r['score']:.4f}  {text}")

PyTorch direct inference (maximum throughput)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("mrdbourke/ettin-150m-food-or-drink-classifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "mrdbourke/ettin-150m-food-or-drink-classifier",
    torch_dtype=torch.float16,
).to("cuda").eval()

texts = ["A bowl of ramen", "A red car on the highway"]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to("cuda")

with torch.no_grad(), torch.autocast(device_type="cuda", dtype=torch.float16):
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits.float(), dim=-1)
    preds = torch.argmax(probs, dim=-1)

labels = ["food_or_drink", "not_food_or_drink"]
for text, pred, prob in zip(texts, preds, probs):
    print(f"{labels[pred]:<20s} {prob[pred]:.4f}  {text}")

Labels

Label ID Description
food_or_drink 0 Caption describes food, beverages, meals, ingredients, drinks
not_food_or_drink 1 Caption describes anything else

How it was made

This model was created through knowledge distillation:

  1. Teacher model: ModernBERT-large-zeroshot-v2.0 (400M params) classified 10M image captions from Recap-DataComp-1B as food or drink / not food or drink
  2. Labeled dataset: mrdbourke/food-or-drink-10m — 1.57M balanced rows (50/50 split)
  3. Supplementary data: mrdbourke/FoodExtract-135k — 135K human-labeled samples
  4. Student model: jhu-clsp/ettin-encoder-150m fine-tuned on both datasets with both re_caption and org_caption as separate training examples

Training details

Parameter Value
Base model jhu-clsp/ettin-encoder-150m
Parameters 149,606,402
Model size (fp16) 285.4 MB
Training examples 2,952,797
Test examples 328,087
Epochs 5
Batch size 128
Learning rate 2e-05
Warmup ratio 0.1
Weight decay 0.01
bf16 training True
Max length 512

Why knowledge distillation?

The teacher model (ModernBERT-large zero-shot NLI) processes ~871 rows/s because it requires encoding hypothesis pairs for each label. The fine-tuned student model does a single forward pass with a classification head, achieving 5,874+ rows/s — a 7x speedup that makes billion-scale inference practical.

Intended use

  • Large-scale food/drink filtering: Extract food and drink content from billion-row image-text datasets
  • Caption classification: Classify image captions as food/drink related or not
  • Dataset curation: Filter web-scraped data for food/drink content

License

Apache 2.0

Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrdbourke/ettin-150m-food-or-drink-classifier

Finetuned
(14)
this model

Datasets used to train mrdbourke/ettin-150m-food-or-drink-classifier

Evaluation results