Ettin-150m Food or Drink Classifier
A 150M parameter binary text classifier fine-tuned to detect food or drink content in image captions. Built for high-throughput inference on billion-scale datasets.
Performance
| Metric | Value |
|---|---|
| Accuracy | 0.9462 |
| F1 | 0.9475 |
| Precision | 0.9424 |
| Recall | 0.9527 |
| Throughput | 5,874 rows/s (fp16, NVIDIA GeForce RTX 4090) |
| 1B rows ETA | ~47.3 hours |
Confusion matrix
| Pred: food/drink | Pred: not food/drink | |
|---|---|---|
| True: food/drink | 159,311 | 7,913 |
| True: not food/drink | 9,733 | 151,130 |
Throughput benchmark (NVIDIA GeForce RTX 4090)
| Batch Size | Rows/s | Elapsed | Peak VRAM | Status |
|---|---|---|---|---|
| 128 | 7,513 | 43.7s | 0.68 GB | ✓ |
| 256 | 8,555 | 38.4s | 1.05 GB | ✓ |
| 512 | 8,154 | 40.2s | 1.80 GB | ✓ |
| 1024 | 8,038 | 40.8s | 3.31 GB | ✓ |
| 2048 | 7,956 | 41.2s | 6.32 GB | ✓ |
| 4096 | 7,747 | 42.4s | 12.34 GB | ✓ |
| 8192 | — | — | — | OOM |
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="mrdbourke/ettin-150m-food-or-drink-classifier",
device="cuda", # or "cpu"
torch_dtype="float16" # for faster inference
)
# Single prediction
result = classifier("A bowl of ramen with soft-boiled egg and nori")
print(result)
# [{'label': 'food_or_drink', 'score': 0.9995}]
# Batch prediction
texts = [
"A glass of red wine next to a cheese board",
"A yellow tractor driving over a grassy hill",
"Fresh squeezed orange juice with ice",
"A laptop computer on a wooden desk",
]
results = classifier(texts, batch_size=512)
for text, r in zip(texts, results):
print(f"{r['label']:<20s} {r['score']:.4f} {text}")
PyTorch direct inference (maximum throughput)
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("mrdbourke/ettin-150m-food-or-drink-classifier")
model = AutoModelForSequenceClassification.from_pretrained(
"mrdbourke/ettin-150m-food-or-drink-classifier",
torch_dtype=torch.float16,
).to("cuda").eval()
texts = ["A bowl of ramen", "A red car on the highway"]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to("cuda")
with torch.no_grad(), torch.autocast(device_type="cuda", dtype=torch.float16):
outputs = model(**inputs)
probs = torch.softmax(outputs.logits.float(), dim=-1)
preds = torch.argmax(probs, dim=-1)
labels = ["food_or_drink", "not_food_or_drink"]
for text, pred, prob in zip(texts, preds, probs):
print(f"{labels[pred]:<20s} {prob[pred]:.4f} {text}")
Labels
| Label | ID | Description |
|---|---|---|
food_or_drink |
0 | Caption describes food, beverages, meals, ingredients, drinks |
not_food_or_drink |
1 | Caption describes anything else |
How it was made
This model was created through knowledge distillation:
- Teacher model: ModernBERT-large-zeroshot-v2.0 (400M params) classified 10M image captions from Recap-DataComp-1B as
food or drink/not food or drink - Labeled dataset: mrdbourke/food-or-drink-10m — 1.57M balanced rows (50/50 split)
- Supplementary data: mrdbourke/FoodExtract-135k — 135K human-labeled samples
- Student model: jhu-clsp/ettin-encoder-150m fine-tuned on both datasets with both
re_captionandorg_captionas separate training examples
Training details
| Parameter | Value |
|---|---|
| Base model | jhu-clsp/ettin-encoder-150m |
| Parameters | 149,606,402 |
| Model size (fp16) | 285.4 MB |
| Training examples | 2,952,797 |
| Test examples | 328,087 |
| Epochs | 5 |
| Batch size | 128 |
| Learning rate | 2e-05 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| bf16 training | True |
| Max length | 512 |
Why knowledge distillation?
The teacher model (ModernBERT-large zero-shot NLI) processes ~871 rows/s because it requires encoding hypothesis pairs for each label. The fine-tuned student model does a single forward pass with a classification head, achieving 5,874+ rows/s — a 7x speedup that makes billion-scale inference practical.
Intended use
- Large-scale food/drink filtering: Extract food and drink content from billion-row image-text datasets
- Caption classification: Classify image captions as food/drink related or not
- Dataset curation: Filter web-scraped data for food/drink content
License
Apache 2.0
- Downloads last month
- 6
Model tree for mrdbourke/ettin-150m-food-or-drink-classifier
Base model
jhu-clsp/ettin-encoder-150mDatasets used to train mrdbourke/ettin-150m-food-or-drink-classifier
Evaluation results
- Accuracy on food-or-drink-10m (test split)self-reported0.946
- F1 on food-or-drink-10m (test split)self-reported0.948
- Precision on food-or-drink-10m (test split)self-reported0.942
- Recall on food-or-drink-10m (test split)self-reported0.953