Minimal Image Captioning with BLIP on Hugging Face
Community Article Published
February 2, 2026
In this post, I’m sharing a tiny script that loads a captioning model from the Hugging Face Hub and prints a caption for a given image.
Model used in this example:
norwoodsystems/image-caption
BLIP docs (for reference):
- Transformers BLIP documentation
What this script does
- Loads a BLIP processor and captioning model from the Hub
- Reads an image file from the command line
- Runs
model.generate(...)to create caption tokens - Decodes tokens into a human-readable caption
Install dependencies
pip install torch transformers pillow
If you plan to run on GPU, install a CUDA-enabled PyTorch build.
Minimal captioning script
Save as caption.py:
import sys
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
MODEL_ID = "norwoodsystems/image-caption"
def main():
if len(sys.argv) < 2:
print("Usage: python caption.py <image_path>")
sys.exit(1)
image_path = sys.argv[1]
# Load processor + model
processor = BlipProcessor.from_pretrained(MODEL_ID)
model = BlipForConditionalGeneration.from_pretrained(
MODEL_ID,
use_safetensors=True
)
# Optional: move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Load image
raw_image = Image.open(image_path).convert("RGB")
# Preprocess
inputs = processor(images=raw_image, return_tensors="pt").to(device)
# Generate caption
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=40)
# Decode
description = processor.decode(generated_ids[0], skip_special_tokens=True)
print("Description:", description)
if __name__ == "__main__":
main()
Run it
python caption.py rose.png
Output example:
Description: a single red rose on a white background
Useful knobs to try
If you want to tune caption style and length:
Longer/shorter captions: adjust
max_new_tokensMore deterministic output:
model.generate(**inputs, do_sample=False)More variety:
model.generate(**inputs, do_sample=True, top_p=0.9, temperature=1.0)
Where this is useful
- Auto-captioning for datasets
- Pre-labeling images before manual review
- Content search / indexing (“find images that contain …”)
- Accessibility features (alt-text suggestions)
Notes & limitations
- Captions can be biased or incomplete (common for vision-language models)
- Results depend heavily on the model’s training data and target domain
- For production, consider batching, caching, and safety filtering (if user-generated images)
If you try this on your own images, I’d love to hear:
- where it works well
- where it fails (small objects, text in images, complex scenes, etc.)