Minimal Image Captioning with BLIP on Hugging Face

Community Article Published February 2, 2026

If you want a quick way to generate captions for images, BLIP (Bootstrapped Language-Image Pretraining) is one of the most accessible choices in the Transformers ecosystem. With a processor + conditional generation model, you can caption an image in just a few lines of Python.

In this post, I’m sharing a tiny script that loads a captioning model from the Hugging Face Hub and prints a caption for a given image.

Model used in this example:

  • norwoodsystems/image-caption

BLIP docs (for reference):

  • Transformers BLIP documentation

What this script does

  1. Loads a BLIP processor and captioning model from the Hub
  2. Reads an image file from the command line
  3. Runs model.generate(...) to create caption tokens
  4. Decodes tokens into a human-readable caption

Install dependencies

pip install torch transformers pillow

If you plan to run on GPU, install a CUDA-enabled PyTorch build.


Minimal captioning script

Save as caption.py:

import sys
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

MODEL_ID = "norwoodsystems/image-caption"

def main():
    if len(sys.argv) < 2:
        print("Usage: python caption.py <image_path>")
        sys.exit(1)

    image_path = sys.argv[1]

    # Load processor + model
    processor = BlipProcessor.from_pretrained(MODEL_ID)
    model = BlipForConditionalGeneration.from_pretrained(
        MODEL_ID,
        use_safetensors=True
    )

    # Optional: move to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    # Load image
    raw_image = Image.open(image_path).convert("RGB")

    # Preprocess
    inputs = processor(images=raw_image, return_tensors="pt").to(device)

    # Generate caption
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=40)

    # Decode
    description = processor.decode(generated_ids[0], skip_special_tokens=True)
    print("Description:", description)

if __name__ == "__main__":
    main()

Run it

python caption.py rose.png

Output example:

Description: a single red rose on a white background

Useful knobs to try

If you want to tune caption style and length:

  • Longer/shorter captions: adjust max_new_tokens

  • More deterministic output:

    model.generate(**inputs, do_sample=False)
    
  • More variety:

    model.generate(**inputs, do_sample=True, top_p=0.9, temperature=1.0)
    

Where this is useful

  • Auto-captioning for datasets
  • Pre-labeling images before manual review
  • Content search / indexing (“find images that contain …”)
  • Accessibility features (alt-text suggestions)

Notes & limitations

  • Captions can be biased or incomplete (common for vision-language models)
  • Results depend heavily on the model’s training data and target domain
  • For production, consider batching, caching, and safety filtering (if user-generated images)

If you try this on your own images, I’d love to hear:

  • where it works well
  • where it fails (small objects, text in images, complex scenes, etc.)

Community

Sign up or log in to comment