Minimal Image Captioning with BLIP on Hugging Face

Community Article Published February 2, 2026

Upvote

nazemi

If you want a quick way to generate captions for images, BLIP (Bootstrapped Language-Image Pretraining) is one of the most accessible choices in the Transformers ecosystem. With a processor + conditional generation model, you can caption an image in just a few lines of Python.

In this post, I’m sharing a tiny script that loads a captioning model from the Hugging Face Hub and prints a caption for a given image.

Model used in this example:

norwoodsystems/image-caption

BLIP docs (for reference):

Transformers BLIP documentation

What this script does

Loads a BLIP processor and captioning model from the Hub
Reads an image file from the command line
Runs model.generate(...) to create caption tokens
Decodes tokens into a human-readable caption

Install dependencies

pip install torch transformers pillow

If you plan to run on GPU, install a CUDA-enabled PyTorch build.

Minimal captioning script

Save as caption.py:

import sys
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

MODEL_ID = "norwoodsystems/image-caption"

def main():
    if len(sys.argv) < 2:
        print("Usage: python caption.py <image_path>")
        sys.exit(1)

    image_path = sys.argv[1]

    # Load processor + model
    processor = BlipProcessor.from_pretrained(MODEL_ID)
    model = BlipForConditionalGeneration.from_pretrained(
        MODEL_ID,
        use_safetensors=True
    )

    # Optional: move to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    # Load image
    raw_image = Image.open(image_path).convert("RGB")

    # Preprocess
    inputs = processor(images=raw_image, return_tensors="pt").to(device)

    # Generate caption
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=40)

    # Decode
    description = processor.decode(generated_ids[0], skip_special_tokens=True)
    print("Description:", description)

if __name__ == "__main__":
    main()

Run it

python caption.py rose.png

Output example:

Description: a single red rose on a white background

Useful knobs to try

If you want to tune caption style and length:

Longer/shorter captions: adjust max_new_tokens

More deterministic output:

model.generate(**inputs, do_sample=False)

More variety:

model.generate(**inputs, do_sample=True, top_p=0.9, temperature=1.0)

Where this is useful

Auto-captioning for datasets
Pre-labeling images before manual review
Content search / indexing (“find images that contain …”)
Accessibility features (alt-text suggestions)

Notes & limitations

Captions can be biased or incomplete (common for vision-language models)
Results depend heavily on the model’s training data and target domain
For production, consider batching, caching, and safety filtering (if user-generated images)

If you try this on your own images, I’d love to hear:

where it works well
where it fails (small objects, text in images, complex scenes, etc.)

Running PersonaPlex-7B on Hugging Face ZeroGPU: A Complete Guide

April 8, 2026

VoxCeleb Dataset: Real-World Speech for Speaker Recognition

March 17, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote