Instructions to use Heliosoph/vit-gpt2-image-captioning-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Heliosoph/vit-gpt2-image-captioning-onnx with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="Heliosoph/vit-gpt2-image-captioning-onnx")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("Heliosoph/vit-gpt2-image-captioning-onnx") model = AutoModelForImageTextToText.from_pretrained("Heliosoph/vit-gpt2-image-captioning-onnx") - Notebooks
- Google Colab
- Kaggle
File size: 1,843 Bytes
ace8aed b3c3714 ace8aed b3c3714 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ---
license: apache-2.0
library_name: transformers
tags:
- image-captioning
- vit
- gpt2
- onnx
base_model: nlpconnect/vit-gpt2-image-captioning
pipeline_tag: image-to-text
---
# ViT-GPT2 Image Captioning — ONNX
ONNX export of [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) — a classic ViT encoder + GPT-2 decoder image captioner. ~240M parameters, trained on COCO captions.
Lightweight baseline captioner. Florence-2 is the better default for new projects (smaller, more capable, multi-task), but this one is useful when you need a vanilla "describe this image in one sentence" with minimal dependencies.
Converted artifact. Training credit: nlpconnect.
## What this repo contains
```
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json
encoder_model.onnx # ViT image encoder
decoder_model.onnx # GPT-2 autoregressive decoder
```
Total: ~1.1 GB at fp32. Load with `optimum.onnxruntime.ORTModelForVision2Seq`.
## How it was produced
```
optimum-cli export onnx \
--model nlpconnect/vit-gpt2-image-captioning \
--task image-to-text \
<output>
```
Conversion script: [`scripts/export-vit-gpt-image-captioning.ps1`](https://github.com/HeliosophLLC/DatumIngest/blob/main/scripts/export-vit-gpt-image-captioning.ps1) in the DatumIngest repo.
Toolchain: `optimum 1.24.0`, `transformers 4.45.2`, `torch 2.4.x`.
## Inference notes
| Setting | Value |
|---|---|
| Input resolution | 224×224 (resized + center-cropped by `preprocessor_config.json`) |
| Output | English caption, ~16-token median length |
| Max tokens | 16 (default in `generation_config.json`) |
| Domain | COCO-style natural scenes |
## License
**Apache-2.0** — same as upstream. `LICENSE` file included.
|