jumbo / README.md
antofuller's picture
Add model card and image-classification metadata (#1)
167b861
metadata
license: mit
pipeline_tag: image-classification
tags:
  - vision
  - vit
  - image-classification

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers (ICLR 2026)

This repository contains the weights for Jumbo, a simple and scalable architecture that makes Vision Transformers (ViTs) faster. Jumbo reduces patch token width while increasing global token width through a new "Jumbo" token processed by a shared, wider FFN.

Model Description

ViTs are general and accurate, but often slow. Jumbo addresses this by reducing patch token width while adding a wider Jumbo token processed by its own wider FFN. This approach increases model capacity efficiently: the Jumbo FFN processes only a single token for speed, and its parameters are shared across all layers for memory efficiency. Crucially, Jumbo is attention-only and non-hierarchical, maintaining compatibility with plain ViT methods.

ImageNet-1K Performance

The following accuracies were achieved on ImageNet-1K:

Model Top-1 Accuracy
Jumbo-pico 69.156%
Jumbo-nano 74.528%
Jumbo-tiny 78.366%
Jumbo-small 82.558%
Jumbo-base 84.954%

Usage

For installation and running ImageNet-1K evals, attention visualization, and speed measurement, please follow the instructions in the official repository.

Installation

pip install -r requirements.txt

Evaluation

python eval_i1k.py --model_path YOUR_PATH/jumbo_small.pth --model_size small

Measuring Speed

python measure_speed.py --model_size small

Visualizing Attention Maps

python visualize_attn.py --model_path YOUR_PATH/jumbo_small.pth --model_size small --out_dir YOUR_PATH/attn_maps --num_images 50

Citation

@article{fuller2025thicker,
  title={Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers},
  author={Fuller, Anthony and Yassin, Yousef and Kyrollos, Daniel G. and Shelhamer, Evan and Green, James R.},
  journal={arXiv preprint arXiv:2502.15021},
  year={2025}
}