Visual Instruction Bottleneck Tuning
Paper • 2505.13946 • Published • 10
Vittle (F) is the fixed-prior variant of Vittle (NeurIPS 2025), a VLM instruction tuning framework that improves robustness to distribution shifts via variational information bottleneck.
import torch
from vittle.model.language_model.vittle_llama import VittleLlamaForCausalLM
from transformers import AutoTokenizer
model = VittleLlamaForCausalLM.from_pretrained(
"changdae/vittle-7b-F",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("changdae/vittle-7b-F", use_fast=False)
Refer to the evaluation guide for full inference instructions.
| Property | Value |
|---|---|
| Base model | lmsys/vicuna-7b-v1.5 |
| Vision encoder | openai/clip-vit-large-patch14-336 |
| Bottleneck layer | 24 |
| Interpolation coefficient (alpha) | 0.5 |
| KLD strength (beta) | 0.1 |
| Learnable prior | No (fixed) |
| Training dtype | bfloat16 |
@inproceedings{
oh2025visual,
title={Visual Instruction Bottleneck Tuning},
author={Changdae Oh and Jiatong Li and Shawn Im and Sharon Li},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=yzHiEmLSk8}
}
Base model
lmsys/vicuna-7b-v1.5