CoVLA β VICUNA-7B
CoVLA with Vicuna-7B backbone + CLIP ViT-L/14@336, trained on LLaVA-665K
CoVLA (Contextual Vision-Language Alignment) is a lightweight multimodal connector that uses CLS-attention-guided token selection (RawPool) to reduce visual tokens from 576β321 while preserving accuracy. Training details are in the paper:
CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection COLM 2026 (under review)
Model Details
| Field | Value |
|---|---|
| Base LLM | lmsys/vicuna-7b-v1.5 |
| Vision Tower | openai/clip-vit-large-patch14-336 |
| Connector | CoVLA RawPool (K=256, pool=8) |
| LoRA rank | 128 |
| LoRA alpha | 256 |
Repository Contents
adapter_model.safetensors β LoRA adapter weights (load with PEFT)
vision_connector.safetensors β CoVLA vision connector weights
adapter_config.json β PEFT LoRA config
config.json β Base LLM config
tokenizer.json β Tokenizer
stage2_metadata.json β Training metadata
Usage
import torch
from transformers import AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
# 1. Clone the CoVLA code
# git clone https://github.com/junyong300/covla
# 2. Load with CoVLA factory (recommended)
import sys; sys.path.insert(0, "covla")
from src.models.model import CovlaModel
# 3. Load base model + LoRA + vision connector
model = CovlaModel.from_pretrained(
llm_path="lmsys/vicuna-7b-v1.5",
vision_tower="openai/clip-vit-large-patch14-336",
lora_adapter="junyong300/covla-vicuna-7b", # loads adapter_model.safetensors
vision_connector="junyong300/covla-vicuna-7b", # loads vision_connector.safetensors
)
See the CoVLA repository for full usage examples.
Performance
| Benchmark | CoVLA (321 tok) | MLP (576 tok) |
|---|---|---|
| MMBench Std | 82.5 | 81.8 |
| MMBench Circ | 75.4 | 74.6 |
| SEED | 72.4 | 72.5 |
| GQA | 49.3 | 50.8 |
| TextVQA | 64.0 | 64.5 |
Results for Qwen3-4B variant. See paper for other backbones.
Citation
@inproceedings{covla2026,
title={CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection},
author={Park, Junyong and others},
booktitle={Conference on Language Modeling (COLM)},
year={2026}
}
- Downloads last month
- 11
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for junyong300/covla-vicuna-7b
Base model
lmsys/vicuna-7b-v1.5