CoVLA β€” VICUNA-7B

CoVLA with Vicuna-7B backbone + CLIP ViT-L/14@336, trained on LLaVA-665K

CoVLA (Contextual Vision-Language Alignment) is a lightweight multimodal connector that uses CLS-attention-guided token selection (RawPool) to reduce visual tokens from 576β†’321 while preserving accuracy. Training details are in the paper:

CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection COLM 2026 (under review)

Model Details

Field Value
Base LLM lmsys/vicuna-7b-v1.5
Vision Tower openai/clip-vit-large-patch14-336
Connector CoVLA RawPool (K=256, pool=8)
LoRA rank 128
LoRA alpha 256

Repository Contents

adapter_model.safetensors  β€” LoRA adapter weights (load with PEFT)
vision_connector.safetensors β€” CoVLA vision connector weights
adapter_config.json        β€” PEFT LoRA config
config.json                β€” Base LLM config
tokenizer.json             β€” Tokenizer
stage2_metadata.json       β€” Training metadata

Usage

import torch
from transformers import AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# 1. Clone the CoVLA code
# git clone https://github.com/junyong300/covla

# 2. Load with CoVLA factory (recommended)
import sys; sys.path.insert(0, "covla")
from src.models.model import CovlaModel

# 3. Load base model + LoRA + vision connector
model = CovlaModel.from_pretrained(
    llm_path="lmsys/vicuna-7b-v1.5",
    vision_tower="openai/clip-vit-large-patch14-336",
    lora_adapter="junyong300/covla-vicuna-7b",   # loads adapter_model.safetensors
    vision_connector="junyong300/covla-vicuna-7b",  # loads vision_connector.safetensors
)

See the CoVLA repository for full usage examples.

Performance

Benchmark CoVLA (321 tok) MLP (576 tok)
MMBench Std 82.5 81.8
MMBench Circ 75.4 74.6
SEED 72.4 72.5
GQA 49.3 50.8
TextVQA 64.0 64.5

Results for Qwen3-4B variant. See paper for other backbones.

Citation

@inproceedings{covla2026,
  title={CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection},
  author={Park, Junyong and others},
  booktitle={Conference on Language Modeling (COLM)},
  year={2026}
}
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for junyong300/covla-vicuna-7b

Adapter
(190)
this model