CoVLA — VICUNA-7B

CoVLA with Vicuna-7B backbone + CLIP ViT-L/14@336, trained on LLaVA-665K

CoVLA (Contextual Vision-Language Alignment) is a lightweight multimodal connector that uses CLS-attention-guided token selection (RawPool) to reduce visual tokens from 576→321 while preserving accuracy. Training details are in the paper:

CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection COLM 2026 (under review)

Model Details

Field	Value
Base LLM	`lmsys/vicuna-7b-v1.5`
Vision Tower	`openai/clip-vit-large-patch14-336`
Connector	CoVLA RawPool (K=256, pool=8)
LoRA rank	128
LoRA alpha	256

Repository Contents

adapter_model.safetensors  — LoRA adapter weights (load with PEFT)
vision_connector.safetensors — CoVLA vision connector weights
adapter_config.json        — PEFT LoRA config
config.json                — Base LLM config
tokenizer.json             — Tokenizer
stage2_metadata.json       — Training metadata

Usage

import torch
from transformers import AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# 1. Clone the CoVLA code
# git clone https://github.com/junyong300/covla

# 2. Load with CoVLA factory (recommended)
import sys; sys.path.insert(0, "covla")
from src.models.model import CovlaModel

# 3. Load base model + LoRA + vision connector
model = CovlaModel.from_pretrained(
    llm_path="lmsys/vicuna-7b-v1.5",
    vision_tower="openai/clip-vit-large-patch14-336",
    lora_adapter="junyong300/covla-vicuna-7b",   # loads adapter_model.safetensors
    vision_connector="junyong300/covla-vicuna-7b",  # loads vision_connector.safetensors
)

See the CoVLA repository for full usage examples.

Performance

Benchmark	CoVLA (321 tok)	MLP (576 tok)
MMBench Std	82.5	81.8
MMBench Circ	75.4	74.6
SEED	72.4	72.5
GQA	49.3	50.8
TextVQA	64.0	64.5

Results for Qwen3-4B variant. See paper for other backbones.

Citation

@inproceedings{covla2026,
  title={CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection},
  author={Park, Junyong and others},
  booktitle={Conference on Language Modeling (COLM)},
  year={2026}
}

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junyong300/covla-vicuna-7b

Base model

lmsys/vicuna-7b-v1.5

Adapter

(190)

this model