CoVLA β EXAONE4-1B
CoVLA with EXAONE-4.0-1.2B backbone + SigLIP2-NaFlex
CoVLA (Contextual Vision-Language Alignment) is a lightweight multimodal connector that uses CLS-attention-guided token selection (RawPool) to reduce visual tokens from 576β321 while preserving accuracy. Training details are in the paper:
CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection COLM 2026 (under review)
Model Details
| Field | Value |
|---|---|
| Base LLM | LGAI-EXAONE/EXAONE-4.0-1.2B |
| Vision Tower | google/siglip2-so400m-patch16-naflex |
| Connector | CoVLA RawPool (K=256, pool=8) |
| LoRA rank | 128 |
| LoRA alpha | 256 |
Repository Contents
adapter_model.safetensors β LoRA adapter weights (load with PEFT)
vision_connector.safetensors β CoVLA vision connector weights
adapter_config.json β PEFT LoRA config
config.json β Base LLM config
tokenizer.json β Tokenizer
stage2_metadata.json β Training metadata
Usage
import torch
from transformers import AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
# 1. Clone the CoVLA code
# git clone https://github.com/junyong300/covla
# 2. Load with CoVLA factory (recommended)
import sys; sys.path.insert(0, "covla")
from src.models.model import CovlaModel
# 3. Load base model + LoRA + vision connector
model = CovlaModel.from_pretrained(
llm_path="LGAI-EXAONE/EXAONE-4.0-1.2B",
vision_tower="google/siglip2-so400m-patch16-naflex",
lora_adapter="junyong300/covla-exaone4-1b", # loads adapter_model.safetensors
vision_connector="junyong300/covla-exaone4-1b", # loads vision_connector.safetensors
)
See the CoVLA repository for full usage examples.
Performance
| Benchmark | CoVLA (321 tok) | MLP (576 tok) |
|---|---|---|
| MMBench Std | 82.5 | 81.8 |
| MMBench Circ | 75.4 | 74.6 |
| SEED | 72.4 | 72.5 |
| GQA | 49.3 | 50.8 |
| TextVQA | 64.0 | 64.5 |
Results for Qwen3-4B variant. See paper for other backbones.
Citation
@inproceedings{covla2026,
title={CoVLA: Contextual Vision-Language Alignment via CLS-Guided Token Selection},
author={Park, Junyong and others},
booktitle={Conference on Language Modeling (COLM)},
year={2026}
}
- Downloads last month
- 10
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for junyong300/covla-exaone4-1b
Base model
LGAI-EXAONE/EXAONE-4.0-1.2B