🚀 CRIT-VL-8B

CRIT-VL-8B is a Vision-Language Model (VLM) fine-tuned for complex Cross-Modal Multi-Hop Reasoning. This model was trained to effectively connect text context with visual cues across multiple images, addressing the hallucination and grounding issues prevalent in existing VLMs.

This model is the official open-source release accompanying the CVPR 2026 Accepted paper: "CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning".

📊 Model Details

Base Model: InternVL-3.5-Pretrained (8B)
Architecture: Vision-Language Model
Training Data Recipe: The model was supervised fine-tuned (SFT) using an optimized combination of the following datasets:
- LLaVA-Onevision-Instruct
- CRIT (+ Korean extension)
- R1-Onevision (+ Korean extension)
Training Infrastructure: Trained on an AWS ParallelCluster / Slurm environment.

💻 Quick Start

To use CRIT-VL-8B, you will need to allow custom code execution (trust_remote_code=True) as it utilizes the InternVL architecture.

import torch
from transformers import AutoTokenizer, AutoModel

path = "KU-MIIL/CRIT-VL-8B"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

# An 8B model in bfloat16 typically requires around 16GB-24GB of VRAM.
# It can comfortably run on a single consumer-grade GPU like RTX 3090/4090 or an A10G/V100.
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda()

# Example: Generate a response (Modify the prompt and image structure according to InternVL documentation)
# response = model.chat(tokenizer, pixel_values, question, generation_config)

📖 Citation

If you find this model or the CRIT dataset useful in your research, please consider citing our CVPR 2026 paper:

@inproceedings{crit2026,
  title={CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning},
  author={Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

🏢 Acknowledgements

This project was conducted by the Multimodal Interactive Intelligence Laboratory (MIIL) at Korea University.

Downloads last month: 31

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for KU-MIIL/CRIT-VL-8B

Base model

OpenGVLab/InternVL3_5-8B-Pretrained

Finetuned

(2)

this model

KU-MIIL
/

CRIT-VL-8B