🚀 CRIT-VL-8B
CRIT-VL-8B is a Vision-Language Model (VLM) fine-tuned for complex Cross-Modal Multi-Hop Reasoning. This model was trained to effectively connect text context with visual cues across multiple images, addressing the hallucination and grounding issues prevalent in existing VLMs.
This model is the official open-source release accompanying the CVPR 2026 Accepted paper: "CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning".
📊 Model Details
- Base Model: InternVL-3.5-Pretrained (8B)
- Architecture: Vision-Language Model
- Training Data Recipe: The model was supervised fine-tuned (SFT) using an optimized combination of the following datasets:
LLaVA-Onevision-InstructCRIT(+ Korean extension)R1-Onevision(+ Korean extension)
- Training Infrastructure: Trained on an AWS ParallelCluster / Slurm environment.
💻 Quick Start
To use CRIT-VL-8B, you will need to allow custom code execution (trust_remote_code=True) as it utilizes the InternVL architecture.
import torch
from transformers import AutoTokenizer, AutoModel
path = "KU-MIIL/CRIT-VL-8B"
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# An 8B model in bfloat16 typically requires around 16GB-24GB of VRAM.
# It can comfortably run on a single consumer-grade GPU like RTX 3090/4090 or an A10G/V100.
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).eval().cuda()
# Example: Generate a response (Modify the prompt and image structure according to InternVL documentation)
# response = model.chat(tokenizer, pixel_values, question, generation_config)
📖 Citation
If you find this model or the CRIT dataset useful in your research, please consider citing our CVPR 2026 paper:
@inproceedings{crit2026,
title={CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning},
author={Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
🏢 Acknowledgements
This project was conducted by the Multimodal Interactive Intelligence Laboratory (MIIL) at Korea University.
- Downloads last month
- 31
Model tree for KU-MIIL/CRIT-VL-8B
Base model
OpenGVLab/InternVL3_5-8B-Pretrained