SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality Paper • 2306.14610 • Published Jun 26, 2023 • 2
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog • 9 items • Updated Mar 2 • 88
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation Paper • 2412.15188 • Published Dec 19, 2024 • 2
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch +5 May 21, 2025 • 255
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis Paper • 2509.06617 • Published Sep 8, 2025 • 1