nanoVLM-222M / README.md
devilops's picture
Upload nanoVLM using push_to_hub
c165322 verified
metadata
language: en
license: mit
library_name: nanovlm
tags:
  - vision-language
  - multimodal
  - smollm2
  - siglip

nanoVLM - devilops/nanoVLM-222M

This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.

Model Description

The model consists of three main components:

  • Vision Backbone: Pretrained google/siglip-base-patch16-224
  • Language Model: Pretrained HuggingFaceTB/SmolLM2-135M
  • Modality Projector: A learnable linear layer with Pixel Shuffle reduction.

Usage

You can load this model using the VisionLanguageModel class from the nanovlm repository.

from models.vision_language_model import VisionLanguageModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionLanguageModel.from_pretrained("devilops/nanoVLM-222M").to(device)