nanoVLM-222M / README.md

devilops

Upload nanoVLM using push_to_hub

c165322 verified 13 days ago

preview code

raw

history blame contribute delete

841 Bytes

metadata

language: en
license: mit
library_name: nanovlm
tags:
  - vision-language
  - multimodal
  - smollm2
  - siglip

nanoVLM - devilops/nanoVLM-222M

This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.

Model Description

The model consists of three main components:

Vision Backbone: Pretrained google/siglip-base-patch16-224
Language Model: Pretrained HuggingFaceTB/SmolLM2-135M
Modality Projector: A learnable linear layer with Pixel Shuffle reduction.

Usage

You can load this model using the VisionLanguageModel class from the nanovlm repository.

from models.vision_language_model import VisionLanguageModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionLanguageModel.from_pretrained("devilops/nanoVLM-222M").to(device)