--- language: en license: mit library_name: nanovlm tags: - vision-language - multimodal - smollm2 - siglip --- # nanoVLM - GuizMeuh/nanoVLM-222M This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course. ## Model Description The model consists of three main components: - **Vision Backbone**: Pretrained `google/siglip-base-patch16-224` - **Language Model**: Pretrained `HuggingFaceTB/SmolLM2-135M` - **Modality Projector**: A learnable linear layer with Pixel Shuffle reduction. ## Usage You can load this model using the `VisionLanguageModel` class from the `nanovlm` repository. ```python from models.vision_language_model import VisionLanguageModel import torch device = "cuda" if torch.cuda.is_available() else "cpu" model = VisionLanguageModel.from_pretrained("GuizMeuh/nanoVLM-222M").to(device) ```