| language: en | |
| license: mit | |
| library_name: nanovlm | |
| tags: | |
| - vision-language | |
| - multimodal | |
| - smollm2 | |
| - siglip | |
| # nanoVLM - GuizMeuh/nanoVLM-222M | |
| This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course. | |
| ## Model Description | |
| The model consists of three main components: | |
| - **Vision Backbone**: Pretrained `google/siglip-base-patch16-224` | |
| - **Language Model**: Pretrained `HuggingFaceTB/SmolLM2-135M` | |
| - **Modality Projector**: A learnable linear layer with Pixel Shuffle reduction. | |
| ## Usage | |
| You can load this model using the `VisionLanguageModel` class from the `nanovlm` repository. | |
| ```python | |
| from models.vision_language_model import VisionLanguageModel | |
| import torch | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model = VisionLanguageModel.from_pretrained("GuizMeuh/nanoVLM-222M").to(device) | |
| ``` | |