GuizMeuh
/

nanoVLM-222M

vision-language

Model card Files Files and versions

nanoVLM-222M / README.md

GuizMeuh's picture

Upload nanoVLM using push_to_hub

22ff6d2 verified 14 days ago

|

history blame contribute delete

841 Bytes


	---
	language: en
	license: mit
	library_name: nanovlm
	tags:
	- vision-language
	- multimodal
	- smollm2
	- siglip
	---

	# nanoVLM - GuizMeuh/nanoVLM-222M

	This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.

	## Model Description
	The model consists of three main components:
	- Vision Backbone: Pretrained `google/siglip-base-patch16-224`
	- Language Model: Pretrained `HuggingFaceTB/SmolLM2-135M`
	- Modality Projector: A learnable linear layer with Pixel Shuffle reduction.

	## Usage
	You can load this model using the `VisionLanguageModel` class from the `nanovlm` repository.

	```python
	from models.vision_language_model import VisionLanguageModel
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = VisionLanguageModel.from_pretrained("GuizMeuh/nanoVLM-222M").to(device)
	```