TRI-ML
/

Foundry-VLM-1.3B-200M

Image-Text-to-Text

Model card Files Files and versions

Foundry-VLM-1.3B-200M / README.md

jmercat's picture

Upload README.md with huggingface_hub

8ef8731 verified 19 days ago

|

history blame contribute delete

1.82 kB

	---
	license: apache-2.0
	library_name: vla-foundry
	tags:
	- foundry
	- vla_foundry
	- vlm
	- image-text-to-text
	---

	# Foundry-VLM-1.3B-200M

	A 1.3B parameter vision-language model trained on 200M image-caption samples, part of the [VLA Foundry](https://github.com/TRI-ML/vla_foundry) collection.

	## Model Description

	- Architecture: ViT encoder (12 layers, 768 hidden dim, patch size 14, pixel-shuffle 2x) + Transformer decoder (24 layers, 2048 hidden dim, 16 heads)
	- Parameters: 1.3B (non-embedding)
	- Processor: SmolVLM2
	- Training data: 200M image-caption pairs from DataComp-DR-1B
	- LR schedule: Warmup + constant for 165M samples, then 35M samples of cosine decay
	- LLM backbone: Initialized from [Foundry-LLM-1.2B-800B](https://huggingface.co/TRI-ML/Foundry-LLM-1.2B-800B)

	Continuation of [Foundry-VLM-1.3B-165M](https://huggingface.co/TRI-ML/Foundry-VLM-1.3B-165M) with an additional 35M samples of cosine-decayed training. Used as the vision-language backbone for the Foundry-VLA-1.7B action models.

	## Evaluation Results

	COCO-val captioning:

	\| BLEU-1 \| BLEU-2 \| BLEU-3 \| BLEU-4 \| ROUGE-L \| CIDEr \|
	\|---\|---\|---\|---\|---\|---\|
	\| 58.64 \| 38.62 \| 24.49 \| 15.57 \| 38.17 \| 55.14 \|

	## Usage

	```bash
	git clone https://github.com/TRI-ML/vla_foundry.git
	cd vla_foundry
	pip install -e .
	```

	```python
	from vla_foundry.models.base_model import BaseModel
	model = BaseModel.from_pretrained("TRI-ML/Foundry-VLM-1.3B-200M")
	```

	## Links

	- Project page: [tri-ml.github.io/vla_foundry](https://tri-ml.github.io/vla_foundry/)
	- Paper: [VLA Foundry (arXiv 2604.19728)](https://arxiv.org/abs/2604.19728)
	- Code: [github.com/TRI-ML/vla_foundry](https://github.com/TRI-ML/vla_foundry)
	- Collection: [VLA Foundry collection](https://huggingface.co/collections/TRI-ML/vla-foundry)