TRI-ML
/

Foundry-VLA-1.7B-sim

diffusion-policy

Model card Files Files and versions

Foundry-VLA-1.7B-sim / README.md

jmercat's picture

Upload README.md with huggingface_hub

1ed3f7c verified 15 days ago

|

history blame contribute delete

1.75 kB

	---
	license: apache-2.0
	library_name: vla-foundry
	tags:
	- foundry
	- vla_foundry
	- vla
	- robotics
	- diffusion-policy
	- flow-matching
	pipeline_tag: robotics
	---

	# Foundry-VLA-1.7B-sim

	A 1.7B parameter vision-language-action model for bimanual robotic manipulation, part of the [VLA Foundry](https://github.com/TRI-ML/vla_foundry) collection. Trained on simulated manipulation data only.

	## Model Description

	- Architecture: Foundry-VLM-1.3B vision-language backbone + (condition on last 4 layers) flow-matching diffusion action head (24 layers, 1024 hidden dim, 16 heads)
	- Parameters: 1.7B (non-embedding)
	- Action space: 20-dim relative actions (bimanual xyz + 6D rotation + gripper)
	- Cameras: 4 views (2 scene + 2 wrist)
	- Training data: 102M samples from simulated bimanual manipulation tasks only
	- VLM backbone: [Foundry-VLM-1.3B-200M](https://huggingface.co/TRI-ML/Foundry-VLM-1.3B-200M)

	## Evaluation Results

	Success rates on 16 seen tasks and 3 unseen tasks (200 rollouts per task):

	\| Simulator \| Seen (16 tasks) \| Unseen (3 tasks) \|
	\|---\|---\|---\|
	\| CS \| 60.3% \| 8.2% \|
	\| OSS \| 41.0% \| 11.7% \|

	## Usage

	```bash
	git clone https://github.com/TRI-ML/vla_foundry.git
	cd vla_foundry
	pip install -e .
	```

	```python
	from vla_foundry.models.base_model import BaseModel
	model = BaseModel.from_pretrained("TRI-ML/Foundry-VLA-1.7B-sim")
	```

	## Links

	- Project page: [tri-ml.github.io/vla_foundry](https://tri-ml.github.io/vla_foundry/)
	- Paper: [VLA Foundry (arXiv 2604.19728)](https://arxiv.org/abs/2604.19728)
	- Code: [github.com/TRI-ML/vla_foundry](https://github.com/TRI-ML/vla_foundry)
	- Collection: [VLA Foundry collection](https://huggingface.co/collections/TRI-ML/vla-foundry)