Add files using upload-large-folder tool

1982ed2 verified 29 days ago

8.69 kB

	---
	license: mit
	library_name: starVLA
	pipeline_tag: robotics
	tags:
	- vla
	- vision-language-action
	- robotics
	- qwen3-vl
	- flow-matching
	- pi-zero
	- manipulation
	- bridge
	- rt-1
	- oxe
	datasets:
	- IPEC-COMMUNITY/bridge_orig_lerobot
	- IPEC-COMMUNITY/fractal20220817_data_lerobot
	language:
	- en
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	---

	# Qwen3VL-PI_v3-Bridge-RT-1

	A Vision-Language-Action (VLA) model from the [StarVLA](https://github.com/starVLA/starVLA)
	project, combining a Qwen3-VL-4B-Instruct backbone with a **layer-wise
	cross-attention flow-matching action head** (`QwenPI_v3`). The model is
	co-trained on the [Bridge V2](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot)
	and [RT-1 / Fractal](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot)
	slices of the Open X-Embodiment (OXE) collection, and is evaluated on the
	SimplerEnv WidowX benchmark.

	`QwenPI_v3` is StarVLA's open-weight realisation of the π₀.₅ recipe:

	1. Layer-wise cross-DiT flow-matching action head — every VLM layer's
	hidden state participates in cross-attention with the action DiT, instead
	of consuming only the last-layer feature.
	2. Compressed Action DiT — per-layer `LayerNorm + Linear` projectors
	compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent,
	shrinking the action-head footprint by ~6× while preserving the
	layer-wise interaction.
	3. Discretised-state language injection — proprioceptive state is
	quantised into 256 bins and appended to the instruction as plain tokens
	(`[STATE] <bins> [ACTION]`), so the VLM can attend to robot state with
	no additional encoder.

	---

	## Model Summary

	\| \| \|
	\| --- \| --- \|
	\| Architecture \| `QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head) \|
	\| VLM backbone \| [`Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) \|
	\| Action head \| Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) \|
	\| Action chunk \| 16 steps \|
	\| Action / state dim \| 7 / 7 (delta end-effector) \|
	\| Image resolution \| 224 × 224, single 3rd-person view \|
	\| Inference timesteps \| 4 (flow matching) \|
	\| Total parameters \| ≈ 5.07 B \|
	\| License \| MIT \|
	\| Codebase \| [starVLA/starVLA](https://github.com/starVLA/starVLA) \|

	### Parameter breakdown

	\| Module \| Parameters \| Share \|
	\| --- \| ---: \| ---: \|
	\| `qwen_vl_interface` (Qwen3-VL-4B) \| 4,437,815,808 \| 87.5 % \|
	\| `action_model` (layer-wise FM DiT, hidden 1024) \| 538,678,305 \| 10.6 % \|
	\| `project_layers` (per-layer 2560 → 1024 projectors) \| 94,593,024 \| 1.9 % \|
	\| Total \| 5,071,087,137 \| 100 % \|

	---

	## Training Data

	Co-training mixture `bridge_rt_1` (1 : 1 sampling):

	\| Dataset \| Embodiment \| Source \|
	\| --- \| --- \| --- \|
	\| `bridge_orig_1.0.0_lerobot` \| WidowX \| [IPEC-COMMUNITY/bridge_orig_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) \|
	\| `fractal20220817_data_0.1.0_lerobot` (RT-1) \| Google Robot \| [IPEC-COMMUNITY/fractal20220817_data_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) \|

	- Action representation: delta end-effector (7-d, gripper included)
	- Image observation: single primary RGB view, resized to 224 × 224
	- Per-dataset normalisation statistics are stored in
	[`dataset_statistics.json`](dataset_statistics.json).

	---

	## Training Recipe

	\| \| \|
	\| --- \| --- \|
	\| Total steps \| 100,000 (released checkpoints up to 60k) \|
	\| Warm-up steps \| 5,000 \|
	\| Per-device batch size \| 24 \|
	\| Hardware \| 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) \|
	\| Precision \| bf16, mixed-precision + gradient checkpointing \|
	\| Optimizer \| AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) \|
	\| LR (base / VLM) \| 1e-5 \|
	\| LR (action head) \| 1e-4 \|
	\| LR scheduler \| `cosine_with_min_lr` (min lr 5e-7) \|
	\| Gradient clipping \| 1.0 \|
	\| Flow-matching noise \| β-distribution (α=1.5, β=1.0), s = 0.999 \|
	\| Repeated diffusion steps \| 8 \|
	\| Frozen modules \| none (full fine-tuning) \|
	\| Attention impl. \| FlashAttention-2 \|

	The exact training config is preserved in
	[`config.yaml`](config.yaml) / [`config.full.yaml`](config.full.yaml), and the
	launch script in [`run_oxe_train.sh`](run_oxe_train.sh).

	---

	## Evaluation — SimplerEnv WidowX

	Following the standard SimplerEnv WidowX protocol on four pick-and-place
	tasks (24 episodes per task per run). Numbers are success rates (↑).

	\| Step \| PutCarrotOnPlate \| PutEggplantInBasket \| PutSpoonOnTableCloth \| StackGreenCubeOnYellowCube \| Average \|
	\| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| 40k \| 0.688 \| 0.917 \| 0.750 \| 0.333 \| 0.672 \|
	\| 50k \| 0.625 \| 1.000 \| 0.792 \| 0.375 \| 0.698 \|
	\| 60k \| 0.667 \| 1.000 \| 0.750 \| 0.167 \| 0.646 \|

	Best average: 69.8 % at the 50k checkpoint
	([`steps_50000_pytorch_model.pt`](checkpoints/steps_50000_pytorch_model.pt)),
	which we ship as the recommended checkpoint.

	For comparison with other StarVLA frameworks on the same `bridge_rt_1`
	mixture and protocol see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).

	---

	## Repository Layout

	```
	.
	├── README.md # this model card
	├── config.yaml # minimal training config
	├── config.full.yaml # fully resolved training config
	├── run_oxe_train.sh # launch script used for this run
	├── dataset_statistics.json # per-dataset action/state normalisation stats
	├── summary.jsonl # training step summary
	├── success_summary/ # SimplerEnv evaluation logs and plots
	│ ├── success_summary.csv
	│ ├── raw_success.txt
	│ └── success_plot.png
	└── checkpoints/
	├── steps_50000_pytorch_model.pt # ← recommended checkpoint
	└── ... # per-step evaluation logs
	```

	---

	## How to Use

	This checkpoint is consumed directly by the StarVLA training / evaluation
	stack. Clone StarVLA and load the checkpoint with the framework name
	`QwenPI_v3`:

	```bash
	git clone https://github.com/starVLA/starVLA.git
	cd starVLA
	# Follow installation instructions in the StarVLA README.
	```

	```python
	from huggingface_hub import snapshot_download
	from starVLA.model.framework.tools import load_framework_from_checkpoint

	ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")

	policy = load_framework_from_checkpoint(
	framework_name="QwenPI_v3",
	config_path=f"{ckpt_dir}/config.full.yaml",
	checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
	)
	# policy.predict_action(images, instruction, state) -> action chunk (16 × 7)
	```

	For end-to-end SimplerEnv evaluation see
	[`examples/SimplerEnv`](https://github.com/starVLA/starVLA/tree/main/examples/SimplerEnv).

	---

	## Intended Use & Limitations

	Intended use. Research on vision-language-action models, manipulation
	policy learning, and as a baseline for π-style flow-matching action heads
	on top of open-weight VLMs.

	Out-of-scope / limitations.

	- Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE
	action space — generalisation to other embodiments / action spaces is not
	guaranteed.
	- Single 224 × 224 third-person view; no wrist camera, no depth.
	- Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots
	has not been validated by the released checkpoint.
	- Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
	- Not safety-tuned. Do not deploy on physical robots without an external
	safety layer.

	---

	## Citation

	If you use this checkpoint, please cite StarVLA:

	```bibtex
	@article{starvla2026,
	title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
	author = {StarVLA Community},
	journal = {arXiv preprint arXiv:2604.05014},
	year = {2026},
	url = {https://arxiv.org/abs/2604.05014}
	}
	```

	And the underlying VLM backbone:

	```bibtex
	@misc{qwen3vl,
	title = {Qwen3-VL},
	author = {Qwen Team},
	year = {2025},
	url = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
	}
	```

	## Acknowledgements

	- [Qwen Team](https://huggingface.co/Qwen) for the Qwen3-VL backbone.
	- [Physical Intelligence](https://www.physicalintelligence.company/) for the
	π₀ / π₀.₅ flow-matching action-head recipe that inspired `QwenPI_v3`.
	- [Open X-Embodiment](https://robotics-transformer-x.github.io/) and
	[IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY) for the LeRobot
	conversions of Bridge V2 and RT-1.
	- [SimplerEnv](https://github.com/simpler-env/SimplerEnv) for the
	evaluation protocol.