Robotics
English
starVLA
vla
vision-language-action
qwen3-vl
flow-matching
pi-zero
manipulation
bridge
rt-1
oxe
Jinhuiye's picture
Add files using upload-large-folder tool
1982ed2 verified
---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
- vla
- vision-language-action
- robotics
- qwen3-vl
- flow-matching
- pi-zero
- manipulation
- bridge
- rt-1
- oxe
datasets:
- IPEC-COMMUNITY/bridge_orig_lerobot
- IPEC-COMMUNITY/fractal20220817_data_lerobot
language:
- en
base_model:
- Qwen/Qwen3-VL-4B-Instruct
---
# Qwen3VL-PI_v3-Bridge-RT-1
A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, combining a **Qwen3-VL-4B-Instruct** backbone with a **layer-wise
cross-attention flow-matching action head** (`QwenPI_v3`). The model is
co-trained on the [Bridge V2](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot)
and [RT-1 / Fractal](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot)
slices of the Open X-Embodiment (OXE) collection, and is evaluated on the
**SimplerEnv WidowX** benchmark.
`QwenPI_v3` is StarVLA's open-weight realisation of the Ο€β‚€.β‚… recipe:
1. **Layer-wise cross-DiT flow-matching action head** β€” every VLM layer's
hidden state participates in cross-attention with the action DiT, instead
of consuming only the last-layer feature.
2. **Compressed Action DiT** β€” per-layer `LayerNorm + Linear` projectors
compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent,
shrinking the action-head footprint by ~6Γ— while preserving the
layer-wise interaction.
3. **Discretised-state language injection** β€” proprioceptive state is
quantised into 256 bins and appended to the instruction as plain tokens
(`[STATE] <bins> [ACTION]`), so the VLM can attend to robot state with
no additional encoder.
---
## Model Summary
| | |
| --- | --- |
| **Architecture** | `QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head) |
| **VLM backbone** | [`Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| **Action head** | Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) |
| **Action chunk** | 16 steps |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 Γ— 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **Total parameters** | **β‰ˆ 5.07 B** |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |
### Parameter breakdown
| Module | Parameters | Share |
| --- | ---: | ---: |
| `qwen_vl_interface` (Qwen3-VL-4B) | 4,437,815,808 | 87.5 % |
| `action_model` (layer-wise FM DiT, hidden 1024) | 538,678,305 | 10.6 % |
| `project_layers` (per-layer 2560 β†’ 1024 projectors) | 94,593,024 | 1.9 % |
| **Total** | **5,071,087,137** | **100 %** |
---
## Training Data
Co-training mixture **`bridge_rt_1`** (1 : 1 sampling):
| Dataset | Embodiment | Source |
| --- | --- | --- |
| `bridge_orig_1.0.0_lerobot` | WidowX | [IPEC-COMMUNITY/bridge_orig_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) |
| `fractal20220817_data_0.1.0_lerobot` (RT-1) | Google Robot | [IPEC-COMMUNITY/fractal20220817_data_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) |
- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ— 224
- Per-dataset normalisation statistics are stored in
[`dataset_statistics.json`](dataset_statistics.json).
---
## Training Recipe
| | |
| --- | --- |
| Total steps | 100,000 (released checkpoints up to 60k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 24 |
| Hardware | 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 1e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 5e-7) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
| Attention impl. | FlashAttention-2 |
The exact training config is preserved in
[`config.yaml`](config.yaml) / [`config.full.yaml`](config.full.yaml), and the
launch script in [`run_oxe_train.sh`](run_oxe_train.sh).
---
## Evaluation β€” SimplerEnv WidowX
Following the standard SimplerEnv WidowX protocol on four pick-and-place
tasks (24 episodes per task per run). Numbers are success rates (↑).
| Step | PutCarrotOnPlate | PutEggplantInBasket | PutSpoonOnTableCloth | StackGreenCubeOnYellowCube | **Average** |
| ---: | ---: | ---: | ---: | ---: | ---: |
| 40k | 0.688 | 0.917 | 0.750 | 0.333 | **0.672** |
| 50k | 0.625 | **1.000** | 0.792 | **0.375** | **0.698** |
| 60k | 0.667 | **1.000** | 0.750 | 0.167 | 0.646 |
Best average: **69.8 %** at the 50k checkpoint
([`steps_50000_pytorch_model.pt`](checkpoints/steps_50000_pytorch_model.pt)),
which we ship as the recommended checkpoint.
For comparison with other StarVLA frameworks on the same `bridge_rt_1`
mixture and protocol see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).
---
## Repository Layout
```
.
β”œβ”€β”€ README.md # this model card
β”œβ”€β”€ config.yaml # minimal training config
β”œβ”€β”€ config.full.yaml # fully resolved training config
β”œβ”€β”€ run_oxe_train.sh # launch script used for this run
β”œβ”€β”€ dataset_statistics.json # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl # training step summary
β”œβ”€β”€ success_summary/ # SimplerEnv evaluation logs and plots
β”‚ β”œβ”€β”€ success_summary.csv
β”‚ β”œβ”€β”€ raw_success.txt
β”‚ └── success_plot.png
└── checkpoints/
β”œβ”€β”€ steps_50000_pytorch_model.pt # ← recommended checkpoint
└── ... # per-step evaluation logs
```
---
## How to Use
This checkpoint is consumed directly by the StarVLA training / evaluation
stack. Clone StarVLA and load the checkpoint with the framework name
`QwenPI_v3`:
```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```
```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")
policy = load_framework_from_checkpoint(
framework_name="QwenPI_v3",
config_path=f"{ckpt_dir}/config.full.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (16 Γ— 7)
```
For end-to-end SimplerEnv evaluation see
[`examples/SimplerEnv`](https://github.com/starVLA/starVLA/tree/main/examples/SimplerEnv).
---
## Intended Use & Limitations
**Intended use.** Research on vision-language-action models, manipulation
policy learning, and as a baseline for Ο€-style flow-matching action heads
on top of open-weight VLMs.
**Out-of-scope / limitations.**
- Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE
action space β€” generalisation to other embodiments / action spaces is not
guaranteed.
- Single 224 Γ— 224 third-person view; no wrist camera, no depth.
- Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots
has not been validated by the released checkpoint.
- Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
- Not safety-tuned. Do **not** deploy on physical robots without an external
safety layer.
---
## Citation
If you use this checkpoint, please cite StarVLA:
```bibtex
@article{starvla2026,
title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
author = {StarVLA Community},
journal = {arXiv preprint arXiv:2604.05014},
year = {2026},
url = {https://arxiv.org/abs/2604.05014}
}
```
And the underlying VLM backbone:
```bibtex
@misc{qwen3vl,
title = {Qwen3-VL},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
}
```
## Acknowledgements
- [Qwen Team](https://huggingface.co/Qwen) for the Qwen3-VL backbone.
- [Physical Intelligence](https://www.physicalintelligence.company/) for the
Ο€β‚€ / Ο€β‚€.β‚… flow-matching action-head recipe that inspired `QwenPI_v3`.
- [Open X-Embodiment](https://robotics-transformer-x.github.io/) and
[IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY) for the LeRobot
conversions of Bridge V2 and RT-1.
- [SimplerEnv](https://github.com/simpler-env/SimplerEnv) for the
evaluation protocol.