--- license: mit library_name: starVLA pipeline_tag: robotics tags: - vla - vision-language-action - robotics - qwen3-vl - flow-matching - pi-zero - manipulation - bridge - rt-1 - oxe datasets: - IPEC-COMMUNITY/bridge_orig_lerobot - IPEC-COMMUNITY/fractal20220817_data_lerobot language: - en base_model: - Qwen/Qwen3-VL-4B-Instruct --- # Qwen3VL-PI_v3-Bridge-RT-1 A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA) project, combining a **Qwen3-VL-4B-Instruct** backbone with a **layer-wise cross-attention flow-matching action head** (`QwenPI_v3`). The model is co-trained on the [Bridge V2](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) and [RT-1 / Fractal](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) slices of the Open X-Embodiment (OXE) collection, and is evaluated on the **SimplerEnv WidowX** benchmark. `QwenPI_v3` is StarVLA's open-weight realisation of the π₀.₅ recipe: 1. **Layer-wise cross-DiT flow-matching action head** — every VLM layer's hidden state participates in cross-attention with the action DiT, instead of consuming only the last-layer feature. 2. **Compressed Action DiT** — per-layer `LayerNorm + Linear` projectors compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent, shrinking the action-head footprint by ~6× while preserving the layer-wise interaction. 3. **Discretised-state language injection** — proprioceptive state is quantised into 256 bins and appended to the instruction as plain tokens (`[STATE] [ACTION]`), so the VLM can attend to robot state with no additional encoder. --- ## Model Summary | | | | --- | --- | | **Architecture** | `QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head) | | **VLM backbone** | [`Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | | **Action head** | Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) | | **Action chunk** | 16 steps | | **Action / state dim** | 7 / 7 (delta end-effector) | | **Image resolution** | 224 × 224, single 3rd-person view | | **Inference timesteps** | 4 (flow matching) | | **Total parameters** | **≈ 5.07 B** | | **License** | MIT | | **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) | ### Parameter breakdown | Module | Parameters | Share | | --- | ---: | ---: | | `qwen_vl_interface` (Qwen3-VL-4B) | 4,437,815,808 | 87.5 % | | `action_model` (layer-wise FM DiT, hidden 1024) | 538,678,305 | 10.6 % | | `project_layers` (per-layer 2560 → 1024 projectors) | 94,593,024 | 1.9 % | | **Total** | **5,071,087,137** | **100 %** | --- ## Training Data Co-training mixture **`bridge_rt_1`** (1 : 1 sampling): | Dataset | Embodiment | Source | | --- | --- | --- | | `bridge_orig_1.0.0_lerobot` | WidowX | [IPEC-COMMUNITY/bridge_orig_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) | | `fractal20220817_data_0.1.0_lerobot` (RT-1) | Google Robot | [IPEC-COMMUNITY/fractal20220817_data_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) | - Action representation: **delta end-effector** (7-d, gripper included) - Image observation: single primary RGB view, resized to 224 × 224 - Per-dataset normalisation statistics are stored in [`dataset_statistics.json`](dataset_statistics.json). --- ## Training Recipe | | | | --- | --- | | Total steps | 100,000 (released checkpoints up to 60k) | | Warm-up steps | 5,000 | | Per-device batch size | 24 | | Hardware | 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) | | Precision | bf16, mixed-precision + gradient checkpointing | | Optimizer | AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) | | LR (base / VLM) | 1e-5 | | LR (action head) | 1e-4 | | LR scheduler | `cosine_with_min_lr` (min lr 5e-7) | | Gradient clipping | 1.0 | | Flow-matching noise | β-distribution (α=1.5, β=1.0), s = 0.999 | | Repeated diffusion steps | 8 | | Frozen modules | none (full fine-tuning) | | Attention impl. | FlashAttention-2 | The exact training config is preserved in [`config.yaml`](config.yaml) / [`config.full.yaml`](config.full.yaml), and the launch script in [`run_oxe_train.sh`](run_oxe_train.sh). --- ## Evaluation — SimplerEnv WidowX Following the standard SimplerEnv WidowX protocol on four pick-and-place tasks (24 episodes per task per run). Numbers are success rates (↑). | Step | PutCarrotOnPlate | PutEggplantInBasket | PutSpoonOnTableCloth | StackGreenCubeOnYellowCube | **Average** | | ---: | ---: | ---: | ---: | ---: | ---: | | 40k | 0.688 | 0.917 | 0.750 | 0.333 | **0.672** | | 50k | 0.625 | **1.000** | 0.792 | **0.375** | **0.698** | | 60k | 0.667 | **1.000** | 0.750 | 0.167 | 0.646 | Best average: **69.8 %** at the 50k checkpoint ([`steps_50000_pytorch_model.pt`](checkpoints/steps_50000_pytorch_model.pt)), which we ship as the recommended checkpoint. For comparison with other StarVLA frameworks on the same `bridge_rt_1` mixture and protocol see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md). --- ## Repository Layout ``` . ├── README.md # this model card ├── config.yaml # minimal training config ├── config.full.yaml # fully resolved training config ├── run_oxe_train.sh # launch script used for this run ├── dataset_statistics.json # per-dataset action/state normalisation stats ├── summary.jsonl # training step summary ├── success_summary/ # SimplerEnv evaluation logs and plots │ ├── success_summary.csv │ ├── raw_success.txt │ └── success_plot.png └── checkpoints/ ├── steps_50000_pytorch_model.pt # ← recommended checkpoint └── ... # per-step evaluation logs ``` --- ## How to Use This checkpoint is consumed directly by the StarVLA training / evaluation stack. Clone StarVLA and load the checkpoint with the framework name `QwenPI_v3`: ```bash git clone https://github.com/starVLA/starVLA.git cd starVLA # Follow installation instructions in the StarVLA README. ``` ```python from huggingface_hub import snapshot_download from starVLA.model.framework.tools import load_framework_from_checkpoint ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1") policy = load_framework_from_checkpoint( framework_name="QwenPI_v3", config_path=f"{ckpt_dir}/config.full.yaml", checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt", ) # policy.predict_action(images, instruction, state) -> action chunk (16 × 7) ``` For end-to-end SimplerEnv evaluation see [`examples/SimplerEnv`](https://github.com/starVLA/starVLA/tree/main/examples/SimplerEnv). --- ## Intended Use & Limitations **Intended use.** Research on vision-language-action models, manipulation policy learning, and as a baseline for π-style flow-matching action heads on top of open-weight VLMs. **Out-of-scope / limitations.** - Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE action space — generalisation to other embodiments / action spaces is not guaranteed. - Single 224 × 224 third-person view; no wrist camera, no depth. - Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots has not been validated by the released checkpoint. - Inherits any biases / failure modes of the underlying Qwen3-VL-4B model. - Not safety-tuned. Do **not** deploy on physical robots without an external safety layer. --- ## Citation If you use this checkpoint, please cite StarVLA: ```bibtex @article{starvla2026, title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing}, author = {StarVLA Community}, journal = {arXiv preprint arXiv:2604.05014}, year = {2026}, url = {https://arxiv.org/abs/2604.05014} } ``` And the underlying VLM backbone: ```bibtex @misc{qwen3vl, title = {Qwen3-VL}, author = {Qwen Team}, year = {2025}, url = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct} } ``` ## Acknowledgements - [Qwen Team](https://huggingface.co/Qwen) for the Qwen3-VL backbone. - [Physical Intelligence](https://www.physicalintelligence.company/) for the π₀ / π₀.₅ flow-matching action-head recipe that inspired `QwenPI_v3`. - [Open X-Embodiment](https://robotics-transformer-x.github.io/) and [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY) for the LeRobot conversions of Bridge V2 and RT-1. - [SimplerEnv](https://github.com/simpler-env/SimplerEnv) for the evaluation protocol.