| --- |
| license: mit |
| library_name: starVLA |
| pipeline_tag: robotics |
| tags: |
| - vla |
| - vision-language-action |
| - robotics |
| - qwen3-vl |
| - flow-matching |
| - pi-zero |
| - manipulation |
| - bridge |
| - rt-1 |
| - oxe |
| datasets: |
| - IPEC-COMMUNITY/bridge_orig_lerobot |
| - IPEC-COMMUNITY/fractal20220817_data_lerobot |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen3-VL-4B-Instruct |
| --- |
| |
| # Qwen3VL-PI_v3-Bridge-RT-1 |
| |
| A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA) |
| project, combining a **Qwen3-VL-4B-Instruct** backbone with a **layer-wise |
| cross-attention flow-matching action head** (`QwenPI_v3`). The model is |
| co-trained on the [Bridge V2](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) |
| and [RT-1 / Fractal](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) |
| slices of the Open X-Embodiment (OXE) collection, and is evaluated on the |
| **SimplerEnv WidowX** benchmark. |
|
|
| `QwenPI_v3` is StarVLA's open-weight realisation of the Οβ.β
recipe: |
|
|
| 1. **Layer-wise cross-DiT flow-matching action head** β every VLM layer's |
| hidden state participates in cross-attention with the action DiT, instead |
| of consuming only the last-layer feature. |
| 2. **Compressed Action DiT** β per-layer `LayerNorm + Linear` projectors |
| compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent, |
| shrinking the action-head footprint by ~6Γ while preserving the |
| layer-wise interaction. |
| 3. **Discretised-state language injection** β proprioceptive state is |
| quantised into 256 bins and appended to the instruction as plain tokens |
| (`[STATE] <bins> [ACTION]`), so the VLM can attend to robot state with |
| no additional encoder. |
|
|
| --- |
|
|
| ## Model Summary |
|
|
| | | | |
| | --- | --- | |
| | **Architecture** | `QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head) | |
| | **VLM backbone** | [`Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | |
| | **Action head** | Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) | |
| | **Action chunk** | 16 steps | |
| | **Action / state dim** | 7 / 7 (delta end-effector) | |
| | **Image resolution** | 224 Γ 224, single 3rd-person view | |
| | **Inference timesteps** | 4 (flow matching) | |
| | **Total parameters** | **β 5.07 B** | |
| | **License** | MIT | |
| | **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) | |
|
|
| ### Parameter breakdown |
|
|
| | Module | Parameters | Share | |
| | --- | ---: | ---: | |
| | `qwen_vl_interface` (Qwen3-VL-4B) | 4,437,815,808 | 87.5 % | |
| | `action_model` (layer-wise FM DiT, hidden 1024) | 538,678,305 | 10.6 % | |
| | `project_layers` (per-layer 2560 β 1024 projectors) | 94,593,024 | 1.9 % | |
| | **Total** | **5,071,087,137** | **100 %** | |
|
|
| --- |
|
|
| ## Training Data |
|
|
| Co-training mixture **`bridge_rt_1`** (1 : 1 sampling): |
|
|
| | Dataset | Embodiment | Source | |
| | --- | --- | --- | |
| | `bridge_orig_1.0.0_lerobot` | WidowX | [IPEC-COMMUNITY/bridge_orig_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) | |
| | `fractal20220817_data_0.1.0_lerobot` (RT-1) | Google Robot | [IPEC-COMMUNITY/fractal20220817_data_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) | |
|
|
| - Action representation: **delta end-effector** (7-d, gripper included) |
| - Image observation: single primary RGB view, resized to 224 Γ 224 |
| - Per-dataset normalisation statistics are stored in |
| [`dataset_statistics.json`](dataset_statistics.json). |
|
|
| --- |
|
|
| ## Training Recipe |
|
|
| | | | |
| | --- | --- | |
| | Total steps | 100,000 (released checkpoints up to 60k) | |
| | Warm-up steps | 5,000 | |
| | Per-device batch size | 24 | |
| | Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) | |
| | Precision | bf16, mixed-precision + gradient checkpointing | |
| | Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) | |
| | LR (base / VLM) | 1e-5 | |
| | LR (action head) | 1e-4 | |
| | LR scheduler | `cosine_with_min_lr` (min lr 5e-7) | |
| | Gradient clipping | 1.0 | |
| | Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 | |
| | Repeated diffusion steps | 8 | |
| | Frozen modules | none (full fine-tuning) | |
| | Attention impl. | FlashAttention-2 | |
|
|
| The exact training config is preserved in |
| [`config.yaml`](config.yaml) / [`config.full.yaml`](config.full.yaml), and the |
| launch script in [`run_oxe_train.sh`](run_oxe_train.sh). |
|
|
| --- |
|
|
| ## Evaluation β SimplerEnv WidowX |
|
|
| Following the standard SimplerEnv WidowX protocol on four pick-and-place |
| tasks (24 episodes per task per run). Numbers are success rates (β). |
|
|
| | Step | PutCarrotOnPlate | PutEggplantInBasket | PutSpoonOnTableCloth | StackGreenCubeOnYellowCube | **Average** | |
| | ---: | ---: | ---: | ---: | ---: | ---: | |
| | 40k | 0.688 | 0.917 | 0.750 | 0.333 | **0.672** | |
| | 50k | 0.625 | **1.000** | 0.792 | **0.375** | **0.698** | |
| | 60k | 0.667 | **1.000** | 0.750 | 0.167 | 0.646 | |
|
|
| Best average: **69.8 %** at the 50k checkpoint |
| ([`steps_50000_pytorch_model.pt`](checkpoints/steps_50000_pytorch_model.pt)), |
| which we ship as the recommended checkpoint. |
|
|
| For comparison with other StarVLA frameworks on the same `bridge_rt_1` |
| mixture and protocol see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md). |
|
|
| --- |
|
|
| ## Repository Layout |
|
|
| ``` |
| . |
| βββ README.md # this model card |
| βββ config.yaml # minimal training config |
| βββ config.full.yaml # fully resolved training config |
| βββ run_oxe_train.sh # launch script used for this run |
| βββ dataset_statistics.json # per-dataset action/state normalisation stats |
| βββ summary.jsonl # training step summary |
| βββ success_summary/ # SimplerEnv evaluation logs and plots |
| β βββ success_summary.csv |
| β βββ raw_success.txt |
| β βββ success_plot.png |
| βββ checkpoints/ |
| βββ steps_50000_pytorch_model.pt # β recommended checkpoint |
| βββ ... # per-step evaluation logs |
| ``` |
|
|
| --- |
|
|
| ## How to Use |
|
|
| This checkpoint is consumed directly by the StarVLA training / evaluation |
| stack. Clone StarVLA and load the checkpoint with the framework name |
| `QwenPI_v3`: |
|
|
| ```bash |
| git clone https://github.com/starVLA/starVLA.git |
| cd starVLA |
| # Follow installation instructions in the StarVLA README. |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from starVLA.model.framework.tools import load_framework_from_checkpoint |
| |
| ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1") |
| |
| policy = load_framework_from_checkpoint( |
| framework_name="QwenPI_v3", |
| config_path=f"{ckpt_dir}/config.full.yaml", |
| checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt", |
| ) |
| # policy.predict_action(images, instruction, state) -> action chunk (16 Γ 7) |
| ``` |
|
|
| For end-to-end SimplerEnv evaluation see |
| [`examples/SimplerEnv`](https://github.com/starVLA/starVLA/tree/main/examples/SimplerEnv). |
|
|
| --- |
|
|
| ## Intended Use & Limitations |
|
|
| **Intended use.** Research on vision-language-action models, manipulation |
| policy learning, and as a baseline for Ο-style flow-matching action heads |
| on top of open-weight VLMs. |
|
|
| **Out-of-scope / limitations.** |
|
|
| - Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE |
| action space β generalisation to other embodiments / action spaces is not |
| guaranteed. |
| - Single 224 Γ 224 third-person view; no wrist camera, no depth. |
| - Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots |
| has not been validated by the released checkpoint. |
| - Inherits any biases / failure modes of the underlying Qwen3-VL-4B model. |
| - Not safety-tuned. Do **not** deploy on physical robots without an external |
| safety layer. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this checkpoint, please cite StarVLA: |
|
|
| ```bibtex |
| @article{starvla2026, |
| title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing}, |
| author = {StarVLA Community}, |
| journal = {arXiv preprint arXiv:2604.05014}, |
| year = {2026}, |
| url = {https://arxiv.org/abs/2604.05014} |
| } |
| ``` |
|
|
| And the underlying VLM backbone: |
|
|
| ```bibtex |
| @misc{qwen3vl, |
| title = {Qwen3-VL}, |
| author = {Qwen Team}, |
| year = {2025}, |
| url = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - [Qwen Team](https://huggingface.co/Qwen) for the Qwen3-VL backbone. |
| - [Physical Intelligence](https://www.physicalintelligence.company/) for the |
| Οβ / Οβ.β
flow-matching action-head recipe that inspired `QwenPI_v3`. |
| - [Open X-Embodiment](https://robotics-transformer-x.github.io/) and |
| [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY) for the LeRobot |
| conversions of Bridge V2 and RT-1. |
| - [SimplerEnv](https://github.com/simpler-env/SimplerEnv) for the |
| evaluation protocol. |
|
|