---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
  - vla
  - vision-language-action
  - robotics
  - qwen3-vl
  - flow-matching
  - pi-zero
  - manipulation
  - bridge
  - rt-1
  - oxe
datasets:
  - IPEC-COMMUNITY/bridge_orig_lerobot
  - IPEC-COMMUNITY/fractal20220817_data_lerobot
language:
  - en
base_model:
  - Qwen/Qwen3-VL-4B-Instruct
---

# Qwen3VL-PI_v3-Bridge-RT-1

A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, combining a **Qwen3-VL-4B-Instruct** backbone with a **layer-wise
cross-attention flow-matching action head** (`QwenPI_v3`). The model is
co-trained on the [Bridge V2](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot)
and [RT-1 / Fractal](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot)
slices of the Open X-Embodiment (OXE) collection, and is evaluated on the
**SimplerEnv WidowX** benchmark.

`QwenPI_v3` is StarVLA's open-weight realisation of the π₀.₅ recipe:

1. **Layer-wise cross-DiT flow-matching action head** — every VLM layer's
   hidden state participates in cross-attention with the action DiT, instead
   of consuming only the last-layer feature.
2. **Compressed Action DiT** — per-layer `LayerNorm + Linear` projectors
   compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent,
   shrinking the action-head footprint by ~6× while preserving the
   layer-wise interaction.
3. **Discretised-state language injection** — proprioceptive state is
   quantised into 256 bins and appended to the instruction as plain tokens
   (`[STATE] <bins> [ACTION]`), so the VLM can attend to robot state with
   no additional encoder.

---

## Model Summary

| | |
| --- | --- |
| **Architecture** | `QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head) |
| **VLM backbone** | [`Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| **Action head** | Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) |
| **Action chunk** | 16 steps |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 × 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **Total parameters** | **≈ 5.07 B** |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |

### Parameter breakdown

| Module | Parameters | Share |
| --- | ---: | ---: |
| `qwen_vl_interface` (Qwen3-VL-4B) | 4,437,815,808 | 87.5 % |
| `action_model` (layer-wise FM DiT, hidden 1024) | 538,678,305 | 10.6 % |
| `project_layers` (per-layer 2560 → 1024 projectors) | 94,593,024 | 1.9 % |
| **Total** | **5,071,087,137** | **100 %** |

---

## Training Data

Co-training mixture **`bridge_rt_1`** (1 : 1 sampling):

| Dataset | Embodiment | Source |
| --- | --- | --- |
| `bridge_orig_1.0.0_lerobot` | WidowX | [IPEC-COMMUNITY/bridge_orig_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot) |
| `fractal20220817_data_0.1.0_lerobot` (RT-1) | Google Robot | [IPEC-COMMUNITY/fractal20220817_data_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) |

- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 × 224
- Per-dataset normalisation statistics are stored in
  [`dataset_statistics.json`](dataset_statistics.json).

---

## Training Recipe

| | |
| --- | --- |
| Total steps | 100,000 (released checkpoints up to 60k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 24 |
| Hardware | 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 1e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 5e-7) |
| Gradient clipping | 1.0 |
| Flow-matching noise | β-distribution (α=1.5, β=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
| Attention impl. | FlashAttention-2 |

The exact training config is preserved in
[`config.yaml`](config.yaml) / [`config.full.yaml`](config.full.yaml), and the
launch script in [`run_oxe_train.sh`](run_oxe_train.sh).

---

## Evaluation — SimplerEnv WidowX

Following the standard SimplerEnv WidowX protocol on four pick-and-place
tasks (24 episodes per task per run). Numbers are success rates (↑).

| Step | PutCarrotOnPlate | PutEggplantInBasket | PutSpoonOnTableCloth | StackGreenCubeOnYellowCube | **Average** |
| ---: | ---: | ---: | ---: | ---: | ---: |
| 40k  | 0.688 | 0.917 | 0.750 | 0.333 | **0.672** |
| 50k  | 0.625 | **1.000** | 0.792 | **0.375** | **0.698** |
| 60k  | 0.667 | **1.000** | 0.750 | 0.167 | 0.646 |

Best average: **69.8 %** at the 50k checkpoint
([`steps_50000_pytorch_model.pt`](checkpoints/steps_50000_pytorch_model.pt)),
which we ship as the recommended checkpoint.

For comparison with other StarVLA frameworks on the same `bridge_rt_1`
mixture and protocol see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).

---

## Repository Layout

```
.
├── README.md                 # this model card
├── config.yaml               # minimal training config
├── config.full.yaml          # fully resolved training config
├── run_oxe_train.sh          # launch script used for this run
├── dataset_statistics.json   # per-dataset action/state normalisation stats
├── summary.jsonl             # training step summary
├── success_summary/          # SimplerEnv evaluation logs and plots
│   ├── success_summary.csv
│   ├── raw_success.txt
│   └── success_plot.png
└── checkpoints/
    ├── steps_50000_pytorch_model.pt   # ← recommended checkpoint
    └── ...                            # per-step evaluation logs
```

---

## How to Use

This checkpoint is consumed directly by the StarVLA training / evaluation
stack. Clone StarVLA and load the checkpoint with the framework name
`QwenPI_v3`:

```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```

```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")

policy = load_framework_from_checkpoint(
    framework_name="QwenPI_v3",
    config_path=f"{ckpt_dir}/config.full.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (16 × 7)
```

For end-to-end SimplerEnv evaluation see
[`examples/SimplerEnv`](https://github.com/starVLA/starVLA/tree/main/examples/SimplerEnv).

---

## Intended Use & Limitations

**Intended use.** Research on vision-language-action models, manipulation
policy learning, and as a baseline for π-style flow-matching action heads
on top of open-weight VLMs.

**Out-of-scope / limitations.**

- Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE
  action space — generalisation to other embodiments / action spaces is not
  guaranteed.
- Single 224 × 224 third-person view; no wrist camera, no depth.
- Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots
  has not been validated by the released checkpoint.
- Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
- Not safety-tuned. Do **not** deploy on physical robots without an external
  safety layer.

---

## Citation

If you use this checkpoint, please cite StarVLA:

```bibtex
@article{starvla2026,
  title   = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
  author  = {StarVLA Community},
  journal = {arXiv preprint arXiv:2604.05014},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.05014}
}
```

And the underlying VLM backbone:

```bibtex
@misc{qwen3vl,
  title  = {Qwen3-VL},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
}
```

## Acknowledgements

- [Qwen Team](https://huggingface.co/Qwen) for the Qwen3-VL backbone.
- [Physical Intelligence](https://www.physicalintelligence.company/) for the
  π₀ / π₀.₅ flow-matching action-head recipe that inspired `QwenPI_v3`.
- [Open X-Embodiment](https://robotics-transformer-x.github.io/) and
  [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY) for the LeRobot
  conversions of Bridge V2 and RT-1.
- [SimplerEnv](https://github.com/simpler-env/SimplerEnv) for the
  evaluation protocol.