Instructions to use StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1 with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 6,555 Bytes
c8173fb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
- vla
- vision-language-action
- robotics
- flow-matching
- cosmos
- gr00t
- manipulation
- libero
datasets:
- IPEC-COMMUNITY/libero_lerobot
language:
- en
base_model:
- nvidia/Cosmos-Predict2-2B-Video2World
---
# StarVLA-CosmoPredict2GR00T-LIBERO-4in1
A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, built on a **Cosmos-Predict2-2B** world model as the visual backbone,
driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`).
The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).
`CosmoPredict2GR00T` is StarVLA's architecture that extracts visual
world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world
diffusion model) and feeds them into a cross-attention DiT flow-matching
action head inspired by the GR00T N1 design:
1. **Cosmos-Predict2 visual features** β the last-layer activations of
`Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual
representations. 32 target vision tokens are extracted and passed to the
action head.
2. **Cross-attention flow-matching DiT** β a 16-layer DiT-B with
cross-attention (cross-attention dim 2048, interleaved self-attention,
adaptive LayerNorm) generates action chunks via flow matching.
3. **Language conditioning via instruction tokens** β the task instruction is
tokenised and injected into the DiT cross-attention alongside the visual
tokens; no separate VLM backbone is used.
---
## Model Summary
| | |
| --- | --- |
| **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) |
| **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) |
| **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) |
| **Action chunk** | 8 steps (+ 7 future-window steps) |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 Γ 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |
---
## Training Data
**LIBERO 4-in-1** mixture (`libero_all`) β all four LIBERO task suites
combined into a single training stream:
| Suite | Tasks | Description |
| --- | ---: | --- |
| `libero_10` | 10 | Long-horizon tabletop manipulation |
| `libero_goal` | 10 | Goal-conditioned rearrangement |
| `libero_object` | 10 | Object-centric pick-and-place |
| `libero_spatial` | 10 | Spatially varied placement |
- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ 224
- Per-dataset normalisation statistics are stored in
[`dataset_statistics.json`](dataset_statistics.json).
---
## Training Recipe
| | |
| --- | --- |
| Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
The exact training config is preserved in
[`config.yaml`](config.yaml), and the launch script in
[`run_libero_train.sh`](run_libero_train.sh).
---
## Evaluation β LIBERO 4-in-1
Following the standard LIBERO evaluation protocol (50 trials per task per
suite). Numbers are success rates (β).
| Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** |
| ---: | ---: | ---: | ---: | ---: |
| 30k | 0.908 | 0.980 | 0.880 | 0.923 |
| 40k | 0.948 | 0.990 | 0.884 | 0.941 |
| **50k** | **0.944** | **0.990** | **0.906** | **0.947** |
> `libero_10` was not evaluated for this run.
> Best checkpoint: **`steps_50000_pytorch_model.pt`** β avg **94.7 %** across libero_goal / object / spatial.
For comparison with other StarVLA frameworks see the
[StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).
---
## Repository Layout
```
.
βββ README.md # this model card
βββ config.yaml # training config
βββ run_libero_train.sh # launch script used for this run
βββ dataset_statistics.json # per-dataset action/state normalisation stats
βββ summary.jsonl # training step summary
βββ logs/ # per-suite evaluation logs
β βββ libero_goal/
β βββ libero_object/
β βββ libero_spatial/
βββ videos/ # evaluation rollout videos
βββ checkpoints/
βββ steps_50000_pytorch_model.pt # β recommended checkpoint
βββ steps_40000_pytorch_model.pt
βββ steps_30000_pytorch_model.pt
```
---
## How to Use
```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```
```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")
policy = load_framework_from_checkpoint(
framework_name="CosmoPredict2GR00T",
config_path=f"{ckpt_dir}/config.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ 7)
```
For end-to-end LIBERO evaluation see
[`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO).
---
## Intended Use & Limitations
**Intended use.** Research on vision-language-action models, LIBERO tabletop
manipulation benchmarks, and as a baseline for dual VLM + world-model
conditioning architectures.
**Out-of-scope / limitations.** This model is trained exclusively on LIBERO
simulation data with WidowX-style delta end-effector control. Real-robot
transfer and cross-embodiment generalisation have not been evaluated.
Performance may degrade on out-of-distribution scenes, objects, or
instructions not present in the LIBERO training split.
|