File size: 6,555 Bytes
c8173fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
  - vla
  - vision-language-action
  - robotics
  - flow-matching
  - cosmos
  - gr00t
  - manipulation
  - libero
datasets:
  - IPEC-COMMUNITY/libero_lerobot
language:
  - en
base_model:
  - nvidia/Cosmos-Predict2-2B-Video2World
---

# StarVLA-CosmoPredict2GR00T-LIBERO-4in1

A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, built on a **Cosmos-Predict2-2B** world model as the visual backbone,
driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`).  
The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).

`CosmoPredict2GR00T` is StarVLA's architecture that extracts visual
world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world
diffusion model) and feeds them into a cross-attention DiT flow-matching
action head inspired by the GR00T N1 design:

1. **Cosmos-Predict2 visual features** β€” the last-layer activations of
   `Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual
   representations. 32 target vision tokens are extracted and passed to the
   action head.
2. **Cross-attention flow-matching DiT** β€” a 16-layer DiT-B with
   cross-attention (cross-attention dim 2048, interleaved self-attention,
   adaptive LayerNorm) generates action chunks via flow matching.
3. **Language conditioning via instruction tokens** β€” the task instruction is
   tokenised and injected into the DiT cross-attention alongside the visual
   tokens; no separate VLM backbone is used.

---

## Model Summary

| | |
| --- | --- |
| **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) |
| **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) |
| **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) |
| **Action chunk** | 8 steps (+ 7 future-window steps) |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 Γ— 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |

---

## Training Data

**LIBERO 4-in-1** mixture (`libero_all`) β€” all four LIBERO task suites
combined into a single training stream:

| Suite | Tasks | Description |
| --- | ---: | --- |
| `libero_10` | 10 | Long-horizon tabletop manipulation |
| `libero_goal` | 10 | Goal-conditioned rearrangement |
| `libero_object` | 10 | Object-centric pick-and-place |
| `libero_spatial` | 10 | Spatially varied placement |

- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ— 224
- Per-dataset normalisation statistics are stored in
  [`dataset_statistics.json`](dataset_statistics.json).

---

## Training Recipe

| | |
| --- | --- |
| Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |

The exact training config is preserved in
[`config.yaml`](config.yaml), and the launch script in
[`run_libero_train.sh`](run_libero_train.sh).

---

## Evaluation β€” LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per
suite). Numbers are success rates (↑).

| Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** |
| ---: | ---: | ---: | ---: | ---: |
| 30k | 0.908 | 0.980 | 0.880 | 0.923 |
| 40k | 0.948 | 0.990 | 0.884 | 0.941 |
| **50k** | **0.944** | **0.990** | **0.906** | **0.947** |

> `libero_10` was not evaluated for this run.  
> Best checkpoint: **`steps_50000_pytorch_model.pt`** β€” avg **94.7 %** across libero_goal / object / spatial.

For comparison with other StarVLA frameworks see the
[StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).

---

## Repository Layout

```
.
β”œβ”€β”€ README.md                 # this model card
β”œβ”€β”€ config.yaml               # training config
β”œβ”€β”€ run_libero_train.sh       # launch script used for this run
β”œβ”€β”€ dataset_statistics.json   # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl             # training step summary
β”œβ”€β”€ logs/                     # per-suite evaluation logs
β”‚   β”œβ”€β”€ libero_goal/
β”‚   β”œβ”€β”€ libero_object/
β”‚   └── libero_spatial/
β”œβ”€β”€ videos/                   # evaluation rollout videos
└── checkpoints/
    β”œβ”€β”€ steps_50000_pytorch_model.pt   # ← recommended checkpoint
    β”œβ”€β”€ steps_40000_pytorch_model.pt
    └── steps_30000_pytorch_model.pt
```

---

## How to Use

```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```

```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="CosmoPredict2GR00T",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ— 7)
```

For end-to-end LIBERO evaluation see
[`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO).

---

## Intended Use & Limitations

**Intended use.** Research on vision-language-action models, LIBERO tabletop
manipulation benchmarks, and as a baseline for dual VLM + world-model
conditioning architectures.

**Out-of-scope / limitations.** This model is trained exclusively on LIBERO
simulation data with WidowX-style delta end-effector control. Real-robot
transfer and cross-embodiment generalisation have not been evaluated.
Performance may degrade on out-of-distribution scenes, objects, or
instructions not present in the LIBERO training split.