File size: 6,555 Bytes

c8173fb

---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
  - vla
  - vision-language-action
  - robotics
  - flow-matching
  - cosmos
  - gr00t
  - manipulation
  - libero
datasets:
  - IPEC-COMMUNITY/libero_lerobot
language:
  - en
base_model:
  - nvidia/Cosmos-Predict2-2B-Video2World
---

# StarVLA-CosmoPredict2GR00T-LIBERO-4in1

A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, built on a **Cosmos-Predict2-2B** world model as the visual backbone,
driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`).  
The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).

`CosmoPredict2GR00T` is StarVLA's architecture that extracts visual
world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world
diffusion model) and feeds them into a cross-attention DiT flow-matching
action head inspired by the GR00T N1 design:

1. **Cosmos-Predict2 visual features** — the last-layer activations of
   `Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual
   representations. 32 target vision tokens are extracted and passed to the
   action head.
2. **Cross-attention flow-matching DiT** — a 16-layer DiT-B with
   cross-attention (cross-attention dim 2048, interleaved self-attention,
   adaptive LayerNorm) generates action chunks via flow matching.
3. **Language conditioning via instruction tokens** — the task instruction is
   tokenised and injected into the DiT cross-attention alongside the visual
   tokens; no separate VLM backbone is used.

---

## Model Summary

| | |
| --- | --- |
| **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) |
| **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) |
| **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) |
| **Action chunk** | 8 steps (+ 7 future-window steps) |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 × 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |

---

## Training Data

**LIBERO 4-in-1** mixture (`libero_all`) — all four LIBERO task suites
combined into a single training stream:

| Suite | Tasks | Description |
| --- | ---: | --- |
| `libero_10` | 10 | Long-horizon tabletop manipulation |
| `libero_goal` | 10 | Goal-conditioned rearrangement |
| `libero_object` | 10 | Object-centric pick-and-place |
| `libero_spatial` | 10 | Spatially varied placement |

- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 × 224
- Per-dataset normalisation statistics are stored in
  [`dataset_statistics.json`](dataset_statistics.json).

---

## Training Recipe

| | |
| --- | --- |
| Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Flow-matching noise | β-distribution (α=1.5, β=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |

The exact training config is preserved in
[`config.yaml`](config.yaml), and the launch script in
[`run_libero_train.sh`](run_libero_train.sh).

---

## Evaluation — LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per
suite). Numbers are success rates (↑).

| Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** |
| ---: | ---: | ---: | ---: | ---: |
| 30k | 0.908 | 0.980 | 0.880 | 0.923 |
| 40k | 0.948 | 0.990 | 0.884 | 0.941 |
| **50k** | **0.944** | **0.990** | **0.906** | **0.947** |

> `libero_10` was not evaluated for this run.  
> Best checkpoint: **`steps_50000_pytorch_model.pt`** — avg **94.7 %** across libero_goal / object / spatial.

For comparison with other StarVLA frameworks see the
[StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).

---

## Repository Layout

```
.
├── README.md                 # this model card
├── config.yaml               # training config
├── run_libero_train.sh       # launch script used for this run
├── dataset_statistics.json   # per-dataset action/state normalisation stats
├── summary.jsonl             # training step summary
├── logs/                     # per-suite evaluation logs
│   ├── libero_goal/
│   ├── libero_object/
│   └── libero_spatial/
├── videos/                   # evaluation rollout videos
└── checkpoints/
    ├── steps_50000_pytorch_model.pt   # ← recommended checkpoint
    ├── steps_40000_pytorch_model.pt
    └── steps_30000_pytorch_model.pt
```

---

## How to Use

```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```

```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="CosmoPredict2GR00T",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 × 7)
```

For end-to-end LIBERO evaluation see
[`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO).

---

## Intended Use & Limitations

**Intended use.** Research on vision-language-action models, LIBERO tabletop
manipulation benchmarks, and as a baseline for dual VLM + world-model
conditioning architectures.

**Out-of-scope / limitations.** This model is trained exclusively on LIBERO
simulation data with WidowX-style delta end-effector control. Real-robot
transfer and cross-embodiment generalisation have not been evaluated.
Performance may degrade on out-of-distribution scenes, objects, or
instructions not present in the LIBERO training split.