Jinhuiye's picture
Add files using upload-large-folder tool
c8173fb verified
---
license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
- vla
- vision-language-action
- robotics
- flow-matching
- cosmos
- gr00t
- manipulation
- libero
datasets:
- IPEC-COMMUNITY/libero_lerobot
language:
- en
base_model:
- nvidia/Cosmos-Predict2-2B-Video2World
---
# StarVLA-CosmoPredict2GR00T-LIBERO-4in1
A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA)
project, built on a **Cosmos-Predict2-2B** world model as the visual backbone,
driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`).
The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).
`CosmoPredict2GR00T` is StarVLA's architecture that extracts visual
world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world
diffusion model) and feeds them into a cross-attention DiT flow-matching
action head inspired by the GR00T N1 design:
1. **Cosmos-Predict2 visual features** β€” the last-layer activations of
`Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual
representations. 32 target vision tokens are extracted and passed to the
action head.
2. **Cross-attention flow-matching DiT** β€” a 16-layer DiT-B with
cross-attention (cross-attention dim 2048, interleaved self-attention,
adaptive LayerNorm) generates action chunks via flow matching.
3. **Language conditioning via instruction tokens** β€” the task instruction is
tokenised and injected into the DiT cross-attention alongside the visual
tokens; no separate VLM backbone is used.
---
## Model Summary
| | |
| --- | --- |
| **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) |
| **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) |
| **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) |
| **Action chunk** | 8 steps (+ 7 future-window steps) |
| **Action / state dim** | 7 / 7 (delta end-effector) |
| **Image resolution** | 224 Γ— 224, single 3rd-person view |
| **Inference timesteps** | 4 (flow matching) |
| **License** | MIT |
| **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) |
---
## Training Data
**LIBERO 4-in-1** mixture (`libero_all`) β€” all four LIBERO task suites
combined into a single training stream:
| Suite | Tasks | Description |
| --- | ---: | --- |
| `libero_10` | 10 | Long-horizon tabletop manipulation |
| `libero_goal` | 10 | Goal-conditioned rearrangement |
| `libero_object` | 10 | Object-centric pick-and-place |
| `libero_spatial` | 10 | Spatially varied placement |
- Action representation: **delta end-effector** (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ— 224
- Per-dataset normalisation statistics are stored in
[`dataset_statistics.json`](dataset_statistics.json).
---
## Training Recipe
| | |
| --- | --- |
| Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | `cosine_with_min_lr` (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
The exact training config is preserved in
[`config.yaml`](config.yaml), and the launch script in
[`run_libero_train.sh`](run_libero_train.sh).
---
## Evaluation β€” LIBERO 4-in-1
Following the standard LIBERO evaluation protocol (50 trials per task per
suite). Numbers are success rates (↑).
| Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** |
| ---: | ---: | ---: | ---: | ---: |
| 30k | 0.908 | 0.980 | 0.880 | 0.923 |
| 40k | 0.948 | 0.990 | 0.884 | 0.941 |
| **50k** | **0.944** | **0.990** | **0.906** | **0.947** |
> `libero_10` was not evaluated for this run.
> Best checkpoint: **`steps_50000_pytorch_model.pt`** β€” avg **94.7 %** across libero_goal / object / spatial.
For comparison with other StarVLA frameworks see the
[StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).
---
## Repository Layout
```
.
β”œβ”€β”€ README.md # this model card
β”œβ”€β”€ config.yaml # training config
β”œβ”€β”€ run_libero_train.sh # launch script used for this run
β”œβ”€β”€ dataset_statistics.json # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl # training step summary
β”œβ”€β”€ logs/ # per-suite evaluation logs
β”‚ β”œβ”€β”€ libero_goal/
β”‚ β”œβ”€β”€ libero_object/
β”‚ └── libero_spatial/
β”œβ”€β”€ videos/ # evaluation rollout videos
└── checkpoints/
β”œβ”€β”€ steps_50000_pytorch_model.pt # ← recommended checkpoint
β”œβ”€β”€ steps_40000_pytorch_model.pt
└── steps_30000_pytorch_model.pt
```
---
## How to Use
```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
```
```python
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")
policy = load_framework_from_checkpoint(
framework_name="CosmoPredict2GR00T",
config_path=f"{ckpt_dir}/config.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ— 7)
```
For end-to-end LIBERO evaluation see
[`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO).
---
## Intended Use & Limitations
**Intended use.** Research on vision-language-action models, LIBERO tabletop
manipulation benchmarks, and as a baseline for dual VLM + world-model
conditioning architectures.
**Out-of-scope / limitations.** This model is trained exclusively on LIBERO
simulation data with WidowX-style delta end-effector control. Real-robot
transfer and cross-embodiment generalisation have not been evaluated.
Performance may degrade on out-of-distribution scenes, objects, or
instructions not present in the LIBERO training split.