StarVLA-CosmoPredict2GR00T-LIBERO-4in1

A Vision-Language-Action (VLA) model from the StarVLA project, built on a Cosmos-Predict2-2B world model as the visual backbone, driving a GR00T-style DiT flow-matching action head (CosmoPredict2GR00T).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 + libero_goal + libero_object + libero_spatial combined).

CosmoPredict2GR00T is StarVLA's architecture that extracts visual world-model features from NVIDIA Cosmos-Predict2-2B (a video-to-world diffusion model) and feeds them into a cross-attention DiT flow-matching action head inspired by the GR00T N1 design:

Cosmos-Predict2 visual features — the last-layer activations of Cosmos-Predict2-2B-Video2World serve as rich spatiotemporal visual representations. 32 target vision tokens are extracted and passed to the action head.
Cross-attention flow-matching DiT — a 16-layer DiT-B with cross-attention (cross-attention dim 2048, interleaved self-attention, adaptive LayerNorm) generates action chunks via flow matching.
Language conditioning via instruction tokens — the task instruction is tokenised and injected into the DiT cross-attention alongside the visual tokens; no separate VLM backbone is used.

Model Summary


Architecture	`CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT)
Visual backbone	`Cosmos-Predict2-2B-Video2World`
Action head	Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden)
Action chunk	8 steps (+ 7 future-window steps)
Action / state dim	7 / 7 (delta end-effector)
Image resolution	224 × 224, single 3rd-person view
Inference timesteps	4 (flow matching)
License	MIT
Codebase	starVLA/starVLA

Training Data

LIBERO 4-in-1 mixture (libero_all) — all four LIBERO task suites combined into a single training stream:

Suite	Tasks	Description
`libero_10`	10	Long-horizon tabletop manipulation
`libero_goal`	10	Goal-conditioned rearrangement
`libero_object`	10	Object-centric pick-and-place
`libero_spatial`	10	Spatially varied placement

Action representation: delta end-effector (7-d, gripper included)
Image observation: single primary RGB view, resized to 224 × 224
Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe


Total steps	80,000 (released checkpoints: 30k / 40k / 50k)
Warm-up steps	5,000
Per-device batch size	8
Hardware	8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision	bf16, mixed-precision + gradient checkpointing
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8)
LR (base / VLM)	2.5e-5
LR (action head)	1e-4
LR scheduler	`cosine_with_min_lr` (min lr 1e-6)
Gradient clipping	1.0
Flow-matching noise	β-distribution (α=1.5, β=1.0), s = 0.999
Repeated diffusion steps	8
Frozen modules	none (full fine-tuning)

The exact training config is preserved in config.yaml, and the launch script in run_libero_train.sh.

Evaluation — LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (↑).

Step	libero_goal	libero_object	libero_spatial	Avg (3 suites)
30k	0.908	0.980	0.880	0.923
40k	0.948	0.990	0.884	0.941
50k	0.944	0.990	0.906	0.947

libero_10 was not evaluated for this run.
Best checkpoint: steps_50000_pytorch_model.pt — avg 94.7 % across libero_goal / object / spatial.

For comparison with other StarVLA frameworks see the StarVLA Model Zoo.

Repository Layout

.
├── README.md                 # this model card
├── config.yaml               # training config
├── run_libero_train.sh       # launch script used for this run
├── dataset_statistics.json   # per-dataset action/state normalisation stats
├── summary.jsonl             # training step summary
├── logs/                     # per-suite evaluation logs
│   ├── libero_goal/
│   ├── libero_object/
│   └── libero_spatial/
├── videos/                   # evaluation rollout videos
└── checkpoints/
    ├── steps_50000_pytorch_model.pt   # ← recommended checkpoint
    ├── steps_40000_pytorch_model.pt
    └── steps_30000_pytorch_model.pt

How to Use

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.

from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="CosmoPredict2GR00T",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 × 7)

For end-to-end LIBERO evaluation see examples/LIBERO.

Intended Use & Limitations

Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for dual VLM + world-model conditioning architectures.

Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with WidowX-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.

Downloads last month: 6

Video Preview

Robotics

Model tree for StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1

Base model

nvidia/Cosmos-Predict2-2B-Video2World

Finetuned

(6)

this model

Collection including StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1

World Model to VLA

Collection

3 items • Updated 7 days ago