StarVLA-CosmoPredict2GR00T-LIBERO-4in1

A Vision-Language-Action (VLA) model from the StarVLA project, built on a Cosmos-Predict2-2B world model as the visual backbone, driving a GR00T-style DiT flow-matching action head (CosmoPredict2GR00T).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 + libero_goal + libero_object + libero_spatial combined).

CosmoPredict2GR00T is StarVLA's architecture that extracts visual world-model features from NVIDIA Cosmos-Predict2-2B (a video-to-world diffusion model) and feeds them into a cross-attention DiT flow-matching action head inspired by the GR00T N1 design:

  1. Cosmos-Predict2 visual features β€” the last-layer activations of Cosmos-Predict2-2B-Video2World serve as rich spatiotemporal visual representations. 32 target vision tokens are extracted and passed to the action head.
  2. Cross-attention flow-matching DiT β€” a 16-layer DiT-B with cross-attention (cross-attention dim 2048, interleaved self-attention, adaptive LayerNorm) generates action chunks via flow matching.
  3. Language conditioning via instruction tokens β€” the task instruction is tokenised and injected into the DiT cross-attention alongside the visual tokens; no separate VLM backbone is used.

Model Summary

Architecture CosmoPredict2GR00T (Cosmos-Predict2 visual backbone + cross-attn FM DiT)
Visual backbone Cosmos-Predict2-2B-Video2World
Action head Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden)
Action chunk 8 steps (+ 7 future-window steps)
Action / state dim 7 / 7 (delta end-effector)
Image resolution 224 Γ— 224, single 3rd-person view
Inference timesteps 4 (flow matching)
License MIT
Codebase starVLA/starVLA

Training Data

LIBERO 4-in-1 mixture (libero_all) β€” all four LIBERO task suites combined into a single training stream:

Suite Tasks Description
libero_10 10 Long-horizon tabletop manipulation
libero_goal 10 Goal-conditioned rearrangement
libero_object 10 Object-centric pick-and-place
libero_spatial 10 Spatially varied placement
  • Action representation: delta end-effector (7-d, gripper included)
  • Image observation: single primary RGB view, resized to 224 Γ— 224
  • Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe

Total steps 80,000 (released checkpoints: 30k / 40k / 50k)
Warm-up steps 5,000
Per-device batch size 8
Hardware 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision bf16, mixed-precision + gradient checkpointing
Optimizer AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8)
LR (base / VLM) 2.5e-5
LR (action head) 1e-4
LR scheduler cosine_with_min_lr (min lr 1e-6)
Gradient clipping 1.0
Flow-matching noise Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999
Repeated diffusion steps 8
Frozen modules none (full fine-tuning)

The exact training config is preserved in config.yaml, and the launch script in run_libero_train.sh.


Evaluation β€” LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (↑).

Step libero_goal libero_object libero_spatial Avg (3 suites)
30k 0.908 0.980 0.880 0.923
40k 0.948 0.990 0.884 0.941
50k 0.944 0.990 0.906 0.947

libero_10 was not evaluated for this run.
Best checkpoint: steps_50000_pytorch_model.pt β€” avg 94.7 % across libero_goal / object / spatial.

For comparison with other StarVLA frameworks see the StarVLA Model Zoo.


Repository Layout

.
β”œβ”€β”€ README.md                 # this model card
β”œβ”€β”€ config.yaml               # training config
β”œβ”€β”€ run_libero_train.sh       # launch script used for this run
β”œβ”€β”€ dataset_statistics.json   # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl             # training step summary
β”œβ”€β”€ logs/                     # per-suite evaluation logs
β”‚   β”œβ”€β”€ libero_goal/
β”‚   β”œβ”€β”€ libero_object/
β”‚   └── libero_spatial/
β”œβ”€β”€ videos/                   # evaluation rollout videos
└── checkpoints/
    β”œβ”€β”€ steps_50000_pytorch_model.pt   # ← recommended checkpoint
    β”œβ”€β”€ steps_40000_pytorch_model.pt
    └── steps_30000_pytorch_model.pt

How to Use

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="CosmoPredict2GR00T",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ— 7)

For end-to-end LIBERO evaluation see examples/LIBERO.


Intended Use & Limitations

Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for dual VLM + world-model conditioning architectures.

Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with WidowX-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.

Downloads last month
6
Video Preview
loading

Model tree for StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1

Finetuned
(6)
this model

Collection including StarVLA/WM4A-CosmoPredict-GR00T-LIBERO-4in1