license: mit
library_name: starVLA
pipeline_tag: robotics
tags:
- vla
- vision-language-action
- robotics
- flow-matching
- cosmos
- gr00t
- manipulation
- libero
datasets:
- IPEC-COMMUNITY/libero_lerobot
language:
- en
base_model:
- nvidia/Cosmos-Predict2-2B-Video2World
StarVLA-CosmoPredict2GR00T-LIBERO-4in1
A Vision-Language-Action (VLA) model from the StarVLA
project, built on a Cosmos-Predict2-2B world model as the visual backbone,
driving a GR00T-style DiT flow-matching action head (CosmoPredict2GR00T).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).
CosmoPredict2GR00T is StarVLA's architecture that extracts visual
world-model features from NVIDIA Cosmos-Predict2-2B (a video-to-world
diffusion model) and feeds them into a cross-attention DiT flow-matching
action head inspired by the GR00T N1 design:
- Cosmos-Predict2 visual features β the last-layer activations of
Cosmos-Predict2-2B-Video2Worldserve as rich spatiotemporal visual representations. 32 target vision tokens are extracted and passed to the action head. - Cross-attention flow-matching DiT β a 16-layer DiT-B with cross-attention (cross-attention dim 2048, interleaved self-attention, adaptive LayerNorm) generates action chunks via flow matching.
- Language conditioning via instruction tokens β the task instruction is tokenised and injected into the DiT cross-attention alongside the visual tokens; no separate VLM backbone is used.
Model Summary
| Architecture | CosmoPredict2GR00T (Cosmos-Predict2 visual backbone + cross-attn FM DiT) |
| Visual backbone | Cosmos-Predict2-2B-Video2World |
| Action head | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) |
| Action chunk | 8 steps (+ 7 future-window steps) |
| Action / state dim | 7 / 7 (delta end-effector) |
| Image resolution | 224 Γ 224, single 3rd-person view |
| Inference timesteps | 4 (flow matching) |
| License | MIT |
| Codebase | starVLA/starVLA |
Training Data
LIBERO 4-in-1 mixture (libero_all) β all four LIBERO task suites
combined into a single training stream:
| Suite | Tasks | Description |
|---|---|---|
libero_10 |
10 | Long-horizon tabletop manipulation |
libero_goal |
10 | Goal-conditioned rearrangement |
libero_object |
10 | Object-centric pick-and-place |
libero_spatial |
10 | Spatially varied placement |
- Action representation: delta end-effector (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ 224
- Per-dataset normalisation statistics are stored in
dataset_statistics.json.
Training Recipe
| Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | cosine_with_min_lr (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
The exact training config is preserved in
config.yaml, and the launch script in
run_libero_train.sh.
Evaluation β LIBERO 4-in-1
Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (β).
| Step | libero_goal | libero_object | libero_spatial | Avg (3 suites) |
|---|---|---|---|---|
| 30k | 0.908 | 0.980 | 0.880 | 0.923 |
| 40k | 0.948 | 0.990 | 0.884 | 0.941 |
| 50k | 0.944 | 0.990 | 0.906 | 0.947 |
libero_10was not evaluated for this run.
Best checkpoint:steps_50000_pytorch_model.ptβ avg 94.7 % across libero_goal / object / spatial.
For comparison with other StarVLA frameworks see the StarVLA Model Zoo.
Repository Layout
.
βββ README.md # this model card
βββ config.yaml # training config
βββ run_libero_train.sh # launch script used for this run
βββ dataset_statistics.json # per-dataset action/state normalisation stats
βββ summary.jsonl # training step summary
βββ logs/ # per-suite evaluation logs
β βββ libero_goal/
β βββ libero_object/
β βββ libero_spatial/
βββ videos/ # evaluation rollout videos
βββ checkpoints/
βββ steps_50000_pytorch_model.pt # β recommended checkpoint
βββ steps_40000_pytorch_model.pt
βββ steps_30000_pytorch_model.pt
How to Use
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")
policy = load_framework_from_checkpoint(
framework_name="CosmoPredict2GR00T",
config_path=f"{ckpt_dir}/config.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ 7)
For end-to-end LIBERO evaluation see
examples/LIBERO.
Intended Use & Limitations
Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for dual VLM + world-model conditioning architectures.
Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with WidowX-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.