--- license: mit library_name: starVLA pipeline_tag: robotics tags: - vla - vision-language-action - robotics - flow-matching - cosmos - gr00t - manipulation - libero datasets: - IPEC-COMMUNITY/libero_lerobot language: - en base_model: - nvidia/Cosmos-Predict2-2B-Video2World --- # StarVLA-CosmoPredict2GR00T-LIBERO-4in1 A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA) project, built on a **Cosmos-Predict2-2B** world model as the visual backbone, driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`). The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 + libero_goal + libero_object + libero_spatial combined). `CosmoPredict2GR00T` is StarVLA's architecture that extracts visual world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world diffusion model) and feeds them into a cross-attention DiT flow-matching action head inspired by the GR00T N1 design: 1. **Cosmos-Predict2 visual features** — the last-layer activations of `Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual representations. 32 target vision tokens are extracted and passed to the action head. 2. **Cross-attention flow-matching DiT** — a 16-layer DiT-B with cross-attention (cross-attention dim 2048, interleaved self-attention, adaptive LayerNorm) generates action chunks via flow matching. 3. **Language conditioning via instruction tokens** — the task instruction is tokenised and injected into the DiT cross-attention alongside the visual tokens; no separate VLM backbone is used. --- ## Model Summary | | | | --- | --- | | **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) | | **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) | | **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) | | **Action chunk** | 8 steps (+ 7 future-window steps) | | **Action / state dim** | 7 / 7 (delta end-effector) | | **Image resolution** | 224 × 224, single 3rd-person view | | **Inference timesteps** | 4 (flow matching) | | **License** | MIT | | **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) | --- ## Training Data **LIBERO 4-in-1** mixture (`libero_all`) — all four LIBERO task suites combined into a single training stream: | Suite | Tasks | Description | | --- | ---: | --- | | `libero_10` | 10 | Long-horizon tabletop manipulation | | `libero_goal` | 10 | Goal-conditioned rearrangement | | `libero_object` | 10 | Object-centric pick-and-place | | `libero_spatial` | 10 | Spatially varied placement | - Action representation: **delta end-effector** (7-d, gripper included) - Image observation: single primary RGB view, resized to 224 × 224 - Per-dataset normalisation statistics are stored in [`dataset_statistics.json`](dataset_statistics.json). --- ## Training Recipe | | | | --- | --- | | Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) | | Warm-up steps | 5,000 | | Per-device batch size | 8 | | Hardware | 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) | | Precision | bf16, mixed-precision + gradient checkpointing | | Optimizer | AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) | | LR (base / VLM) | 2.5e-5 | | LR (action head) | 1e-4 | | LR scheduler | `cosine_with_min_lr` (min lr 1e-6) | | Gradient clipping | 1.0 | | Flow-matching noise | β-distribution (α=1.5, β=1.0), s = 0.999 | | Repeated diffusion steps | 8 | | Frozen modules | none (full fine-tuning) | The exact training config is preserved in [`config.yaml`](config.yaml), and the launch script in [`run_libero_train.sh`](run_libero_train.sh). --- ## Evaluation — LIBERO 4-in-1 Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (↑). | Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** | | ---: | ---: | ---: | ---: | ---: | | 30k | 0.908 | 0.980 | 0.880 | 0.923 | | 40k | 0.948 | 0.990 | 0.884 | 0.941 | | **50k** | **0.944** | **0.990** | **0.906** | **0.947** | > `libero_10` was not evaluated for this run. > Best checkpoint: **`steps_50000_pytorch_model.pt`** — avg **94.7 %** across libero_goal / object / spatial. For comparison with other StarVLA frameworks see the [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md). --- ## Repository Layout ``` . ├── README.md # this model card ├── config.yaml # training config ├── run_libero_train.sh # launch script used for this run ├── dataset_statistics.json # per-dataset action/state normalisation stats ├── summary.jsonl # training step summary ├── logs/ # per-suite evaluation logs │ ├── libero_goal/ │ ├── libero_object/ │ └── libero_spatial/ ├── videos/ # evaluation rollout videos └── checkpoints/ ├── steps_50000_pytorch_model.pt # ← recommended checkpoint ├── steps_40000_pytorch_model.pt └── steps_30000_pytorch_model.pt ``` --- ## How to Use ```bash git clone https://github.com/starVLA/starVLA.git cd starVLA # Follow installation instructions in the StarVLA README. ``` ```python from huggingface_hub import snapshot_download from starVLA.model.framework.tools import load_framework_from_checkpoint ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1") policy = load_framework_from_checkpoint( framework_name="CosmoPredict2GR00T", config_path=f"{ckpt_dir}/config.yaml", checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt", ) # policy.predict_action(images, instruction, state) -> action chunk (8 × 7) ``` For end-to-end LIBERO evaluation see [`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO). --- ## Intended Use & Limitations **Intended use.** Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for dual VLM + world-model conditioning architectures. **Out-of-scope / limitations.** This model is trained exclusively on LIBERO simulation data with WidowX-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.