| --- |
| license: mit |
| library_name: starVLA |
| pipeline_tag: robotics |
| tags: |
| - vla |
| - vision-language-action |
| - robotics |
| - flow-matching |
| - cosmos |
| - gr00t |
| - manipulation |
| - libero |
| datasets: |
| - IPEC-COMMUNITY/libero_lerobot |
| language: |
| - en |
| base_model: |
| - nvidia/Cosmos-Predict2-2B-Video2World |
| --- |
| |
| # StarVLA-CosmoPredict2GR00T-LIBERO-4in1 |
|
|
| A **Vision-Language-Action (VLA)** model from the [StarVLA](https://github.com/starVLA/starVLA) |
| project, built on a **Cosmos-Predict2-2B** world model as the visual backbone, |
| driving a **GR00T-style DiT flow-matching action head** (`CosmoPredict2GR00T`). |
| The model is trained on the full **LIBERO 4-in-1** benchmark (libero_10 + |
| libero_goal + libero_object + libero_spatial combined). |
|
|
| `CosmoPredict2GR00T` is StarVLA's architecture that extracts visual |
| world-model features from **NVIDIA Cosmos-Predict2-2B** (a video-to-world |
| diffusion model) and feeds them into a cross-attention DiT flow-matching |
| action head inspired by the GR00T N1 design: |
|
|
| 1. **Cosmos-Predict2 visual features** β the last-layer activations of |
| `Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual |
| representations. 32 target vision tokens are extracted and passed to the |
| action head. |
| 2. **Cross-attention flow-matching DiT** β a 16-layer DiT-B with |
| cross-attention (cross-attention dim 2048, interleaved self-attention, |
| adaptive LayerNorm) generates action chunks via flow matching. |
| 3. **Language conditioning via instruction tokens** β the task instruction is |
| tokenised and injected into the DiT cross-attention alongside the visual |
| tokens; no separate VLM backbone is used. |
|
|
| --- |
|
|
| ## Model Summary |
|
|
| | | | |
| | --- | --- | |
| | **Architecture** | `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) | |
| | **Visual backbone** | [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) | |
| | **Action head** | Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) | |
| | **Action chunk** | 8 steps (+ 7 future-window steps) | |
| | **Action / state dim** | 7 / 7 (delta end-effector) | |
| | **Image resolution** | 224 Γ 224, single 3rd-person view | |
| | **Inference timesteps** | 4 (flow matching) | |
| | **License** | MIT | |
| | **Codebase** | [starVLA/starVLA](https://github.com/starVLA/starVLA) | |
|
|
| --- |
|
|
| ## Training Data |
|
|
| **LIBERO 4-in-1** mixture (`libero_all`) β all four LIBERO task suites |
| combined into a single training stream: |
|
|
| | Suite | Tasks | Description | |
| | --- | ---: | --- | |
| | `libero_10` | 10 | Long-horizon tabletop manipulation | |
| | `libero_goal` | 10 | Goal-conditioned rearrangement | |
| | `libero_object` | 10 | Object-centric pick-and-place | |
| | `libero_spatial` | 10 | Spatially varied placement | |
|
|
| - Action representation: **delta end-effector** (7-d, gripper included) |
| - Image observation: single primary RGB view, resized to 224 Γ 224 |
| - Per-dataset normalisation statistics are stored in |
| [`dataset_statistics.json`](dataset_statistics.json). |
|
|
| --- |
|
|
| ## Training Recipe |
|
|
| | | | |
| | --- | --- | |
| | Total steps | 80,000 (released checkpoints: 30k / 40k / 50k) | |
| | Warm-up steps | 5,000 | |
| | Per-device batch size | 8 | |
| | Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) | |
| | Precision | bf16, mixed-precision + gradient checkpointing | |
| | Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) | |
| | LR (base / VLM) | 2.5e-5 | |
| | LR (action head) | 1e-4 | |
| | LR scheduler | `cosine_with_min_lr` (min lr 1e-6) | |
| | Gradient clipping | 1.0 | |
| | Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 | |
| | Repeated diffusion steps | 8 | |
| | Frozen modules | none (full fine-tuning) | |
|
|
| The exact training config is preserved in |
| [`config.yaml`](config.yaml), and the launch script in |
| [`run_libero_train.sh`](run_libero_train.sh). |
|
|
| --- |
|
|
| ## Evaluation β LIBERO 4-in-1 |
|
|
| Following the standard LIBERO evaluation protocol (50 trials per task per |
| suite). Numbers are success rates (β). |
|
|
| | Step | libero_goal | libero_object | libero_spatial | **Avg (3 suites)** | |
| | ---: | ---: | ---: | ---: | ---: | |
| | 30k | 0.908 | 0.980 | 0.880 | 0.923 | |
| | 40k | 0.948 | 0.990 | 0.884 | 0.941 | |
| | **50k** | **0.944** | **0.990** | **0.906** | **0.947** | |
| |
| > `libero_10` was not evaluated for this run. |
| > Best checkpoint: **`steps_50000_pytorch_model.pt`** β avg **94.7 %** across libero_goal / object / spatial. |
| |
| For comparison with other StarVLA frameworks see the |
| [StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md). |
| |
| --- |
| |
| ## Repository Layout |
| |
| ``` |
| . |
| βββ README.md # this model card |
| βββ config.yaml # training config |
| βββ run_libero_train.sh # launch script used for this run |
| βββ dataset_statistics.json # per-dataset action/state normalisation stats |
| βββ summary.jsonl # training step summary |
| βββ logs/ # per-suite evaluation logs |
| β βββ libero_goal/ |
| β βββ libero_object/ |
| β βββ libero_spatial/ |
| βββ videos/ # evaluation rollout videos |
| βββ checkpoints/ |
| βββ steps_50000_pytorch_model.pt # β recommended checkpoint |
| βββ steps_40000_pytorch_model.pt |
| βββ steps_30000_pytorch_model.pt |
| ``` |
| |
| --- |
| |
| ## How to Use |
| |
| ```bash |
| git clone https://github.com/starVLA/starVLA.git |
| cd starVLA |
| # Follow installation instructions in the StarVLA README. |
| ``` |
| |
| ```python |
| from huggingface_hub import snapshot_download |
| from starVLA.model.framework.tools import load_framework_from_checkpoint |
| |
| ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1") |
| |
| policy = load_framework_from_checkpoint( |
| framework_name="CosmoPredict2GR00T", |
| config_path=f"{ckpt_dir}/config.yaml", |
| checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt", |
| ) |
| # policy.predict_action(images, instruction, state) -> action chunk (8 Γ 7) |
| ``` |
| |
| For end-to-end LIBERO evaluation see |
| [`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO). |
| |
| --- |
| |
| ## Intended Use & Limitations |
| |
| **Intended use.** Research on vision-language-action models, LIBERO tabletop |
| manipulation benchmarks, and as a baseline for dual VLM + world-model |
| conditioning architectures. |
|
|
| **Out-of-scope / limitations.** This model is trained exclusively on LIBERO |
| simulation data with WidowX-style delta end-effector control. Real-robot |
| transfer and cross-embodiment generalisation have not been evaluated. |
| Performance may degrade on out-of-distribution scenes, objects, or |
| instructions not present in the LIBERO training split. |
|
|