Add files using upload-large-folder tool

c8173fb verified 7 days ago

6.56 kB

	---
	license: mit
	library_name: starVLA
	pipeline_tag: robotics
	tags:
	- vla
	- vision-language-action
	- robotics
	- flow-matching
	- cosmos
	- gr00t
	- manipulation
	- libero
	datasets:
	- IPEC-COMMUNITY/libero_lerobot
	language:
	- en
	base_model:
	- nvidia/Cosmos-Predict2-2B-Video2World
	---

	# StarVLA-CosmoPredict2GR00T-LIBERO-4in1

	A Vision-Language-Action (VLA) model from the [StarVLA](https://github.com/starVLA/starVLA)
	project, built on a Cosmos-Predict2-2B world model as the visual backbone,
	driving a GR00T-style DiT flow-matching action head (`CosmoPredict2GR00T`).
	The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 +
	libero_goal + libero_object + libero_spatial combined).

	`CosmoPredict2GR00T` is StarVLA's architecture that extracts visual
	world-model features from NVIDIA Cosmos-Predict2-2B (a video-to-world
	diffusion model) and feeds them into a cross-attention DiT flow-matching
	action head inspired by the GR00T N1 design:

	1. Cosmos-Predict2 visual features — the last-layer activations of
	`Cosmos-Predict2-2B-Video2World` serve as rich spatiotemporal visual
	representations. 32 target vision tokens are extracted and passed to the
	action head.
	2. Cross-attention flow-matching DiT — a 16-layer DiT-B with
	cross-attention (cross-attention dim 2048, interleaved self-attention,
	adaptive LayerNorm) generates action chunks via flow matching.
	3. Language conditioning via instruction tokens — the task instruction is
	tokenised and injected into the DiT cross-attention alongside the visual
	tokens; no separate VLM backbone is used.

	---

	## Model Summary

	\| \| \|
	\| --- \| --- \|
	\| Architecture \| `CosmoPredict2GR00T` (Cosmos-Predict2 visual backbone + cross-attn FM DiT) \|
	\| Visual backbone \| [`Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) \|
	\| Action head \| Cross-attention Flow-Matching DiT-B (16 layers, 1024 hidden) \|
	\| Action chunk \| 8 steps (+ 7 future-window steps) \|
	\| Action / state dim \| 7 / 7 (delta end-effector) \|
	\| Image resolution \| 224 × 224, single 3rd-person view \|
	\| Inference timesteps \| 4 (flow matching) \|
	\| License \| MIT \|
	\| Codebase \| [starVLA/starVLA](https://github.com/starVLA/starVLA) \|

	---

	## Training Data

	LIBERO 4-in-1 mixture (`libero_all`) — all four LIBERO task suites
	combined into a single training stream:

	\| Suite \| Tasks \| Description \|
	\| --- \| ---: \| --- \|
	\| `libero_10` \| 10 \| Long-horizon tabletop manipulation \|
	\| `libero_goal` \| 10 \| Goal-conditioned rearrangement \|
	\| `libero_object` \| 10 \| Object-centric pick-and-place \|
	\| `libero_spatial` \| 10 \| Spatially varied placement \|

	- Action representation: delta end-effector (7-d, gripper included)
	- Image observation: single primary RGB view, resized to 224 × 224
	- Per-dataset normalisation statistics are stored in
	[`dataset_statistics.json`](dataset_statistics.json).

	---

	## Training Recipe

	\| \| \|
	\| --- \| --- \|
	\| Total steps \| 80,000 (released checkpoints: 30k / 40k / 50k) \|
	\| Warm-up steps \| 5,000 \|
	\| Per-device batch size \| 8 \|
	\| Hardware \| 8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2) \|
	\| Precision \| bf16, mixed-precision + gradient checkpointing \|
	\| Optimizer \| AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8) \|
	\| LR (base / VLM) \| 2.5e-5 \|
	\| LR (action head) \| 1e-4 \|
	\| LR scheduler \| `cosine_with_min_lr` (min lr 1e-6) \|
	\| Gradient clipping \| 1.0 \|
	\| Flow-matching noise \| β-distribution (α=1.5, β=1.0), s = 0.999 \|
	\| Repeated diffusion steps \| 8 \|
	\| Frozen modules \| none (full fine-tuning) \|

	The exact training config is preserved in
	[`config.yaml`](config.yaml), and the launch script in
	[`run_libero_train.sh`](run_libero_train.sh).

	---

	## Evaluation — LIBERO 4-in-1

	Following the standard LIBERO evaluation protocol (50 trials per task per
	suite). Numbers are success rates (↑).

	\| Step \| libero_goal \| libero_object \| libero_spatial \| Avg (3 suites) \|
	\| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| 30k \| 0.908 \| 0.980 \| 0.880 \| 0.923 \|
	\| 40k \| 0.948 \| 0.990 \| 0.884 \| 0.941 \|
	\| 50k \| 0.944 \| 0.990 \| 0.906 \| 0.947 \|

	> `libero_10` was not evaluated for this run.
	> Best checkpoint: `steps_50000_pytorch_model.pt` — avg 94.7 % across libero_goal / object / spatial.

	For comparison with other StarVLA frameworks see the
	[StarVLA Model Zoo](https://github.com/starVLA/starVLA/blob/main/docs/model_zoo.md).

	---

	## Repository Layout

	```
	.
	├── README.md # this model card
	├── config.yaml # training config
	├── run_libero_train.sh # launch script used for this run
	├── dataset_statistics.json # per-dataset action/state normalisation stats
	├── summary.jsonl # training step summary
	├── logs/ # per-suite evaluation logs
	│ ├── libero_goal/
	│ ├── libero_object/
	│ └── libero_spatial/
	├── videos/ # evaluation rollout videos
	└── checkpoints/
	├── steps_50000_pytorch_model.pt # ← recommended checkpoint
	├── steps_40000_pytorch_model.pt
	└── steps_30000_pytorch_model.pt
	```

	---

	## How to Use

	```bash
	git clone https://github.com/starVLA/starVLA.git
	cd starVLA
	# Follow installation instructions in the StarVLA README.
	```

	```python
	from huggingface_hub import snapshot_download
	from starVLA.model.framework.tools import load_framework_from_checkpoint

	ckpt_dir = snapshot_download("StarVLA/Qwen3VL-CosmoPredict2GR00T-LIBERO-4in1")

	policy = load_framework_from_checkpoint(
	framework_name="CosmoPredict2GR00T",
	config_path=f"{ckpt_dir}/config.yaml",
	checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
	)
	# policy.predict_action(images, instruction, state) -> action chunk (8 × 7)
	```

	For end-to-end LIBERO evaluation see
	[`examples/LIBERO`](https://github.com/starVLA/starVLA/tree/main/examples/LIBERO).

	---

	## Intended Use & Limitations

	Intended use. Research on vision-language-action models, LIBERO tabletop
	manipulation benchmarks, and as a baseline for dual VLM + world-model
	conditioning architectures.

	Out-of-scope / limitations. This model is trained exclusively on LIBERO
	simulation data with WidowX-style delta end-effector control. Real-robot
	transfer and cross-embodiment generalisation have not been evaluated.
	Performance may degrade on out-of-distribution scenes, objects, or
	instructions not present in the LIBERO training split.