WanOFT-LIBERO-4in1

A Vision-Language-Action (VLA) model from the StarVLA project, built on Wan2.2-TI2V-5B (a large-scale text-to-video diffusion model) as the visual backbone, driving a lightweight MLP action head (WanOFT).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 + libero_goal + libero_object + libero_spatial combined).

WanOFT is StarVLA's architecture that leverages the rich spatiotemporal features of the Wan 2.2 video diffusion model as visual representations, paired with a simple yet effective MLP action head:

Wan2.2 visual features — last-layer activations of Wan2.2-TI2V-5B-Diffusers provide high-quality, motion-aware visual tokens that encode dynamics well-suited for manipulation policy learning.
MLP action head (OFT-style) — a compact 3-layer MLP (hidden dim 3072) produces action predictions directly from the Wan2.2 visual features and instruction tokens, offering fast inference with minimal overhead.
Language conditioning via instruction tokens — the task instruction is tokenised and concatenated with the visual tokens before the MLP head; no separate VLM backbone is used.

Model Summary


Architecture	`WanOFT` (Wan2.2 visual backbone + MLP action head)
Visual backbone	`Wan2.2-TI2V-5B-Diffusers`
Action head	MLP (hidden dim 3072, OFT-style)
Action chunk	8 steps (+ 7 future-window steps)
Action / state dim	7 / 7 (delta end-effector)
Image resolution	224 × 224, single 3rd-person view
License	MIT
Codebase	starVLA/starVLA

Training Data

LIBERO 4-in-1 mixture (libero_all) — all four LIBERO task suites combined into a single training stream:

Suite	Tasks	Description
`libero_10`	10	Long-horizon tabletop manipulation
`libero_goal`	10	Goal-conditioned rearrangement
`libero_object`	10	Object-centric pick-and-place
`libero_spatial`	10	Spatially varied placement

Action representation: delta end-effector (7-d, gripper included)
Image observation: single primary RGB view, resized to 224 × 224
Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe


Total steps	800,000 (released checkpoints: 10k – 60k)
Warm-up steps	5,000
Per-device batch size	8
Hardware	8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision	bf16, mixed-precision
Attention impl.	SDPA
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8)
LR (base / VLM)	2.5e-5
LR (action head)	1e-4
LR scheduler	`cosine_with_min_lr` (min lr 1e-6)
Gradient clipping	1.0
Frozen modules	none (full fine-tuning)

The exact training config is preserved in config.yaml / config.full.yaml, and the launch script in run_libero_train.sh.

Evaluation — LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (↑).

Step	libero_10	libero_goal	libero_object	libero_spatial	Avg (4 suites)
10k	0.364	0.772	0.986	0.808	0.732
20k	0.750	0.900	0.942	0.896	0.872
30k	0.722	0.920	0.978	0.882	0.876
40k	0.788	0.934	0.978	0.872	0.893
50k	0.772	0.924	0.978	0.864	0.885
60k	0.860	0.954	0.978	0.874	0.916

Best checkpoint: steps_60000_pytorch_model.pt — avg 91.6 % across all four LIBERO suites.

For comparison with other StarVLA frameworks see the StarVLA Model Zoo.

Repository Layout

.
├── README.md                 # this model card
├── config.yaml               # minimal training config
├── config.full.yaml          # fully resolved training config
├── run_libero_train.sh       # launch script used for this run
├── dataset_statistics.json   # per-dataset action/state normalisation stats
├── summary.jsonl             # training step summary
├── logs/                     # per-suite evaluation logs
│   ├── libero_10/
│   ├── libero_goal/
│   ├── libero_object/
│   └── libero_spatial/
├── videos/                   # evaluation rollout videos
└── checkpoints/
    ├── steps_60000_pytorch_model.pt   # ← recommended checkpoint
    ├── steps_50000_pytorch_model.pt
    ├── steps_40000_pytorch_model.pt
    ├── steps_30000_pytorch_model.pt
    ├── steps_20000_pytorch_model.pt
    └── steps_10000_pytorch_model.pt

How to Use

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.

from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-WanOFT-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="WanOFT",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_60000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 × 7)

For end-to-end LIBERO evaluation see examples/LIBERO.

Intended Use & Limitations

Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for large video diffusion model features in VLA architectures.

Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with Franka-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.

Downloads last month: 6

Video Preview

Robotics

Model tree for StarVLA/WM4A-Wan2d2-OFT-LIBERO-4in1

Base model

Wan-AI/Wan2.2-TI2V-5B-Diffusers

Finetuned

(9)

this model

Collection including StarVLA/WM4A-Wan2d2-OFT-LIBERO-4in1

World Model to VLA

Collection

3 items • Updated 7 days ago