WanOFT-LIBERO-4in1

A Vision-Language-Action (VLA) model from the StarVLA project, built on Wan2.2-TI2V-5B (a large-scale text-to-video diffusion model) as the visual backbone, driving a lightweight MLP action head (WanOFT).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 + libero_goal + libero_object + libero_spatial combined).

WanOFT is StarVLA's architecture that leverages the rich spatiotemporal features of the Wan 2.2 video diffusion model as visual representations, paired with a simple yet effective MLP action head:

  1. Wan2.2 visual features β€” last-layer activations of Wan2.2-TI2V-5B-Diffusers provide high-quality, motion-aware visual tokens that encode dynamics well-suited for manipulation policy learning.
  2. MLP action head (OFT-style) β€” a compact 3-layer MLP (hidden dim 3072) produces action predictions directly from the Wan2.2 visual features and instruction tokens, offering fast inference with minimal overhead.
  3. Language conditioning via instruction tokens β€” the task instruction is tokenised and concatenated with the visual tokens before the MLP head; no separate VLM backbone is used.

Model Summary

Architecture WanOFT (Wan2.2 visual backbone + MLP action head)
Visual backbone Wan2.2-TI2V-5B-Diffusers
Action head MLP (hidden dim 3072, OFT-style)
Action chunk 8 steps (+ 7 future-window steps)
Action / state dim 7 / 7 (delta end-effector)
Image resolution 224 Γ— 224, single 3rd-person view
License MIT
Codebase starVLA/starVLA

Training Data

LIBERO 4-in-1 mixture (libero_all) β€” all four LIBERO task suites combined into a single training stream:

Suite Tasks Description
libero_10 10 Long-horizon tabletop manipulation
libero_goal 10 Goal-conditioned rearrangement
libero_object 10 Object-centric pick-and-place
libero_spatial 10 Spatially varied placement
  • Action representation: delta end-effector (7-d, gripper included)
  • Image observation: single primary RGB view, resized to 224 Γ— 224
  • Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe

Total steps 800,000 (released checkpoints: 10k – 60k)
Warm-up steps 5,000
Per-device batch size 8
Hardware 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision bf16, mixed-precision
Attention impl. SDPA
Optimizer AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8)
LR (base / VLM) 2.5e-5
LR (action head) 1e-4
LR scheduler cosine_with_min_lr (min lr 1e-6)
Gradient clipping 1.0
Frozen modules none (full fine-tuning)

The exact training config is preserved in config.yaml / config.full.yaml, and the launch script in run_libero_train.sh.


Evaluation β€” LIBERO 4-in-1

Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (↑).

Step libero_10 libero_goal libero_object libero_spatial Avg (4 suites)
10k 0.364 0.772 0.986 0.808 0.732
20k 0.750 0.900 0.942 0.896 0.872
30k 0.722 0.920 0.978 0.882 0.876
40k 0.788 0.934 0.978 0.872 0.893
50k 0.772 0.924 0.978 0.864 0.885
60k 0.860 0.954 0.978 0.874 0.916

Best checkpoint: steps_60000_pytorch_model.pt β€” avg 91.6 % across all four LIBERO suites.

For comparison with other StarVLA frameworks see the StarVLA Model Zoo.


Repository Layout

.
β”œβ”€β”€ README.md                 # this model card
β”œβ”€β”€ config.yaml               # minimal training config
β”œβ”€β”€ config.full.yaml          # fully resolved training config
β”œβ”€β”€ run_libero_train.sh       # launch script used for this run
β”œβ”€β”€ dataset_statistics.json   # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl             # training step summary
β”œβ”€β”€ logs/                     # per-suite evaluation logs
β”‚   β”œβ”€β”€ libero_10/
β”‚   β”œβ”€β”€ libero_goal/
β”‚   β”œβ”€β”€ libero_object/
β”‚   └── libero_spatial/
β”œβ”€β”€ videos/                   # evaluation rollout videos
└── checkpoints/
    β”œβ”€β”€ steps_60000_pytorch_model.pt   # ← recommended checkpoint
    β”œβ”€β”€ steps_50000_pytorch_model.pt
    β”œβ”€β”€ steps_40000_pytorch_model.pt
    β”œβ”€β”€ steps_30000_pytorch_model.pt
    β”œβ”€β”€ steps_20000_pytorch_model.pt
    └── steps_10000_pytorch_model.pt

How to Use

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-WanOFT-LIBERO-4in1")

policy = load_framework_from_checkpoint(
    framework_name="WanOFT",
    config_path=f"{ckpt_dir}/config.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_60000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ— 7)

For end-to-end LIBERO evaluation see examples/LIBERO.


Intended Use & Limitations

Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for large video diffusion model features in VLA architectures.

Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with Franka-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.

Downloads last month
6
Video Preview
loading

Model tree for StarVLA/WM4A-Wan2d2-OFT-LIBERO-4in1

Finetuned
(9)
this model

Collection including StarVLA/WM4A-Wan2d2-OFT-LIBERO-4in1