WanOFT-LIBERO-4in1
A Vision-Language-Action (VLA) model from the StarVLA
project, built on Wan2.2-TI2V-5B (a large-scale text-to-video diffusion
model) as the visual backbone, driving a lightweight MLP action head
(WanOFT).
The model is trained on the full LIBERO 4-in-1 benchmark (libero_10 +
libero_goal + libero_object + libero_spatial combined).
WanOFT is StarVLA's architecture that leverages the rich spatiotemporal
features of the Wan 2.2 video diffusion model as visual representations,
paired with a simple yet effective MLP action head:
- Wan2.2 visual features β last-layer activations of
Wan2.2-TI2V-5B-Diffusersprovide high-quality, motion-aware visual tokens that encode dynamics well-suited for manipulation policy learning. - MLP action head (OFT-style) β a compact 3-layer MLP (hidden dim 3072) produces action predictions directly from the Wan2.2 visual features and instruction tokens, offering fast inference with minimal overhead.
- Language conditioning via instruction tokens β the task instruction is tokenised and concatenated with the visual tokens before the MLP head; no separate VLM backbone is used.
Model Summary
| Architecture | WanOFT (Wan2.2 visual backbone + MLP action head) |
| Visual backbone | Wan2.2-TI2V-5B-Diffusers |
| Action head | MLP (hidden dim 3072, OFT-style) |
| Action chunk | 8 steps (+ 7 future-window steps) |
| Action / state dim | 7 / 7 (delta end-effector) |
| Image resolution | 224 Γ 224, single 3rd-person view |
| License | MIT |
| Codebase | starVLA/starVLA |
Training Data
LIBERO 4-in-1 mixture (libero_all) β all four LIBERO task suites
combined into a single training stream:
| Suite | Tasks | Description |
|---|---|---|
libero_10 |
10 | Long-horizon tabletop manipulation |
libero_goal |
10 | Goal-conditioned rearrangement |
libero_object |
10 | Object-centric pick-and-place |
libero_spatial |
10 | Spatially varied placement |
- Action representation: delta end-effector (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ 224
- Per-dataset normalisation statistics are stored in
dataset_statistics.json.
Training Recipe
| Total steps | 800,000 (released checkpoints: 10k β 60k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 8 |
| Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision |
| Attention impl. | SDPA |
| Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 2.5e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | cosine_with_min_lr (min lr 1e-6) |
| Gradient clipping | 1.0 |
| Frozen modules | none (full fine-tuning) |
The exact training config is preserved in
config.yaml / config.full.yaml, and the
launch script in run_libero_train.sh.
Evaluation β LIBERO 4-in-1
Following the standard LIBERO evaluation protocol (50 trials per task per suite). Numbers are success rates (β).
| Step | libero_10 | libero_goal | libero_object | libero_spatial | Avg (4 suites) |
|---|---|---|---|---|---|
| 10k | 0.364 | 0.772 | 0.986 | 0.808 | 0.732 |
| 20k | 0.750 | 0.900 | 0.942 | 0.896 | 0.872 |
| 30k | 0.722 | 0.920 | 0.978 | 0.882 | 0.876 |
| 40k | 0.788 | 0.934 | 0.978 | 0.872 | 0.893 |
| 50k | 0.772 | 0.924 | 0.978 | 0.864 | 0.885 |
| 60k | 0.860 | 0.954 | 0.978 | 0.874 | 0.916 |
Best checkpoint: steps_60000_pytorch_model.pt β avg 91.6 % across
all four LIBERO suites.
For comparison with other StarVLA frameworks see the StarVLA Model Zoo.
Repository Layout
.
βββ README.md # this model card
βββ config.yaml # minimal training config
βββ config.full.yaml # fully resolved training config
βββ run_libero_train.sh # launch script used for this run
βββ dataset_statistics.json # per-dataset action/state normalisation stats
βββ summary.jsonl # training step summary
βββ logs/ # per-suite evaluation logs
β βββ libero_10/
β βββ libero_goal/
β βββ libero_object/
β βββ libero_spatial/
βββ videos/ # evaluation rollout videos
βββ checkpoints/
βββ steps_60000_pytorch_model.pt # β recommended checkpoint
βββ steps_50000_pytorch_model.pt
βββ steps_40000_pytorch_model.pt
βββ steps_30000_pytorch_model.pt
βββ steps_20000_pytorch_model.pt
βββ steps_10000_pytorch_model.pt
How to Use
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-WanOFT-LIBERO-4in1")
policy = load_framework_from_checkpoint(
framework_name="WanOFT",
config_path=f"{ckpt_dir}/config.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_60000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (8 Γ 7)
For end-to-end LIBERO evaluation see
examples/LIBERO.
Intended Use & Limitations
Intended use. Research on vision-language-action models, LIBERO tabletop manipulation benchmarks, and as a baseline for large video diffusion model features in VLA architectures.
Out-of-scope / limitations. This model is trained exclusively on LIBERO simulation data with Franka-style delta end-effector control. Real-robot transfer and cross-embodiment generalisation have not been evaluated. Performance may degrade on out-of-distribution scenes, objects, or instructions not present in the LIBERO training split.
- Downloads last month
- 6
Model tree for StarVLA/WM4A-Wan2d2-OFT-LIBERO-4in1
Base model
Wan-AI/Wan2.2-TI2V-5B-Diffusers