Learning Visual Feature-Based World Models via Residual Latent Action

This repository contains the weights for the model presented in the paper Learning Visual Feature-Based World Models via Residual Latent Action.

Project Page | GitHub | Colab Demo

Introduction

World models predict future transitions from observations and actions. RLA-WM is a visual feature-based world model that predicts future visual features instead of raw video pixels. It leverages a new type of latent action representation called Residual Latent Action (RLA), learned from DINO residuals. RLA-WM predicts these RLA values via flow matching, offering a more efficient and less hallucination-prone alternative to video-diffusion world models while being significantly faster.

The model enables advanced robot learning techniques, including learning from actionless demonstration videos and training visual RL policies entirely within the world model.

Quickstart

RLA-WM setup and inference demo:

# Environment setup
MAX_JOBS=1 uv sync
source .venv/bin/activate
export PYTHONPATH=.:./third_party/diffusion_policy

# Pretrained weights -> runs/weights/
hf download xyzhang368/RLA-WM --local-dir runs/weights

# Minimal dataset for inference -> data/maniskill/ + data/eval_handles/
mkdir -p data && cd data
hf download xyzhang368/RLA-WM --repo-type dataset --include "maniskill.tar"  --local-dir . && tar -xf maniskill.tar
cd ..

# Run the inference notebook
jupyter notebook notebooks/inference_demo.ipynb

Citation

@article{zhang2026learning,
  title={{Learning Visual Feature-Based World Models via Residual Latent Action}},
  author={Zhang, Xinyu and Xu, Zhengtong and Tao, Yutian and Wang, Yeping and She, Yu and Boularias, Abdeslam},
  journal={arXiv preprint arXiv:2605.07079},
  year={2026},
  eprint={2605.07079},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for xyzhang368/RLA-WM