File size: 3,059 Bytes
67e5f0c a741768 67e5f0c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | ---
license: apache-2.0
pipeline_tag: robotics
---
# From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).
The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.
## Resources
- **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
- **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)
## Model Variants
The paper compares four representative strategies for integrating latent action supervision:
| Model | Latent supervision | Role in VLA training |
| ----------- | --------------------------- | --------------------------------------------------------- |
| `Baseline` | None | Direct action prediction without latent supervision |
| `LA-Align` | Image-based latent actions | Align internal VLM representations with latent embeddings |
| `LA-Direct` | Image-based latent actions | Directly decode latent actions as discrete tokens |
| `LA-Cond` | Image-based latent actions | Jointly decode latent actions and action representations |
| `LA-Tok` | Action-based latent actions | Map actions into discrete latent tokens |
## Training Example
Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:
```bash
torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
--seed 42 \
--run_root_dir runs \
--save_checkpoint True \
--vla_id baseline \
--vlm_path /path/to/Qwen3-VL-2B \
--vlm_model_id Qwen3 \
--default_image_size 224 \
--data_root_dir /path/to/rlds_data \
--data_mix '["libero_goal"]' \
--shuffle_buffer_size 128 \
--image_aug True \
--window_size 8 \
--use_wrist_image True \
--use_proprio True \
--type training \
--epochs 10 \
--max_steps 20000 \
--global_batch_size 128 \
--per_device_batch_size 32 \
--learning_rate 1e-4 \
--weight_decay 0.01 \
--max_grad_norm 1.0 \
--lr_scheduler_type constant \
--warmup_ratio 0.03 \
--save_step 20000 \
--wandb_project your_project \
--use_wandb True
```
## Citation
```bibtex
@article{pixels2tokens2026,
title = {From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
author = {Lin, Yihan and Li, Haoyang and Li, Yang and Shen, Haitao and Zhao, Yihan and Shao, Chao and Zhang, Jing},
journal = {arXiv preprint arXiv:2605.04678},
year = {2026},
doi = {10.48550/arXiv.2605.04678}
}
``` |