File size: 3,059 Bytes
67e5f0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a741768
 
 
 
 
 
67e5f0c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
pipeline_tag: robotics
---

# From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).

The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.

## Resources
- **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
- **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)

## Model Variants

The paper compares four representative strategies for integrating latent action supervision:

| Model       | Latent supervision          | Role in VLA training                                      |
| ----------- | --------------------------- | --------------------------------------------------------- |
| `Baseline`  | None                        | Direct action prediction without latent supervision       |
| `LA-Align`  | Image-based latent actions  | Align internal VLM representations with latent embeddings |
| `LA-Direct` | Image-based latent actions  | Directly decode latent actions as discrete tokens         |
| `LA-Cond`   | Image-based latent actions  | Jointly decode latent actions and action representations  |
| `LA-Tok`    | Action-based latent actions | Map actions into discrete latent tokens                   |

## Training Example

Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:

```bash
torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
  --seed 42 \
  --run_root_dir runs \
  --save_checkpoint True \
  --vla_id baseline \
  --vlm_path /path/to/Qwen3-VL-2B \
  --vlm_model_id Qwen3 \
  --default_image_size 224 \
  --data_root_dir /path/to/rlds_data \
  --data_mix '["libero_goal"]' \
  --shuffle_buffer_size 128 \
  --image_aug True \
  --window_size 8 \
  --use_wrist_image True \
  --use_proprio True \
  --type training \
  --epochs 10 \
  --max_steps 20000 \
  --global_batch_size 128 \
  --per_device_batch_size 32 \
  --learning_rate 1e-4 \
  --weight_decay 0.01 \
  --max_grad_norm 1.0 \
  --lr_scheduler_type constant \
  --warmup_ratio 0.03 \
  --save_step 20000 \
  --wandb_project your_project \
  --use_wandb True
```

## Citation

```bibtex
@article{pixels2tokens2026,
  title   = {From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
  author  = {Lin, Yihan and Li, Haoyang and Li, Yang and Shen, Haitao and Zhao, Yihan and Shao, Chao and Zhang, Jing},
  journal = {arXiv preprint arXiv:2605.04678},
  year    = {2026},
  doi     = {10.48550/arXiv.2605.04678}
}
```