| --- |
| license: apache-2.0 |
| pipeline_tag: robotics |
| --- |
| |
| # From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models |
|
|
| This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678). |
|
|
| The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone. |
|
|
| ## Resources |
| - **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678) |
| - **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens) |
|
|
| ## Model Variants |
|
|
| The paper compares four representative strategies for integrating latent action supervision: |
|
|
| | Model | Latent supervision | Role in VLA training | |
| | ----------- | --------------------------- | --------------------------------------------------------- | |
| | `Baseline` | None | Direct action prediction without latent supervision | |
| | `LA-Align` | Image-based latent actions | Align internal VLM representations with latent embeddings | |
| | `LA-Direct` | Image-based latent actions | Directly decode latent actions as discrete tokens | |
| | `LA-Cond` | Image-based latent actions | Jointly decode latent actions and action representations | |
| | `LA-Tok` | Action-based latent actions | Map actions into discrete latent tokens | |
|
|
| ## Training Example |
|
|
| Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset: |
|
|
| ```bash |
| torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \ |
| --seed 42 \ |
| --run_root_dir runs \ |
| --save_checkpoint True \ |
| --vla_id baseline \ |
| --vlm_path /path/to/Qwen3-VL-2B \ |
| --vlm_model_id Qwen3 \ |
| --default_image_size 224 \ |
| --data_root_dir /path/to/rlds_data \ |
| --data_mix '["libero_goal"]' \ |
| --shuffle_buffer_size 128 \ |
| --image_aug True \ |
| --window_size 8 \ |
| --use_wrist_image True \ |
| --use_proprio True \ |
| --type training \ |
| --epochs 10 \ |
| --max_steps 20000 \ |
| --global_batch_size 128 \ |
| --per_device_batch_size 32 \ |
| --learning_rate 1e-4 \ |
| --weight_decay 0.01 \ |
| --max_grad_norm 1.0 \ |
| --lr_scheduler_type constant \ |
| --warmup_ratio 0.03 \ |
| --save_step 20000 \ |
| --wandb_project your_project \ |
| --use_wandb True |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{pixels2tokens2026, |
| title = {From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models}, |
| author = {Lin, Yihan and Li, Haoyang and Li, Yang and Shen, Haitao and Zhao, Yihan and Shao, Chao and Zhang, Jing}, |
| journal = {arXiv preprint arXiv:2605.04678}, |
| year = {2026}, |
| doi = {10.48550/arXiv.2605.04678} |
| } |
| ``` |