CokeAnd1ce
/

From_Pixels_to_Tokens

Robotics

Model card Files Files and versions

xet

Community

Add model card and robotics metadata

by nielsr HF Staff - opened 20 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+72

-3

Files changed (1) hide show

README.md +72 -3

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: robotics
+---
+# From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
+This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).
+The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.
+## Resources
+- **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
+- **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)
+## Model Variants
+The paper compares four representative strategies for integrating latent action supervision:
+| Model       | Latent supervision          | Role in VLA training                                      |
+| ----------- | --------------------------- | --------------------------------------------------------- |
+| `Baseline`  | None                        | Direct action prediction without latent supervision       |
+| `LA-Align`  | Image-based latent actions  | Align internal VLM representations with latent embeddings |
+| `LA-Direct` | Image-based latent actions  | Directly decode latent actions as discrete tokens         |
+| `LA-Cond`   | Image-based latent actions  | Jointly decode latent actions and action representations  |
+| `LA-Tok`    | Action-based latent actions | Map actions into discrete latent tokens                   |
+## Training Example
+Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:
+```bash
+torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
+  --seed 42 \
+  --run_root_dir runs \
+  --save_checkpoint True \
+  --vla_id baseline \
+  --vlm_path /path/to/Qwen3-VL-2B \
+  --vlm_model_id Qwen3 \
+  --default_image_size 224 \
+  --data_root_dir /path/to/rlds_data \
+  --data_mix '["libero_goal"]' \
+  --shuffle_buffer_size 128 \
+  --image_aug True \
+  --window_size 8 \
+  --use_wrist_image True \
+  --use_proprio True \
+  --type training \
+  --epochs 10 \
+  --max_steps 20000 \
+  --global_batch_size 128 \
+  --per_device_batch_size 32 \
+  --learning_rate 1e-4 \
+  --weight_decay 0.01 \
+  --max_grad_norm 1.0 \
+  --lr_scheduler_type constant \
+  --warmup_ratio 0.03 \
+  --save_step 20000 \
+  --wandb_project your_project \
+  --use_wandb True
+```
+## Citation
+```bibtex
+@article{liu2025pixels,
+  title={From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
+  author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},
+  journal={arXiv preprint arXiv:2605.04678},
+  year={2025}
+}
+```