Commit ·
67e5f0c
1
Parent(s): 6151e7b
Add model card and robotics metadata (#1)
Browse files- Add model card and robotics metadata (59ca31e10330719474b7e7e4d9c7fb9ea2f0d7dd)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,3 +1,72 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
|
| 7 |
+
|
| 8 |
+
This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).
|
| 9 |
+
|
| 10 |
+
The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.
|
| 11 |
+
|
| 12 |
+
## Resources
|
| 13 |
+
- **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
|
| 14 |
+
- **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)
|
| 15 |
+
|
| 16 |
+
## Model Variants
|
| 17 |
+
|
| 18 |
+
The paper compares four representative strategies for integrating latent action supervision:
|
| 19 |
+
|
| 20 |
+
| Model | Latent supervision | Role in VLA training |
|
| 21 |
+
| ----------- | --------------------------- | --------------------------------------------------------- |
|
| 22 |
+
| `Baseline` | None | Direct action prediction without latent supervision |
|
| 23 |
+
| `LA-Align` | Image-based latent actions | Align internal VLM representations with latent embeddings |
|
| 24 |
+
| `LA-Direct` | Image-based latent actions | Directly decode latent actions as discrete tokens |
|
| 25 |
+
| `LA-Cond` | Image-based latent actions | Jointly decode latent actions and action representations |
|
| 26 |
+
| `LA-Tok` | Action-based latent actions | Map actions into discrete latent tokens |
|
| 27 |
+
|
| 28 |
+
## Training Example
|
| 29 |
+
|
| 30 |
+
Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
|
| 34 |
+
--seed 42 \
|
| 35 |
+
--run_root_dir runs \
|
| 36 |
+
--save_checkpoint True \
|
| 37 |
+
--vla_id baseline \
|
| 38 |
+
--vlm_path /path/to/Qwen3-VL-2B \
|
| 39 |
+
--vlm_model_id Qwen3 \
|
| 40 |
+
--default_image_size 224 \
|
| 41 |
+
--data_root_dir /path/to/rlds_data \
|
| 42 |
+
--data_mix '["libero_goal"]' \
|
| 43 |
+
--shuffle_buffer_size 128 \
|
| 44 |
+
--image_aug True \
|
| 45 |
+
--window_size 8 \
|
| 46 |
+
--use_wrist_image True \
|
| 47 |
+
--use_proprio True \
|
| 48 |
+
--type training \
|
| 49 |
+
--epochs 10 \
|
| 50 |
+
--max_steps 20000 \
|
| 51 |
+
--global_batch_size 128 \
|
| 52 |
+
--per_device_batch_size 32 \
|
| 53 |
+
--learning_rate 1e-4 \
|
| 54 |
+
--weight_decay 0.01 \
|
| 55 |
+
--max_grad_norm 1.0 \
|
| 56 |
+
--lr_scheduler_type constant \
|
| 57 |
+
--warmup_ratio 0.03 \
|
| 58 |
+
--save_step 20000 \
|
| 59 |
+
--wandb_project your_project \
|
| 60 |
+
--use_wandb True
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Citation
|
| 64 |
+
|
| 65 |
+
```bibtex
|
| 66 |
+
@article{liu2025pixels,
|
| 67 |
+
title={From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
|
| 68 |
+
author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},
|
| 69 |
+
journal={arXiv preprint arXiv:2605.04678},
|
| 70 |
+
year={2025}
|
| 71 |
+
}
|
| 72 |
+
```
|