Add model card and robotics metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ ---
5
+
6
+ # From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
7
+
8
+ This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).
9
+
10
+ The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.
11
+
12
+ ## Resources
13
+ - **Paper**: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
14
+ - **GitHub Repository**: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)
15
+
16
+ ## Model Variants
17
+
18
+ The paper compares four representative strategies for integrating latent action supervision:
19
+
20
+ | Model | Latent supervision | Role in VLA training |
21
+ | ----------- | --------------------------- | --------------------------------------------------------- |
22
+ | `Baseline` | None | Direct action prediction without latent supervision |
23
+ | `LA-Align` | Image-based latent actions | Align internal VLM representations with latent embeddings |
24
+ | `LA-Direct` | Image-based latent actions | Directly decode latent actions as discrete tokens |
25
+ | `LA-Cond` | Image-based latent actions | Jointly decode latent actions and action representations |
26
+ | `LA-Tok` | Action-based latent actions | Map actions into discrete latent tokens |
27
+
28
+ ## Training Example
29
+
30
+ Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:
31
+
32
+ ```bash
33
+ torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
34
+ --seed 42 \
35
+ --run_root_dir runs \
36
+ --save_checkpoint True \
37
+ --vla_id baseline \
38
+ --vlm_path /path/to/Qwen3-VL-2B \
39
+ --vlm_model_id Qwen3 \
40
+ --default_image_size 224 \
41
+ --data_root_dir /path/to/rlds_data \
42
+ --data_mix '["libero_goal"]' \
43
+ --shuffle_buffer_size 128 \
44
+ --image_aug True \
45
+ --window_size 8 \
46
+ --use_wrist_image True \
47
+ --use_proprio True \
48
+ --type training \
49
+ --epochs 10 \
50
+ --max_steps 20000 \
51
+ --global_batch_size 128 \
52
+ --per_device_batch_size 32 \
53
+ --learning_rate 1e-4 \
54
+ --weight_decay 0.01 \
55
+ --max_grad_norm 1.0 \
56
+ --lr_scheduler_type constant \
57
+ --warmup_ratio 0.03 \
58
+ --save_step 20000 \
59
+ --wandb_project your_project \
60
+ --use_wandb True
61
+ ```
62
+
63
+ ## Citation
64
+
65
+ ```bibtex
66
+ @article{liu2025pixels,
67
+ title={From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
68
+ author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},
69
+ journal={arXiv preprint arXiv:2605.04678},
70
+ year={2025}
71
+ }
72
+ ```