CokeAnd1ce
/

From_Pixels_to_Tokens

Model card Files Files and versions

From_Pixels_to_Tokens / README.md

CokeAnd1ce's picture

Update README.md

a741768 verified about 14 hours ago

|

history blame contribute delete

3.06 kB

	---
	license: apache-2.0
	pipeline_tag: robotics
	---

	# From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

	This repository contains the weights and documentation for the models presented in the paper [From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models](https://huggingface.co/papers/2605.04678).

	The study investigates how latent actions can serve as an intermediate representation to enable consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. The models are built using a shared `Qwen3-VL-2B` backbone.

	## Resources
	- Paper: [arxiv.org/abs/2605.04678](https://huggingface.co/papers/2605.04678)
	- GitHub Repository: [RUCKBReasoning/From_Pixels_to_Tokens](https://github.com/RUCKBReasoning/From_Pixels_to_Tokens)

	## Model Variants

	The paper compares four representative strategies for integrating latent action supervision:

	\| Model \| Latent supervision \| Role in VLA training \|
	\| ----------- \| --------------------------- \| --------------------------------------------------------- \|
	\| `Baseline` \| None \| Direct action prediction without latent supervision \|
	\| `LA-Align` \| Image-based latent actions \| Align internal VLM representations with latent embeddings \|
	\| `LA-Direct` \| Image-based latent actions \| Directly decode latent actions as discrete tokens \|
	\| `LA-Cond` \| Image-based latent actions \| Jointly decode latent actions and action representations \|
	\| `LA-Tok` \| Action-based latent actions \| Map actions into discrete latent tokens \|

	## Training Example

	Training is performed using the `exp/train_vla.py` script. Below is an example command for training the baseline model on the `libero_goal` dataset:

	```bash
	torchrun --nnodes=1 --nproc_per_node=1 exp/train_vla.py \
	--seed 42 \
	--run_root_dir runs \
	--save_checkpoint True \
	--vla_id baseline \
	--vlm_path /path/to/Qwen3-VL-2B \
	--vlm_model_id Qwen3 \
	--default_image_size 224 \
	--data_root_dir /path/to/rlds_data \
	--data_mix '["libero_goal"]' \
	--shuffle_buffer_size 128 \
	--image_aug True \
	--window_size 8 \
	--use_wrist_image True \
	--use_proprio True \
	--type training \
	--epochs 10 \
	--max_steps 20000 \
	--global_batch_size 128 \
	--per_device_batch_size 32 \
	--learning_rate 1e-4 \
	--weight_decay 0.01 \
	--max_grad_norm 1.0 \
	--lr_scheduler_type constant \
	--warmup_ratio 0.03 \
	--save_step 20000 \
	--wandb_project your_project \
	--use_wandb True
	```

	## Citation

	```bibtex
	@article{pixels2tokens2026,
	title = {From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models},
	author = {Lin, Yihan and Li, Haoyang and Li, Yang and Shen, Haitao and Zhao, Yihan and Shao, Chao and Zhang, Jing},
	journal = {arXiv preprint arXiv:2605.04678},
	year = {2026},
	doi = {10.48550/arXiv.2605.04678}
	}
	```