license: mit
pipeline_tag: robotics
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
Paper | Code | Project Collection
FrameSkip is a training-time frame selection framework for Vision-Language-Action (VLA) models. Instead of treating every frame in a dense robot demonstration trajectory as equally useful supervision, FrameSkip scores trajectory frames with lightweight cues and trains primarily from fewer but more informative frames.
FrameSkip is designed as a data-layer intervention: it changes which frames are exposed during training while leaving the VLA architecture, action head, loss function, and inference procedure unchanged.
Highlights
- Efficiency: Achieves higher success rates while training on a compressed view of trajectories (retaining ~20% of frames).
- Architecture-agnostic: Operates entirely in the dataloader, making it compatible with various VLA architectures.
- Importance-guided: Uses action variation, visual-action coherence, task-progress priors, and gripper-transition preservation to score frames.
Usage
FrameSkip is built on the starVLA training and evaluation stack. The released checkpoints follow the standard starVLA checkpoint format and can be loaded in the same way as starVLA VLA policies.
For simulation evaluation, please refer to the model loading and evaluation workflow of the QwenGR00T architecture in starVLA, and replace the checkpoint path with the downloaded FrameSkip checkpoint.
Citation
If you find FrameSkip useful, please cite the following work:
@article{FrameSkip,
title={FrameSkip: Learning from Fewer but More Informative Frames in VLA Training},
author={Bin Yu and Shijie Lian and Xiaopeng Lin and Zhaolong Shen and Yuliang Wei and Changti Wu and Hang Yuan and Haishan Liu and Bailing Wang and Cong Huang and Kai Chen},
journal={arXiv preprint arXiv:2605.13757},
year={2026}
}