CorridorVLA
Browse files
README.md
CHANGED
|
@@ -1,3 +1,140 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- Sylvest/LIBERO-plus
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## π CorridorVLA
|
| 8 |
+
|
| 9 |
+
This repository provides the official implementation of **CorridorVLA**.
|
| 10 |
+
|
| 11 |
+
> **Direct spatial constraints for Vision-Language-Action models via sparse physical anchors**
|
| 12 |
+
|
| 13 |
+
[](https://arxiv.org/abs/2604.21241)
|
| 14 |
+
[](https://github.com/lidc54/corridorVLA)
|
| 15 |
+
[](#)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## π TL;DR
|
| 20 |
+
|
| 21 |
+
* Explore an alternative to common visual-style spatial guidance (e.g., predicting future images/videos) using **text-style physical anchors**
|
| 22 |
+
* Predict sparse **end-effector Ξ-positions**
|
| 23 |
+
* Use them to impose an **explicit corridor constraint** on action generation
|
| 24 |
+
* Achieves **83.21% success rate on LIBERO-Plus**
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## π§ Motivation
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
<p align="center">
|
| 32 |
+
<img src="https://raw.githubusercontent.com/lidc54/corridorVLA/main/assets/motive.png" width="40%">
|
| 33 |
+
</p>
|
| 34 |
+
|
| 35 |
+
### Existing VLA paradigm
|
| 36 |
+
|
| 37 |
+
* Spatial guidance is encoded as visual-style tokens or latent features
|
| 38 |
+
* Action generation is influenced indirectly through the backbone features
|
| 39 |
+
|
| 40 |
+
### CorridorVLA
|
| 41 |
+
|
| 42 |
+
* Predict **compact physical quantities** (spatial anchors)
|
| 43 |
+
* Apply them as **direct constraints in the loss**
|
| 44 |
+
* No need for heavy visual intermediate representations
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## ποΈ Method Overview
|
| 49 |
+
|
| 50 |
+
<p align="center">
|
| 51 |
+
<img src="https://raw.githubusercontent.com/lidc54/corridorVLA/main/assets/framework.png" width="50%">
|
| 52 |
+
</p>
|
| 53 |
+
|
| 54 |
+
### Key components
|
| 55 |
+
|
| 56 |
+
**(1) Sparse Anchor Prediction**
|
| 57 |
+
|
| 58 |
+
* Predict $K$ future **Ξ-position anchors**
|
| 59 |
+
* Represent trajectory structure in a compact form
|
| 60 |
+
|
| 61 |
+
**(2) Action Augmentation**
|
| 62 |
+
|
| 63 |
+
* Concatenate state-related physical quantities (e.g., Ξ-positions) to the action vector
|
| 64 |
+
* Enable joint prediction of state and action, providing implicit alignment between state space and action space
|
| 65 |
+
|
| 66 |
+
**(3) Corridor Loss**
|
| 67 |
+
|
| 68 |
+
* Defines a tolerance region over the predicted trajectory
|
| 69 |
+
* Penalizes deviations outside the region while allowing smooth convergence within it
|
| 70 |
+
|
| 71 |
+
π Behaves like a **structured smooth-L1 in trajectory space**
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## π Results
|
| 76 |
+
|
| 77 |
+
### LIBERO-Plus (GR00T-based)
|
| 78 |
+
|
| 79 |
+
| Variant | Description | AVG |
|
| 80 |
+
|--------|----------------------------------|------|
|
| 81 |
+
| base | | 75.23 |
|
| 82 |
+
| c1 | query=3 | 77.25 |
|
| 83 |
+
| c2 | + extra data | 77.25 |
|
| 84 |
+
| c3 | + Ξpos anchors | 79.21 |
|
| 85 |
+
| **c4** | + corridor loss (**CorridorVLA**) | **83.21** |
|
| 86 |
+
|
| 87 |
+
π Improvement:
|
| 88 |
+
|
| 89 |
+
* +7.98% over baselines
|
| 90 |
+
* Largest gain from **explicit spatial constraint**
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## βοΈ Implementation
|
| 95 |
+
|
| 96 |
+
* Built on **[StarVLA](https://github.com/starVLA/starVLA/commit/e1e6457c6cd124248f5ce7b2d3d40fb74f48c6fc)**
|
| 97 |
+
* Minimal changes:
|
| 98 |
+
|
| 99 |
+
* few prediction slots
|
| 100 |
+
* loss terms
|
| 101 |
+
* No heavy architecture redesign
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## π Key Insights
|
| 107 |
+
|
| 108 |
+
* Spatial guidance can be:
|
| 109 |
+
|
| 110 |
+
* **explicit (loss-level)** instead of implicit (feature-level)
|
| 111 |
+
|
| 112 |
+
* Physical quantities are:
|
| 113 |
+
|
| 114 |
+
* more **action-aligned**
|
| 115 |
+
* more **interpretable**
|
| 116 |
+
|
| 117 |
+
* Simple constraints can:
|
| 118 |
+
|
| 119 |
+
* significantly improve **stability**
|
| 120 |
+
* reduce **unstructured exploration**
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
## π Citation
|
| 128 |
+
|
| 129 |
+
If you find this work useful, please cite:
|
| 130 |
+
|
| 131 |
+
```bibtex
|
| 132 |
+
@article{corridorvla2025,
|
| 133 |
+
title={CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors},
|
| 134 |
+
author={Dachong Li and ZhuangZhuang Chen and Jin Zhang and Jianqiang Li},
|
| 135 |
+
year={2026},
|
| 136 |
+
eprint={2604.21241},
|
| 137 |
+
archivePrefix={arXiv},
|
| 138 |
+
primaryClass={cs.RO},
|
| 139 |
+
url={https://arxiv.org/abs/2604.21241}
|
| 140 |
+
}
|