Add model card and metadata
Browse filesThis PR adds a model card for the paper [Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation](https://huggingface.co/papers/2605.11832).
The model card includes:
- Metadata for the `robotics` pipeline tag.
- `library_name: peft` metadata based on the presence of LoRA adapter configurations.
- Links to the paper, project page, and official GitHub repository.
- A summary of the key innovations: Geometry-Guided Gated Transformer (G3T) and Action Manifold Learning (AML).
- The official BibTeX citation.
README.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: robotics
|
| 3 |
+
library_name: peft
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
|
| 7 |
+
|
| 8 |
+
This repository contains the weights for **Multi-view-VLA**, a Vision-Language-Action framework designed for robust and precise robotic manipulation.
|
| 9 |
+
|
| 10 |
+
[**Project Page**](https://junjxiao.github.io/Multi-view-VLA.github.io/) | [**Code**](https://github.com/junjxiao/Multi-view-VLA) | [**arXiv**](https://huggingface.co/papers/2605.11832)
|
| 11 |
+
|
| 12 |
+
## Introduction
|
| 13 |
+
Multi-view-VLA addresses the challenges of spatial perception and manipulation in Vision-Language-Action (VLA) models. Key features include:
|
| 14 |
+
|
| 15 |
+
- **Geometry-Guided Gated Transformer (G3T):** Addresses monocular depth ambiguity by leveraging multi-view diffusion priors to provide geometric guidance while adaptively filtering occlusion noise.
|
| 16 |
+
- **Action Manifold Learning (AML):** A direct action prediction mechanism that bypasses the limitations of traditional diffusion-based indirect noise/velocity regression, leading to more efficient action learning.
|
| 17 |
+
|
| 18 |
+
The model demonstrates superior success rates and robustness on benchmarks like LIBERO, RoboTwin 2.0, and real-world robotic tasks.
|
| 19 |
+
|
| 20 |
+
## Usage
|
| 21 |
+
For detailed instructions on installation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/junjxiao/Multi-view-VLA).
|
| 22 |
+
|
| 23 |
+
## Citation
|
| 24 |
+
If you find this work useful, please consider citing:
|
| 25 |
+
|
| 26 |
+
```bibtex
|
| 27 |
+
@article{xiao2026learning,
|
| 28 |
+
title={Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation},
|
| 29 |
+
author={Junjin Xiao and Dongyang Li and Yandan Yang and Shuang Zeng and Tong Lin and Xinyuan Chang and Feng Xiong and Mu Xu and Xing Wei and Zhiheng Ma and Qing Zhang and Wei-Shi Zheng},
|
| 30 |
+
year={2026},
|
| 31 |
+
journal={arxiv:2605.11832},
|
| 32 |
+
}
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
## Acknowledgement
|
| 36 |
+
This project builds upon [starVLA](https://github.com/starVLA/starVLA), [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [vggt](https://github.com/facebookresearch/vggt), [JiT](https://github.com/LTH14/JiT), [LeRobot](https://github.com/huggingface/lerobot), [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T) and [any4lerobot](https://github.com/Tavish9/any4lerobot).
|