| --- |
| license: apache-2.0 |
| datasets: |
| - Vincent2025hello/usim |
| language: |
| - en |
| base_model: |
| - nvidia/GR00T-N1.5-3B |
| tags: |
| - robotics |
| - vla |
| - underwater robot |
| --- |
| # U0 Final — Underwater Robot VLA Model |
|
|
| **Model ID:** `Vincent2025hello/u0_final` |
| **Base Model:** [nvidia/GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B) |
| **License:** Apache 2.0 |
| **Paper:** [USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots](https://arxiv.org/abs/2510.07869) |
|
|
| --- |
|
|
| ## Model Description |
|
|
| This model is a **Vision-Language-Action (VLA)** policy fine-tuned from NVIDIA GR00T N1.5 (3B parameters) for the **U0 underwater robot** (based on BlueROV2). It takes dual-camera visual observations and multi-sensor state inputs, and outputs 16-step action trajectories for autonomous underwater tasks. |
|
|
| ## Fine-Tuning Details |
|
|
| | Item | Value | |
| |------|-------| |
| | Base Model | GR00T-N1.5-3B | |
| | Fine-Tuning Method | Full Fine-Tuning (with visual tuning) | |
| | Action Horizon | 16 steps | |
| | Denoising Steps | 4 (inference) | |
| | Embodiment Tag | `new_embodiment` | |
| | Data Config | `u0_bot` | |
|
|
| ## Input / Output |
|
|
| ### Inputs |
|
|
| - **Video** (dual camera): ego-view + wrist-view images (224×224) |
| - **State** (29-dim): |
| - `joint_pos` (6): joint positions |
| - `pwm` (8): thruster PWM values |
| - `joint_v` (5): joint velocities |
| - `dvl_v` (3): DVL velocity |
| - `imu_av` (3): IMU angular velocity |
| - `imu_la` (3): IMU linear acceleration |
| - `pressure` (1): depth pressure |
| - `dvl_h` (1): DVL altitude |
| - **Language**: natural language task description |
|
|
| ### Outputs |
|
|
| - **Action** (13-dim × 16 steps): |
| - `joint_pos` (6): target joint positions |
| - `pwm` (8): target thruster PWM values |
|
|
| ### Download Model |
|
|
| ```bash |
| pip install huggingface_hub |
| hf download Vincent2025hello/u0_final --local-dir ./u0_final |
| ``` |
|
|
| ## Training Code |
|
|
| The complete fine-tuning and evaluation framework is available at: [https://github.com/VincentGu2000/u0](https://github.com/VincentGu2000/u0) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{gu2025usimu0visionlanguageactiondataset, |
| title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots}, |
| author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu}, |
| year={2025}, |
| eprint={2510.07869}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2510.07869}, |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| This model is fine-tuned from [NVIDIA GR00T N1.5](https://huggingface.co/nvidia/GR00T-N1.5-3B). We thank the NVIDIA GEAR team for open-sourcing the GR00T model and framework. |