File size: 2,661 Bytes
c0c2a84 7ea87e2 c0c2a84 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ---
license: apache-2.0
datasets:
- Vincent2025hello/usim
language:
- en
base_model:
- nvidia/GR00T-N1.5-3B
tags:
- robotics
- vla
- underwater robot
---
# U0 Final — Underwater Robot VLA Model
**Model ID:** `Vincent2025hello/u0_final`
**Base Model:** [nvidia/GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B)
**License:** Apache 2.0
**Paper:** [USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots](https://arxiv.org/abs/2510.07869)
---
## Model Description
This model is a **Vision-Language-Action (VLA)** policy fine-tuned from NVIDIA GR00T N1.5 (3B parameters) for the **U0 underwater robot** (based on BlueROV2). It takes dual-camera visual observations and multi-sensor state inputs, and outputs 16-step action trajectories for autonomous underwater tasks.
## Fine-Tuning Details
| Item | Value |
|------|-------|
| Base Model | GR00T-N1.5-3B |
| Fine-Tuning Method | Full Fine-Tuning (with visual tuning) |
| Action Horizon | 16 steps |
| Denoising Steps | 4 (inference) |
| Embodiment Tag | `new_embodiment` |
| Data Config | `u0_bot` |
## Input / Output
### Inputs
- **Video** (dual camera): ego-view + wrist-view images (224×224)
- **State** (29-dim):
- `joint_pos` (6): joint positions
- `pwm` (8): thruster PWM values
- `joint_v` (5): joint velocities
- `dvl_v` (3): DVL velocity
- `imu_av` (3): IMU angular velocity
- `imu_la` (3): IMU linear acceleration
- `pressure` (1): depth pressure
- `dvl_h` (1): DVL altitude
- **Language**: natural language task description
### Outputs
- **Action** (13-dim × 16 steps):
- `joint_pos` (6): target joint positions
- `pwm` (8): target thruster PWM values
### Download Model
```bash
pip install huggingface_hub
hf download Vincent2025hello/u0_final --local-dir ./u0_final
```
## Training Code
The complete fine-tuning and evaluation framework is available at: [https://github.com/VincentGu2000/u0](https://github.com/VincentGu2000/u0)
## Citation
```bibtex
@misc{gu2025usimu0visionlanguageactiondataset,
title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots},
author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu},
year={2025},
eprint={2510.07869},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.07869},
}
```
## Acknowledgments
This model is fine-tuned from [NVIDIA GR00T N1.5](https://huggingface.co/nvidia/GR00T-N1.5-3B). We thank the NVIDIA GEAR team for open-sourcing the GR00T model and framework. |