u0_final / README.md
Vincent2025hello's picture
Update README.md
c0c2a84 verified
---
license: apache-2.0
datasets:
- Vincent2025hello/usim
language:
- en
base_model:
- nvidia/GR00T-N1.5-3B
tags:
- robotics
- vla
- underwater robot
---
# U0 Final — Underwater Robot VLA Model
**Model ID:** `Vincent2025hello/u0_final`
**Base Model:** [nvidia/GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B)
**License:** Apache 2.0
**Paper:** [USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots](https://arxiv.org/abs/2510.07869)
---
## Model Description
This model is a **Vision-Language-Action (VLA)** policy fine-tuned from NVIDIA GR00T N1.5 (3B parameters) for the **U0 underwater robot** (based on BlueROV2). It takes dual-camera visual observations and multi-sensor state inputs, and outputs 16-step action trajectories for autonomous underwater tasks.
## Fine-Tuning Details
| Item | Value |
|------|-------|
| Base Model | GR00T-N1.5-3B |
| Fine-Tuning Method | Full Fine-Tuning (with visual tuning) |
| Action Horizon | 16 steps |
| Denoising Steps | 4 (inference) |
| Embodiment Tag | `new_embodiment` |
| Data Config | `u0_bot` |
## Input / Output
### Inputs
- **Video** (dual camera): ego-view + wrist-view images (224×224)
- **State** (29-dim):
- `joint_pos` (6): joint positions
- `pwm` (8): thruster PWM values
- `joint_v` (5): joint velocities
- `dvl_v` (3): DVL velocity
- `imu_av` (3): IMU angular velocity
- `imu_la` (3): IMU linear acceleration
- `pressure` (1): depth pressure
- `dvl_h` (1): DVL altitude
- **Language**: natural language task description
### Outputs
- **Action** (13-dim × 16 steps):
- `joint_pos` (6): target joint positions
- `pwm` (8): target thruster PWM values
### Download Model
```bash
pip install huggingface_hub
hf download Vincent2025hello/u0_final --local-dir ./u0_final
```
## Training Code
The complete fine-tuning and evaluation framework is available at: [https://github.com/VincentGu2000/u0](https://github.com/VincentGu2000/u0)
## Citation
```bibtex
@misc{gu2025usimu0visionlanguageactiondataset,
title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots},
author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu},
year={2025},
eprint={2510.07869},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.07869},
}
```
## Acknowledgments
This model is fine-tuned from [NVIDIA GR00T N1.5](https://huggingface.co/nvidia/GR00T-N1.5-3B). We thank the NVIDIA GEAR team for open-sourcing the GR00T model and framework.