File size: 2,661 Bytes
c0c2a84
 
 
 
 
 
 
 
 
 
 
 
 
7ea87e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0c2a84
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
datasets:
- Vincent2025hello/usim
language:
- en
base_model:
- nvidia/GR00T-N1.5-3B
tags:
- robotics
- vla
- underwater robot
---
# U0 Final — Underwater Robot VLA Model

**Model ID:** `Vincent2025hello/u0_final`  
**Base Model:** [nvidia/GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B)  
**License:** Apache 2.0  
**Paper:** [USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots](https://arxiv.org/abs/2510.07869)

---

## Model Description

This model is a **Vision-Language-Action (VLA)** policy fine-tuned from NVIDIA GR00T N1.5 (3B parameters) for the **U0 underwater robot** (based on BlueROV2). It takes dual-camera visual observations and multi-sensor state inputs, and outputs 16-step action trajectories for autonomous underwater tasks.

## Fine-Tuning Details

| Item | Value |
|------|-------|
| Base Model | GR00T-N1.5-3B |
| Fine-Tuning Method | Full Fine-Tuning (with visual tuning) |
| Action Horizon | 16 steps |
| Denoising Steps | 4 (inference) |
| Embodiment Tag | `new_embodiment` |
| Data Config | `u0_bot` |

## Input / Output

### Inputs

- **Video** (dual camera): ego-view + wrist-view images (224×224)
- **State** (29-dim):
  - `joint_pos` (6): joint positions
  - `pwm` (8): thruster PWM values
  - `joint_v` (5): joint velocities
  - `dvl_v` (3): DVL velocity
  - `imu_av` (3): IMU angular velocity
  - `imu_la` (3): IMU linear acceleration
  - `pressure` (1): depth pressure
  - `dvl_h` (1): DVL altitude
- **Language**: natural language task description

### Outputs

- **Action** (13-dim × 16 steps):
  - `joint_pos` (6): target joint positions
  - `pwm` (8): target thruster PWM values

### Download Model

```bash
pip install huggingface_hub
hf download Vincent2025hello/u0_final --local-dir ./u0_final
```

## Training Code

The complete fine-tuning and evaluation framework is available at: [https://github.com/VincentGu2000/u0](https://github.com/VincentGu2000/u0)

## Citation

```bibtex
@misc{gu2025usimu0visionlanguageactiondataset,
      title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots}, 
      author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu},
      year={2025},
      eprint={2510.07869},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.07869}, 
}
```

## Acknowledgments

This model is fine-tuned from [NVIDIA GR00T N1.5](https://huggingface.co/nvidia/GR00T-N1.5-3B). We thank the NVIDIA GEAR team for open-sourcing the GR00T model and framework.