Document float32 and optional bfloat16 inference
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ tags:
|
|
| 7 |
- droid
|
| 8 |
---
|
| 9 |
|
| 10 |
-
<img src="assets/MolmoAct2.svg" alt="MolmoAct Logo"
|
| 11 |
|
| 12 |
# **MolmoAct2-DROID**
|
| 13 |
|
|
@@ -25,7 +25,7 @@ This checkpoint is fine-tuned on the filtered DROID Franka mixture with absolute
|
|
| 25 |
|
| 26 |
## Intended Use
|
| 27 |
|
| 28 |
-
Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`
|
| 29 |
|
| 30 |
Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
|
| 31 |
|
|
@@ -109,7 +109,7 @@ processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
|
|
| 109 |
model = AutoModelForImageTextToText.from_pretrained(
|
| 110 |
repo_id,
|
| 111 |
trust_remote_code=True,
|
| 112 |
-
|
| 113 |
).to("cuda").eval()
|
| 114 |
|
| 115 |
out = model.predict_action(
|
|
@@ -128,17 +128,35 @@ out = model.predict_action(
|
|
| 128 |
actions = out.actions
|
| 129 |
```
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
`images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
|
| 132 |
|
| 133 |
`normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
|
| 134 |
|
| 135 |
-
`enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs
|
| 136 |
|
| 137 |
Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
|
| 138 |
|
| 139 |
## Discrete Actions
|
| 140 |
|
| 141 |
-
Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly
|
| 142 |
|
| 143 |
```python
|
| 144 |
action_tokenizer = AutoProcessor.from_pretrained(
|
|
|
|
| 7 |
- droid
|
| 8 |
---
|
| 9 |
|
| 10 |
+
<img src="assets/MolmoAct2.svg" alt="MolmoAct Logo" height="50">
|
| 11 |
|
| 12 |
# **MolmoAct2-DROID**
|
| 13 |
|
|
|
|
| 25 |
|
| 26 |
## Intended Use
|
| 27 |
|
| 28 |
+
Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`. pass `norm_tag="franka_droid"` at inference time.
|
| 29 |
|
| 30 |
Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
|
| 31 |
|
|
|
|
| 109 |
model = AutoModelForImageTextToText.from_pretrained(
|
| 110 |
repo_id,
|
| 111 |
trust_remote_code=True,
|
| 112 |
+
dtype=torch.float32,
|
| 113 |
).to("cuda").eval()
|
| 114 |
|
| 115 |
out = model.predict_action(
|
|
|
|
| 128 |
actions = out.actions
|
| 129 |
```
|
| 130 |
|
| 131 |
+
MolmoAct2 was trained with mixed precision. For our reported experiments, we ran inference in `float32`. This path uses the most GPU memory: roughly 26GB with CUDA graph enabled, or around 24GB without CUDA graph.
|
| 132 |
+
|
| 133 |
+
If you have a GPU with less memory, you can run inference with `bfloat16` instead:
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 137 |
+
repo_id,
|
| 138 |
+
trust_remote_code=True,
|
| 139 |
+
dtype=torch.bfloat16,
|
| 140 |
+
).to("cuda").eval()
|
| 141 |
+
|
| 142 |
+
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
|
| 143 |
+
out = model.predict_action(...)
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
Using `bfloat16` is much more memory efficient and can run under 16GB of GPU memory in our tests. It usually does not hurt performance much.
|
| 147 |
+
|
| 148 |
+
|
| 149 |
`images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
|
| 150 |
|
| 151 |
`normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
|
| 152 |
|
| 153 |
+
`enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs. run several random warm-up calls before measuring deployment latency. `num_steps` controls the continuous flow solver and defaults to the checkpoint config value, 10.
|
| 154 |
|
| 155 |
Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
|
| 156 |
|
| 157 |
## Discrete Actions
|
| 158 |
|
| 159 |
+
Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly. the continuous action expert is not used.
|
| 160 |
|
| 161 |
```python
|
| 162 |
action_tokenizer = AutoProcessor.from_pretrained(
|