hqfang commited on
Commit
2eb2a03
·
1 Parent(s): 036a703

Document float32 and optional bfloat16 inference

Browse files
Files changed (1) hide show
  1. README.md +23 -5
README.md CHANGED
@@ -7,7 +7,7 @@ tags:
7
  - droid
8
  ---
9
 
10
- <img src="assets/MolmoAct2.svg" alt="MolmoAct Logo" style="width: auto; height: 50px;">
11
 
12
  # **MolmoAct2-DROID**
13
 
@@ -25,7 +25,7 @@ This checkpoint is fine-tuned on the filtered DROID Franka mixture with absolute
25
 
26
  ## Intended Use
27
 
28
- Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`; pass `norm_tag="franka_droid"` at inference time.
29
 
30
  Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
31
 
@@ -109,7 +109,7 @@ processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
109
  model = AutoModelForImageTextToText.from_pretrained(
110
  repo_id,
111
  trust_remote_code=True,
112
- torch_dtype=torch.float32,
113
  ).to("cuda").eval()
114
 
115
  out = model.predict_action(
@@ -128,17 +128,35 @@ out = model.predict_action(
128
  actions = out.actions
129
  ```
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  `images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
132
 
133
  `normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
134
 
135
- `enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs; run several random warm-up calls before measuring deployment latency. `num_steps` controls the continuous flow solver and defaults to the checkpoint config value, 10.
136
 
137
  Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
138
 
139
  ## Discrete Actions
140
 
141
- Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly; the continuous action expert is not used.
142
 
143
  ```python
144
  action_tokenizer = AutoProcessor.from_pretrained(
 
7
  - droid
8
  ---
9
 
10
+ <img src="assets/MolmoAct2.svg" alt="MolmoAct Logo" height="50">
11
 
12
  # **MolmoAct2-DROID**
13
 
 
25
 
26
  ## Intended Use
27
 
28
+ Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`. pass `norm_tag="franka_droid"` at inference time.
29
 
30
  Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
31
 
 
109
  model = AutoModelForImageTextToText.from_pretrained(
110
  repo_id,
111
  trust_remote_code=True,
112
+ dtype=torch.float32,
113
  ).to("cuda").eval()
114
 
115
  out = model.predict_action(
 
128
  actions = out.actions
129
  ```
130
 
131
+ MolmoAct2 was trained with mixed precision. For our reported experiments, we ran inference in `float32`. This path uses the most GPU memory: roughly 26GB with CUDA graph enabled, or around 24GB without CUDA graph.
132
+
133
+ If you have a GPU with less memory, you can run inference with `bfloat16` instead:
134
+
135
+ ```python
136
+ model = AutoModelForImageTextToText.from_pretrained(
137
+ repo_id,
138
+ trust_remote_code=True,
139
+ dtype=torch.bfloat16,
140
+ ).to("cuda").eval()
141
+
142
+ with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
143
+ out = model.predict_action(...)
144
+ ```
145
+
146
+ Using `bfloat16` is much more memory efficient and can run under 16GB of GPU memory in our tests. It usually does not hurt performance much.
147
+
148
+
149
  `images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
150
 
151
  `normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
152
 
153
+ `enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs. run several random warm-up calls before measuring deployment latency. `num_steps` controls the continuous flow solver and defaults to the checkpoint config value, 10.
154
 
155
  Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
156
 
157
  ## Discrete Actions
158
 
159
+ Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly. the continuous action expert is not used.
160
 
161
  ```python
162
  action_tokenizer = AutoProcessor.from_pretrained(