allenai
/

MolmoAct2-DROID

@@ -7,7 +7,7 @@ tags:
   - droid
 ---
-<img src="assets/MolmoAct2.svg" alt="MolmoAct Logo" style="width: auto; height: 50px;">
 # **MolmoAct2-DROID**
@@ -25,7 +25,7 @@ This checkpoint is fine-tuned on the filtered DROID Franka mixture with absolute
 ## Intended Use
-Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`; pass `norm_tag="franka_droid"` at inference time.
 Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
@@ -109,7 +109,7 @@ processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
 model = AutoModelForImageTextToText.from_pretrained(
     repo_id,
     trust_remote_code=True,
-    torch_dtype=torch.float32,
 ).to("cuda").eval()
 out = model.predict_action(
@@ -128,17 +128,35 @@ out = model.predict_action(
 actions = out.actions
 ```
 `images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
 `normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
-`enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs; run several random warm-up calls before measuring deployment latency. `num_steps` controls the continuous flow solver and defaults to the checkpoint config value, 10.
 Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
 ## Discrete Actions
-Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly; the continuous action expert is not used.
 ```python
 action_tokenizer = AutoProcessor.from_pretrained(

   - droid
 ---
+<img src="assets/MolmoAct2.svg" alt="MolmoAct Logo" height="50">
 # **MolmoAct2-DROID**
 ## Intended Use
+Use this checkpoint for DROID-style inference or for further fine-tuning. Dataset normalization metadata is stored in `norm_stats.json`. pass `norm_tag="franka_droid"` at inference time.
 Continuous action prediction is the intended and recommended inference mode. Discrete action prediction is exposed for parity and debugging, but we use continuous actions by default.
 model = AutoModelForImageTextToText.from_pretrained(
     repo_id,
     trust_remote_code=True,
+    dtype=torch.float32,
 ).to("cuda").eval()
 out = model.predict_action(
 actions = out.actions
 ```
+MolmoAct2 was trained with mixed precision. For our reported experiments, we ran inference in `float32`. This path uses the most GPU memory: roughly 26GB with CUDA graph enabled, or around 24GB without CUDA graph.
+If you have a GPU with less memory, you can run inference with `bfloat16` instead:
+```python
+model = AutoModelForImageTextToText.from_pretrained(
+    repo_id,
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to("cuda").eval()
+with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+    out = model.predict_action(...)
+```
+Using `bfloat16` is much more memory efficient and can run under 16GB of GPU memory in our tests. It usually does not hurt performance much.
 `images` should preserve camera order, for example `[exterior_1_rgb, wrist_rgb]`. Images may be PIL images or RGB arrays. `state` is the raw robot state, and actions are returned in robot scale.
 `normalize_language=True` is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. Set it to `False` if you need to preserve the task text exactly.
+`enable_cuda_graph=True` is the default. The first few calls can be slow because the model warms up and captures CUDA graphs. run several random warm-up calls before measuring deployment latency. `num_steps` controls the continuous flow solver and defaults to the checkpoint config value, 10.
 Depth reasoning is disabled for this checkpoint. Calling `enable_depth_reasoning=True` will raise an error.
 ## Discrete Actions
+Discrete action inference requires a caller-provided action tokenizer. It is not saved in this repository. Discrete mode decodes action tokens directly. the continuous action expert is not used.
 ```python
 action_tokenizer = AutoProcessor.from_pretrained(