File size: 6,853 Bytes
9cb5bdf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: other
license_name: rlwrld-model-license-v1.0
license_link: LICENSE.md
library_name: transformers
pipeline_tag: robotics
tags:
- robotics
- vla
- vision-language-action
- manipulation
- flow-matching
- rldx
- droid
base_model: RLWRLD/RLDX-1-PT
---
# RLDX-1-MT-DROID
[Paper](https://arxiv.org/abs/2605.03269) · [Project page](https://rlwrld.ai/rldx-1) · [Code](https://github.com/RLWRLD/RLDX-1) · [Models](https://huggingface.co/collections/RLWRLD/rldx-1)
<p align="center">
<img src="teaser.png" width="100%" alt="RLDX-1 teaser">
</p>
**RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous
manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it
seamlessly unifies multimodal perception (visual + tactile), high-DoF
actuation, and memory-aware decision-making in a single architecture.
This repository hosts **`RLDX-1-MT-DROID`** — RLDX-1 **mid-trained** on the
[DROID](https://droid-dataset.github.io/) dataset (large-scale Franka-arm
teleoperation). Mid-training continues from the multi-source `RLDX-1-PT`
pretraining with an embodiment-specific corpus before downstream task
finetuning, making this checkpoint a stronger initialization than
`RLDX-1-PT` for any Franka-style downstream task.
## Highlights
- **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and
action each get a dedicated stream coupled by joint self-attention —
an extension of MM-DiT to action modeling.
- **Motion awareness.** Multi-frame observations + a motion module
capture temporal dynamics; intermediate VLM layers compress video
tokens to keep the policy efficient.
- **Long-term memory.** A memory module fuses past cognition features
with the current ones for history-grounded decisions beyond a short
multi-frame window.
- **Physical sensing.** Tactile and torque enter as a dedicated physics
stream; the decoder is jointly trained to predict future physical
signals.
- **Three-stage training.** Pre-training (generalization) → mid-training
(functionality) → post-training (task adaptation), with synthetic data
augmenting rare manipulation scenarios.
- **Real-time inference.** Static graph capture + custom fused kernels
bring the all-modality model to **43.7 ms / step on RTX 5090
(1.63× speedup, >22 Hz)**.
## Quick start
### Installation
```bash
git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .
```
### Inference
```python
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag
policy = RLDXPolicy(
model_path="RLWRLD/RLDX-1-MT-DROID",
embodiment_tag=EmbodimentTag.OXE_DROID,
device="cuda:0",
)
action = policy.get_action(observation)
```
### Real-time serving (ZeroMQ)
```bash
uv run python rldx/eval/run_rldx_server.py \
--model-path RLWRLD/RLDX-1-MT-DROID \
--embodiment-tag OXE_DROID \
--host 0.0.0.0 --port 20000
```
### Finetune from this checkpoint
```bash
uv run python rldx/experiment/launch_train.py \
--base-model-path RLWRLD/RLDX-1-MT-DROID \
--dataset-path /path/to/your/dataset \
--embodiment-tag OXE_DROID \
--video-length 4 --n-cog-tokens 64 \
--global-batch-size 64 --learning-rate 1e-4 \
--max-steps 60000 --output-dir ./outputs/my_finetune
```
For a full finetune walkthrough see
[`docs/training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md).
## Model details
- **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a
Qwen3-VL backbone with cognition-token perceptual summary. Trained with
flow matching.
- **Inputs:** RGB video (default 4 frames), state proprioception, language
instruction.
- **Outputs:** Action chunks of length 16.
- **Embodiment tag:** `OXE_DROID`.
- **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT).
- **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
- **Mid-train data:** DROID.
- **Params:** 8.1B.
For the full architectural walkthrough see
[`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md).
## RLDX-1 model family
| Checkpoint | Description |
|---|---|
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation |
| [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone |
| [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune |
| [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune |
| [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune |
| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune |
| [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune |
| [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune |
| [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train (this repo) |
| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) |
## Intended use & limitations
**Intended use.** As a strong initialization for downstream finetuning on
Franka-arm manipulation tasks; research on robotic manipulation; and
non-commercial real-robot deployment under the conditions of the RLWRLD
Model License v1.0.
**Out of scope.** Commercial deployment, military or weapons applications,
non-consensual surveillance, and any use that violates applicable laws or
regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list.
**Limitations.** Mid-train conditioning is most useful for Franka-style
embodiments. For very different morphologies (humanoid, dual-arm, mobile),
[`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) or
[`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) may be
better starting points. The memory, motion, and physics modules are
inactive in this checkpoint — enable them at finetune time if needed.
## Citation
```bibtex
@article{rldx2026,
title={RLDX-1 Technical Report},
author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
year={2026},
note={RLWRLD},
eprint={2605.03269},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2605.03269}
}
```
## License
Released under the **RLWRLD Model License v1.0** — a non-commercial license
with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for
the full text. By using this model you agree to those terms, including the
use restrictions in §3.5.
|