| --- |
| license: other |
| license_name: rlwrld-model-license-v1.0 |
| license_link: LICENSE.md |
| library_name: transformers |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - vla |
| - vision-language-action |
| - manipulation |
| - flow-matching |
| - rldx |
| - droid |
| base_model: RLWRLD/RLDX-1-PT |
| --- |
| |
| # RLDX-1-MT-DROID |
|
|
| [Paper](https://arxiv.org/abs/2605.03269) · [Project page](https://rlwrld.ai/rldx-1) · [Code](https://github.com/RLWRLD/RLDX-1) · [Models](https://huggingface.co/collections/RLWRLD/rldx-1) |
|
|
| <p align="center"> |
| <img src="teaser.png" width="100%" alt="RLDX-1 teaser"> |
| </p> |
|
|
| **RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous |
| manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it |
| seamlessly unifies multimodal perception (visual + tactile), high-DoF |
| actuation, and memory-aware decision-making in a single architecture. |
|
|
| This repository hosts **`RLDX-1-MT-DROID`** — RLDX-1 **mid-trained** on the |
| [DROID](https://droid-dataset.github.io/) dataset (large-scale Franka-arm |
| teleoperation). Mid-training continues from the multi-source `RLDX-1-PT` |
| pretraining with an embodiment-specific corpus before downstream task |
| finetuning, making this checkpoint a stronger initialization than |
| `RLDX-1-PT` for any Franka-style downstream task. |
|
|
| ## Highlights |
|
|
| - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and |
| action each get a dedicated stream coupled by joint self-attention — |
| an extension of MM-DiT to action modeling. |
| - **Motion awareness.** Multi-frame observations + a motion module |
| capture temporal dynamics; intermediate VLM layers compress video |
| tokens to keep the policy efficient. |
| - **Long-term memory.** A memory module fuses past cognition features |
| with the current ones for history-grounded decisions beyond a short |
| multi-frame window. |
| - **Physical sensing.** Tactile and torque enter as a dedicated physics |
| stream; the decoder is jointly trained to predict future physical |
| signals. |
| - **Three-stage training.** Pre-training (generalization) → mid-training |
| (functionality) → post-training (task adaptation), with synthetic data |
| augmenting rare manipulation scenarios. |
| - **Real-time inference.** Static graph capture + custom fused kernels |
| bring the all-modality model to **43.7 ms / step on RTX 5090 |
| (1.63× speedup, >22 Hz)**. |
|
|
| ## Quick start |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/RLWRLD/RLDX-1.git |
| cd RLDX |
| uv sync --python 3.10 |
| uv pip install -e . |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| from rldx.policy.rldx_policy import RLDXPolicy |
| from rldx.data.embodiment_tags import EmbodimentTag |
| |
| policy = RLDXPolicy( |
| model_path="RLWRLD/RLDX-1-MT-DROID", |
| embodiment_tag=EmbodimentTag.OXE_DROID, |
| device="cuda:0", |
| ) |
| |
| action = policy.get_action(observation) |
| ``` |
|
|
| ### Real-time serving (ZeroMQ) |
|
|
| ```bash |
| uv run python rldx/eval/run_rldx_server.py \ |
| --model-path RLWRLD/RLDX-1-MT-DROID \ |
| --embodiment-tag OXE_DROID \ |
| --host 0.0.0.0 --port 20000 |
| ``` |
|
|
| ### Finetune from this checkpoint |
|
|
| ```bash |
| uv run python rldx/experiment/launch_train.py \ |
| --base-model-path RLWRLD/RLDX-1-MT-DROID \ |
| --dataset-path /path/to/your/dataset \ |
| --embodiment-tag OXE_DROID \ |
| --video-length 4 --n-cog-tokens 64 \ |
| --global-batch-size 64 --learning-rate 1e-4 \ |
| --max-steps 60000 --output-dir ./outputs/my_finetune |
| ``` |
|
|
| For a full finetune walkthrough see |
| [`docs/training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md). |
|
|
| ## Model details |
|
|
| - **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a |
| Qwen3-VL backbone with cognition-token perceptual summary. Trained with |
| flow matching. |
| - **Inputs:** RGB video (default 4 frames), state proprioception, language |
| instruction. |
| - **Outputs:** Action chunks of length 16. |
| - **Embodiment tag:** `OXE_DROID`. |
| - **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT). |
| - **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). |
| - **Mid-train data:** DROID. |
| - **Params:** 8.1B. |
|
|
| For the full architectural walkthrough see |
| [`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md). |
|
|
| ## RLDX-1 model family |
|
|
| | Checkpoint | Description | |
| |---|---| |
| | [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation | |
| | [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | |
| | [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | |
| | [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | |
| | [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune | |
| | [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | |
| | [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune | |
| | [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | |
| | [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train (this repo) | |
| | [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | |
|
|
| ## Intended use & limitations |
|
|
| **Intended use.** As a strong initialization for downstream finetuning on |
| Franka-arm manipulation tasks; research on robotic manipulation; and |
| non-commercial real-robot deployment under the conditions of the RLWRLD |
| Model License v1.0. |
|
|
| **Out of scope.** Commercial deployment, military or weapons applications, |
| non-consensual surveillance, and any use that violates applicable laws or |
| regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list. |
|
|
| **Limitations.** Mid-train conditioning is most useful for Franka-style |
| embodiments. For very different morphologies (humanoid, dual-arm, mobile), |
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) or |
| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) may be |
| better starting points. The memory, motion, and physics modules are |
| inactive in this checkpoint — enable them at finetune time if needed. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{rldx2026, |
| title={RLDX-1 Technical Report}, |
| author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others}, |
| year={2026}, |
| note={RLWRLD}, |
| eprint={2605.03269}, |
| archivePrefix={arXiv}, |
| url={https://arxiv.org/abs/2605.03269} |
| } |
| ``` |
|
|
| ## License |
|
|
| Released under the **RLWRLD Model License v1.0** — a non-commercial license |
| with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for |
| the full text. By using this model you agree to those terms, including the |
| use restrictions in §3.5. |
|
|