--- license: other license_name: rlwrld-model-license-v1.0 license_link: LICENSE.md library_name: transformers pipeline_tag: robotics tags: - robotics - vla - vision-language-action - manipulation - flow-matching - rldx - droid base_model: RLWRLD/RLDX-1-PT --- # RLDX-1-MT-DROID [Paper](https://arxiv.org/abs/2605.03269)  ·  [Project page](https://rlwrld.ai/rldx-1)  ·  [Code](https://github.com/RLWRLD/RLDX-1)  ·  [Models](https://huggingface.co/collections/RLWRLD/rldx-1)

RLDX-1 teaser

**RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it seamlessly unifies multimodal perception (visual + tactile), high-DoF actuation, and memory-aware decision-making in a single architecture. This repository hosts **`RLDX-1-MT-DROID`** — RLDX-1 **mid-trained** on the [DROID](https://droid-dataset.github.io/) dataset (large-scale Franka-arm teleoperation). Mid-training continues from the multi-source `RLDX-1-PT` pretraining with an embodiment-specific corpus before downstream task finetuning, making this checkpoint a stronger initialization than `RLDX-1-PT` for any Franka-style downstream task. ## Highlights - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling. - **Motion awareness.** Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient. - **Long-term memory.** A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window. - **Physical sensing.** Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals. - **Three-stage training.** Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios. - **Real-time inference.** Static graph capture + custom fused kernels bring the all-modality model to **43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz)**. ## Quick start ### Installation ```bash git clone https://github.com/RLWRLD/RLDX-1.git cd RLDX uv sync --python 3.10 uv pip install -e . ``` ### Inference ```python from rldx.policy.rldx_policy import RLDXPolicy from rldx.data.embodiment_tags import EmbodimentTag policy = RLDXPolicy( model_path="RLWRLD/RLDX-1-MT-DROID", embodiment_tag=EmbodimentTag.OXE_DROID, device="cuda:0", ) action = policy.get_action(observation) ``` ### Real-time serving (ZeroMQ) ```bash uv run python rldx/eval/run_rldx_server.py \ --model-path RLWRLD/RLDX-1-MT-DROID \ --embodiment-tag OXE_DROID \ --host 0.0.0.0 --port 20000 ``` ### Finetune from this checkpoint ```bash uv run python rldx/experiment/launch_train.py \ --base-model-path RLWRLD/RLDX-1-MT-DROID \ --dataset-path /path/to/your/dataset \ --embodiment-tag OXE_DROID \ --video-length 4 --n-cog-tokens 64 \ --global-batch-size 64 --learning-rate 1e-4 \ --max-steps 60000 --output-dir ./outputs/my_finetune ``` For a full finetune walkthrough see [`docs/training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md). ## Model details - **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a Qwen3-VL backbone with cognition-token perceptual summary. Trained with flow matching. - **Inputs:** RGB video (default 4 frames), state proprioception, language instruction. - **Outputs:** Action chunks of length 16. - **Embodiment tag:** `OXE_DROID`. - **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT). - **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). - **Mid-train data:** DROID. - **Params:** 8.1B. For the full architectural walkthrough see [`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md). ## RLDX-1 model family | Checkpoint | Description | |---|---| | [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation | | [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | | [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | | [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | | [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune | | [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | | [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune | | [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | | [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train (this repo) | | [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | ## Intended use & limitations **Intended use.** As a strong initialization for downstream finetuning on Franka-arm manipulation tasks; research on robotic manipulation; and non-commercial real-robot deployment under the conditions of the RLWRLD Model License v1.0. **Out of scope.** Commercial deployment, military or weapons applications, non-consensual surveillance, and any use that violates applicable laws or regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list. **Limitations.** Mid-train conditioning is most useful for Franka-style embodiments. For very different morphologies (humanoid, dual-arm, mobile), [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) or [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) may be better starting points. The memory, motion, and physics modules are inactive in this checkpoint — enable them at finetune time if needed. ## Citation ```bibtex @article{rldx2026, title={RLDX-1 Technical Report}, author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others}, year={2026}, note={RLWRLD}, eprint={2605.03269}, archivePrefix={arXiv}, url={https://arxiv.org/abs/2605.03269} } ``` ## License Released under the **RLWRLD Model License v1.0** — a non-commercial license with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for the full text. By using this model you agree to those terms, including the use restrictions in §3.5.