| --- |
| license: other |
| license_name: rlwrld-model-license-v1.0 |
| license_link: LICENSE.md |
| library_name: transformers |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - vla |
| - vision-language-action |
| - manipulation |
| - flow-matching |
| - rldx |
| base_model: Qwen/Qwen3-VL-8B-Instruct |
| --- |
| |
| # RLDX-1 |
|
|
| [Paper](https://arxiv.org/abs/2605.03269) · [Project page](https://rlwrld.ai/rldx-1) · [Code](https://github.com/RLWRLD/RLDX-1) · [Models](https://huggingface.co/collections/RLWRLD/rldx-1) |
|
|
| <p align="center"> |
| <img src="teaser.png" width="100%" alt="RLDX-1 teaser"> |
| </p> |
|
|
| **RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous |
| manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it |
| seamlessly unifies multimodal perception (visual + tactile), high-DoF |
| actuation, and memory-aware decision-making in a single architecture. RLDX-1 |
| achieves state-of-the-art performance across diverse simulation benchmarks |
| and is fully validated on real-world hardware. |
|
|
| This repository hosts **`RLDX-1-PT`** — a foundation checkpoint pretrained on |
| a broad mixture of public manipulation corpora, from which all downstream |
| `RLDX-1-{FT,MT}-*` releases finetune. Use it as your starting point for new |
| embodiments and tasks. |
|
|
| <p align="center"> |
| <img src="architecture.png" width="90%" alt="RLDX-1 architecture"> |
| </p> |
|
|
| ## Highlights |
|
|
| - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and |
| action each get a dedicated stream coupled by joint self-attention — |
| an extension of MM-DiT to action modeling. |
| - **Motion awareness.** Multi-frame observations + a motion module |
| capture temporal dynamics; intermediate VLM layers compress video |
| tokens to keep the policy efficient. |
| - **Long-term memory.** A memory module fuses past cognition features |
| with the current ones for history-grounded decisions beyond a short |
| multi-frame window. |
| - **Physical sensing.** Tactile and torque enter as a dedicated physics |
| stream; the decoder is jointly trained to predict future physical |
| signals. |
| - **Three-stage training.** Pre-training (generalization) → mid-training |
| (functionality) → post-training (task adaptation), with synthetic data |
| augmenting rare manipulation scenarios. |
| - **Real-time inference.** Static graph capture + custom fused kernels |
| bring the all-modality model to **43.7 ms / step on RTX 5090 |
| (1.63× speedup, >22 Hz)**. |
|
|
| ## Released Checkpoints |
|
|
| This card describes `RLDX-1-PT` (foundation). The full RLDX-1 model family: |
|
|
| | Checkpoint | Description | Params | Embodiment Tag | |
| |---|---|---|---| |
| | [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation (this repo) | 6.9B | per-dataset | |
| | [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | 8B | — | |
| | [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | 6.9B | `GENERAL_EMBODIMENT` | |
| | [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | 6.9B | `GENERAL_EMBODIMENT` | |
| | [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune | 6.9B | `GENERAL_EMBODIMENT` | |
| | [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | 6.9B | `OXE_FRACTAL` | |
| | [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune | 6.9B | `OXE_BRIDGE_ORIG` | |
| | [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | 6.9B | `GENERAL_EMBODIMENT` | |
| | [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train | 8.1B | `OXE_DROID` | |
| | [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | 8.1B | `GENERAL_EMBODIMENT` | |
|
|
| ## Performance |
|
|
| Success rate (%) of RLDX-1 finetuned on each benchmark's training set, |
| evaluated with the linked checkpoint. |
|
|
| | Benchmark | Success Rate | Checkpoint | |
| |---|---|---| |
| | LIBERO (Avg) | 97.8 | `RLDX-1-FT-LIBERO` | |
| | LIBERO-Plus | 87.6 | `RLDX-1-FT-LIBERO` | |
| | SIMPLER Google-VM | 81.5 | `RLDX-1-FT-SIMPLER-GOOGLE` | |
| | SIMPLER Google-VA | 77.4 | `RLDX-1-FT-SIMPLER-GOOGLE` | |
| | SIMPLER WidowX | 71.9 | `RLDX-1-FT-SIMPLER-WIDOWX` | |
| | RoboCasa Kitchen (24 tasks) | 70.6 | `RLDX-1-FT-ROBOCASA` | |
| | GR-1 Tabletop | 58.7 | `RLDX-1-FT-GR1` | |
| | RoboCasa365 (Avg) | 31.5 | `RLDX-1-FT-RC365` | |
|
|
| ## Quick start |
|
|
| ```bash |
| git clone https://github.com/RLWRLD/RLDX-1.git |
| cd RLDX |
| uv sync --python 3.10 |
| uv pip install -e . |
| ``` |
|
|
| ### Inference (single step) |
|
|
| ```python |
| from rldx.policy.rldx_policy import RLDXPolicy |
| from rldx.data.embodiment_tags import EmbodimentTag |
| |
| policy = RLDXPolicy( |
| model_path="RLWRLD/RLDX-1-FT-ROBOCASA", |
| embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT, |
| device="cuda:0", |
| ) |
| |
| action = policy.get_action(observation) |
| ``` |
|
|
| `RLDX-1-PT` is pretrained on a multi-source mixture, so for direct inference |
| pair it with the embodiment tag matching your data source — e.g. |
| `OXE_FRACTAL`, `OXE_BRIDGE_ORIG`, `OXE_DROID`, `GALAXEA`, `AGIBOT_GRIPPER`, |
| `AGIBOT_DEXHAND`, `NEURAL_GR1`, `HUMANOID_EVERYDAY_G1`, |
| `HUMANOID_EVERYDAY_H1`, etc. For custom robots, finetune. |
|
|
| ### Real-time serving (ZeroMQ) |
|
|
| ```bash |
| uv run python rldx/eval/run_rldx_server.py \ |
| --model-path RLWRLD/RLDX-1-FT-ROBOCASA \ |
| --embodiment-tag GENERAL_EMBODIMENT \ |
| --host 0.0.0.0 --port 20000 |
| ``` |
|
|
| A WebSocket server (`run_rldx_server_pi.py`) is also available for |
| openpi-compatible clients. |
|
|
| ### Finetune from `RLDX-1-PT` |
|
|
| ```bash |
| uv run python rldx/experiment/launch_train.py \ |
| --base-model-path RLWRLD/RLDX-1-PT \ |
| --dataset-path /path/to/your/dataset \ |
| --embodiment-tag GENERAL_EMBODIMENT \ |
| --video-length 4 --n-cog-tokens 64 \ |
| --global-batch-size 64 --learning-rate 1e-4 \ |
| --max-steps 60000 --save-steps 5000 \ |
| --output-dir ./outputs/my_finetune |
| ``` |
|
|
| To enable add-ons (memory / motion / physics) see the recipes in the |
| [main README](https://github.com/RLWRLD/RLDX-1#finetuning) and the |
| [`training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md) |
| guide. |
|
|
| ## Model details |
|
|
| - **Architecture:** Multi-Stream Action Transformer (MSAT) policy with a |
| Qwen3-VL vision-language backbone, cognition-token perceptual summary, |
| optional Transformer memory, motion module, and tactile/torque physics |
| encoder/decoder. Trained with flow matching. |
| - **Inputs:** RGB video (default 4 frames), state proprioception, optional |
| tactile / torque signals, language instruction. |
| - **Outputs:** Action chunks of length 16 (default `--action-horizon 16`). |
| - **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). |
| - **Pretraining data:** A mixture of public manipulation corpora, covering |
| 27 [Open X-Embodiment (OXE)](https://robotics-transformer-x.github.io/) |
| datasets (DROID, Bridge, Fractal, Language Table, …) plus |
| [Galaxea](https://galaxea.ai/), [AgiBot World](https://agibot-world.com/) |
| (Gripper + Dexhand), ActionNet, Neural-Curated GR-1 humanoid trajectories, |
| and Unitree G1 / H1 from |
| [HumanoidEveryday](https://lipeng-zhou.github.io/HumanoidEveryday/). |
|
|
| For a full architectural walkthrough see |
| [`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md). |
|
|
| ## Intended use & limitations |
|
|
| **Intended use.** Research on robotic manipulation, finetuning on custom |
| embodiments, simulation benchmarking, and non-commercial real-robot |
| deployment under the conditions of the RLWRLD Model License v1.0. |
|
|
| **Out of scope.** Commercial deployment, military or weapons applications, |
| non-consensual surveillance, and any use that violates applicable laws or |
| regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list. |
|
|
| **Limitations.** Performance depends heavily on embodiment match and data |
| distribution. The pretrained checkpoint is OXE-conditioned and is not |
| guaranteed to work zero-shot on novel embodiments without finetuning. |
| Memory, motion, and physics modules are dormant in `RLDX-1-PT` and only |
| activate when the corresponding flags are wired during finetuning (see |
| `RLDX-1-MT-ALLEX`). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{rldx2026, |
| title={RLDX-1 Technical Report}, |
| author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others}, |
| year={2026}, |
| note={RLWRLD}, |
| eprint={2605.03269}, |
| archivePrefix={arXiv}, |
| url={https://arxiv.org/abs/2605.03269} |
| } |
| ``` |
|
|
| ## License |
|
|
| Released under the **RLWRLD Model License v1.0** — a non-commercial license |
| with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for |
| the full text. By using this model you agree to those terms, including the |
| use restrictions in §3.5. |
|
|