| --- |
| license: other |
| license_name: rlwrld-model-license-v1.0 |
| license_link: LICENSE.md |
| library_name: transformers |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - vla |
| - vision-language-action |
| - manipulation |
| - flow-matching |
| - rldx |
| - simpler |
| - widowx |
| base_model: RLWRLD/RLDX-1-PT |
| --- |
| |
| # RLDX-1-FT-SIMPLER-WIDOWX |
|
|
| [Paper](https://arxiv.org/abs/2605.03269) · [Project page](https://rlwrld.ai/rldx-1) · [Code](https://github.com/RLWRLD/RLDX-1) · [Models](https://huggingface.co/collections/RLWRLD/rldx-1) |
|
|
| <p align="center"> |
| <img src="teaser.png" width="100%" alt="RLDX-1 teaser"> |
| </p> |
|
|
| **RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous |
| manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it |
| seamlessly unifies multimodal perception (visual + tactile), high-DoF |
| actuation, and memory-aware decision-making in a single architecture. |
|
|
| This repository hosts **`RLDX-1-FT-SIMPLER-WIDOWX`** — RLDX-1 finetuned for |
| the **SimplerEnv WidowX** benchmark (BridgeData-style WidowX 250 tasks). |
| It achieves **71.9%** average success. |
|
|
| ## Highlights |
|
|
| - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and |
| action each get a dedicated stream coupled by joint self-attention — |
| an extension of MM-DiT to action modeling. |
| - **Motion awareness.** Multi-frame observations + a motion module |
| capture temporal dynamics; intermediate VLM layers compress video |
| tokens to keep the policy efficient. |
| - **Long-term memory.** A memory module fuses past cognition features |
| with the current ones for history-grounded decisions beyond a short |
| multi-frame window. |
| - **Physical sensing.** Tactile and torque enter as a dedicated physics |
| stream; the decoder is jointly trained to predict future physical |
| signals. |
| - **Three-stage training.** Pre-training (generalization) → mid-training |
| (functionality) → post-training (task adaptation), with synthetic data |
| augmenting rare manipulation scenarios. |
| - **Real-time inference.** Static graph capture + custom fused kernels |
| bring the all-modality model to **43.7 ms / step on RTX 5090 |
| (1.63× speedup, >22 Hz)**. |
|
|
| ## Performance |
|
|
| | Benchmark | Success Rate | |
| |---|---| |
| | SIMPLER WidowX | **71.9%** | |
|
|
| ## Quick start |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/RLWRLD/RLDX-1.git |
| cd RLDX |
| uv sync --python 3.10 |
| uv pip install -e . |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| from rldx.policy.rldx_policy import RLDXPolicy |
| from rldx.data.embodiment_tags import EmbodimentTag |
| |
| policy = RLDXPolicy( |
| model_path="RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX", |
| embodiment_tag=EmbodimentTag.OXE_BRIDGE_ORIG, |
| device="cuda:0", |
| ) |
| |
| action = policy.get_action(observation) |
| ``` |
|
|
| ### Real-time serving (ZeroMQ) |
|
|
| ```bash |
| uv run python rldx/eval/run_rldx_server.py \ |
| --model-path RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX \ |
| --embodiment-tag OXE_BRIDGE_ORIG \ |
| --host 0.0.0.0 --port 20000 |
| ``` |
|
|
| To reproduce the benchmark numbers end-to-end, see |
| [`run_scripts/eval/simpler/README.md`](https://github.com/RLWRLD/RLDX-1/blob/main/run_scripts/eval/simpler/README.md). |
|
|
| ## Model details |
|
|
| - **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a |
| Qwen3-VL backbone with cognition-token perceptual summary. Trained with |
| flow matching. |
| - **Inputs:** RGB video (default 4 frames), state proprioception, language |
| instruction. |
| - **Outputs:** Action chunks of length 16. |
| - **Embodiment tag:** `OXE_BRIDGE_ORIG`. |
| - **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT). |
| - **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). |
| - **Finetune data:** SimplerEnv WidowX training set (BridgeData subset of OXE). |
| - **Params:** 6.9B. |
|
|
| For the full architectural walkthrough see |
| [`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md). |
|
|
| ## RLDX-1 model family |
|
|
| | Checkpoint | Description | |
| |---|---| |
| | [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation | |
| | [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | |
| | [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | |
| | [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | |
| | [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune | |
| | [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | |
| | [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune (this repo) | |
| | [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | |
| | [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train | |
| | [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | |
|
|
| ## Intended use & limitations |
|
|
| **Intended use.** Research on robotic manipulation, simulation benchmarking |
| on SimplerEnv WidowX, and non-commercial real-robot deployment under the |
| conditions of the RLWRLD Model License v1.0. |
|
|
| **Out of scope.** Commercial deployment, military or weapons applications, |
| non-consensual surveillance, and any use that violates applicable laws or |
| regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list. |
|
|
| **Limitations.** Conditioned on the WidowX 250 BridgeData embodiment. For |
| Google-Robot evaluation use |
| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE); |
| for other embodiments, finetune from |
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) instead. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{rldx2026, |
| title={RLDX-1 Technical Report}, |
| author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others}, |
| year={2026}, |
| note={RLWRLD}, |
| eprint={2605.03269}, |
| archivePrefix={arXiv}, |
| url={https://arxiv.org/abs/2605.03269} |
| } |
| ``` |
|
|
| ## License |
|
|
| Released under the **RLWRLD Model License v1.0** — a non-commercial license |
| with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for |
| the full text. By using this model you agree to those terms, including the |
| use restrictions in §3.5. |
|
|