File size: 8,838 Bytes

13d439a

---
license: other
license_name: rlwrld-model-license-v1.0
license_link: LICENSE.md
library_name: transformers
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - manipulation
  - flow-matching
  - rldx
base_model: Qwen/Qwen3-VL-8B-Instruct
---

# RLDX-1

[Paper](https://arxiv.org/abs/2605.03269) &nbsp;·&nbsp; [Project page](https://rlwrld.ai/rldx-1) &nbsp;·&nbsp; [Code](https://github.com/RLWRLD/RLDX-1) &nbsp;·&nbsp; [Models](https://huggingface.co/collections/RLWRLD/rldx-1)

<p align="center">
<img src="teaser.png" width="100%" alt="RLDX-1 teaser">
</p>

**RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous
manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it
seamlessly unifies multimodal perception (visual + tactile), high-DoF
actuation, and memory-aware decision-making in a single architecture. RLDX-1
achieves state-of-the-art performance across diverse simulation benchmarks
and is fully validated on real-world hardware.

This repository hosts **`RLDX-1-PT`** — a foundation checkpoint pretrained on
a broad mixture of public manipulation corpora, from which all downstream
`RLDX-1-{FT,MT}-*` releases finetune. Use it as your starting point for new
embodiments and tasks.

<p align="center">
<img src="architecture.png" width="90%" alt="RLDX-1 architecture">
</p>

## Highlights

- **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and
  action each get a dedicated stream coupled by joint self-attention —
  an extension of MM-DiT to action modeling.
- **Motion awareness.** Multi-frame observations + a motion module
  capture temporal dynamics; intermediate VLM layers compress video
  tokens to keep the policy efficient.
- **Long-term memory.** A memory module fuses past cognition features
  with the current ones for history-grounded decisions beyond a short
  multi-frame window.
- **Physical sensing.** Tactile and torque enter as a dedicated physics
  stream; the decoder is jointly trained to predict future physical
  signals.
- **Three-stage training.** Pre-training (generalization) → mid-training
  (functionality) → post-training (task adaptation), with synthetic data
  augmenting rare manipulation scenarios.
- **Real-time inference.** Static graph capture + custom fused kernels
  bring the all-modality model to **43.7 ms / step on RTX 5090
  (1.63× speedup, >22 Hz)**.

## Released Checkpoints

This card describes `RLDX-1-PT` (foundation). The full RLDX-1 model family:

| Checkpoint | Description | Params | Embodiment Tag |
|---|---|---|---|
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation (this repo) | 6.9B | per-dataset |
| [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | 8B | — |
| [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | 6.9B | `GENERAL_EMBODIMENT` |
| [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | 6.9B | `GENERAL_EMBODIMENT` |
| [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune | 6.9B | `GENERAL_EMBODIMENT` |
| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | 6.9B | `OXE_FRACTAL` |
| [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune | 6.9B | `OXE_BRIDGE_ORIG` |
| [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | 6.9B | `GENERAL_EMBODIMENT` |
| [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train | 8.1B | `OXE_DROID` |
| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | 8.1B | `GENERAL_EMBODIMENT` |

## Performance

Success rate (%) of RLDX-1 finetuned on each benchmark's training set,
evaluated with the linked checkpoint.

| Benchmark | Success Rate | Checkpoint |
|---|---|---|
| LIBERO (Avg) | 97.8 | `RLDX-1-FT-LIBERO` |
| LIBERO-Plus | 87.6 | `RLDX-1-FT-LIBERO` |
| SIMPLER Google-VM | 81.5 | `RLDX-1-FT-SIMPLER-GOOGLE` |
| SIMPLER Google-VA | 77.4 | `RLDX-1-FT-SIMPLER-GOOGLE` |
| SIMPLER WidowX | 71.9 | `RLDX-1-FT-SIMPLER-WIDOWX` |
| RoboCasa Kitchen (24 tasks) | 70.6 | `RLDX-1-FT-ROBOCASA` |
| GR-1 Tabletop | 58.7 | `RLDX-1-FT-GR1` |
| RoboCasa365 (Avg) | 31.5 | `RLDX-1-FT-RC365` |

## Quick start

```bash
git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .
```

### Inference (single step)

```python
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-ROBOCASA",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

action = policy.get_action(observation)
```

`RLDX-1-PT` is pretrained on a multi-source mixture, so for direct inference
pair it with the embodiment tag matching your data source — e.g.
`OXE_FRACTAL`, `OXE_BRIDGE_ORIG`, `OXE_DROID`, `GALAXEA`, `AGIBOT_GRIPPER`,
`AGIBOT_DEXHAND`, `NEURAL_GR1`, `HUMANOID_EVERYDAY_G1`,
`HUMANOID_EVERYDAY_H1`, etc. For custom robots, finetune.

### Real-time serving (ZeroMQ)

```bash
uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-ROBOCASA \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000
```

A WebSocket server (`run_rldx_server_pi.py`) is also available for
openpi-compatible clients.

### Finetune from `RLDX-1-PT`

```bash
uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 --n-cog-tokens 64 \
    --global-batch-size 64 --learning-rate 1e-4 \
    --max-steps 60000 --save-steps 5000 \
    --output-dir ./outputs/my_finetune
```

To enable add-ons (memory / motion / physics) see the recipes in the
[main README](https://github.com/RLWRLD/RLDX-1#finetuning) and the
[`training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md)
guide.

## Model details

- **Architecture:** Multi-Stream Action Transformer (MSAT) policy with a
  Qwen3-VL vision-language backbone, cognition-token perceptual summary,
  optional Transformer memory, motion module, and tactile/torque physics
  encoder/decoder. Trained with flow matching.
- **Inputs:** RGB video (default 4 frames), state proprioception, optional
  tactile / torque signals, language instruction.
- **Outputs:** Action chunks of length 16 (default `--action-horizon 16`).
- **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
- **Pretraining data:** A mixture of public manipulation corpora, covering
  27 [Open X-Embodiment (OXE)](https://robotics-transformer-x.github.io/)
  datasets (DROID, Bridge, Fractal, Language Table, …) plus
  [Galaxea](https://galaxea.ai/), [AgiBot World](https://agibot-world.com/)
  (Gripper + Dexhand), ActionNet, Neural-Curated GR-1 humanoid trajectories,
  and Unitree G1 / H1 from
  [HumanoidEveryday](https://lipeng-zhou.github.io/HumanoidEveryday/).

For a full architectural walkthrough see
[`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md).

## Intended use & limitations

**Intended use.** Research on robotic manipulation, finetuning on custom
embodiments, simulation benchmarking, and non-commercial real-robot
deployment under the conditions of the RLWRLD Model License v1.0.

**Out of scope.** Commercial deployment, military or weapons applications,
non-consensual surveillance, and any use that violates applicable laws or
regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list.

**Limitations.** Performance depends heavily on embodiment match and data
distribution. The pretrained checkpoint is OXE-conditioned and is not
guaranteed to work zero-shot on novel embodiments without finetuning.
Memory, motion, and physics modules are dormant in `RLDX-1-PT` and only
activate when the corresponding flags are wired during finetuning (see
`RLDX-1-MT-ALLEX`).

## Citation

```bibtex
@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}
```

## License

Released under the **RLWRLD Model License v1.0** — a non-commercial license
with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for
the full text. By using this model you agree to those terms, including the
use restrictions in §3.5.