RLDX-1-MT-ALLEX / README.md
jaehyunkang's picture
RLDX-1 Release
894bba2
---
license: other
license_name: rlwrld-model-license-v1.0
license_link: LICENSE.md
library_name: transformers
pipeline_tag: robotics
tags:
- robotics
- vla
- vision-language-action
- manipulation
- flow-matching
- rldx
- allex
- memory
- physics
base_model: RLWRLD/RLDX-1-PT
---
# RLDX-1-MT-ALLEX
[Paper](https://arxiv.org/abs/2605.03269)  ·  [Project page](https://rlwrld.ai/rldx-1)  ·  [Code](https://github.com/RLWRLD/RLDX-1)  ·  [Models](https://huggingface.co/collections/RLWRLD/rldx-1)
<p align="center">
<img src="teaser.png" width="100%" alt="RLDX-1 teaser">
</p>
**RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous
manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it
seamlessly unifies multimodal perception (visual + tactile), high-DoF
actuation, and memory-aware decision-making in a single architecture.
This repository hosts **`RLDX-1-MT-ALLEX`** — RLDX-1 **mid-trained on the
ALLEX humanoid dataset with every optional MSAT module enabled**: memory,
motion, physics (tactile + torque), and video. It is the most
feature-complete checkpoint in the family and the recommended initialization
for tasks that need long-horizon memory or rich physical signal
conditioning.
## Highlights
- **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and
action each get a dedicated stream coupled by joint self-attention —
an extension of MM-DiT to action modeling.
- **Motion awareness.** Multi-frame observations + a motion module
capture temporal dynamics; intermediate VLM layers compress video
tokens to keep the policy efficient.
- **Long-term memory.** A memory module fuses past cognition features
with the current ones for history-grounded decisions beyond a short
multi-frame window.
- **Physical sensing.** Tactile and torque enter as a dedicated physics
stream; the decoder is jointly trained to predict future physical
signals.
- **Three-stage training.** Pre-training (generalization) → mid-training
(functionality) → post-training (task adaptation), with synthetic data
augmenting rare manipulation scenarios.
- **Real-time inference.** Static graph capture + custom fused kernels
bring the all-modality model to **43.7 ms / step on RTX 5090
(1.63× speedup, >22 Hz)**.
## Quick start
### Installation
```bash
git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .
```
### Inference
```python
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag
policy = RLDXPolicy(
model_path="RLWRLD/RLDX-1-MT-ALLEX",
embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
device="cuda:0",
)
action = policy.get_action(observation)
```
### Real-time serving (ZeroMQ)
```bash
uv run python rldx/eval/run_rldx_server.py \
--model-path RLWRLD/RLDX-1-MT-ALLEX \
--embodiment-tag GENERAL_EMBODIMENT \
--host 0.0.0.0 --port 20000
```
### Finetune from this checkpoint (preserve all add-ons)
To keep memory / motion / physics active during downstream finetuning,
mirror the original training flags. See
[`docs/training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md)
for the full recipe.
## Model details
- **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a
Qwen3-VL backbone, with **memory + motion + physics + video** modules all
enabled. Trained with flow matching.
- **Inputs:** RGB video (4 frames), state proprioception, tactile and
torque streams, language instruction.
- **Outputs:** Action chunks of length 16, plus auxiliary flow-matching
predictions of future tactile / torque signals.
- **Embodiment tag:** `GENERAL_EMBODIMENT`.
- **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT).
- **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
- **Mid-train data:** ALLEX dataset.
- **Params:** 8.1B.
For the full architectural walkthrough including how memory / motion /
physics modules are wired, see
[`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md).
## RLDX-1 model family
| Checkpoint | Description |
|---|---|
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation |
| [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone |
| [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune |
| [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune |
| [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune |
| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune |
| [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune |
| [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune |
| [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train |
| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) — this repo |
## Intended use & limitations
**Intended use.** Research on memory- and physical-signal-conditioned
robotic manipulation, mid-train initialization for downstream tasks that
benefit from history or tactile/torque conditioning, and non-commercial
real-robot deployment under the conditions of the RLWRLD Model License
v1.0.
**Out of scope.** Commercial deployment, military or weapons applications,
non-consensual surveillance, and any use that violates applicable laws or
regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list.
**Limitations.** Active memory / motion / physics modules add inference
cost and require feeding tactile + torque observations at inference time
to fully exploit the physics stream. If the deployment hardware lacks
tactile or torque sensors, prefer
[`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) or
[`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID).
Conditioning is biased toward the ALLEX humanoid embodiment; for
fundamentally different morphologies, finetune.
## Citation
```bibtex
@article{rldx2026,
title={RLDX-1 Technical Report},
author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
year={2026},
note={RLWRLD},
eprint={2605.03269},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2605.03269}
}
```
## License
Released under the **RLWRLD Model License v1.0** — a non-commercial license
with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for
the full text. By using this model you agree to those terms, including the
use restrictions in §3.5.