File size: 6,420 Bytes
d0277c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: other
license_name: rlwrld-model-license-v1.0
license_link: LICENSE.md
library_name: transformers
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - manipulation
  - flow-matching
  - rldx
  - gr1
  - tabletop
  - humanoid
base_model: RLWRLD/RLDX-1-PT
---

# RLDX-1-FT-GR1

[Paper](https://arxiv.org/abs/2605.03269)  ·  [Project page](https://rlwrld.ai/rldx-1)  ·  [Code](https://github.com/RLWRLD/RLDX-1)  ·  [Models](https://huggingface.co/collections/RLWRLD/rldx-1)

<p align="center">
<img src="teaser.png" width="100%" alt="RLDX-1 teaser">
</p>

**RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous
manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it
seamlessly unifies multimodal perception (visual + tactile), high-DoF
actuation, and memory-aware decision-making in a single architecture.

This repository hosts **`RLDX-1-FT-GR1`** — RLDX-1 finetuned on the
**GR-1 Tabletop** benchmark, a 24-task humanoid manipulation suite using
the Fourier GR-1 humanoid platform. It achieves **58.7%** average success.

## Highlights

- **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and
  action each get a dedicated stream coupled by joint self-attention —
  an extension of MM-DiT to action modeling.
- **Motion awareness.** Multi-frame observations + a motion module
  capture temporal dynamics; intermediate VLM layers compress video
  tokens to keep the policy efficient.
- **Long-term memory.** A memory module fuses past cognition features
  with the current ones for history-grounded decisions beyond a short
  multi-frame window.
- **Physical sensing.** Tactile and torque enter as a dedicated physics
  stream; the decoder is jointly trained to predict future physical
  signals.
- **Three-stage training.** Pre-training (generalization) → mid-training
  (functionality) → post-training (task adaptation), with synthetic data
  augmenting rare manipulation scenarios.
- **Real-time inference.** Static graph capture + custom fused kernels
  bring the all-modality model to **43.7 ms / step on RTX 5090
  (1.63× speedup, >22 Hz)**.

## Performance

| Benchmark | Success Rate |
|---|---|
| GR-1 Tabletop (24-task avg) | **58.7%** |

## Quick start

### Installation

```bash
git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .
```

### Inference

```python
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-GR1",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

action = policy.get_action(observation)
```

### Real-time serving (ZeroMQ)

```bash
uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-GR1 \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000
```

To reproduce the benchmark numbers end-to-end, see
[`run_scripts/eval/gr1_tabletop/README.md`](https://github.com/RLWRLD/RLDX-1/blob/main/run_scripts/eval/gr1_tabletop/README.md).

## Model details

- **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a
  Qwen3-VL backbone with cognition-token perceptual summary. Trained with
  flow matching.
- **Inputs:** RGB video (default 4 frames, GR-1 ego-view), state
  proprioception, language instruction.
- **Outputs:** Action chunks of length 16 across the GR-1 humanoid's joint
  space (arms + waist + Fourier hands).
- **Embodiment tag:** `GENERAL_EMBODIMENT`.
- **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT).
- **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
- **Finetune data:** GR-1 Tabletop training set.
- **Params:** 6.9B.

For the full architectural walkthrough see
[`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md).

## RLDX-1 model family

| Checkpoint | Description |
|---|---|
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation |
| [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone |
| [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune |
| [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune |
| [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune |
| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune |
| [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune |
| [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune (this repo) |
| [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train |
| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) |

## Intended use & limitations

**Intended use.** Research on humanoid robotic manipulation, simulation
benchmarking on GR-1 Tabletop, and non-commercial deployment on the GR-1
platform under the conditions of the RLWRLD Model License v1.0.

**Out of scope.** Commercial deployment, military or weapons applications,
non-consensual surveillance, and any use that violates applicable laws or
regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list.

**Limitations.** Conditioned on the GR-1 humanoid embodiment. For other
robots, finetune from [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT)
or [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) for
the most generalist initialization.

## Citation

```bibtex
@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}
```

## License

Released under the **RLWRLD Model License v1.0** — a non-commercial license
with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for
the full text. By using this model you agree to those terms, including the
use restrictions in §3.5.