RLDX-1 Release

13d439a 1 day ago

8.84 kB

	---
	license: other
	license_name: rlwrld-model-license-v1.0
	license_link: LICENSE.md
	library_name: transformers
	pipeline_tag: robotics
	tags:
	- robotics
	- vla
	- vision-language-action
	- manipulation
	- flow-matching
	- rldx
	base_model: Qwen/Qwen3-VL-8B-Instruct
	---

	# RLDX-1

	[Paper](https://arxiv.org/abs/2605.03269)  ·  [Project page](https://rlwrld.ai/rldx-1)  ·  [Code](https://github.com/RLWRLD/RLDX-1)  ·  [Models](https://huggingface.co/collections/RLWRLD/rldx-1)

	<p align="center">
	<img src="teaser.png" width="100%" alt="RLDX-1 teaser">
	</p>

	RLDX-1 is a general-purpose Robot Foundation Model designed for dexterous
	manipulation. Powered by a Multi-Stream Action Transformer (MSAT), it
	seamlessly unifies multimodal perception (visual + tactile), high-DoF
	actuation, and memory-aware decision-making in a single architecture. RLDX-1
	achieves state-of-the-art performance across diverse simulation benchmarks
	and is fully validated on real-world hardware.

	This repository hosts `RLDX-1-PT` — a foundation checkpoint pretrained on
	a broad mixture of public manipulation corpora, from which all downstream
	`RLDX-1-{FT,MT}-*` releases finetune. Use it as your starting point for new
	embodiments and tasks.

	<p align="center">
	<img src="architecture.png" width="90%" alt="RLDX-1 architecture">
	</p>

	## Highlights

	- Multi-Stream Action Transformer (MSAT). Cognition, physics, and
	action each get a dedicated stream coupled by joint self-attention —
	an extension of MM-DiT to action modeling.
	- Motion awareness. Multi-frame observations + a motion module
	capture temporal dynamics; intermediate VLM layers compress video
	tokens to keep the policy efficient.
	- Long-term memory. A memory module fuses past cognition features
	with the current ones for history-grounded decisions beyond a short
	multi-frame window.
	- Physical sensing. Tactile and torque enter as a dedicated physics
	stream; the decoder is jointly trained to predict future physical
	signals.
	- Three-stage training. Pre-training (generalization) → mid-training
	(functionality) → post-training (task adaptation), with synthetic data
	augmenting rare manipulation scenarios.
	- Real-time inference. Static graph capture + custom fused kernels
	bring the all-modality model to **43.7 ms / step on RTX 5090
	(1.63× speedup, >22 Hz)**.

	## Released Checkpoints

	This card describes `RLDX-1-PT` (foundation). The full RLDX-1 model family:

	\| Checkpoint \| Description \| Params \| Embodiment Tag \|
	\|---\|---\|---\|---\|
	\| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) \| Multi-source pretrained foundation (this repo) \| 6.9B \| per-dataset \|
	\| [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) \| Qwen3-VL-8B vision-language backbone \| 8B \| — \|
	\| [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) \| RoboCasa Kitchen 24-task finetune \| 6.9B \| `GENERAL_EMBODIMENT` \|
	\| [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) \| RoboCasa-365 cross-task finetune \| 6.9B \| `GENERAL_EMBODIMENT` \|
	\| [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) \| LIBERO 4-task suite (goal, object, spatial, long) finetune \| 6.9B \| `GENERAL_EMBODIMENT` \|
	\| [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) \| SIMPLER Google VM/VA finetune \| 6.9B \| `OXE_FRACTAL` \|
	\| [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) \| SIMPLER WidowX finetune \| 6.9B \| `OXE_BRIDGE_ORIG` \|
	\| [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) \| GR-1 Tabletop finetune \| 6.9B \| `GENERAL_EMBODIMENT` \|
	\| [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) \| DROID mid-train \| 8.1B \| `OXE_DROID` \|
	\| [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) \| All add-ons (memory + motion + physics + video) \| 8.1B \| `GENERAL_EMBODIMENT` \|

	## Performance

	Success rate (%) of RLDX-1 finetuned on each benchmark's training set,
	evaluated with the linked checkpoint.

	\| Benchmark \| Success Rate \| Checkpoint \|
	\|---\|---\|---\|
	\| LIBERO (Avg) \| 97.8 \| `RLDX-1-FT-LIBERO` \|
	\| LIBERO-Plus \| 87.6 \| `RLDX-1-FT-LIBERO` \|
	\| SIMPLER Google-VM \| 81.5 \| `RLDX-1-FT-SIMPLER-GOOGLE` \|
	\| SIMPLER Google-VA \| 77.4 \| `RLDX-1-FT-SIMPLER-GOOGLE` \|
	\| SIMPLER WidowX \| 71.9 \| `RLDX-1-FT-SIMPLER-WIDOWX` \|
	\| RoboCasa Kitchen (24 tasks) \| 70.6 \| `RLDX-1-FT-ROBOCASA` \|
	\| GR-1 Tabletop \| 58.7 \| `RLDX-1-FT-GR1` \|
	\| RoboCasa365 (Avg) \| 31.5 \| `RLDX-1-FT-RC365` \|

	## Quick start

	```bash
	git clone https://github.com/RLWRLD/RLDX-1.git
	cd RLDX
	uv sync --python 3.10
	uv pip install -e .
	```

	### Inference (single step)

	```python
	from rldx.policy.rldx_policy import RLDXPolicy
	from rldx.data.embodiment_tags import EmbodimentTag

	policy = RLDXPolicy(
	model_path="RLWRLD/RLDX-1-FT-ROBOCASA",
	embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
	device="cuda:0",
	)

	action = policy.get_action(observation)
	```

	`RLDX-1-PT` is pretrained on a multi-source mixture, so for direct inference
	pair it with the embodiment tag matching your data source — e.g.
	`OXE_FRACTAL`, `OXE_BRIDGE_ORIG`, `OXE_DROID`, `GALAXEA`, `AGIBOT_GRIPPER`,
	`AGIBOT_DEXHAND`, `NEURAL_GR1`, `HUMANOID_EVERYDAY_G1`,
	`HUMANOID_EVERYDAY_H1`, etc. For custom robots, finetune.

	### Real-time serving (ZeroMQ)

	```bash
	uv run python rldx/eval/run_rldx_server.py \
	--model-path RLWRLD/RLDX-1-FT-ROBOCASA \
	--embodiment-tag GENERAL_EMBODIMENT \
	--host 0.0.0.0 --port 20000
	```

	A WebSocket server (`run_rldx_server_pi.py`) is also available for
	openpi-compatible clients.

	### Finetune from `RLDX-1-PT`

	```bash
	uv run python rldx/experiment/launch_train.py \
	--base-model-path RLWRLD/RLDX-1-PT \
	--dataset-path /path/to/your/dataset \
	--embodiment-tag GENERAL_EMBODIMENT \
	--video-length 4 --n-cog-tokens 64 \
	--global-batch-size 64 --learning-rate 1e-4 \
	--max-steps 60000 --save-steps 5000 \
	--output-dir ./outputs/my_finetune
	```

	To enable add-ons (memory / motion / physics) see the recipes in the
	[main README](https://github.com/RLWRLD/RLDX-1#finetuning) and the
	[`training.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/training.md)
	guide.

	## Model details

	- Architecture: Multi-Stream Action Transformer (MSAT) policy with a
	Qwen3-VL vision-language backbone, cognition-token perceptual summary,
	optional Transformer memory, motion module, and tactile/torque physics
	encoder/decoder. Trained with flow matching.
	- Inputs: RGB video (default 4 frames), state proprioception, optional
	tactile / torque signals, language instruction.
	- Outputs: Action chunks of length 16 (default `--action-horizon 16`).
	- Backbone: [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
	- Pretraining data: A mixture of public manipulation corpora, covering
	27 [Open X-Embodiment (OXE)](https://robotics-transformer-x.github.io/)
	datasets (DROID, Bridge, Fractal, Language Table, …) plus
	[Galaxea](https://galaxea.ai/), [AgiBot World](https://agibot-world.com/)
	(Gripper + Dexhand), ActionNet, Neural-Curated GR-1 humanoid trajectories,
	and Unitree G1 / H1 from
	[HumanoidEveryday](https://lipeng-zhou.github.io/HumanoidEveryday/).

	For a full architectural walkthrough see
	[`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md).

	## Intended use & limitations

	Intended use. Research on robotic manipulation, finetuning on custom
	embodiments, simulation benchmarking, and non-commercial real-robot
	deployment under the conditions of the RLWRLD Model License v1.0.

	Out of scope. Commercial deployment, military or weapons applications,
	non-consensual surveillance, and any use that violates applicable laws or
	regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list.

	Limitations. Performance depends heavily on embodiment match and data
	distribution. The pretrained checkpoint is OXE-conditioned and is not
	guaranteed to work zero-shot on novel embodiments without finetuning.
	Memory, motion, and physics modules are dormant in `RLDX-1-PT` and only
	activate when the corresponding flags are wired during finetuning (see
	`RLDX-1-MT-ALLEX`).

	## Citation

	```bibtex
	@article{rldx2026,
	title={RLDX-1 Technical Report},
	author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
	year={2026},
	note={RLWRLD},
	eprint={2605.03269},
	archivePrefix={arXiv},
	url={https://arxiv.org/abs/2605.03269}
	}
	```

	## License

	Released under the RLWRLD Model License v1.0 — a non-commercial license
	with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for
	the full text. By using this model you agree to those terms, including the
	use restrictions in §3.5.