RLDX-1 Release

587e9ec 1 day ago

6.5 kB

license: other
license_name: rlwrld-model-license-v1.0
license_link: LICENSE.md
library_name: transformers
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - manipulation
  - flow-matching
  - rldx
  - robocasa
base_model: RLWRLD/RLDX-1-PT

RLDX-1-FT-RC365

Paper · Project page · Code · Models

RLDX-1 teaser

RLDX-1 is a general-purpose Robot Foundation Model designed for dexterous manipulation. Powered by a Multi-Stream Action Transformer (MSAT), it seamlessly unifies multimodal perception (visual + tactile), high-DoF actuation, and memory-aware decision-making in a single architecture.

This repository hosts RLDX-1-FT-RC365 — RLDX-1 finetuned on the RoboCasa-365 cross-task generalization suite. It achieves 31.5% average success across the 365 tasks, which span a much broader scene and skill distribution than the standard RoboCasa Kitchen suite.

Highlights

Multi-Stream Action Transformer (MSAT). Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling.
Motion awareness. Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient.
Long-term memory. A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window.
Physical sensing. Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals.
Three-stage training. Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios.
Real-time inference. Static graph capture + custom fused kernels bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz).

Performance

Benchmark	Success Rate
RoboCasa-365 (365-task avg)	31.5%

Quick start

Installation

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .

Inference

from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-RC365",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

action = policy.get_action(observation)

Real-time serving (ZeroMQ)

uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-RC365 \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000

To reproduce the benchmark numbers end-to-end, see run_scripts/eval/robocasa_365/README.md.

Model details

Architecture: Multi-Stream Action Transformer (MSAT) policy on a Qwen3-VL backbone with cognition-token perceptual summary. Trained with flow matching.
Inputs: RGB video (default 4 frames), state proprioception, language instruction.
Outputs: Action chunks of length 16.
Embodiment tag: GENERAL_EMBODIMENT.
Base model: RLWRLD/RLDX-1-PT.
Backbone: Qwen/Qwen3-VL-8B-Instruct.
Finetune data: RoboCasa-365 (365 tasks).
Params: 6.9B.

For the full architectural walkthrough see docs/architecture.md.

RLDX-1 model family

Checkpoint	Description
`RLDX-1-PT`	Multi-source pretrained foundation
`RLDX-1-VLM`	Qwen3-VL-8B vision-language backbone
`RLDX-1-FT-ROBOCASA`	RoboCasa Kitchen 24-task finetune
`RLDX-1-FT-RC365`	RoboCasa-365 cross-task finetune (this repo)
`RLDX-1-FT-LIBERO`	LIBERO 4-task suite (goal, object, spatial, long) finetune
`RLDX-1-FT-SIMPLER-GOOGLE`	SIMPLER Google VM/VA finetune
`RLDX-1-FT-SIMPLER-WIDOWX`	SIMPLER WidowX finetune
`RLDX-1-FT-GR1`	GR-1 Tabletop finetune
`RLDX-1-MT-DROID`	DROID mid-train
`RLDX-1-MT-ALLEX`	All add-ons (memory + motion + physics + video)

Intended use & limitations

Intended use. Research on robotic manipulation, generalization studies on the RoboCasa-365 suite, and non-commercial real-robot deployment under the conditions of the RLWRLD Model License v1.0.

Out of scope. Commercial deployment, military or weapons applications, non-consensual surveillance, and any use that violates applicable laws or regulations. See LICENSE.md §3.5 for the full list.

Limitations. RoboCasa-365 deliberately probes broad task generalization, so absolute success rate is lower than focused 24-task RoboCasa Kitchen finetuning. For RoboCasa Kitchen specifically, prefer RLDX-1-FT-ROBOCASA. For other embodiments or datasets, finetune from RLDX-1-PT instead.

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}

License

Released under the RLWRLD Model License v1.0 — a non-commercial license with attribution and share-alike requirements. See LICENSE.md for the full text. By using this model you agree to those terms, including the use restrictions in §3.5.