How to use from the
Use from the
Transformers library
# Load model directly
from transformers import RLDX
model = RLDX.from_pretrained("RLWRLD/RLDX-1-MT-DROID", dtype="auto")
Quick Links

RLDX-1-MT-DROID

Paper  ·  Project page  ·  Code  ·  Models

RLDX-1 teaser

RLDX-1 is a general-purpose Robot Foundation Model designed for dexterous manipulation. Powered by a Multi-Stream Action Transformer (MSAT), it seamlessly unifies multimodal perception (visual + tactile), high-DoF actuation, and memory-aware decision-making in a single architecture.

This repository hosts RLDX-1-MT-DROID — RLDX-1 mid-trained on the DROID dataset (large-scale Franka-arm teleoperation). Mid-training continues from the multi-source RLDX-1-PT pretraining with an embodiment-specific corpus before downstream task finetuning, making this checkpoint a stronger initialization than RLDX-1-PT for any Franka-style downstream task.

Highlights

  • Multi-Stream Action Transformer (MSAT). Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling.
  • Motion awareness. Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient.
  • Long-term memory. A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window.
  • Physical sensing. Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals.
  • Three-stage training. Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios.
  • Real-time inference. Static graph capture + custom fused kernels bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz).

Quick start

Installation

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .

Inference

from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-MT-DROID",
    embodiment_tag=EmbodimentTag.OXE_DROID,
    device="cuda:0",
)

action = policy.get_action(observation)

Real-time serving (ZeroMQ)

uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-MT-DROID \
    --embodiment-tag OXE_DROID \
    --host 0.0.0.0 --port 20000

Finetune from this checkpoint

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-MT-DROID \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag OXE_DROID \
    --video-length 4 --n-cog-tokens 64 \
    --global-batch-size 64 --learning-rate 1e-4 \
    --max-steps 60000 --output-dir ./outputs/my_finetune

For a full finetune walkthrough see docs/training.md.

Model details

  • Architecture: Multi-Stream Action Transformer (MSAT) policy on a Qwen3-VL backbone with cognition-token perceptual summary. Trained with flow matching.
  • Inputs: RGB video (default 4 frames), state proprioception, language instruction.
  • Outputs: Action chunks of length 16.
  • Embodiment tag: OXE_DROID.
  • Base model: RLWRLD/RLDX-1-PT.
  • Backbone: Qwen/Qwen3-VL-8B-Instruct.
  • Mid-train data: DROID.
  • Params: 8.1B.

For the full architectural walkthrough see docs/architecture.md.

RLDX-1 model family

Checkpoint Description
RLDX-1-PT Multi-source pretrained foundation
RLDX-1-VLM Qwen3-VL-8B vision-language backbone
RLDX-1-FT-ROBOCASA RoboCasa Kitchen 24-task finetune
RLDX-1-FT-RC365 RoboCasa-365 cross-task finetune
RLDX-1-FT-LIBERO LIBERO 4-task suite (goal, object, spatial, long) finetune
RLDX-1-FT-SIMPLER-GOOGLE SIMPLER Google VM/VA finetune
RLDX-1-FT-SIMPLER-WIDOWX SIMPLER WidowX finetune
RLDX-1-FT-GR1 GR-1 Tabletop finetune
RLDX-1-MT-DROID DROID mid-train (this repo)
RLDX-1-MT-ALLEX All add-ons (memory + motion + physics + video)

Intended use & limitations

Intended use. As a strong initialization for downstream finetuning on Franka-arm manipulation tasks; research on robotic manipulation; and non-commercial real-robot deployment under the conditions of the RLWRLD Model License v1.0.

Out of scope. Commercial deployment, military or weapons applications, non-consensual surveillance, and any use that violates applicable laws or regulations. See LICENSE.md §3.5 for the full list.

Limitations. Mid-train conditioning is most useful for Franka-style embodiments. For very different morphologies (humanoid, dual-arm, mobile), RLDX-1-PT or RLDX-1-MT-ALLEX may be better starting points. The memory, motion, and physics modules are inactive in this checkpoint — enable them at finetune time if needed.

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}

License

Released under the RLWRLD Model License v1.0 — a non-commercial license with attribution and share-alike requirements. See LICENSE.md for the full text. By using this model you agree to those terms, including the use restrictions in §3.5.

Downloads last month
27
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for RLWRLD/RLDX-1-MT-DROID

Finetuned
RLWRLD/RLDX-1-PT
Finetuned
(8)
this model

Collection including RLWRLD/RLDX-1-MT-DROID

Paper for RLWRLD/RLDX-1-MT-DROID