DK-1 DreamDojo — 30K Checkpoint

Fine-tuned action-conditioned world model for the DK-1 bimanual robot arm, based on NVIDIA DreamDojo (Cosmos-Predict2.5-2B).

Experiment

Goal: Learn action conditioning for the DK-1 bimanual arm (14 DOF joint positions) on top of a pretrained YAM post-trained checkpoint.

Key design decisions:

  • action_dim=14 — dense input to ActionEmbedder (no sparse 384-dim multi-embodiment padding)
  • Zero-init fc2 + Kaiming fc1 — prevents disrupting pretrained backbone at start
  • Differential learning rates: ActionEmbedder 1e-4, AdaLN modulation 1.6e-4, Backbone 1.6e-7 (quasi-frozen)
  • Delta actions normalized to [-1, 1] via p01/p99 statistics over dk1-merge-2026-03

Dataset: andreaskoepf/dk1-merge-2026-03 — 1674 episodes, unified cameras (top/left_wrist/right_wrist), 360×640, 30 fps, 14-dim joint positions

Training: 30,000 steps, batch size 4 (2×RTX 6000 Ada)

W&B: https://wandb.ai/andreaskoepf/dk1-dreamdojo

Checkpoint

model_ema_bf16.pt — EMA weights in bfloat16, no optimizer states.

Load with the DreamDojo codebase:

from cosmos_predict2._src.predict2.utils.model_loader import load_model_from_checkpoint
model, config = load_model_from_checkpoint(
    experiment_name="dreamdojo_2b_480_640_dk1_merge",
    s3_checkpoint_dir="model_ema_bf16.pt",
    config_file="cosmos_predict2/_src/predict2/action/configs/action_conditioned/config.py",
    load_ema_to_reg=True,
)

Eval

The eval/ directory contains 32 comparison videos (random episodes from dk1-merge-2026-03): GT | generated with GT actions | generated with zero actions

PSNR (correct actions vs GT, 4 samples): 24.6, 34.0, 20.8, 27.5 dB (avg ~26.7 dB)

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading