DK-1 DreamDojo — 30K Checkpoint
Fine-tuned action-conditioned world model for the DK-1 bimanual robot arm, based on NVIDIA DreamDojo (Cosmos-Predict2.5-2B).
Experiment
Goal: Learn action conditioning for the DK-1 bimanual arm (14 DOF joint positions) on top of a pretrained YAM post-trained checkpoint.
Key design decisions:
action_dim=14— dense input to ActionEmbedder (no sparse 384-dim multi-embodiment padding)- Zero-init fc2 + Kaiming fc1 — prevents disrupting pretrained backbone at start
- Differential learning rates: ActionEmbedder
1e-4, AdaLN modulation1.6e-4, Backbone1.6e-7(quasi-frozen) - Delta actions normalized to
[-1, 1]via p01/p99 statistics over dk1-merge-2026-03
Dataset: andreaskoepf/dk1-merge-2026-03 — 1674 episodes, unified cameras (top/left_wrist/right_wrist), 360×640, 30 fps, 14-dim joint positions
Training: 30,000 steps, batch size 4 (2×RTX 6000 Ada)
W&B: https://wandb.ai/andreaskoepf/dk1-dreamdojo
Checkpoint
model_ema_bf16.pt — EMA weights in bfloat16, no optimizer states.
Load with the DreamDojo codebase:
from cosmos_predict2._src.predict2.utils.model_loader import load_model_from_checkpoint
model, config = load_model_from_checkpoint(
experiment_name="dreamdojo_2b_480_640_dk1_merge",
s3_checkpoint_dir="model_ema_bf16.pt",
config_file="cosmos_predict2/_src/predict2/action/configs/action_conditioned/config.py",
load_ema_to_reg=True,
)
Eval
The eval/ directory contains 32 comparison videos (random episodes from dk1-merge-2026-03):
GT | generated with GT actions | generated with zero actions
PSNR (correct actions vs GT, 4 samples): 24.6, 34.0, 20.8, 27.5 dB (avg ~26.7 dB)