DreamZero DK-1 LoRA (r=64, alpha=16, 20k steps)

LoRA fine-tune of DreamZero-AgiBot on the DK-1 merged bimanual robot dataset.

Model Details

Parameter Value
Base model DreamZero-AgiBot (Wan2.1-I2V-14B backbone)
LoRA rank 64
LoRA alpha 16
LoRA targets q, k, v, o, ffn.0, ffn.2
LoRA init Kaiming
Action horizon 24 steps per chunk
Action dim 14 (6+1 left arm/gripper, 6+1 right arm/gripper)
Video resolution 640x352 (3 cameras tiled: top, left_wrist, right_wrist)
Num frames 33 per chunk

Training

Parameter Value
Steps 20,000
Training time ~55.5 hours
GPUs 4x H100 80GB
Batch size 1 per device (4 effective)
Learning rate 1e-5 (cosine schedule)
Warmup 5% of steps
Precision bf16
DeepSpeed ZeRO Stage 2
Final loss 0.056
Final action loss 0.007
Final dynamics loss 0.052

The model was fine-tuned from the pretrained DreamZero-AgiBot checkpoint with LoRA adapters injected into the Wan2.1 DiT transformer layers. The action encoder/decoder heads were fully trained. Training used a cosine LR schedule with 5% linear warmup.

W&B run: dreamzero_dk1_merged_lora

Dataset

The DK-1 merged dataset contains 1,674 episodes (1.7M frames) of bimanual robot manipulation across 16 tasks, recorded at 30 FPS with 3 camera views (top, left_wrist, right_wrist) at 640x360 resolution.

Tasks include: fold t-shirt, put ball in cup, grab cap, transfer lego cube, pick up spoon, PCB placement, and various pick-and-place tasks.

Evaluation Results

GT-conditioned (4 chunks, 96 action steps)

Each chunk is conditioned on a fresh ground-truth frame. Measures action prediction accuracy.

Task MSE Count
put the plastic in the cup 0.013 2
remove the pink thing from the box 0.023 2
put the ball in the cup 0.026 11
put the green bag in the box 0.031 4
put the pink thing in the box 0.037 1
put the plastic tube the box 0.046 1
grab the cap 0.048 4
Pick up the spoon 0.052 1
Take a PCB from the box and place it in the testbed 0.059 4
Fold the t-shirt 0.059 22
Transfer the lego cube to the other arm 0.070 11
Overall 0.050 63

AR rollout (12 chunks, 288 action steps from single start frame)

Only the first chunk uses a GT frame; subsequent chunks condition on the previous prediction.

Task MSE Count
put the ball in the cup 0.200 5
put the green bag in the box 0.258 1
Fold the t-shirt 0.325 7
grab the cap 0.422 1
Transfer the Lego Cube to the other arm. 0.490 1
Take a PCB from the box and place it in the testbed 0.623 1
Overall 0.317 16

Usage

This checkpoint contains only LoRA adapter weights (~792MB). To use it, you need:

  1. Wan2.1-I2V-14B-480P — DiT backbone, VAE, CLIP, and T5 encoder
  2. DreamZero-AgiBot — Pretrained base VLA weights (43GB)
  3. DreamZero codebasegroot.vla model code

The load_lora method in groot.vla.model.dreamzero.base_vla.VLA handles the full loading sequence: base model weights first, then LoRA injection, then adapter weight loading.

Eval Outputs on HF

  • eval_gt4_v2/ — 63 episodes, GT-conditioned, 4 chunks (videos, action plots, summary)
  • eval_ar12/ — 16 episodes, AR rollout, 12 chunks
  • Each episode contains: video_pred_tiled.mp4, video_comparison.mp4, actions.png
Downloads last month
118
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Video Preview
loading