DreamZero DK-1 LoRA (r=64, alpha=16, 20k steps)

LoRA fine-tune of DreamZero-AgiBot on the DK-1 merged bimanual robot dataset.

Model Details

Parameter	Value
Base model	DreamZero-AgiBot (Wan2.1-I2V-14B backbone)
LoRA rank	64
LoRA alpha	16
LoRA targets	q, k, v, o, ffn.0, ffn.2
LoRA init	Kaiming
Action horizon	24 steps per chunk
Action dim	14 (6+1 left arm/gripper, 6+1 right arm/gripper)
Video resolution	640x352 (3 cameras tiled: top, left_wrist, right_wrist)
Num frames	33 per chunk

Training

Parameter	Value
Steps	20,000
Training time	~55.5 hours
GPUs	4x H100 80GB
Batch size	1 per device (4 effective)
Learning rate	1e-5 (cosine schedule)
Warmup	5% of steps
Precision	bf16
DeepSpeed	ZeRO Stage 2
Final loss	0.056
Final action loss	0.007
Final dynamics loss	0.052

The model was fine-tuned from the pretrained DreamZero-AgiBot checkpoint with LoRA adapters injected into the Wan2.1 DiT transformer layers. The action encoder/decoder heads were fully trained. Training used a cosine LR schedule with 5% linear warmup.

W&B run: dreamzero_dk1_merged_lora

Dataset

The DK-1 merged dataset contains 1,674 episodes (1.7M frames) of bimanual robot manipulation across 16 tasks, recorded at 30 FPS with 3 camera views (top, left_wrist, right_wrist) at 640x360 resolution.

Tasks include: fold t-shirt, put ball in cup, grab cap, transfer lego cube, pick up spoon, PCB placement, and various pick-and-place tasks.

Evaluation Results

GT-conditioned (4 chunks, 96 action steps)

Each chunk is conditioned on a fresh ground-truth frame. Measures action prediction accuracy.

Task	MSE	Count
put the plastic in the cup	0.013	2
remove the pink thing from the box	0.023	2
put the ball in the cup	0.026	11
put the green bag in the box	0.031	4
put the pink thing in the box	0.037	1
put the plastic tube the box	0.046	1
grab the cap	0.048	4
Pick up the spoon	0.052	1
Take a PCB from the box and place it in the testbed	0.059	4
Fold the t-shirt	0.059	22
Transfer the lego cube to the other arm	0.070	11
Overall	0.050	63

AR rollout (12 chunks, 288 action steps from single start frame)

Only the first chunk uses a GT frame; subsequent chunks condition on the previous prediction.

Task	MSE	Count
put the ball in the cup	0.200	5
put the green bag in the box	0.258	1
Fold the t-shirt	0.325	7
grab the cap	0.422	1
Transfer the Lego Cube to the other arm.	0.490	1
Take a PCB from the box and place it in the testbed	0.623	1
Overall	0.317	16

Usage

This checkpoint contains only LoRA adapter weights (~792MB). To use it, you need:

Wan2.1-I2V-14B-480P — DiT backbone, VAE, CLIP, and T5 encoder
DreamZero-AgiBot — Pretrained base VLA weights (43GB)
DreamZero codebase — groot.vla model code

The load_lora method in groot.vla.model.dreamzero.base_vla.VLA handles the full loading sequence: base model weights first, then LoRA injection, then adapter weight loading.

Eval Outputs on HF

eval_gt4_v2/ — 63 episodes, GT-conditioned, 4 chunks (videos, action plots, summary)
eval_ar12/ — 16 episodes, AR rollout, 12 chunks
Each episode contains: video_pred_tiled.mp4, video_comparison.mp4, actions.png

Downloads last month: 118

Safetensors

Model size

0.4B params

Tensor type

BF16

Video Preview

Robotics