DreamZero DK-1 LoRA (r=64, alpha=16, 20k steps)
LoRA fine-tune of DreamZero-AgiBot on the DK-1 merged bimanual robot dataset.
Model Details
| Parameter | Value |
|---|---|
| Base model | DreamZero-AgiBot (Wan2.1-I2V-14B backbone) |
| LoRA rank | 64 |
| LoRA alpha | 16 |
| LoRA targets | q, k, v, o, ffn.0, ffn.2 |
| LoRA init | Kaiming |
| Action horizon | 24 steps per chunk |
| Action dim | 14 (6+1 left arm/gripper, 6+1 right arm/gripper) |
| Video resolution | 640x352 (3 cameras tiled: top, left_wrist, right_wrist) |
| Num frames | 33 per chunk |
Training
| Parameter | Value |
|---|---|
| Steps | 20,000 |
| Training time | ~55.5 hours |
| GPUs | 4x H100 80GB |
| Batch size | 1 per device (4 effective) |
| Learning rate | 1e-5 (cosine schedule) |
| Warmup | 5% of steps |
| Precision | bf16 |
| DeepSpeed | ZeRO Stage 2 |
| Final loss | 0.056 |
| Final action loss | 0.007 |
| Final dynamics loss | 0.052 |
The model was fine-tuned from the pretrained DreamZero-AgiBot checkpoint with LoRA adapters injected into the Wan2.1 DiT transformer layers. The action encoder/decoder heads were fully trained. Training used a cosine LR schedule with 5% linear warmup.
W&B run: dreamzero_dk1_merged_lora
Dataset
The DK-1 merged dataset contains 1,674 episodes (1.7M frames) of bimanual robot manipulation across 16 tasks, recorded at 30 FPS with 3 camera views (top, left_wrist, right_wrist) at 640x360 resolution.
Tasks include: fold t-shirt, put ball in cup, grab cap, transfer lego cube, pick up spoon, PCB placement, and various pick-and-place tasks.
Evaluation Results
GT-conditioned (4 chunks, 96 action steps)
Each chunk is conditioned on a fresh ground-truth frame. Measures action prediction accuracy.
| Task | MSE | Count |
|---|---|---|
| put the plastic in the cup | 0.013 | 2 |
| remove the pink thing from the box | 0.023 | 2 |
| put the ball in the cup | 0.026 | 11 |
| put the green bag in the box | 0.031 | 4 |
| put the pink thing in the box | 0.037 | 1 |
| put the plastic tube the box | 0.046 | 1 |
| grab the cap | 0.048 | 4 |
| Pick up the spoon | 0.052 | 1 |
| Take a PCB from the box and place it in the testbed | 0.059 | 4 |
| Fold the t-shirt | 0.059 | 22 |
| Transfer the lego cube to the other arm | 0.070 | 11 |
| Overall | 0.050 | 63 |
AR rollout (12 chunks, 288 action steps from single start frame)
Only the first chunk uses a GT frame; subsequent chunks condition on the previous prediction.
| Task | MSE | Count |
|---|---|---|
| put the ball in the cup | 0.200 | 5 |
| put the green bag in the box | 0.258 | 1 |
| Fold the t-shirt | 0.325 | 7 |
| grab the cap | 0.422 | 1 |
| Transfer the Lego Cube to the other arm. | 0.490 | 1 |
| Take a PCB from the box and place it in the testbed | 0.623 | 1 |
| Overall | 0.317 | 16 |
Usage
This checkpoint contains only LoRA adapter weights (~792MB). To use it, you need:
- Wan2.1-I2V-14B-480P — DiT backbone, VAE, CLIP, and T5 encoder
- DreamZero-AgiBot — Pretrained base VLA weights (43GB)
- DreamZero codebase —
groot.vlamodel code
The load_lora method in groot.vla.model.dreamzero.base_vla.VLA handles the full loading
sequence: base model weights first, then LoRA injection, then adapter weight loading.
Eval Outputs on HF
eval_gt4_v2/— 63 episodes, GT-conditioned, 4 chunks (videos, action plots, summary)eval_ar12/— 16 episodes, AR rollout, 12 chunks- Each episode contains:
video_pred_tiled.mp4,video_comparison.mp4,actions.png
- Downloads last month
- 118