Multi-Task DiT Policy β Coffee Capsules v1
Diffusion Transformer (DiT) policy trained on villekuosmanen/bin_pick_pack_coffee_capsules for robotic bin-picking.
Training Details
| Parameter | Value |
|---|---|
| Architecture | DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning |
| Dataset | 47,865 samples, 200 episodes |
| State/Action dim | 17D β joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1) |
| Delta actions | All dims except 6D rotation (absolute) |
| Normalization | Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt |
| Batch size | 64 |
| Training steps | 100,000 |
| Mixed precision | AMP (fp16) |
| Optimizer | AdamW, grad clip 1.0 |
| Hardware | NVIDIA GH200 (Isambard-AI) |
| Training time | ~13.5 hours |
| Final loss | 0.0056 |
Checkpoints
| Checkpoint | Steps | sha256 (model.safetensors) |
|---|---|---|
checkpoint_25000 |
25k | 1fd5a94b...f6d860 |
checkpoint_50000 |
50k | fe248b44...a5eec2 |
checkpoint_75000 |
75k | 880c9654...aeb906 |
checkpoint_90000 |
90k | a0911055...1e1fa8 |
Each checkpoint contains:
model.safetensorsβ model weights (952MB)config.jsonβ model configurationramen_stats.ptβ normalization statistics (required for inference)
Task
Pick a single coffee capsule from the cardboard tray and drop it inside the brown cardboard container holding a plastic bag.
W&B
Training logs: wandb.ai/pravsels/multitask-dit-policy/runs/g024s3gd
Usage
from multitask_dit_policy.model import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.load("pravsels/multitask-dit-coffee-capsules-v1/checkpoint_90000")