Multi-Task DiT Policy β€” Coffee Capsules v1

Diffusion Transformer (DiT) policy trained on villekuosmanen/bin_pick_pack_coffee_capsules for robotic bin-picking.

Training Details

Parameter Value
Architecture DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning
Dataset 47,865 samples, 200 episodes
State/Action dim 17D β€” joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1)
Delta actions All dims except 6D rotation (absolute)
Normalization Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt
Batch size 64
Training steps 100,000
Mixed precision AMP (fp16)
Optimizer AdamW, grad clip 1.0
Hardware NVIDIA GH200 (Isambard-AI)
Training time ~13.5 hours
Final loss 0.0056

Checkpoints

Checkpoint Steps sha256 (model.safetensors)
checkpoint_25000 25k 1fd5a94b...f6d860
checkpoint_50000 50k fe248b44...a5eec2
checkpoint_75000 75k 880c9654...aeb906
checkpoint_90000 90k a0911055...1e1fa8

Each checkpoint contains:

  • model.safetensors β€” model weights (952MB)
  • config.json β€” model configuration
  • ramen_stats.pt β€” normalization statistics (required for inference)

Task

Pick a single coffee capsule from the cardboard tray and drop it inside the brown cardboard container holding a plastic bag.

W&B

Training logs: wandb.ai/pravsels/multitask-dit-policy/runs/g024s3gd

Usage

from multitask_dit_policy.model import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.load("pravsels/multitask-dit-coffee-capsules-v1/checkpoint_90000")
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train pravsels/multitask-dit-coffee-capsules-v1