PI0 Hanoi Policy Gradient Checkpoint (30k steps)
This is a checkpoint for the PI0 (Physical Intelligence 0) model trained on the Hanoi task using a subtask-based approach with policy gradient methods.
This model was presented in the paper The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption.
Model Details
- Task: Hanoi Tower puzzle (subtask decomposition)
- Training Steps: 30,000
- Model Type: Policy gradient with subtask learning
- Framework: JAX/Flax
- Dataset: hanoi_300_lerobot
- Architecture: Vision-Language-Action model with subtask masking
Key Features
- Subtask Learning: Decomposes Hanoi puzzle into manageable subtasks
- No Masking: Trained without subtask masking for better generalization across subtask boundaries
- Policy Gradient: Uses policy gradient optimization
- End-to-End: Learns from visual observations to actions
Checkpoint Structure
params/: Model parameters
train_state/: Training state
assets/: Additional assets including normalization statistics
_CHECKPOINT_METADATA: Checkpoint metadata
Usage
This checkpoint can be loaded and evaluated using the openpi framework. Detailed instructions for running experiments in Robosuite via Docker can be found in the official GitHub repository.
Training Configuration
- Dataset: hanoi_300_lerobot
- Training approach: Subtask-based policy gradient
- Subtask masking: Disabled (no_masking)
- Total training steps: 30,000
- Model: PI0 with subtask decomposition
Technical Details
- Subtask Approach: Decomposes Hanoi puzzle into logical subtasks
- No Masking: Allows model to learn across subtask boundaries
- Policy Gradient: Direct policy optimization without value function
- Vision Input: Processes visual observations from robot cameras (agentview and eye-in-hand)
- Action Output: Generates robot manipulation actions