SFT-Robo2: beat_block_hammer

OpenVLA-OFT SFT checkpoint for the beat_block_hammer task from RoboTwin 2.0, trained to match the SimpleVLA-RL paper (arXiv:2509.09674) settings.

Model Details

Base model: openvla/openvla-7b
Fine-tuning method: LoRA (rank 32) on LLaMA2-7B backbone
Training objective: Cross-entropy on discrete action tokens (NOT L1 regression)
Architecture: Single-view image + proprioception + language instruction -> 25 action chunks x 14D (bimanual ALOHA)

Training Config

Parameter	Value
max_steps	10,000
batch_size	8 per GPU x 2 GPUs = 16 global
learning_rate	5e-4
lora_rank	32
num_images_in_input	1 (head camera only)
use_l1_regression	False
use_film	False
use_proprio	True
use_diffusion	False
image_aug	True
NUM_ACTIONS_CHUNK	25
ACTION_DIM	14
ACTION_PROPRIO_NORMALIZATION_TYPE	bounds

Training Data

1000 expert demonstrations from RoboTwin 2.0 beat_block_hammer task
Collected with curobo motion planner under full domain randomization
950 train / 50 val split (5% validation)
Expert success rate: ~70% (filtered to successful trajectories only)

Evaluation Results

Metric	Value
Success rate (seed 0, 100 episodes)	45.0%
Success rate (seed 1, partial 16 episodes)	50.0%
Paper SFT baseline (Table 4)	28.1%
Prior Phase 1 reproduction	35.2%

Evaluated with greedy sampling (do_sample=False), 100 held-out scenarios, demo_randomized config.

Compatibility

This checkpoint is compatible with:

SimpleVLA-RL RL training script (run_openvla_oft_rl_twin2.sh)
RoboTwin 2.0 evaluation suite (policy/openvla-oft/eval.sh)

Important: This checkpoint uses cross-entropy discrete action tokens (LLaMA2 head), NOT L1 regression with MLP action head. Ensure your inference code passes use_l1_regression=False and passes proprioceptive state to predict_action.