pi0.5 Build Block Tower β€” RLT Stage 1 (Encoder-Decoder)

RL Token encoder-decoder trained on top of a frozen pi0.5 baseline VLA for building a block tower. Implements Stage 1 of the RL Token method (Xu et al., 2026): a lightweight transformer encoder-decoder compresses VLA prefix embeddings into a single RL token via autoregressive reconstruction.

Experiment

  • Objective: Train RLT encoder-decoder to produce a compact RL token representation from frozen VLA prefix embeddings.
  • VLA backbone: Baseline 55k checkpoint (pravsels/pi05-build-block-tower-baseline), frozen (rl_vla_loss_weight=0.0).
  • Encoder-decoder: 2-layer transformer, 8 heads, dim=2048, SwiGLU FFN.
  • Loss: Autoregressive reconstruction of VLA prefix embeddings (L2).
  • Validation: Deterministic episode-level 90/10 train/val split with held-out episode IDs saved in assets/episode_split.json.
  • Steps: 10,000

Config

  • Config name: pi05_rl_token_build_block_tower
  • Model: Pi0RLConfig (pi05=True, action_horizon=50, rl_vla_loss_weight=0.0)
  • Batch size: 36
  • Learning rate: 5e-5 cosine decay (1k warmup)
  • Optimizer: AdamW (gradient clip norm 1.0)
  • EMA decay: 0.999
  • Delta actions: enabled
  • State/action space: 7D joint-space

Dataset

  • villekuosmanen/build_block_tower (200 episodes, LeRobot v2.1)
  • Train/val separation is by whole episode, not timestep, to avoid leakage.

Checkpoint Hashes

Verify integrity with find params -type f | sort | xargs sha256sum | sha256sum.

Step Train Loss Val Loss SHA-256
9,999 216.8683 286.5721 4378fc1886f7eef6adab8a123ec491cde783c9aa94cd60a0b57757314862ed95

W&B

Evaluation

Evaluation artifacts for the frozen 9999 checkpoint are available under evals/.

All evaluations depend on two artifacts together:

  • the frozen base pi0.5 VLA: pravsels/pi05-build-block-tower-baseline
  • the Stage 1 RLT encoder-decoder checkpoint in this repo: step 9999

Cosine Similarity (ID vs OOD)

Tested on 1 ID episode (1,028 frames) and 1 OOD episode (486 frames). Raw cosine similarity does not cleanly separate ID (build_block_tower) from OOD (drop_footbag_into_dice_tower).

Comparison Mean cosine Std
Within ID (build_block_tower) 0.972 0.010
Within OOD (drop_footbag) 0.988 0.005
Cross-task (ID vs OOD) 0.974 0.006
Episode-level (mean-pooled) 0.994 β€”

Within-ID similarity (0.972) and cross-task similarity (0.974) are nearly identical β€” the token doesn't distinguish between tasks any more than it distinguishes between frames of the same task. The most likely failure mode is not a dead token, but a token dominated by a large shared component with useful information compressed into smaller residual directions.

See evals/2026-03-27_rl_token_eval/eval_log.md for the full interpretation and the accompanying JSON/plot artifacts.

Reconstruction Ablation (step 5k vs 10k)

Tests whether the RL token carries meaningful information by comparing decoder reconstruction loss under three conditions: real token, zero vector, and shuffled (batch-neighbour's) token. Evaluated on 32 timesteps (4 batches Γ— 8) from the train split. Loss is mean L2Β² per token (summed over embedding dim), averaged over valid tokens per example, then averaged over the batch.

Condition Mean L2 (Step 5k) Mean L2 (Step 10k)
Real RL token 365.4 226.2
Neighbour's token 401.4 (+10%) 316.3 (+40%)
Zero vector 850.1 (+133%) 1038.3 (+359%)

Percentages are relative to the real RL token loss at the same checkpoint. Pairwise cosine similarity between tokens decreased from 0.990 (5k) to 0.970 (10k), confirming tokens are differentiating more across examples.

All metrics improve from step 5k to 10k, confirming the RL token is learning genuine information rather than collapsing. The modest neighbour gap is expected for a single-task dataset (100 episodes, same prompt) β€” tokens are legitimately similar across same-task observations.

See evals/2026-03-27_rl_token_eval/recon_ablation_progression.md for full analysis and per-batch breakdowns.

Probe Suite (step 10k)

Tests whether the frozen RL token is informative enough for downstream actor-critic work by training lightweight PyTorch probes on extracted features (5k train / 1k val samples, episode-level split).

Probe Input Target Val Metric Baseline Delta
Action MLP rl_token + state VLA action chunk MSE 0.1517 state-only: 0.1612 -6%
Action MLP rl_token + state ground-truth action MSE 0.0088 β€” β€”
Linear state rl_token normalized state MSE 0.0555 random vector: 0.0785 -29%
Subtask classifier rl_token 11 subtask classes accuracy 19.9% chance: 9.1% 2.2x

Interpretation:

  • The RL token adds 6% MSE improvement over state-only for action prediction, confirming it contributes beyond raw proprioception.
  • State information is linearly decodable from the RL token (29% below random baseline). Val loss decreases monotonically over 40 epochs without divergence.
  • Subtask classification is 2.2Γ— above chance, but train loss drops to 0.03 while val loss diverges (2.0 β†’ 5.3) β€” the probe is overfitting, likely because 200 episodes split across 11 subtask classes leaves too few examples per class to learn from.
  • Verdict: moderate pass β€” the RL token carries sufficient information for Stage 3 critic training.

See evals/2026-03-30_probe_suite/metrics.json for full per-epoch training histories.

Repo Structure

assets/                                # Norm stats plus deterministic episode split metadata
checkpoints/9999/params/               # Model weights (params only)
evals/2026-03-27_rl_token_eval/        # Cosine analysis, reconstruction ablation JSONs, plots
evals/2026-03-30_probe_suite/          # Probe suite metrics (action, linear, subtask probes)
README.md                              # This file
TRAINING_LOG.md                        # Training log
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading