Ο€β‚€ (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

  1. Task Structure: 170+ tasks grouped into four key dimensions:
    • Safety: Operating reliably under strict constraints.
    • Distractor: Handling environmental unpredictability.
    • Extrapolation: Generalizing to unseen scenarios.
    • Long Horizon: Executing complex, multi-step tasks.
  2. Language Command: Variations in instruction complexity.
  3. Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is Ο€β‚€ (Pi-Zero) explicitly fine-tuned on demonstration data generated from VLA-Arena. It serves as a robust baseline for evaluating performance across the benchmark's Safety, Distractor, Extrapolation, and Long Horizon dimensions.

This checkpoint utilizes a decoupled architecture, employing separate components for vision-language reasoning and low-level action generation, both adapted via LoRA for efficient fine-tuning.


Model Architecture

The Ο€β‚€ architecture distinguishes itself by splitting the VLM backbone and the action generation policy, allowing for specialized optimization of both semantic understanding and motor control.

Component Description
VLM Backbone Gemma-2B (Vision-Language Model) adapted via LoRA (gemma_2b_lora)
Action Expert Gemma-300M adapted for control via LoRA (gemma_300m_lora)
Action Space 7-DoF continuous control (End-effector pose + Gripper)
Adaptation Low-Rank Adaptation (LoRA) applied to both backbone and expert

Key Feature: Decoupled Reasoning & Control

By utilizing a gemma_2b_lora backbone for high-level reasoning and a gemma_300m_lora expert for action generation, Ο€β‚€ achieves a balance between computational efficiency and control precision. This separation allows the model to handle complex instructions while maintaining high-frequency control capabilities.


Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_lerobot_openpi dataset. This dataset contains demonstration data collected from VLA-Arena, formatted specifically for LeRobot and OpenPi training pipelines.

Hyperparameters

The model was fine-tuned using LoRA (Low-Rank Adaptation) to ensure memory efficiency during training. The weights were initialized from a pre-trained checkpoint using a CheckpointWeightLoader, with updates restricted to parameters specified by the LoRA freeze filter.

Parameter Value
Max Training Steps 30,000
Global Batch Size 32
Optimizer AdamW
LR Schedule CosineDecaySchedule
EMA Disabled (None)
Data Loader Workers 2

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train VLA-Arena/pi0-vla-arena-fintuned-LoRA