OpenVLA (VLA-Arena Fine-tuned)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
Language Command: Variations in instruction complexity.
Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

The model is OpenVLA (7B), explicitly fine-tuned on demonstration data generated from VLA-Arena. It serves as a strong generalist baseline, leveraging the pre-trained knowledge of the original OpenVLA architecture while adapting to the specific safety and manipulation requirements of the benchmark.

This checkpoint was trained using Low-Rank Adaptation (LoRA) to efficiently adapt the large language backbone without full-parameter fine-tuning.

Model Architecture

OpenVLA builds upon a strong Vision-Language Model backbone, directly tokenizing robotic actions into the language model's vocabulary.

Component	Description
Backbone	Llama 2 7B (Language) + SigLIP (Vision)
Action Space	7-DoF continuous control (Discretized into tokens)
Adaptation Strategy	LoRA (Rank $r=32$)
Quantization	None (Full precision loading during LoRA training)

Key Feature: Direct Action Prediction

OpenVLA treats robot actions as just another "language," allowing the model to utilize the massive reasoning capabilities of the Llama 2 backbone to plan and execute motion sequences based on visual inputs and textual instructions.

Training Details

Dataset

This model was trained on the VLA-Arena/VLA_Arena_L0_L_rlds dataset. The data consists of diverse robotic manipulation demonstrations covering the foundational L0 tasks of the arena.

Hyperparameters

The model was fine-tuned using LoRA on an 8-GPU cluster. Image augmentation was enabled to improve robustness against visual distractors.

Parameter	Value
Max Training Steps	200,000
Global Batch Size	128 (16 per device $\times$ 8 GPUs)
Optimizer	AdamW
Learning Rate ($\eta$)	$5.0 \times 10^{-4}$
Shuffle Buffer Size	100,000
Image Augmentation	Enabled (`TRUE`)

LoRA Configuration

Unlike some lighter adaptations, this training run utilized a moderate rank without quantization to preserve model fidelity.

Parameter	Value
LoRA Rank ($r$)	32
LoRA Dropout	0
4-bit Quantization	Disabled (`FALSE`)

Evaluation & Usage

This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month: 35

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Robotics

VLA-Arena
/

openvla-7b-oft-finetuned-vla-arena