OpenVLA (VLA-Arena Fine-tuned)
About VLA-Arena
VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
- Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
- Language Command: Variations in instruction complexity.
- Visual Observation: Perturbations in visual input.
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
Model Overview
The model is OpenVLA (7B), explicitly fine-tuned on demonstration data generated from VLA-Arena. It serves as a strong generalist baseline, leveraging the pre-trained knowledge of the original OpenVLA architecture while adapting to the specific safety and manipulation requirements of the benchmark.
This checkpoint was trained using Low-Rank Adaptation (LoRA) to efficiently adapt the large language backbone without full-parameter fine-tuning.
Model Architecture
OpenVLA builds upon a strong Vision-Language Model backbone, directly tokenizing robotic actions into the language model's vocabulary.
| Component | Description |
|---|---|
| Backbone | Llama 2 7B (Language) + SigLIP (Vision) |
| Action Space | 7-DoF continuous control (Discretized into tokens) |
| Adaptation Strategy | LoRA (Rank $r=32$) |
| Quantization | None (Full precision loading during LoRA training) |
Key Feature: Direct Action Prediction
OpenVLA treats robot actions as just another "language," allowing the model to utilize the massive reasoning capabilities of the Llama 2 backbone to plan and execute motion sequences based on visual inputs and textual instructions.
Training Details
Dataset
This model was trained on the VLA-Arena/VLA_Arena_L0_L_rlds dataset. The data consists of diverse robotic manipulation demonstrations covering the foundational L0 tasks of the arena.
Hyperparameters
The model was fine-tuned using LoRA on an 8-GPU cluster. Image augmentation was enabled to improve robustness against visual distractors.
| Parameter | Value |
|---|---|
| Max Training Steps | 200,000 |
| Global Batch Size | 128 (16 per device $\times$ 8 GPUs) |
| Optimizer | AdamW |
| Learning Rate ($\eta$) | $5.0 \times 10^{-4}$ |
| Shuffle Buffer Size | 100,000 |
| Image Augmentation | Enabled (TRUE) |
LoRA Configuration
Unlike some lighter adaptations, this training run utilized a moderate rank without quantization to preserve model fidelity.
| Parameter | Value |
|---|---|
| LoRA Rank ($r$) | 32 |
| LoRA Dropout | 0 |
| 4-bit Quantization | Disabled (FALSE) |
Evaluation & Usage
This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).
For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.
- Downloads last month
- 35