1. Environment Setup

# Create and activate conda environment
conda create -n capvector-openvla-oft python=3.10.16 -y
conda activate capvector-openvla-oft

# Install PyTorch
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0

# pip install to download dependencies
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
#   =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

If you are uncertain about the version of a dependency, please refer to our complete envs list.

2. Data Preparation

First, clone and install the LIBERO repo and required packages:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt

(Optional, if you plan to launch training) Then, to download the LIBERO datasets that we used in our fine-tuning experiments, run the command below or download them manually. This will download the LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10 datasets in RLDS data format (~10 GB total). You can use these to fine-tune openvla-SF or train other methods like OpenVLA. Note that these are the same datasets used in the original OpenVLA project. If needed, see details on how to download the original non-RLDS datasets here.

git clone git@hf.co:datasets/openvla/modified_libero_rlds ./data/libero  # or download manually

Finally, the directory structure will be as below:

capvector-oft
    ├── data
    ·   ├── libero
        │   ├── libero_10_no_noops
        │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)
        │   ├── libero_goal_no_noops
        │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)
        │   ├── libero_object_no_noops
        │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)
        │   ├── libero_spatial_no_noops
        │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)
        │
        └── other benchmarks ...

3. set up a conda environment (see instructions in SETUP.md).

4. Obtain the CapVector on any dataset (e.g. LIBERO/ROBOTWIN) and merge it to obtain $\theta_{meta}$

First, download the OpenVLA and place them in the ./ckpts/ folder. The directory structure is as below:

capvector-oft
    ├── ckpts
    ·   ├── openvla-7b
        │   ├── added_tokens.json
        │   ├── model-00001-of-00003.safetensors
        │   └── ...
        ·

Then,

cd capvector-oft
bash capvector/interpolate.sh #LIBERO
bash capvector/initialized_interpolate_shell/get_vector_robotwin.sh #ROBOTWIN or other custom datasets (PROPRIO_DIM modification required)

The above steps are equivalent to directly downloading the $\theta_{meta}$:

capvector-openvla-7b

This $\theta_{meta}$ is obtained from LIBERO Spatial and the capability vector is merged with OpenVLA weights with vector weight = 1.

Place them in the ./ckpts/ folder. The directory structure is as below:

capvector-oft
    ├── ckpts
    ·   ├── openvla-7b
        ├── capvector-openvla-7b
        ·

5. Start Training

conda activate capvector-openvla-oft
cd capvector-oft

First, be sure you have downloaded the LIBERO datasets, as mentioned in the Data Preparation Section: libero_spatial_no_noops, libero_object_no_noops, libero_goal_no_noops, libero_10_no_noops. ("_no_noops" stands for no no-op actions, i.e., training samples with near-zero actions are filtered out).

Then, prepare the lora diff and place it at capvector-oft/capvector/lora_diff. This is used to compute the orthogonal loss.

Next, launch the fine-tuning script below, replacing X in the first line with the number of GPUs. The command below launches fine-tuning on LIBERO-Spatial with the hyperparameters that we used in our paper. Here, batch size 8 per GPU will require ~74 GB VRAM, and batch size 1 per GPU will require ~30 GB VRAM. The training results are stored according to the --run_root_dir and --run_id_override.

You can refer to the code block below or directly consult training_scripts/training.sh for reference.


torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune_regular_loss.py \
  --vla_path ckpts/capvector-openvla-7b \
  --data_root_dir data/libero/ \
  --dataset_name libero_${TASK}_no_noops \
  --run_root_dir experiments/training_results/ \
  --use_l1_regression True \
  --use_diffusion False \
  --use_film False \
  --num_images_in_input 2 \
  --use_proprio True \
  --batch_size 8 \
  --learning_rate 5e-4 \
  --scheduler CosineAnnealingLR \
  --max_steps 150100 \
  --save_freq 150000 \
  --save_latest_checkpoint_only True \
  --merge_lora_during_training True \
  --regularization_lora_vector_path capvector/lora_diff/sf_150000_steps_spatial_adapter_diff.safetensors \
  --regularization_weight 1e-4 \
  --image_aug True \
  --lora_rank 32 \
  --wandb_entity "YOUR_WANDB_ENTITY" \
  --wandb_project "YOUR_WANDB_PROJECT" \
  --run_id_override "$VERSION"

The above training command should reproduce our CapVector results if X = 1 and the 150K step checkpoint is evaluated.

Please be sure to test your policy with the same device/GPU used to train it! Otherwise, performance may drop substantially. You may be able to avoid the performance drop if you merge the LoRA weights into the base model on the downstream device used for testing (e.g., if you train on H100 and then merge on A100 before testing on A100). You can see our script vla-scripts/merge_lora_weights_and_save.py for merging the LoRA adapter into the base model offline. It's okay if you already merged LoRA weights into the base OpenVLA model during fine-tuning; you can always redownload the base model and merge again as long as you still have the LoRA adapter (merge_lora_weights_and_save.py will handle this for you).

6. Inference

Then, run the commands below to start evaluations with the independently trained checkpoints:

python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint experiments/training_results/$VERSION \
  --task_suite_name libero_${TASK}

Notes:

The evaluation script will run 500 trials by default (10 tasks x 50 episodes each). You can modify the number of trials per task by setting --num_trials_per_task. You can also change the random seed via --seed. There are other arguments in the script; we set them to the default values that work with the openvla-SF checkpoints above.
The evaluation script logs results locally. You can also log results in Weights & Biases by setting --use_wandb True and specifying --wandb_project <PROJECT> and --wandb_entity <ENTITY>.
The results reported in our paper were obtained using Python 3.10.16, PyTorch 2.2.0, and bidirectional transformers on an NVIDIA H100 GPU. Please stick to these package versions if possible.