TTI / Dev /guidelines.md

Upload folder using huggingface_hub

857c2e9 verified 5 days ago

3.25 kB

Overview

This document describes how to integrate a Vision-Language-Action-Critic (VLAC) into the SimpleVLA-RL training stack. The goal is to replace the simulator-provided terminal success signal during training with a VLAC-predicted value, while preserving the simulator signal for evaluation.

Repository structure (high level)

verl Core RL training framework for SimpleVLA-RL. Uses Ray for distributed orchestration and vLLM for fast inference/parallelization. Originally built for LLM RL, adapted here for VLA training. Ships with a pre-trained OpenVLA-OFT model and the LIBERO simulation environment as Python packages.
examples Entry scripts and helpers for training and evaluation. run_openvla_oft_rl.sh is the main training entrypoint. eval_openvla_oft.sh runs evaluation (with trainer.val_only=True).
evo_vlac External module imported from the VLAC project. Provides models to predict task progress/success for robot manipulation. evo_vlac/examples shows how to run the critic on images/videos and obtain pairwise critic, per-frame values, and done/success estimates.

Objective

Training Do not use the environment done during training. Termination and terminal rewards are determined by VLAC only. After each step, call VLAC to detect done.
- If VLAC determined done == True: terminate the episode immediately and set the terminal reward to 1.0 (no need to call VLAC value again).
- Else if max_step is reached: terminate the rollout and call the VLAC model to compute the value list and use the last value the reward.
- Else: continue the episode.
Evaluation Keep using the environment done signal to compute success rate. VLAC is not used to determine evaluation success.
Rollout termination logic
- Training: "done" signal from VLAC
- Testing: "done" signal from the environment
Reward logic
- Training: if "done" detected by VLAC, reward=1.0; else, reward comes from VLAC value.

Integration approach (service-oriented)

Why service verl is Ray/vLLM-managed; VLAC is managed via Hugging Face and ms-swift. To keep systems decoupled and switchable, expose VLAC as a lightweight service that verl calls at needed times.
Deployment flexibility The VLAC service can run on the same node (sharing GPU memory fraction) or on a different node.
Transport options During training, frames are kept in memory as a Python list (TODO: you need to double-confirm this in the code in implementation); they are not written to disk until the episode finishes. Therefore, prefer sending the frames for communication in the service.

Additional Notes

Try to use simple implementation. Avoding using complicated environment, e.g., docker.
GPU resource sharing: Likely to share GPUs with SimpleVLA-RL on H100 80GB cards. During rollout generation, usage is ~20–30 GB; during actor gradient updates, peaks at ~60–70 GB.
Development environment: Use this desktop only for editing and light, non-intrusive smoke tests. I will push to a server for real training/evaluation.
Collaboration: Before starting, I will confirm the open questions above with you and then proceed with the plan.