X-VLA v20 β€” BEHAVIOR-1K Task 0 (Turn on Radio)

Fine-tune of 2toINF/X-VLA-Pt on a single BEHAVIOR-1K task (turning on the radio receiver), using the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.

Available checkpoints

Subfolder Steps total loss joints skill_cls progress
ckpt-30000/ 30,000 0.0232 0.0204 0.0000 0.0027
ckpt-40000/ 40,000 0.0165 0.0158 0.0000 0.0007
ckpt-50000/ 50,000 0.0131 0.0117 0.0000 0.0014
ckpt-60000/ 60,000 (final) 0.0135 0.0107 0.0000 0.0028

Each subfolder is fully self-contained β€” load any of them with subfolder="ckpt-XXXXX" (see Usage).

Architecture (v20)

  • Additive per-task + per-skill soft prompts (task_prompt_hub[task_id] + skill_prompt_hub[skill_id]), 32 tokens Γ— 1024 dim each, zero-initialized
  • Skill-conditioned progress head (predicts position within current skill segment)
  • Skill classifier head on pooled VLM features, trained with sqrt-inverse-frequency weighted CE (Ξ»=0.1, on detached features)
  • Skill-enriched language instructions at training time, e.g. "Turn on the radio receiver. Current: move to radio."
  • 23-D action space for the R1Pro robot (3 base-qvel + 4 trunk + 7 arm L + 1 grip L + 7 arm R + 1 grip R), 30-step action horizon, 3 RGB cameras (head + L/R wrist)

Training

Setting Value
Base model 2toINF/X-VLA-Pt
Dataset Hoshipu/behavior-1k-mp-collected-turning-on-radio (success split)
Trainable episodes 1354 (task 0)
Total frames 2,845,413
GPUs 8 Γ— NVIDIA H200
Per-GPU batch 32
Effective batch 256
Precision bf16
LR (core) 5e-5, cosine decay to 5e-6
LR (VLM, soft prompts) 0.1 Γ— LR (so 5e-6 β†’ 5e-7)
Warmup 2000 steps
Freeze schedule VLM + transformer core frozen for first 1000 steps (only soft prompts + action heads train)
Iterations 60,000
Optimizer AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0
Wall-clock ~14 h 51 m

Loss curve (selected steps)

Step total joints skill_cls progress
0 15.70 15.11 0.582 0.006
1000 0.568 0.555 0.005 0.008
5000 0.107 0.101 0.000 0.006
10000 0.021 0.018 0.000 0.003
20000 0.029 0.025 0.000 0.003
30000 0.023 0.020 0.000 0.003
40000 0.017 0.016 0.000 0.001
50000 0.013 0.012 0.000 0.001
59980 0.014 0.011 0.000 0.003

Usage

from transformers import AutoModel, AutoConfig

REPO = "Hoshipu/xvla-v20-task0-mp-radio"
CKPT = "ckpt-60000"   # or ckpt-30000 / ckpt-40000 / ckpt-50000

config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model  = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)

Or deploy as an inference WebSocket server (handles all pre/post-processing for OmniGibson observations):

git clone -b v20 https://github.com/markli1hoshipu/behavior1k-xvla.git
cd behavior1k-xvla
bash setup.sh

# Download a single checkpoint to a local dir
huggingface-cli download Hoshipu/xvla-v20-task0-mp-radio \
    --include "ckpt-60000/*" \
    --local-dir ./xvla-v20-task0-mp-radio

cd behavior1k_training
python deploy_b1k.py --model_path ../xvla-v20-task0-mp-radio/ckpt-60000 --port 8000

See INFERENCE_README.md for the protocol.

Files in each ckpt-XXXXX/ subfolder

  • model.safetensors β€” model weights (3.5 GB, bf16)
  • config.json, preprocessor_config.json, tokenizer*, vocab.json, merges.txt β€” config / tokenizer
  • modeling_xvla.py, configuration_xvla.py, transformer.py β€” patched v20 modules (additive task+skill prompts, skill classifier, progress head)
  • modeling_florence2.py, configuration_florence2.py, action_hub.py, processing_xvla.py β€” unchanged from base 2toINF/X-VLA-Pt
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Hoshipu/xvla-v20-task0-mp-radio

Finetuned
2toINF/X-VLA-Pt
Finetuned
(6)
this model