X-VLA v20 — BEHAVIOR-1K Task 0 (Turn on Radio)

Fine-tune of 2toINF/X-VLA-Pt on a single BEHAVIOR-1K task (turning on the radio receiver), using the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.

Available checkpoints

Subfolder	Steps	total loss	joints	progress
`ckpt-30000/`	30,000	0.0232	0.0204	0.0027
`ckpt-40000/`	40,000	0.0165	0.0158	0.0007
`ckpt-50000/`	50,000	0.0131	0.0117	0.0014
`ckpt-60000/`	60,000 (final)	0.0135	0.0107	0.0028

Each subfolder is fully self-contained — load any of them with subfolder="ckpt-XXXXX" (see Usage).

Architecture (v20)

Additive per-task + per-skill soft prompts (task_prompt_hub[task_id] + skill_prompt_hub[skill_id]), 32 tokens × 1024 dim each, zero-initialized
Skill-conditioned progress head (predicts position within current skill segment)
Skill classifier head on pooled VLM features, trained with sqrt-inverse-frequency weighted CE (λ=0.1, on detached features)
Skill-enriched language instructions at training time, e.g. "Turn on the radio receiver. Current: move to radio."
23-D action space for the R1Pro robot (3 base-qvel + 4 trunk + 7 arm L + 1 grip L + 7 arm R + 1 grip R), 30-step action horizon, 3 RGB cameras (head + L/R wrist)

Training

Setting	Value
Base model	`2toINF/X-VLA-Pt`
Dataset	`Hoshipu/behavior-1k-mp-collected-turning-on-radio` (success split)
Trainable episodes	1354 (task 0)
Total frames	2,845,413
GPUs	8 × NVIDIA H200
Per-GPU batch	32
Effective batch	256
Precision	bf16
LR (core)	5e-5, cosine decay to 5e-6
LR (VLM, soft prompts)	0.1 × LR (so 5e-6 → 5e-7)
Warmup	2000 steps
Freeze schedule	VLM + transformer core frozen for first 1000 steps (only soft prompts + action heads train)
Iterations	60,000
Optimizer	AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0
Wall-clock	~14 h 51 m

Loss curve (selected steps)

Step	total	joints	skill_cls	progress
0	15.70	15.11	0.582	0.006
1000	0.568	0.555	0.005	0.008
5000	0.107	0.101	0.000	0.006
10000	0.021	0.018	0.000	0.003
20000	0.029	0.025	0.000	0.003
30000	0.023	0.020	0.000	0.003
40000	0.017	0.016	0.000	0.001
50000	0.013	0.012	0.000	0.001
59980	0.014	0.011	0.000	0.003

Usage

from transformers import AutoModel, AutoConfig

REPO = "Hoshipu/xvla-v20-task0-mp-radio"
CKPT = "ckpt-60000"   # or ckpt-30000 / ckpt-40000 / ckpt-50000

config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model  = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)

Or deploy as an inference WebSocket server (handles all pre/post-processing for OmniGibson observations):

git clone -b v20 https://github.com/markli1hoshipu/behavior1k-xvla.git
cd behavior1k-xvla
bash setup.sh

# Download a single checkpoint to a local dir
huggingface-cli download Hoshipu/xvla-v20-task0-mp-radio \
    --include "ckpt-60000/*" \
    --local-dir ./xvla-v20-task0-mp-radio

cd behavior1k_training
python deploy_b1k.py --model_path ../xvla-v20-task0-mp-radio/ckpt-60000 --port 8000

See INFERENCE_README.md for the protocol.

Files in each `ckpt-XXXXX/` subfolder

model.safetensors — model weights (3.5 GB, bf16)
config.json, preprocessor_config.json, tokenizer*, vocab.json, merges.txt — config / tokenizer
modeling_xvla.py, configuration_xvla.py, transformer.py — patched v20 modules (additive task+skill prompts, skill classifier, progress head)
modeling_florence2.py, configuration_florence2.py, action_hub.py, processing_xvla.py — unchanged from base 2toINF/X-VLA-Pt