Instructions to use Hoshipu/xvla-v20-task0-mp-radio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hoshipu/xvla-v20-task0-mp-radio with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Hoshipu/xvla-v20-task0-mp-radio", dtype="auto") - Notebooks
- Google Colab
- Kaggle
X-VLA v20 β BEHAVIOR-1K Task 0 (Turn on Radio)
Fine-tune of 2toINF/X-VLA-Pt on a single BEHAVIOR-1K task (turning on the radio receiver), using the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.
Available checkpoints
| Subfolder | Steps | total loss | joints | skill_cls | progress |
|---|---|---|---|---|---|
ckpt-30000/ |
30,000 | 0.0232 | 0.0204 | 0.0000 | 0.0027 |
ckpt-40000/ |
40,000 | 0.0165 | 0.0158 | 0.0000 | 0.0007 |
ckpt-50000/ |
50,000 | 0.0131 | 0.0117 | 0.0000 | 0.0014 |
ckpt-60000/ |
60,000 (final) | 0.0135 | 0.0107 | 0.0000 | 0.0028 |
Each subfolder is fully self-contained β load any of them with subfolder="ckpt-XXXXX" (see Usage).
Architecture (v20)
- Additive per-task + per-skill soft prompts (
task_prompt_hub[task_id] + skill_prompt_hub[skill_id]), 32 tokens Γ 1024 dim each, zero-initialized - Skill-conditioned progress head (predicts position within current skill segment)
- Skill classifier head on pooled VLM features, trained with sqrt-inverse-frequency weighted CE (Ξ»=0.1, on detached features)
- Skill-enriched language instructions at training time, e.g.
"Turn on the radio receiver. Current: move to radio." - 23-D action space for the R1Pro robot (3 base-qvel + 4 trunk + 7 arm L + 1 grip L + 7 arm R + 1 grip R), 30-step action horizon, 3 RGB cameras (head + L/R wrist)
Training
| Setting | Value |
|---|---|
| Base model | 2toINF/X-VLA-Pt |
| Dataset | Hoshipu/behavior-1k-mp-collected-turning-on-radio (success split) |
| Trainable episodes | 1354 (task 0) |
| Total frames | 2,845,413 |
| GPUs | 8 Γ NVIDIA H200 |
| Per-GPU batch | 32 |
| Effective batch | 256 |
| Precision | bf16 |
| LR (core) | 5e-5, cosine decay to 5e-6 |
| LR (VLM, soft prompts) | 0.1 Γ LR (so 5e-6 β 5e-7) |
| Warmup | 2000 steps |
| Freeze schedule | VLM + transformer core frozen for first 1000 steps (only soft prompts + action heads train) |
| Iterations | 60,000 |
| Optimizer | AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 |
| Wall-clock | ~14 h 51 m |
Loss curve (selected steps)
| Step | total | joints | skill_cls | progress |
|---|---|---|---|---|
| 0 | 15.70 | 15.11 | 0.582 | 0.006 |
| 1000 | 0.568 | 0.555 | 0.005 | 0.008 |
| 5000 | 0.107 | 0.101 | 0.000 | 0.006 |
| 10000 | 0.021 | 0.018 | 0.000 | 0.003 |
| 20000 | 0.029 | 0.025 | 0.000 | 0.003 |
| 30000 | 0.023 | 0.020 | 0.000 | 0.003 |
| 40000 | 0.017 | 0.016 | 0.000 | 0.001 |
| 50000 | 0.013 | 0.012 | 0.000 | 0.001 |
| 59980 | 0.014 | 0.011 | 0.000 | 0.003 |
Usage
from transformers import AutoModel, AutoConfig
REPO = "Hoshipu/xvla-v20-task0-mp-radio"
CKPT = "ckpt-60000" # or ckpt-30000 / ckpt-40000 / ckpt-50000
config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
Or deploy as an inference WebSocket server (handles all pre/post-processing for OmniGibson observations):
git clone -b v20 https://github.com/markli1hoshipu/behavior1k-xvla.git
cd behavior1k-xvla
bash setup.sh
# Download a single checkpoint to a local dir
huggingface-cli download Hoshipu/xvla-v20-task0-mp-radio \
--include "ckpt-60000/*" \
--local-dir ./xvla-v20-task0-mp-radio
cd behavior1k_training
python deploy_b1k.py --model_path ../xvla-v20-task0-mp-radio/ckpt-60000 --port 8000
See INFERENCE_README.md for the protocol.
Files in each ckpt-XXXXX/ subfolder
model.safetensorsβ model weights (3.5 GB, bf16)config.json,preprocessor_config.json,tokenizer*,vocab.json,merges.txtβ config / tokenizermodeling_xvla.py,configuration_xvla.py,transformer.pyβ patched v20 modules (additive task+skill prompts, skill classifier, progress head)modeling_florence2.py,configuration_florence2.py,action_hub.py,processing_xvla.pyβ unchanged from base2toINF/X-VLA-Pt