X-VLA v20 β€” Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6

Fine-tune of 2toINF/X-VLA-Pt on three BEHAVIOR-1K tasks, with training data filtered to only the pick-up sub-skill segments. Uses the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.

What this model is

A pick-up specialist. Trained on the official behavior-1k/2025-challenge-demos data for tasks:

Task Description Allowed skills during training
0 Turn on the radio receiver pick up from, press
1 Put soda cans in the trash can pick up from
6 Take Easter eggs out of the wicker basket pick up from

Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.

Skill filter implemented via the B1K_ALLOWED_SKILLS_PER_TASK env var added in v20:

B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'

Training

Setting Value
Base model 2toINF/X-VLA-Pt
Dataset behavior-1k/2025-challenge-demos (tasks 0, 1, 6)
Trainable episodes 600 (200 per task)
Total frames in dataset 3,007,423; trainable after skill filter β‰ˆ 1,099,352
GPUs 4 Γ— NVIDIA H200
Per-GPU batch 32
Effective batch 128
Precision bf16
LR (core) 1e-4, cosine decay to 1e-5
LR (VLM, soft prompts) 0.1 Γ— LR
Warmup 2000 steps
Iterations 60,000
Optimizer AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0
Skill class weights recomputed for filtered distribution (only 2 active skills)

Filtered skill distribution

Skill frames frame % weight
pick up from 1,041,232 94.7% 0.382
press 58,120 5.3% 1.618

Loss trajectory (selected steps)

Step total joints skill_cls progress
0 14.94 14.49 0.438 0.017
1000 0.862 0.854 0.000 0.007
10000 0.028 0.026 0.000 0.002
20000 0.128 0.109 0.014 0.004
30000 0.052 0.048 0.000 0.005
40000 0.069 0.066 0.000 0.003
50000 0.019 0.016 0.000 0.003
59940 0.041 0.038 0.000 0.003

Architecture (v20)

  • Additive per-task + per-skill soft prompts (task_prompt_hub[task_id] + skill_prompt_hub[skill_id])
  • Skill-conditioned progress head
  • Skill classifier head with dataset-specific weighted CE
  • Skill-enriched language instructions
  • 23-D R1Pro action space, 30-step horizon, 3 RGB cameras

Usage

from transformers import AutoModel, AutoConfig
REPO = "Hoshipu/xvla-v20-tasks016-pickup"
CKPT = "ckpt-60000"
config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model  = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)

For inference, you'll typically want to gate this model on the active high-level skill β€” invoke it only when the system is in a pick-up phase. The deploy server in behavior1k_training/deploy_b1k.py predicts the skill class automatically from the VLM features.

Files (under ckpt-60000/)

  • model.safetensors β€” final 60k-step weights (3.5 GB, bf16)
  • config.json, preprocessor_config.json, tokenizer*, vocab.json, merges.txt
  • modeling_xvla.py, configuration_xvla.py, transformer.py β€” patched v20 modules
  • modeling_florence2.py, configuration_florence2.py, action_hub.py, processing_xvla.py β€” unchanged from base
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Hoshipu/xvla-v20-tasks016-pickup

Finetuned
2toINF/X-VLA-Pt
Finetuned
(6)
this model