X-VLA v20 — Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6

Fine-tune of 2toINF/X-VLA-Pt on three BEHAVIOR-1K tasks, with training data filtered to only the pick-up sub-skill segments. Uses the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.

What this model is

A pick-up specialist. Trained on the official behavior-1k/2025-challenge-demos data for tasks:

Task	Description	Allowed skills during training
0	Turn on the radio receiver	`pick up from`, `press`
1	Put soda cans in the trash can	`pick up from`
6	Take Easter eggs out of the wicker basket	`pick up from`

Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.

Skill filter implemented via the B1K_ALLOWED_SKILLS_PER_TASK env var added in v20:

B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'

Training

Setting	Value
Base model	`2toINF/X-VLA-Pt`
Dataset	`behavior-1k/2025-challenge-demos` (tasks 0, 1, 6)
Trainable episodes	600 (200 per task)
Total frames in dataset	3,007,423; trainable after skill filter ≈ 1,099,352
GPUs	4 × NVIDIA H200
Per-GPU batch	32
Effective batch	128
Precision	bf16
LR (core)	1e-4, cosine decay to 1e-5
LR (VLM, soft prompts)	0.1 × LR
Warmup	2000 steps
Iterations	60,000
Optimizer	AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0
Skill class weights	recomputed for filtered distribution (only 2 active skills)

Filtered skill distribution

Skill	frames	frame %	weight
pick up from	1,041,232	94.7%	0.382
press	58,120	5.3%	1.618

Loss trajectory (selected steps)

Step	total	joints	skill_cls	progress
0	14.94	14.49	0.438	0.017
1000	0.862	0.854	0.000	0.007
10000	0.028	0.026	0.000	0.002
20000	0.128	0.109	0.014	0.004
30000	0.052	0.048	0.000	0.005
40000	0.069	0.066	0.000	0.003
50000	0.019	0.016	0.000	0.003
59940	0.041	0.038	0.000	0.003

Architecture (v20)

Additive per-task + per-skill soft prompts (task_prompt_hub[task_id] + skill_prompt_hub[skill_id])
Skill-conditioned progress head
Skill classifier head with dataset-specific weighted CE
Skill-enriched language instructions
23-D R1Pro action space, 30-step horizon, 3 RGB cameras

Usage

from transformers import AutoModel, AutoConfig
REPO = "Hoshipu/xvla-v20-tasks016-pickup"
CKPT = "ckpt-60000"
config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model  = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)

For inference, you'll typically want to gate this model on the active high-level skill — invoke it only when the system is in a pick-up phase. The deploy server in behavior1k_training/deploy_b1k.py predicts the skill class automatically from the VLM features.

Files (under `ckpt-60000/`)

model.safetensors — final 60k-step weights (3.5 GB, bf16)
config.json, preprocessor_config.json, tokenizer*, vocab.json, merges.txt
modeling_xvla.py, configuration_xvla.py, transformer.py — patched v20 modules
modeling_florence2.py, configuration_florence2.py, action_hub.py, processing_xvla.py — unchanged from base

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Model tree for Hoshipu/xvla-v20-tasks016-pickup

Base model

microsoft/Florence-2-large

Finetuned

2toINF/X-VLA-Pt