--- license: mit library_name: transformers tags: - robotics - vla - behavior-1k - x-vla - pick-up base_model: 2toINF/X-VLA-Pt --- # X-VLA v20 — Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6 Fine-tune of [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) on three BEHAVIOR-1K tasks, with training data **filtered to only the pick-up sub-skill segments**. Uses the **v20 architecture** from [markli1hoshipu/behavior1k-xvla @ v20](https://github.com/markli1hoshipu/behavior1k-xvla/tree/v20). ## What this model is A pick-up specialist. Trained on the official [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) data for tasks: | Task | Description | Allowed skills during training | |---|---|---| | 0 | Turn on the radio receiver | `pick up from`, `press` | | 1 | Put soda cans in the trash can | `pick up from` | | 6 | Take Easter eggs out of the wicker basket | `pick up from` | Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases. Skill filter implemented via the `B1K_ALLOWED_SKILLS_PER_TASK` env var added in v20: ```bash B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}' ``` ## Training | Setting | Value | |---|---| | Base model | `2toINF/X-VLA-Pt` | | Dataset | [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) (tasks 0, 1, 6) | | Trainable episodes | 600 (200 per task) | | Total frames in dataset | 3,007,423; trainable after skill filter ≈ **1,099,352** | | GPUs | 4 × NVIDIA H200 | | Per-GPU batch | 32 | | Effective batch | 128 | | Precision | bf16 | | LR (core) | **1e-4**, cosine decay to 1e-5 | | LR (VLM, soft prompts) | 0.1 × LR | | Warmup | 2000 steps | | Iterations | 60,000 | | Optimizer | AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 | | Skill class weights | recomputed for filtered distribution (only 2 active skills) | ### Filtered skill distribution | Skill | frames | frame % | weight | |---|---|---|---| | pick up from | 1,041,232 | 94.7% | 0.382 | | press | 58,120 | 5.3% | 1.618 | ### Loss trajectory (selected steps) | Step | total | joints | skill_cls | progress | |---|---|---|---|---| | 0 | 14.94 | 14.49 | 0.438 | 0.017 | | 1000 | 0.862 | 0.854 | 0.000 | 0.007 | | 10000 | 0.028 | 0.026 | 0.000 | 0.002 | | 20000 | 0.128 | 0.109 | 0.014 | 0.004 | | 30000 | 0.052 | 0.048 | 0.000 | 0.005 | | 40000 | 0.069 | 0.066 | 0.000 | 0.003 | | 50000 | 0.019 | 0.016 | 0.000 | 0.003 | | 59940 | 0.041 | 0.038 | 0.000 | 0.003 | ## Architecture (v20) - **Additive per-task + per-skill soft prompts** (`task_prompt_hub[task_id] + skill_prompt_hub[skill_id]`) - **Skill-conditioned progress head** - **Skill classifier head** with dataset-specific weighted CE - **Skill-enriched language instructions** - 23-D R1Pro action space, 30-step horizon, 3 RGB cameras ## Usage ```python from transformers import AutoModel, AutoConfig REPO = "Hoshipu/xvla-v20-tasks016-pickup" CKPT = "ckpt-60000" config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True) model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True) ``` For inference, you'll typically want to **gate** this model on the active high-level skill — invoke it only when the system is in a pick-up phase. The deploy server in `behavior1k_training/deploy_b1k.py` predicts the skill class automatically from the VLM features. ## Files (under `ckpt-60000/`) - `model.safetensors` — final 60k-step weights (3.5 GB, bf16) - `config.json`, `preprocessor_config.json`, `tokenizer*`, `vocab.json`, `merges.txt` - `modeling_xvla.py`, `configuration_xvla.py`, `transformer.py` — patched v20 modules - `modeling_florence2.py`, `configuration_florence2.py`, `action_hub.py`, `processing_xvla.py` — unchanged from base