---
license: mit
library_name: transformers
tags:
- robotics
- vla
- behavior-1k
- x-vla
- pick-up
base_model: 2toINF/X-VLA-Pt
---

# X-VLA v20 — Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6

Fine-tune of [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) on three BEHAVIOR-1K tasks, with training data **filtered to only the pick-up sub-skill segments**. Uses the **v20 architecture** from [markli1hoshipu/behavior1k-xvla @ v20](https://github.com/markli1hoshipu/behavior1k-xvla/tree/v20).

## What this model is

A pick-up specialist. Trained on the official [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) data for tasks:

| Task | Description | Allowed skills during training |
|---|---|---|
| 0 | Turn on the radio receiver | `pick up from`, `press` |
| 1 | Put soda cans in the trash can | `pick up from` |
| 6 | Take Easter eggs out of the wicker basket | `pick up from` |

Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.

Skill filter implemented via the `B1K_ALLOWED_SKILLS_PER_TASK` env var added in v20:
```bash
B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'
```

## Training

| Setting | Value |
|---|---|
| Base model | `2toINF/X-VLA-Pt` |
| Dataset | [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) (tasks 0, 1, 6) |
| Trainable episodes | 600 (200 per task) |
| Total frames in dataset | 3,007,423; trainable after skill filter ≈ **1,099,352** |
| GPUs | 4 × NVIDIA H200 |
| Per-GPU batch | 32 |
| Effective batch | 128 |
| Precision | bf16 |
| LR (core) | **1e-4**, cosine decay to 1e-5 |
| LR (VLM, soft prompts) | 0.1 × LR |
| Warmup | 2000 steps |
| Iterations | 60,000 |
| Optimizer | AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 |
| Skill class weights | recomputed for filtered distribution (only 2 active skills) |

### Filtered skill distribution

| Skill | frames | frame % | weight |
|---|---|---|---|
| pick up from | 1,041,232 | 94.7% | 0.382 |
| press | 58,120 | 5.3% | 1.618 |

### Loss trajectory (selected steps)

| Step | total | joints | skill_cls | progress |
|---|---|---|---|---|
| 0 | 14.94 | 14.49 | 0.438 | 0.017 |
| 1000 | 0.862 | 0.854 | 0.000 | 0.007 |
| 10000 | 0.028 | 0.026 | 0.000 | 0.002 |
| 20000 | 0.128 | 0.109 | 0.014 | 0.004 |
| 30000 | 0.052 | 0.048 | 0.000 | 0.005 |
| 40000 | 0.069 | 0.066 | 0.000 | 0.003 |
| 50000 | 0.019 | 0.016 | 0.000 | 0.003 |
| 59940 | 0.041 | 0.038 | 0.000 | 0.003 |

## Architecture (v20)

- **Additive per-task + per-skill soft prompts** (`task_prompt_hub[task_id] + skill_prompt_hub[skill_id]`)
- **Skill-conditioned progress head**
- **Skill classifier head** with dataset-specific weighted CE
- **Skill-enriched language instructions**
- 23-D R1Pro action space, 30-step horizon, 3 RGB cameras

## Usage

```python
from transformers import AutoModel, AutoConfig
REPO = "Hoshipu/xvla-v20-tasks016-pickup"
CKPT = "ckpt-60000"
config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model  = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
```

For inference, you'll typically want to **gate** this model on the active high-level skill — invoke it only when the system is in a pick-up phase. The deploy server in `behavior1k_training/deploy_b1k.py` predicts the skill class automatically from the VLM features.

## Files (under `ckpt-60000/`)

- `model.safetensors` — final 60k-step weights (3.5 GB, bf16)
- `config.json`, `preprocessor_config.json`, `tokenizer*`, `vocab.json`, `merges.txt`
- `modeling_xvla.py`, `configuration_xvla.py`, `transformer.py` — patched v20 modules
- `modeling_florence2.py`, `configuration_florence2.py`, `action_hub.py`, `processing_xvla.py` — unchanged from base