README: update usage to load via subfolder=ckpt-60000

48019a4 verified 4 days ago

3.98 kB

	---
	license: mit
	library_name: transformers
	tags:
	- robotics
	- vla
	- behavior-1k
	- x-vla
	- pick-up
	base_model: 2toINF/X-VLA-Pt
	---

	# X-VLA v20 — Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6

	Fine-tune of [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) on three BEHAVIOR-1K tasks, with training data filtered to only the pick-up sub-skill segments. Uses the v20 architecture from [markli1hoshipu/behavior1k-xvla @ v20](https://github.com/markli1hoshipu/behavior1k-xvla/tree/v20).

	## What this model is

	A pick-up specialist. Trained on the official [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) data for tasks:

	\| Task \| Description \| Allowed skills during training \|
	\|---\|---\|---\|
	\| 0 \| Turn on the radio receiver \| `pick up from`, `press` \|
	\| 1 \| Put soda cans in the trash can \| `pick up from` \|
	\| 6 \| Take Easter eggs out of the wicker basket \| `pick up from` \|

	Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.

	Skill filter implemented via the `B1K_ALLOWED_SKILLS_PER_TASK` env var added in v20:
	```bash
	B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'
	```

	## Training

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `2toINF/X-VLA-Pt` \|
	\| Dataset \| [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) (tasks 0, 1, 6) \|
	\| Trainable episodes \| 600 (200 per task) \|
	\| Total frames in dataset \| 3,007,423; trainable after skill filter ≈ 1,099,352 \|
	\| GPUs \| 4 × NVIDIA H200 \|
	\| Per-GPU batch \| 32 \|
	\| Effective batch \| 128 \|
	\| Precision \| bf16 \|
	\| LR (core) \| 1e-4, cosine decay to 1e-5 \|
	\| LR (VLM, soft prompts) \| 0.1 × LR \|
	\| Warmup \| 2000 steps \|
	\| Iterations \| 60,000 \|
	\| Optimizer \| AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 \|
	\| Skill class weights \| recomputed for filtered distribution (only 2 active skills) \|

	### Filtered skill distribution

	\| Skill \| frames \| frame % \| weight \|
	\|---\|---\|---\|---\|
	\| pick up from \| 1,041,232 \| 94.7% \| 0.382 \|
	\| press \| 58,120 \| 5.3% \| 1.618 \|

	### Loss trajectory (selected steps)

	\| Step \| total \| joints \| skill_cls \| progress \|
	\|---\|---\|---\|---\|---\|
	\| 0 \| 14.94 \| 14.49 \| 0.438 \| 0.017 \|
	\| 1000 \| 0.862 \| 0.854 \| 0.000 \| 0.007 \|
	\| 10000 \| 0.028 \| 0.026 \| 0.000 \| 0.002 \|
	\| 20000 \| 0.128 \| 0.109 \| 0.014 \| 0.004 \|
	\| 30000 \| 0.052 \| 0.048 \| 0.000 \| 0.005 \|
	\| 40000 \| 0.069 \| 0.066 \| 0.000 \| 0.003 \|
	\| 50000 \| 0.019 \| 0.016 \| 0.000 \| 0.003 \|
	\| 59940 \| 0.041 \| 0.038 \| 0.000 \| 0.003 \|

	## Architecture (v20)

	- Additive per-task + per-skill soft prompts (`task_prompt_hub[task_id] + skill_prompt_hub[skill_id]`)
	- Skill-conditioned progress head
	- Skill classifier head with dataset-specific weighted CE
	- Skill-enriched language instructions
	- 23-D R1Pro action space, 30-step horizon, 3 RGB cameras

	## Usage

	```python
	from transformers import AutoModel, AutoConfig
	REPO = "Hoshipu/xvla-v20-tasks016-pickup"
	CKPT = "ckpt-60000"
	config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
	model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
	```

	For inference, you'll typically want to gate this model on the active high-level skill — invoke it only when the system is in a pick-up phase. The deploy server in `behavior1k_training/deploy_b1k.py` predicts the skill class automatically from the VLM features.

	## Files (under `ckpt-60000/`)

	- `model.safetensors` — final 60k-step weights (3.5 GB, bf16)
	- `config.json`, `preprocessor_config.json`, `tokenizer*`, `vocab.json`, `merges.txt`
	- `modeling_xvla.py`, `configuration_xvla.py`, `transformer.py` — patched v20 modules
	- `modeling_florence2.py`, `configuration_florence2.py`, `action_hub.py`, `processing_xvla.py` — unchanged from base

	---
	license: mit
	library_name: transformers
	tags:
	- robotics
	- vla
	- behavior-1k
	- x-vla
	- pick-up
	base_model: 2toINF/X-VLA-Pt
	---

	# X-VLA v20 — Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6

	Fine-tune of [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) on three BEHAVIOR-1K tasks, with training data filtered to only the pick-up sub-skill segments. Uses the v20 architecture from [markli1hoshipu/behavior1k-xvla @ v20](https://github.com/markli1hoshipu/behavior1k-xvla/tree/v20).

	## What this model is

	A pick-up specialist. Trained on the official [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) data for tasks:

	\| Task \| Description \| Allowed skills during training \|
	\|---\|---\|---\|
	\| 0 \| Turn on the radio receiver \| `pick up from`, `press` \|
	\| 1 \| Put soda cans in the trash can \| `pick up from` \|
	\| 6 \| Take Easter eggs out of the wicker basket \| `pick up from` \|

	Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.

	Skill filter implemented via the `B1K_ALLOWED_SKILLS_PER_TASK` env var added in v20:
	```bash
	B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'
	```

	## Training

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `2toINF/X-VLA-Pt` \|
	\| Dataset \| [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) (tasks 0, 1, 6) \|
	\| Trainable episodes \| 600 (200 per task) \|
	\| Total frames in dataset \| 3,007,423; trainable after skill filter ≈ 1,099,352 \|
	\| GPUs \| 4 × NVIDIA H200 \|
	\| Per-GPU batch \| 32 \|
	\| Effective batch \| 128 \|
	\| Precision \| bf16 \|
	\| LR (core) \| 1e-4, cosine decay to 1e-5 \|
	\| LR (VLM, soft prompts) \| 0.1 × LR \|
	\| Warmup \| 2000 steps \|
	\| Iterations \| 60,000 \|
	\| Optimizer \| AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 \|
	\| Skill class weights \| recomputed for filtered distribution (only 2 active skills) \|

	### Filtered skill distribution

	\| Skill \| frames \| frame % \| weight \|
	\|---\|---\|---\|---\|
	\| pick up from \| 1,041,232 \| 94.7% \| 0.382 \|
	\| press \| 58,120 \| 5.3% \| 1.618 \|

	### Loss trajectory (selected steps)

	\| Step \| total \| joints \| skill_cls \| progress \|
	\|---\|---\|---\|---\|---\|
	\| 0 \| 14.94 \| 14.49 \| 0.438 \| 0.017 \|
	\| 1000 \| 0.862 \| 0.854 \| 0.000 \| 0.007 \|
	\| 10000 \| 0.028 \| 0.026 \| 0.000 \| 0.002 \|
	\| 20000 \| 0.128 \| 0.109 \| 0.014 \| 0.004 \|
	\| 30000 \| 0.052 \| 0.048 \| 0.000 \| 0.005 \|
	\| 40000 \| 0.069 \| 0.066 \| 0.000 \| 0.003 \|
	\| 50000 \| 0.019 \| 0.016 \| 0.000 \| 0.003 \|
	\| 59940 \| 0.041 \| 0.038 \| 0.000 \| 0.003 \|

	## Architecture (v20)

	- Additive per-task + per-skill soft prompts (`task_prompt_hub[task_id] + skill_prompt_hub[skill_id]`)
	- Skill-conditioned progress head
	- Skill classifier head with dataset-specific weighted CE
	- Skill-enriched language instructions
	- 23-D R1Pro action space, 30-step horizon, 3 RGB cameras

	## Usage

	```python
	from transformers import AutoModel, AutoConfig
	REPO = "Hoshipu/xvla-v20-tasks016-pickup"
	CKPT = "ckpt-60000"
	config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
	model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
	```

	For inference, you'll typically want to gate this model on the active high-level skill — invoke it only when the system is in a pick-up phase. The deploy server in `behavior1k_training/deploy_b1k.py` predicts the skill class automatically from the VLM features.

	## Files (under `ckpt-60000/`)

	- `model.safetensors` — final 60k-step weights (3.5 GB, bf16)
	- `config.json`, `preprocessor_config.json`, `tokenizer*`, `vocab.json`, `merges.txt`
	- `modeling_xvla.py`, `configuration_xvla.py`, `transformer.py` — patched v20 modules
	- `modeling_florence2.py`, `configuration_florence2.py`, `action_hub.py`, `processing_xvla.py` — unchanged from base