Instructions to use Hoshipu/xvla-v20-tasks016-pickup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hoshipu/xvla-v20-tasks016-pickup with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Hoshipu/xvla-v20-tasks016-pickup", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| tags: | |
| - robotics | |
| - vla | |
| - behavior-1k | |
| - x-vla | |
| - pick-up | |
| base_model: 2toINF/X-VLA-Pt | |
| # X-VLA v20 β Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6 | |
| Fine-tune of [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) on three BEHAVIOR-1K tasks, with training data **filtered to only the pick-up sub-skill segments**. Uses the **v20 architecture** from [markli1hoshipu/behavior1k-xvla @ v20](https://github.com/markli1hoshipu/behavior1k-xvla/tree/v20). | |
| ## What this model is | |
| A pick-up specialist. Trained on the official [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) data for tasks: | |
| | Task | Description | Allowed skills during training | | |
| |---|---|---| | |
| | 0 | Turn on the radio receiver | `pick up from`, `press` | | |
| | 1 | Put soda cans in the trash can | `pick up from` | | |
| | 6 | Take Easter eggs out of the wicker basket | `pick up from` | | |
| Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases. | |
| Skill filter implemented via the `B1K_ALLOWED_SKILLS_PER_TASK` env var added in v20: | |
| ```bash | |
| B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}' | |
| ``` | |
| ## Training | |
| | Setting | Value | | |
| |---|---| | |
| | Base model | `2toINF/X-VLA-Pt` | | |
| | Dataset | [`behavior-1k/2025-challenge-demos`](https://huggingface.co/datasets/behavior-1k/2025-challenge-demos) (tasks 0, 1, 6) | | |
| | Trainable episodes | 600 (200 per task) | | |
| | Total frames in dataset | 3,007,423; trainable after skill filter β **1,099,352** | | |
| | GPUs | 4 Γ NVIDIA H200 | | |
| | Per-GPU batch | 32 | | |
| | Effective batch | 128 | | |
| | Precision | bf16 | | |
| | LR (core) | **1e-4**, cosine decay to 1e-5 | | |
| | LR (VLM, soft prompts) | 0.1 Γ LR | | |
| | Warmup | 2000 steps | | |
| | Iterations | 60,000 | | |
| | Optimizer | AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 | | |
| | Skill class weights | recomputed for filtered distribution (only 2 active skills) | | |
| ### Filtered skill distribution | |
| | Skill | frames | frame % | weight | | |
| |---|---|---|---| | |
| | pick up from | 1,041,232 | 94.7% | 0.382 | | |
| | press | 58,120 | 5.3% | 1.618 | | |
| ### Loss trajectory (selected steps) | |
| | Step | total | joints | skill_cls | progress | | |
| |---|---|---|---|---| | |
| | 0 | 14.94 | 14.49 | 0.438 | 0.017 | | |
| | 1000 | 0.862 | 0.854 | 0.000 | 0.007 | | |
| | 10000 | 0.028 | 0.026 | 0.000 | 0.002 | | |
| | 20000 | 0.128 | 0.109 | 0.014 | 0.004 | | |
| | 30000 | 0.052 | 0.048 | 0.000 | 0.005 | | |
| | 40000 | 0.069 | 0.066 | 0.000 | 0.003 | | |
| | 50000 | 0.019 | 0.016 | 0.000 | 0.003 | | |
| | 59940 | 0.041 | 0.038 | 0.000 | 0.003 | | |
| ## Architecture (v20) | |
| - **Additive per-task + per-skill soft prompts** (`task_prompt_hub[task_id] + skill_prompt_hub[skill_id]`) | |
| - **Skill-conditioned progress head** | |
| - **Skill classifier head** with dataset-specific weighted CE | |
| - **Skill-enriched language instructions** | |
| - 23-D R1Pro action space, 30-step horizon, 3 RGB cameras | |
| ## Usage | |
| ```python | |
| from transformers import AutoModel, AutoConfig | |
| REPO = "Hoshipu/xvla-v20-tasks016-pickup" | |
| CKPT = "ckpt-60000" | |
| config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True) | |
| model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True) | |
| ``` | |
| For inference, you'll typically want to **gate** this model on the active high-level skill β invoke it only when the system is in a pick-up phase. The deploy server in `behavior1k_training/deploy_b1k.py` predicts the skill class automatically from the VLM features. | |
| ## Files (under `ckpt-60000/`) | |
| - `model.safetensors` β final 60k-step weights (3.5 GB, bf16) | |
| - `config.json`, `preprocessor_config.json`, `tokenizer*`, `vocab.json`, `merges.txt` | |
| - `modeling_xvla.py`, `configuration_xvla.py`, `transformer.py` β patched v20 modules | |
| - `modeling_florence2.py`, `configuration_florence2.py`, `action_hub.py`, `processing_xvla.py` β unchanged from base | |