Instructions to use Hoshipu/xvla-v20-tasks016-pickup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hoshipu/xvla-v20-tasks016-pickup with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Hoshipu/xvla-v20-tasks016-pickup", dtype="auto") - Notebooks
- Google Colab
- Kaggle
X-VLA v20 β Pick-Up Specialist on BEHAVIOR-1K Tasks 0, 1, 6
Fine-tune of 2toINF/X-VLA-Pt on three BEHAVIOR-1K tasks, with training data filtered to only the pick-up sub-skill segments. Uses the v20 architecture from markli1hoshipu/behavior1k-xvla @ v20.
What this model is
A pick-up specialist. Trained on the official behavior-1k/2025-challenge-demos data for tasks:
| Task | Description | Allowed skills during training |
|---|---|---|
| 0 | Turn on the radio receiver | pick up from, press |
| 1 | Put soda cans in the trash can | pick up from |
| 6 | Take Easter eggs out of the wicker basket | pick up from |
Frames whose active skill segment was NOT in the allowed list (navigation, placing, etc.) were skipped at sample time. The result is a model that learns dexterous grasping + press-on motions without spending capacity on the navigation/place phases.
Skill filter implemented via the B1K_ALLOWED_SKILLS_PER_TASK env var added in v20:
B1K_ALLOWED_SKILLS_PER_TASK='{"0":[2,67],"1":[2],"6":[2]}'
Training
| Setting | Value |
|---|---|
| Base model | 2toINF/X-VLA-Pt |
| Dataset | behavior-1k/2025-challenge-demos (tasks 0, 1, 6) |
| Trainable episodes | 600 (200 per task) |
| Total frames in dataset | 3,007,423; trainable after skill filter β 1,099,352 |
| GPUs | 4 Γ NVIDIA H200 |
| Per-GPU batch | 32 |
| Effective batch | 128 |
| Precision | bf16 |
| LR (core) | 1e-4, cosine decay to 1e-5 |
| LR (VLM, soft prompts) | 0.1 Γ LR |
| Warmup | 2000 steps |
| Iterations | 60,000 |
| Optimizer | AdamW, betas=(0.9, 0.95), wd=0.0, grad-clip 1.0 |
| Skill class weights | recomputed for filtered distribution (only 2 active skills) |
Filtered skill distribution
| Skill | frames | frame % | weight |
|---|---|---|---|
| pick up from | 1,041,232 | 94.7% | 0.382 |
| press | 58,120 | 5.3% | 1.618 |
Loss trajectory (selected steps)
| Step | total | joints | skill_cls | progress |
|---|---|---|---|---|
| 0 | 14.94 | 14.49 | 0.438 | 0.017 |
| 1000 | 0.862 | 0.854 | 0.000 | 0.007 |
| 10000 | 0.028 | 0.026 | 0.000 | 0.002 |
| 20000 | 0.128 | 0.109 | 0.014 | 0.004 |
| 30000 | 0.052 | 0.048 | 0.000 | 0.005 |
| 40000 | 0.069 | 0.066 | 0.000 | 0.003 |
| 50000 | 0.019 | 0.016 | 0.000 | 0.003 |
| 59940 | 0.041 | 0.038 | 0.000 | 0.003 |
Architecture (v20)
- Additive per-task + per-skill soft prompts (
task_prompt_hub[task_id] + skill_prompt_hub[skill_id]) - Skill-conditioned progress head
- Skill classifier head with dataset-specific weighted CE
- Skill-enriched language instructions
- 23-D R1Pro action space, 30-step horizon, 3 RGB cameras
Usage
from transformers import AutoModel, AutoConfig
REPO = "Hoshipu/xvla-v20-tasks016-pickup"
CKPT = "ckpt-60000"
config = AutoConfig.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
model = AutoModel.from_pretrained(REPO, subfolder=CKPT, trust_remote_code=True)
For inference, you'll typically want to gate this model on the active high-level skill β invoke it only when the system is in a pick-up phase. The deploy server in behavior1k_training/deploy_b1k.py predicts the skill class automatically from the VLM features.
Files (under ckpt-60000/)
model.safetensorsβ final 60k-step weights (3.5 GB, bf16)config.json,preprocessor_config.json,tokenizer*,vocab.json,merges.txtmodeling_xvla.py,configuration_xvla.py,transformer.pyβ patched v20 modulesmodeling_florence2.py,configuration_florence2.py,action_hub.py,processing_xvla.pyβ unchanged from base