HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing
Paper • 2603.15257 • Published
Distilled from a tactile-conditioned SA-RWFM teacher via offline action-level knowledge distillation. At inference, this model is an unmodified SmolVLA — no tactile sensors, no extra modules, zero overhead.
HapticVLA achieves 86.7% mean success rate (vs. 61.7% baseline, 75.0% tactile teacher) with the lowest force error rate of any model (5.0%), including the tactile-equipped teacher — all without requiring tactile sensors at inference.
lerobot/smolvla_base (450M params)| Task | Success Rate | Force Errors |
|---|---|---|
| Eggs | 95% | 1/20 |
| Can | 75% | 5/20 |
| Waffles | 90% | 2/20 |
| Mean | 86.7% | 8/60 |
| Model | Mean Success | Force Error Rate | Tactile Required |
|---|---|---|---|
| SmolVLA (Baseline) | 61.7% | 26.7% | No |
| SA-RWFM (Teacher) | 75.0% | 11.7% | Yes |
| HapticVLA (Ours) | 86.7% | 5.0% | No |
(1-α)·GT + α·teacher_predictionimport torch
checkpoint = torch.load("best/model.pt", map_location="cpu")
See Advanced-Robotic-Manipulation/crab for full inference pipeline.
If you use this model, please cite our paper:
@article{gubernatorov2026hapticvla,
title={HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing},
author={Gubernatorov, Konstantin and Sannikov, Mikhail and Mikhalchuk, Ilya and Kuznetsov, Egor and Artemov, Makar and Ouwatobi, Ogunwoye Faith and Fernando, Marcelino and Asanov, Artem and Guo, Ziang and Tsetserukou, Dzmitry},
journal={arXiv preprint arXiv:2603.15257},
year={2026}
}
Base model
lerobot/smolvla_base