File size: 5,873 Bytes
ebad73b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- robotics
- failure-detection
- manipulation
- vision-language
- multi-view
- chain-of-thought
- internvl
base_model: OpenGVLab/InternVL3-8B
---
# Guardian — Multi-View VLM for Robotic Planning & Execution Failure Detection (Thinking variant)
**Guardian** is a vision-language model fine-tuned for **unified planning and execution verification** in robotic manipulation. Given an instruction and one or more images of the robot scene, it predicts whether a proposed plan is correct (planning verification) or whether a subtask was successfully executed (execution verification), and emits an explicit chain-of-thought reasoning trace alongside the final answer.
This checkpoint (`guardian-thinking`) is the **thinking** variant: it is trained and inferred with `<think> ... </think>` reasoning before the final `<answer>` and `<category>` tokens. A lighter no-CoT counterpart (`guardian-vanilla`) is released separately.
| Project page | Paper | Code | Data |
|---|---|---|---|
| [di.ens.fr/willow/research/guardian](https://www.di.ens.fr/willow/research/guardian/) | [arXiv:2512.01946](https://arxiv.org/abs/2512.01946) | [GitHub](https://github.com/) | [🤗 Guardian collection](https://huggingface.co/collections/paulpacaud/robotic-failure-detection-dataset-and-model-guardian) |
## Model summary
- **Architecture**: InternVL3-8B (Qwen2.5-7B LLM + InternViT-300M-448px-V2.5), fine-tuned with LoRA (rank 16) on the LLM only; visual encoder and MLP connector kept frozen.
- **Capabilities**:
- **Planning verification** — from an initial scene image and a proposed list of subtasks, decide whether the plan is correct.
- **Execution verification** — from before/after observations of a subtask (single-view or multi-view), decide whether the subtask succeeded.
- **Thinking mode** — every prediction is preceded by an explicit reasoning trace.
- **Output format**:
- Thinking: `<think> reasoning </think> <answer> True|False </answer> <category> ... </category>`
- **Training data**: FailCoT (RLBench-Fail + BridgeDataV2-Fail), ~30K planning + execution failures with reasoning traces. See the paper *Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation* (Pacaud et al., 2026).
## Quick start
The simplest way to run Guardian is the lightweight wrapper shipped in the Guardian repo (`examples/guardian.py`):
```python
from examples.guardian import Guardian
guardian = Guardian(
model_path="<path>/guardian-thinking",
thinking=True,
)
# Planning verification: 1 image of the initial scene
answer, category = guardian.verify_plan(
img_paths_list=["/path/to/start_img.png"],
task_instruction="stack the red cup on the blue cup",
plan=str([
"grasp red cup",
"move grasped object on top of blue cup",
"release",
]),
)
# Execution verification: 2, 6, or 8 images (before/after, possibly multi-view)
answer, category = guardian.verify_subtask(
img_paths_list=[
"/path/to/start_left.png",
"/path/to/start_right.png",
"/path/to/start_wrist.png",
"/path/to/end_left.png",
"/path/to/end_right.png",
"/path/to/end_wrist.png",
],
task_instruction="stack the red cup on the blue cup",
subtask_instruction="grasp red cup",
)
```
For execution verification, the wrapper accepts:
- **2 images** — single-view: `[start, end]`
- **6 images** — three views: `[start_left, start_right, start_wrist, end_left, end_right, end_wrist]`
- **8 images** — four views, similarly ordered.
See [`docs/RUN_DEMO.md`](https://github.com/) in the Guardian repo for the full demo.
## Downloading the checkpoint
```bash
hf download paulpacaud/guardian-thinking \
--local-dir ./data/failure_forge/models/guardian-thinking
```
The codebase expects the checkpoint to live under `./data/failure_forge/models/guardian-thinking/`.
## Evaluation
Guardian is evaluated on three real-robot OOD benchmarks bundled at [`paulpacaud/Guardian-FailCoT-OOD-datasets`](https://huggingface.co/datasets/paulpacaud/Guardian-FailCoT-OOD-datasets) — UR5-Fail, RoboFail, RoboVQA — plus the in-distribution test splits of FailCoT (RLBench-Fail / BridgeDataV2-Fail).
Reproduce evaluation results following [`docs/Offline_VQA_Evaluation.md`](https://github.com/) in the Guardian repo. Headline numbers from Table II of the paper:
| Benchmark | Execution acc. | Planning acc. |
|---|---|---|
| RoboFail | 0.86 | 0.70 |
| UR5-Fail | 0.77 | 0.89 |
| RoboVQA | 0.85 | — |
## Intended use
Guardian is designed as a plug-and-play verification module for robotic manipulation pipelines (e.g. as the verifier in 3D-LOTUS++): at each planning step or subtask boundary, query Guardian; on a failure, trigger replanning or re-execution.
## Citation
```bibtex
@misc{pacaud2026guardian_failcot,
title = {Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation},
author = {Paul Pacaud and Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
year = {2026},
eprint = {2512.01946},
archivePrefix = {arXiv},
primaryClass = {cs.RO}
}
```
If you specifically build on the earlier Guardian workshop paper:
```bibtex
@inproceedings{pacaud2025guardian,
title = {Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models},
author = {Paul Pacaud and Ricardo Garcia Pinel and Shizhe Chen and Cordelia Schmid},
booktitle = {Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025},
year = {2025},
url = {https://openreview.net/forum?id=wps46mtC9B}
}
```
## License
Released under the Apache 2.0 license, inheriting the license of the InternVL3-8B base model.
|