File size: 5,873 Bytes
ebad73b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - robotics
  - failure-detection
  - manipulation
  - vision-language
  - multi-view
  - chain-of-thought
  - internvl
base_model: OpenGVLab/InternVL3-8B
---

# Guardian — Multi-View VLM for Robotic Planning & Execution Failure Detection (Thinking variant)

**Guardian** is a vision-language model fine-tuned for **unified planning and execution verification** in robotic manipulation. Given an instruction and one or more images of the robot scene, it predicts whether a proposed plan is correct (planning verification) or whether a subtask was successfully executed (execution verification), and emits an explicit chain-of-thought reasoning trace alongside the final answer.

This checkpoint (`guardian-thinking`) is the **thinking** variant: it is trained and inferred with `<think> ... </think>` reasoning before the final `<answer>` and `<category>` tokens. A lighter no-CoT counterpart (`guardian-vanilla`) is released separately.

| Project page | Paper | Code | Data |
|---|---|---|---|
| [di.ens.fr/willow/research/guardian](https://www.di.ens.fr/willow/research/guardian/) | [arXiv:2512.01946](https://arxiv.org/abs/2512.01946) | [GitHub](https://github.com/) | [🤗 Guardian collection](https://huggingface.co/collections/paulpacaud/robotic-failure-detection-dataset-and-model-guardian) |

## Model summary

- **Architecture**: InternVL3-8B (Qwen2.5-7B LLM + InternViT-300M-448px-V2.5), fine-tuned with LoRA (rank 16) on the LLM only; visual encoder and MLP connector kept frozen.
- **Capabilities**:
  - **Planning verification** — from an initial scene image and a proposed list of subtasks, decide whether the plan is correct.
  - **Execution verification** — from before/after observations of a subtask (single-view or multi-view), decide whether the subtask succeeded.
  - **Thinking mode** — every prediction is preceded by an explicit reasoning trace.
- **Output format**:
  - Thinking: `<think> reasoning </think> <answer> True|False </answer> <category> ... </category>`
- **Training data**: FailCoT (RLBench-Fail + BridgeDataV2-Fail), ~30K planning + execution failures with reasoning traces. See the paper *Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation* (Pacaud et al., 2026).

## Quick start

The simplest way to run Guardian is the lightweight wrapper shipped in the Guardian repo (`examples/guardian.py`):

```python
from examples.guardian import Guardian

guardian = Guardian(
    model_path="<path>/guardian-thinking",
    thinking=True,
)

# Planning verification: 1 image of the initial scene
answer, category = guardian.verify_plan(
    img_paths_list=["/path/to/start_img.png"],
    task_instruction="stack the red cup on the blue cup",
    plan=str([
        "grasp red cup",
        "move grasped object on top of blue cup",
        "release",
    ]),
)

# Execution verification: 2, 6, or 8 images (before/after, possibly multi-view)
answer, category = guardian.verify_subtask(
    img_paths_list=[
        "/path/to/start_left.png",
        "/path/to/start_right.png",
        "/path/to/start_wrist.png",
        "/path/to/end_left.png",
        "/path/to/end_right.png",
        "/path/to/end_wrist.png",
    ],
    task_instruction="stack the red cup on the blue cup",
    subtask_instruction="grasp red cup",
)
```

For execution verification, the wrapper accepts:
- **2 images** — single-view: `[start, end]`
- **6 images** — three views: `[start_left, start_right, start_wrist, end_left, end_right, end_wrist]`
- **8 images** — four views, similarly ordered.

See [`docs/RUN_DEMO.md`](https://github.com/) in the Guardian repo for the full demo.

## Downloading the checkpoint

```bash
hf download paulpacaud/guardian-thinking \
    --local-dir ./data/failure_forge/models/guardian-thinking
```

The codebase expects the checkpoint to live under `./data/failure_forge/models/guardian-thinking/`.

## Evaluation

Guardian is evaluated on three real-robot OOD benchmarks bundled at [`paulpacaud/Guardian-FailCoT-OOD-datasets`](https://huggingface.co/datasets/paulpacaud/Guardian-FailCoT-OOD-datasets) — UR5-Fail, RoboFail, RoboVQA — plus the in-distribution test splits of FailCoT (RLBench-Fail / BridgeDataV2-Fail).

Reproduce evaluation results following [`docs/Offline_VQA_Evaluation.md`](https://github.com/) in the Guardian repo. Headline numbers from Table II of the paper:

| Benchmark | Execution acc. | Planning acc. |
|---|---|---|
| RoboFail  | 0.86 | 0.70 |
| UR5-Fail  | 0.77 | 0.89 |
| RoboVQA   | 0.85 | —    |

## Intended use

Guardian is designed as a plug-and-play verification module for robotic manipulation pipelines (e.g. as the verifier in 3D-LOTUS++): at each planning step or subtask boundary, query Guardian; on a failure, trigger replanning or re-execution.

## Citation

```bibtex
@misc{pacaud2026guardian_failcot,
  title  = {Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation},
  author = {Paul Pacaud and Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
  year   = {2026},
  eprint = {2512.01946},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}
```

If you specifically build on the earlier Guardian workshop paper:

```bibtex
@inproceedings{pacaud2025guardian,
  title     = {Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models},
  author    = {Paul Pacaud and Ricardo Garcia Pinel and Shizhe Chen and Cordelia Schmid},
  booktitle = {Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025},
  year      = {2025},
  url       = {https://openreview.net/forum?id=wps46mtC9B}
}
```

## License

Released under the Apache 2.0 license, inheriting the license of the InternVL3-8B base model.