Commit ·
3340866
1
Parent(s): 7d8fc87
preTrained predictions are ready for observation
Browse files- README.md +1 -405
- maskcut/README.md +1 -292
- third_party/README.md +1 -293
README.md
CHANGED
|
@@ -1,405 +1 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
**Cut**-and-**LE**a**R**n (**CutLER**) is a simple approach for training object detection and instance segmentation models without human annotations.
|
| 4 |
-
It outperforms previous SOTA by **2.7 times** for AP50 and **2.6 times** for AR on **11 benchmarks**.
|
| 5 |
-
|
| 6 |
-
<p align="center"> <img src='docs/teaser_img.jpg' align="center" > </p>
|
| 7 |
-
|
| 8 |
-
> [**Cut and Learn for Unsupervised Object Detection and Instance Segmentation**](http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/)
|
| 9 |
-
> [Xudong Wang](https://people.eecs.berkeley.edu/~xdwang/), [Rohit Girdhar](https://rohitgirdhar.github.io/), [Stella X. Yu](https://www1.icsi.berkeley.edu/~stellayu/), [Ishan Misra](https://imisra.github.io/)
|
| 10 |
-
> FAIR, Meta AI; UC Berkeley
|
| 11 |
-
> CVPR 2023
|
| 12 |
-
|
| 13 |
-
[[`project page`](http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/)] [[`arxiv`](https://arxiv.org/abs/2301.11320)] [[`colab`](https://colab.research.google.com/drive/1NgEyFHvOfuA2MZZnfNPWg1w5gSr3HOBb?usp=sharing)] [[`bibtex`](#citation)]
|
| 14 |
-
|
| 15 |
-
Unsupervised video instance segmentation (**VideoCutLER**) is also supported. ***We demonstrate that video instance segmentation models can be learned without using any human annotations, without relying on natural videos (ImageNet data alone is sufficient), and even without motion estimations!*** The code is available [here](videocutler).
|
| 16 |
-
|
| 17 |
-
<p align="center">
|
| 18 |
-
<img src="docs/demos_videocutler.gif" width=100%>
|
| 19 |
-
</p>
|
| 20 |
-
|
| 21 |
-
> [**VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation**](https://people.eecs.berkeley.edu/~xdwang/projects/VideoCutLER/videocutler.pdf)
|
| 22 |
-
> [Xudong Wang](https://people.eecs.berkeley.edu/~xdwang/), [Ishan Misra](https://imisra.github.io/), Ziyun Zeng, [Rohit Girdhar](https://rohitgirdhar.github.io/), [Trevor Darrell](https://people.eecs.berkeley.edu/~trevor/)
|
| 23 |
-
> UC Berkeley; FAIR, Meta AI
|
| 24 |
-
> CVPR 2024
|
| 25 |
-
|
| 26 |
-
[[`code`](videocutler/README.md)] [[`PDF`](https://people.eecs.berkeley.edu/~xdwang/projects/VideoCutLER/videocutler.pdf)] [[`arxiv`](https://arxiv.org/abs/2308.14710)] [[`bibtex`](#citation)]
|
| 27 |
-
|
| 28 |
-
## Features
|
| 29 |
-
- We propose MaskCut approach to generate pseudo-masks for multiple objects in an image.
|
| 30 |
-
- CutLER can learn unsupervised object detectors and instance segmentors solely on ImageNet-1K.
|
| 31 |
-
- CutLER exhibits strong robustness to domain shifts when evaluated on 11 different benchmarks across domains like natural images, video frames, paintings, sketches, etc.
|
| 32 |
-
- CutLER can serve as a pretrained model for fully/semi-supervised detection and segmentation tasks.
|
| 33 |
-
- We also propose VideoCutLER, a surprisingly simple unsupervised video instance segmentation (UVIS) method without relying on optical flows. ImaegNet-1K is all we need for training a SOTA UVIS model!
|
| 34 |
-
|
| 35 |
-
## Installation
|
| 36 |
-
See [installation instructions](INSTALL.md).
|
| 37 |
-
|
| 38 |
-
## Dataset Preparation
|
| 39 |
-
See [Preparing Datasets for CutLER](datasets/README.md).
|
| 40 |
-
|
| 41 |
-
## Method Overview
|
| 42 |
-
<p align="center">
|
| 43 |
-
<img src="docs/pipeline.jpg" width=55%>
|
| 44 |
-
</p>
|
| 45 |
-
Cut-and-Learn has two stages: 1) generating pseudo-masks with MaskCut and 2) learning unsupervised detectors from pseudo-masks of unlabeled data.
|
| 46 |
-
|
| 47 |
-
### 1. MaskCut
|
| 48 |
-
|
| 49 |
-
MaskCut can be used to provide segmentation masks for multiple instances of each image.
|
| 50 |
-
<p align="center">
|
| 51 |
-
<img src="docs/maskcut.gif" width=100%>
|
| 52 |
-
</p>
|
| 53 |
-
|
| 54 |
-
### MaskCut Demo
|
| 55 |
-
|
| 56 |
-
Try out the MaskCut demo using Colab (no GPU needed): [](https://colab.research.google.com/drive/1X05lKL_IBRvZB7q6n6pb4w00_tIYjGlf?usp=sharing)
|
| 57 |
-
|
| 58 |
-
Try out the web demo: [](https://huggingface.co/spaces/facebook/MaskCut) (thanks to [@hysts](https://github.com/hysts)!)
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
If you want to run MaskCut locally, we provide `demo.py` that is able to visualize the pseudo-masks produced by MaskCut.
|
| 64 |
-
Run it with:
|
| 65 |
-
```
|
| 66 |
-
cd maskcut
|
| 67 |
-
python demo.py --img-path imgs/demo2.jpg \
|
| 68 |
-
--N 3 --tau 0.15 --vit-arch base --patch-size 8 \
|
| 69 |
-
[--other-options]
|
| 70 |
-
```
|
| 71 |
-
We give a few demo images in maskcut/imgs/. If you want to run demo.py with cpu, simply add "--cpu" when running the demo script.
|
| 72 |
-
For imgs/demo4.jpg, you need to use "--N 6" to segment all six instances in the image.
|
| 73 |
-
Following, we give some visualizations of the pseudo-masks on the demo images.
|
| 74 |
-
<p align="center">
|
| 75 |
-
<img src="docs/maskcut-demo.jpg" width=100%>
|
| 76 |
-
</p>
|
| 77 |
-
|
| 78 |
-
### Generating Annotations for ImageNet-1K with MaskCut
|
| 79 |
-
To generate pseudo-masks for ImageNet-1K using MaskCut, first set up the ImageNet-1K dataset according to the instructions in [datasets/README.md](datasets/README.md), then execute the following command:
|
| 80 |
-
```
|
| 81 |
-
cd maskcut
|
| 82 |
-
python maskcut.py \
|
| 83 |
-
--vit-arch base --patch-size 8 \
|
| 84 |
-
--tau 0.15 --fixed_size 480 --N 3 \
|
| 85 |
-
--num-folder-per-job 1000 --job-index 0 \
|
| 86 |
-
--dataset-path /path/to/dataset/traindir \
|
| 87 |
-
--out-dir /path/to/save/annotations \
|
| 88 |
-
```
|
| 89 |
-
As the process of generating pseudo-masks for all 1.3 million images in 1,000 folders takes a significant amount of time, it is recommended to use multiple runs. Each run should process the pseudo-mask generation for a smaller number of image folders by setting "--num-folder-per-job" and "--job-index". Once all runs are completed, you can merge all the resulting json files by using the following command:
|
| 90 |
-
```
|
| 91 |
-
python merge_jsons.py \
|
| 92 |
-
--base-dir /path/to/save/annotations \
|
| 93 |
-
--num-folder-per-job 2 --fixed-size 480 \
|
| 94 |
-
--tau 0.15 --N 3 \
|
| 95 |
-
--save-path imagenet_train_fixsize480_tau0.15_N3.json
|
| 96 |
-
```
|
| 97 |
-
The "--num-folder-per-job", "--fixed-size", "--tau" and "--N" of merge_jsons.py should match the ones used to run maskcut.py.
|
| 98 |
-
|
| 99 |
-
We also provide a submitit script to launch the pseudo-mask generation process with multiple nodes.
|
| 100 |
-
```
|
| 101 |
-
cd maskcut
|
| 102 |
-
bash run_maskcut_with_submitit.sh
|
| 103 |
-
```
|
| 104 |
-
After that, you can use "merge_jsons.py" to merge all these json files as described above.
|
| 105 |
-
|
| 106 |
-
### 2. CutLER
|
| 107 |
-
|
| 108 |
-
### Inference Demo for CutLER with Pre-trained Models
|
| 109 |
-
Try out the CutLER demo using Colab (no GPU needed): [](https://colab.research.google.com/drive/1NgEyFHvOfuA2MZZnfNPWg1w5gSr3HOBb?usp=sharing)
|
| 110 |
-
|
| 111 |
-
Try out the web demo: [](https://huggingface.co/spaces/facebook/CutLER) (thanks to [@hysts](https://github.com/hysts)!)
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
Try out Replicate demo and the API: [](https://replicate.com/cjwbw/cutler)
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
If you want to run CutLER demos locally,
|
| 118 |
-
1. Pick a model and its config file from [model zoo](#model-zoo),
|
| 119 |
-
for example, `model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml`.
|
| 120 |
-
2. We provide `demo.py` that is able to demo builtin configs. Run it with:
|
| 121 |
-
```
|
| 122 |
-
cd cutler
|
| 123 |
-
python demo/demo.py --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_demo.yaml \
|
| 124 |
-
--input demo/imgs/*.jpg \
|
| 125 |
-
[--other-options]
|
| 126 |
-
--opts MODEL.WEIGHTS /path/to/cutler_w_cascade_checkpoint
|
| 127 |
-
```
|
| 128 |
-
The configs are made for training, therefore we need to specify `MODEL.WEIGHTS` to a model from model zoo for evaluation.
|
| 129 |
-
This command will run the inference and show visualizations in an OpenCV window.
|
| 130 |
-
<!-- For details of the command line arguments, see `demo.py -h` or look at its source code
|
| 131 |
-
to understand its behavior. Some common arguments are: -->
|
| 132 |
-
* To run __on cpu__, add `MODEL.DEVICE cpu` after `--opts`.
|
| 133 |
-
* To save outputs to a directory (for images) or a file (for webcam or video), use `--output`.
|
| 134 |
-
|
| 135 |
-
Following, we give some visualizations of the model predictions on the demo images.
|
| 136 |
-
<p align="center">
|
| 137 |
-
<img src="docs/cutler-demo.jpg" width=100%>
|
| 138 |
-
</p>
|
| 139 |
-
|
| 140 |
-
### Unsupervised Model Learning
|
| 141 |
-
Before training the detector, it is necessary to use MaskCut to generate pseudo-masks for all ImageNet data.
|
| 142 |
-
You can either use the pre-generated json file directly by downloading it from [here](http://dl.fbaipublicfiles.com/cutler/maskcut/imagenet_train_fixsize480_tau0.15_N3.json) and placing it under "DETECTRON2_DATASETS/imagenet/annotations/", or generate your own pseudo-masks by following the instructions in [MaskCut](#1-maskcut).
|
| 143 |
-
|
| 144 |
-
We provide a script `train_net.py`, that is made to train all the configs provided in CutLER.
|
| 145 |
-
To train a model with "train_net.py", first setup the ImageNet-1K dataset following [datasets/README.md](datasets/README.md), then run:
|
| 146 |
-
```
|
| 147 |
-
cd cutler
|
| 148 |
-
export DETECTRON2_DATASETS=/path/to/DETECTRON2_DATASETS/
|
| 149 |
-
python train_net.py --num-gpus 8 \
|
| 150 |
-
--config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
If you want to train a model using multiple nodes, you may need to adjust [some model parameters](https://arxiv.org/abs/1706.02677) and some SBATCH command options in "tools/train-1node.sh" and "tools/single-node_run.sh", then run:
|
| 154 |
-
```
|
| 155 |
-
cd cutler
|
| 156 |
-
sbatch tools/train-1node.sh \
|
| 157 |
-
--config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml \
|
| 158 |
-
MODEL.WEIGHTS /path/to/dino/d2format/model \
|
| 159 |
-
OUTPUT_DIR output/
|
| 160 |
-
```
|
| 161 |
-
You can also convert a pre-trained DINO model to detectron2's format by yourself following [this link](https://github.com/facebookresearch/moco/tree/main/detection).
|
| 162 |
-
|
| 163 |
-
### Self-training
|
| 164 |
-
We further improve performance by self-training the model on its predictions.
|
| 165 |
-
|
| 166 |
-
Firstly, we can get model predictions on ImageNet via running:
|
| 167 |
-
```
|
| 168 |
-
python train_net.py --num-gpus 8 \
|
| 169 |
-
--config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml \
|
| 170 |
-
--test-dataset imagenet_train \
|
| 171 |
-
--eval-only TEST.DETECTIONS_PER_IMAGE 30 \
|
| 172 |
-
MODEL.WEIGHTS output/model_final.pth \ # load previous stage/round checkpoints
|
| 173 |
-
OUTPUT_DIR output/ # path to save model predictions
|
| 174 |
-
```
|
| 175 |
-
Secondly, we can run the following command to generate the json file for the first round of self-training:
|
| 176 |
-
```
|
| 177 |
-
python tools/get_self_training_ann.py \
|
| 178 |
-
--new-pred output/inference/coco_instances_results.json \ # load model predictions
|
| 179 |
-
--prev-ann DETECTRON2_DATASETS/imagenet/annotations/imagenet_train_fixsize480_tau0.15_N3.json \ # path to the old annotation file.
|
| 180 |
-
--save-path DETECTRON2_DATASETS/imagenet/annotations/cutler_imagenet1k_train_r1.json \ # path to save a new annotation file.
|
| 181 |
-
--threshold 0.7
|
| 182 |
-
```
|
| 183 |
-
Finally, place "cutler_imagenet1k_train_r1.json" under "DETECTRON2_DATASETS/imagenet/annotations/", then launch the self-training process:
|
| 184 |
-
```
|
| 185 |
-
python train_net.py --num-gpus 8 \
|
| 186 |
-
--config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_self_train.yaml \
|
| 187 |
-
--train-dataset imagenet_train_r1 \
|
| 188 |
-
MODEL.WEIGHTS output/model_final.pth \ # load previous stage/round checkpoints
|
| 189 |
-
OUTPUT_DIR output/self-train-r1/ # path to save checkpoints
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
You can repeat the steps above to perform multiple rounds of self-training and adjust some arguments as needed (e.g., "--threshold" for round 1 and 2 can be set to 0.7 and 0.65, respectively; "--train-dataset" for round 1 and 2 can be set to "imagenet_train_r1" and "imagenet_train_r2", respectively; MODEL.WEIGHTS for round 1 and 2 should point to the previous stage/round checkpoints). Ensure that all annotation files are placed under DETECTRON2_DATASETS/imagenet/annotations/.
|
| 193 |
-
Please ensure that "--train-dataset", json file names and locations match the ones specified in "cutler/data/datasets/builtin.py".
|
| 194 |
-
Please refer to this [instruction](https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html) for guidance on using custom datasets.
|
| 195 |
-
|
| 196 |
-
You can also directly download the MODEL.WEIGHTS and annotations used for each round of self-training:
|
| 197 |
-
<table><tbody>
|
| 198 |
-
<!-- START TABLE -->
|
| 199 |
-
<!-- TABLE BODY -->
|
| 200 |
-
<!-- ROW: round 1 -->
|
| 201 |
-
<tr><td align="center">round 1</td>
|
| 202 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_cascade_r1.pth">cutler_cascade_r1.pth</a></td>
|
| 203 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/maskcut/cutler_imagenet1k_train_r1.json">cutler_imagenet1k_train_r1.json</a></td>
|
| 204 |
-
</tr>
|
| 205 |
-
<!-- ROW: round 2 -->
|
| 206 |
-
<tr><td align="center">round 2</td>
|
| 207 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_cascade_r2.pth">cutler_cascade_r2.pth</a></td>
|
| 208 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/maskcut/cutler_imagenet1k_train_r2.json">cutler_imagenet1k_train_r2.json</a></td>
|
| 209 |
-
</tr>
|
| 210 |
-
</tbody></table>
|
| 211 |
-
|
| 212 |
-
### Unsupervised Zero-shot Evaluation
|
| 213 |
-
To evaluate a model's performance on 11 different datasets, please refer to [datasets/README.md](datasets/README.md) for instructions on preparing the datasets. Next, select a model from the model zoo, specify the "model_weights", "config_file" and the path to "DETECTRON2_DATASETS" in `tools/eval.sh`, then run the script.
|
| 214 |
-
```
|
| 215 |
-
bash tools/eval.sh
|
| 216 |
-
```
|
| 217 |
-
|
| 218 |
-
### Model Zoo
|
| 219 |
-
We show zero-shot unsupervised object detection performance (AP50 | AR) on 11 different datasets spanning a variety of domains. ^: CutLER using Mask R-CNN as a detector; *: CutLER using Cascade Mask R-CNN as a detector.
|
| 220 |
-
<table><tbody>
|
| 221 |
-
<!-- START TABLE -->
|
| 222 |
-
<!-- TABLE HEADER -->
|
| 223 |
-
<th valign="bottom">Methods</th>
|
| 224 |
-
<th valign="bottom">Models</th>
|
| 225 |
-
<th valign="bottom">COCO</th>
|
| 226 |
-
<th valign="bottom">COCO20K</th>
|
| 227 |
-
<th valign="bottom">VOC</th>
|
| 228 |
-
<th valign="bottom">LVIS</th>
|
| 229 |
-
<th valign="bottom">UVO</th>
|
| 230 |
-
<th valign="bottom">Clipart</th>
|
| 231 |
-
<th valign="bottom">Comic</th>
|
| 232 |
-
<th valign="bottom">Watercolor</th>
|
| 233 |
-
<th valign="bottom">KITTI</th>
|
| 234 |
-
<th valign="bottom">Objects365</th>
|
| 235 |
-
<th valign="bottom">OpenImages</th>
|
| 236 |
-
<!-- TABLE BODY -->
|
| 237 |
-
</tr>
|
| 238 |
-
<tr><td align="center">Prev. SOTA</td>
|
| 239 |
-
<td valign="bottom">-</td>
|
| 240 |
-
<td align="center">9.6 | 12.6</td>
|
| 241 |
-
<td align="center">9.7 | 12.6</td>
|
| 242 |
-
<td align="center">15.9 | 21.3</td>
|
| 243 |
-
<td align="center">3.8 | 6.4</td>
|
| 244 |
-
<td align="center">10.0 | 14.2</td>
|
| 245 |
-
<td align="center">7.9 | 15.1</td>
|
| 246 |
-
<td align="center">9.9 | 16.3</td>
|
| 247 |
-
<td align="center">6.7 | 16.2</td>
|
| 248 |
-
<td align="center">7.7 | 7.1</td>
|
| 249 |
-
<td align="center">8.1 | 10.2</td>
|
| 250 |
-
<td align="center">9.9 | 14.9</td>
|
| 251 |
-
</tr>
|
| 252 |
-
<!-- ROW: Box/Mask AP for CutLER -->
|
| 253 |
-
</tr>
|
| 254 |
-
<tr><td align="center">CutLER^</td>
|
| 255 |
-
<td valign="bottom"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_mrcnn_final.pth">download</a></td>
|
| 256 |
-
<td align="center">21.1 | 29.6</td>
|
| 257 |
-
<td align="center">21.6 | 30.0</td>
|
| 258 |
-
<td align="center">36.6 | 41.0</td>
|
| 259 |
-
<td align="center">7.7 | 18.7</td>
|
| 260 |
-
<td align="center">29.8 | 38.4</td>
|
| 261 |
-
<td align="center">20.9 | 38.5</td>
|
| 262 |
-
<td align="center">31.2 | 37.1</td>
|
| 263 |
-
<td align="center">37.3 | 39.9</td>
|
| 264 |
-
<td align="center">15.3 | 25.4</td>
|
| 265 |
-
<td align="center">19.5 | 30.0</td>
|
| 266 |
-
<td align="center">17.1 | 26.4</td>
|
| 267 |
-
</tr>
|
| 268 |
-
<!-- ROW: Box/Mask AP for CutLER -->
|
| 269 |
-
</tr>
|
| 270 |
-
<tr><td align="center">CutLER*</td>
|
| 271 |
-
<td valign="bottom"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_cascade_final.pth">download</a></td>
|
| 272 |
-
<td align="center">21.9 | 32.7</td>
|
| 273 |
-
<td align="center">22.4 | 33.1</td>
|
| 274 |
-
<td align="center">36.9 | 44.3</td>
|
| 275 |
-
<td align="center">8.4 | 21.8</td>
|
| 276 |
-
<td align="center">31.7 | 42.8</td>
|
| 277 |
-
<td align="center">21.1 | 41.3</td>
|
| 278 |
-
<td align="center">30.4 | 38.6</td>
|
| 279 |
-
<td align="center">37.5 | 44.6</td>
|
| 280 |
-
<td align="center">18.4 | 27.5</td>
|
| 281 |
-
<td align="center">21.6 | 34.2</td>
|
| 282 |
-
<td align="center">17.3 | 29.6</td>
|
| 283 |
-
</tr>
|
| 284 |
-
</tbody></table>
|
| 285 |
-
|
| 286 |
-
## Semi-supervised and Fully-supervised Learning
|
| 287 |
-
CutLER can also serve as a pretrained model for training fully supervised object detection and instance segmentation models and improves performance on COCO, including on few-shot benchmarks.
|
| 288 |
-
|
| 289 |
-
### Training & Evaluation in Command Line
|
| 290 |
-
You can find all the semi-supervised and fully-supervised learning configs provided in CutLER under `model_zoo/configs/COCO-Semisupervised`.
|
| 291 |
-
|
| 292 |
-
To train a model using K% labels with `train_net.py`, first set up the COCO dataset according to [datasets/README.md](datasets/README.md) and specify K value in the config file, then run:
|
| 293 |
-
```
|
| 294 |
-
python train_net.py --num-gpus 8 \
|
| 295 |
-
--config-file model_zoo/configs/COCO-Semisupervised/cascade_mask_rcnn_R_50_FPN_{K}perc.yaml \
|
| 296 |
-
MODEL.WEIGHTS /path/to/cutler_pretrained_model
|
| 297 |
-
```
|
| 298 |
-
|
| 299 |
-
You can find all config files used to train supervised models under `model_zoo/configs/COCO-Semisupervised`.
|
| 300 |
-
The configs are made for 8-GPU training. To train on 1 GPU, you may need to [change some parameters](https://arxiv.org/abs/1706.02677), e.g. number of GPUs (num-gpus your_num_gpus), learning rates (SOLVER.BASE_LR your_base_lr) and batch size (SOLVER.IMS_PER_BATCH your_batch_size).
|
| 301 |
-
|
| 302 |
-
### Evaluation
|
| 303 |
-
To evaluate a model's performance, use
|
| 304 |
-
```
|
| 305 |
-
python train_net.py \
|
| 306 |
-
--config-file model_zoo/configs/COCO-Semisupervised/cascade_mask_rcnn_R_50_FPN_{K}perc.yaml \
|
| 307 |
-
--eval-only MODEL.WEIGHTS /path/to/checkpoint_file
|
| 308 |
-
```
|
| 309 |
-
For more options, see `python train_net.py -h`.
|
| 310 |
-
|
| 311 |
-
### Model Zoo
|
| 312 |
-
We fine-tune a Cascade R-CNN model initialized with CutLER or MoCo-v2 on varying amounts of labeled COCO data, and show results (Box | Mask AP) on the val2017 split below:
|
| 313 |
-
|
| 314 |
-
<table><tbody>
|
| 315 |
-
<!-- START TABLE -->
|
| 316 |
-
<!-- TABLE HEADER -->
|
| 317 |
-
<th valign="bottom">% of labels</th>
|
| 318 |
-
<th valign="bottom">1%</th>
|
| 319 |
-
<th valign="bottom">2%</th>
|
| 320 |
-
<th valign="bottom">5%</th>
|
| 321 |
-
<th valign="bottom">10%</th>
|
| 322 |
-
<th valign="bottom">20%</th>
|
| 323 |
-
<th valign="bottom">30%</th>
|
| 324 |
-
<th valign="bottom">40%</th>
|
| 325 |
-
<th valign="bottom">50%</th>
|
| 326 |
-
<th valign="bottom">60%</th>
|
| 327 |
-
<th valign="bottom">80%</th>
|
| 328 |
-
<th valign="bottom">100%</th>
|
| 329 |
-
<!-- TABLE BODY -->
|
| 330 |
-
<!-- ROW: Box/Mask AP for CutLER -->
|
| 331 |
-
<tr><td align="center">MoCo-v2</td>
|
| 332 |
-
<td align="center">11.8 | 10.0</td>
|
| 333 |
-
<td align="center">16.2 | 13.8</td>
|
| 334 |
-
<td align="center">20.5 | 17.8</td>
|
| 335 |
-
<td align="center">26.5 | 23.0</td>
|
| 336 |
-
<td align="center">32.5 | 28.2</td>
|
| 337 |
-
<td align="center">35.5 | 30.8</td>
|
| 338 |
-
<td align="center">37.3 | 32.3</td>
|
| 339 |
-
<td align="center">38.7 | 33.6</td>
|
| 340 |
-
<td align="center">39.9 | 34.6</td>
|
| 341 |
-
<td align="center">41.6 | 36.0</td>
|
| 342 |
-
<td align="center">42.8 | 37.0</td>
|
| 343 |
-
</tr>
|
| 344 |
-
<!-- ROW: Mask AP -->
|
| 345 |
-
<tr><td align="center">CutLER</td>
|
| 346 |
-
<td align="center">16.8 | 14.6</td>
|
| 347 |
-
<td align="center">21.6 | 18.9</td>
|
| 348 |
-
<td align="center">27.8 | 24.3</td>
|
| 349 |
-
<td align="center">32.2 | 28.1</td>
|
| 350 |
-
<td align="center">36.6 | 31.7</td>
|
| 351 |
-
<td align="center">38.2 | 33.3</td>
|
| 352 |
-
<td align="center">39.9 | 34.7</td>
|
| 353 |
-
<td align="center">41.5 | 35.9</td>
|
| 354 |
-
<td align="center">42.3 | 36.7</td>
|
| 355 |
-
<td align="center">43.8 | 37.9</td>
|
| 356 |
-
<td align="center">44.7 | 38.5</td>
|
| 357 |
-
</tr>
|
| 358 |
-
<!-- ROW: Model Downloads -->
|
| 359 |
-
<tr><td align="center">Download</td>
|
| 360 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_1perc.pth">model</a></td>
|
| 361 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_2perc.pth">model</a></td>
|
| 362 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_5perc.pth">model</a></td>
|
| 363 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_10perc.pth">model</a></td>
|
| 364 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_20perc.pth">model</a></td>
|
| 365 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_30perc.pth">model</a></td>
|
| 366 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_40perc.pth">model</a></td>
|
| 367 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_50perc.pth">model</a></td>
|
| 368 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_60perc.pth">model</a></td>
|
| 369 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_semi_80perc.pth">model</a></td>
|
| 370 |
-
<td align="center"><a href="http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_fully_100perc.pth">model</a></td>
|
| 371 |
-
</tr>
|
| 372 |
-
</tbody></table>
|
| 373 |
-
|
| 374 |
-
Both MoCo-v2 and our CutLER are trained for the 1x schedule using Detectron2, except for extremely low-shot settings with 1% or 2% labels. When training with 1% or 2% labels, we train both MoCo-v2 and our model for 3,600 iterations with a batch size of 16.
|
| 375 |
-
|
| 376 |
-
## License
|
| 377 |
-
The majority of CutLER, Detectron2 and DINO are licensed under the [CC-BY-NC license](LICENSE), however portions of the project are available under separate license terms: TokenCut, Bilateral Solver and CRF are licensed under the MIT license; If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0.
|
| 378 |
-
|
| 379 |
-
## Ethical Considerations
|
| 380 |
-
CutLER's wide range of detection capabilities may introduce similar challenges to many other visual recognition methods.
|
| 381 |
-
As the image can contain arbitrary instances, it may impact the model output.
|
| 382 |
-
|
| 383 |
-
## How to get support from us?
|
| 384 |
-
If you have any general questions, feel free to email us at [Xudong Wang](mailto:xdwang@eecs.berkeley.edu), [Ishan Misra](mailto:imisra@meta.com) and [Rohit Girdhar](mailto:rgirdhar@meta.com). If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).
|
| 385 |
-
|
| 386 |
-
## Citation
|
| 387 |
-
If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.
|
| 388 |
-
```
|
| 389 |
-
@inproceedings{wang2023cut,
|
| 390 |
-
title={Cut and learn for unsupervised object detection and instance segmentation},
|
| 391 |
-
author={Wang, Xudong and Girdhar, Rohit and Yu, Stella X and Misra, Ishan},
|
| 392 |
-
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
|
| 393 |
-
pages={3124--3134},
|
| 394 |
-
year={2023}
|
| 395 |
-
}
|
| 396 |
-
```
|
| 397 |
-
|
| 398 |
-
```
|
| 399 |
-
@article{wang2023videocutler,
|
| 400 |
-
title={VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation},
|
| 401 |
-
author={Wang, Xudong and Misra, Ishan and Zeng, Ziyun and Girdhar, Rohit and Darrell, Trevor},
|
| 402 |
-
journal={arXiv preprint arXiv:2308.14710},
|
| 403 |
-
year={2023}
|
| 404 |
-
}
|
| 405 |
-
```
|
|
|
|
| 1 |
+
Under Development for research
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
maskcut/README.md
CHANGED
|
@@ -1,293 +1,2 @@
|
|
| 1 |
-
# (CVPR 2022) TokenCut
|
| 2 |
-
Pytorch implementation of **Tokencut**:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
**Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut**
|
| 6 |
-
|
| 7 |
-
*[Yangtao Wang](https://yangtaowang95.github.io), [Xi Shen](https://xishen0220.github.io/), [Shell Xu Hu](http://hushell.github.io/), [Yuan Yuan](https://yyuanad.github.io/), [James L. Crowley](http://crowley-coutaz.fr/jlc/jlc.html), [Dominique Vaufreydaz](https://research.vaufreydaz.org/)*
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
[[Project page](https://www.m-psi.fr/Papers/TokenCut2022/)]
|
| 11 |
-
[[ Github (Video Segmentation) ](https://github.com/YangtaoWANG95/TokenCut_video)]
|
| 12 |
-
[[Paper](https://arxiv.org/pdf/2202.11539.pdf)]
|
| 13 |
-
[](https://colab.research.google.com/github/YangtaoWANG95/TokenCut/blob/master/inference_demo.ipynb)
|
| 14 |
-
[](https://huggingface.co/spaces/yangtaowang/TokenCut)
|
| 15 |
-
|
| 16 |
-
<p align="center">
|
| 17 |
-
<img width="100%" alt="TokenCut teaser" src="examples/overview.png">
|
| 18 |
-
</p>
|
| 19 |
-
|
| 20 |
-
If our project is helpful for your research, please consider citing :
|
| 21 |
-
```
|
| 22 |
-
@inproceedings{wang2022tokencut,
|
| 23 |
-
title={Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut},
|
| 24 |
-
author={Wang, Yangtao and Shen, Xi and Hu, Shell Xu and Yuan, Yuan and Crowley, James L. and Vaufreydaz, Dominique},
|
| 25 |
-
booktitle={Conference on Computer Vision and Pattern Recognition}
|
| 26 |
-
year={2022}
|
| 27 |
-
}
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
## Table of Content
|
| 32 |
-
* [1. Updates](#1-updates)
|
| 33 |
-
* [2. Installation](#2-installation)
|
| 34 |
-
* [2.1 Dependencies](#21-dependencies)
|
| 35 |
-
* [2.2 Data](#22-data)
|
| 36 |
-
* [3. Quick Start](#3-quick-start)
|
| 37 |
-
* [3.1 Detecting an object in one image](#31-detecting-an-object-in-one-image)
|
| 38 |
-
* [3.2 Segmenting a salient region in one image](#32-segmenting-a-salient-region-in-one-image)
|
| 39 |
-
* [4. Evaluation](#4-evaluation)
|
| 40 |
-
* [4.1 Unsupervised object discovery](#41-unsupervised-object-discovery)
|
| 41 |
-
* [4.2 Unsupervised saliency detection](#42-unsupervised-saliency-detection)
|
| 42 |
-
* [4.3 Weakly supervised object detection](#43-weakly-supervised-object-detection)
|
| 43 |
-
* [5. Acknowledgement](#5-acknowledgement)
|
| 44 |
-
|
| 45 |
-
## 1. Updates
|
| 46 |
-
|
| 47 |
-
***09/06/2022***
|
| 48 |
-
Extension work of [TokeCut Video Segmentation](https://github.com/YangtaoWANG95/TokenCut_video) is realised!
|
| 49 |
-
|
| 50 |
-
***03/10/2022***
|
| 51 |
-
Creating a 480p Demo using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [](https://huggingface.co/spaces/yangtaowang/TokenCut)
|
| 52 |
-
|
| 53 |
-
Internet image results:
|
| 54 |
-
<p >
|
| 55 |
-
<img width="20%" alt="TokenCut visualizations" src="examples/internet_image/kungfu_pred.jpg">
|
| 56 |
-
<img width="20%" alt="TokenCut visualizations" src="examples/internet_image/kungfu_attn.jpg">
|
| 57 |
-
<img width="19%" alt="TokenCut visualizations" src="examples/internet_image/pokemon_pred.jpg">
|
| 58 |
-
<img width="19.5%" alt="TokenCut visualizations" src="examples/internet_image/pokemon_attn.jpg">
|
| 59 |
-
</p>
|
| 60 |
-
|
| 61 |
-
***02/26/2022***
|
| 62 |
-
Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [](https://huggingface.co/spaces/akhaliq/TokenCut)
|
| 63 |
-
|
| 64 |
-
***02/26/2022***
|
| 65 |
-
A simple TokenCut Colab Demo is available.
|
| 66 |
-
|
| 67 |
-
***02/21/2022***
|
| 68 |
-
Initial commit: Code of TokenCut is released, including evaluation of unsupervised object discovery, unsupervised saliency object detection, weakly supervised object locolization.
|
| 69 |
-
|
| 70 |
-
## 2. Installation
|
| 71 |
-
### 2.1 Dependencies
|
| 72 |
-
|
| 73 |
-
This code was implemented with Python 3.7, PyTorch 1.7.1 and CUDA 11.2. Please refer to [the official installation](https://pytorch.org/get-started/previous-versions/). If CUDA 10.2 has been properly installed :
|
| 74 |
-
```
|
| 75 |
-
pip install torch==1.7.1 torchvision==0.8.2
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
In order to install the additionnal dependencies, please launch the following command:
|
| 80 |
-
|
| 81 |
-
```
|
| 82 |
-
pip install -r requirements.txt
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
### 2.2 Data
|
| 87 |
-
|
| 88 |
-
We provide quick download commands in [DOWNLOAD_DATA.md](./DOWNLOAD_DATA.md) for VOC2007, VOC2012, COCO, CUB, ImageNet, ECSSD, DUTS and DUT-OMRON as well as DINO checkpoints.
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
## 3. Quick Start
|
| 92 |
-
|
| 93 |
-
### 3.1 Detecting an object in one image
|
| 94 |
-
|
| 95 |
-
We provide TokenCut visualization for bounding box prediction and attention map. Using `all` for all visualization results.
|
| 96 |
-
<!--Following are scripts to apply TokenCut to an image defined via the `image_path` parameter and visualize the predictions (`pred`), the eigen attention map of the Figure 3 in the paper (`attn`), and all the figures(`all`). Box predictions are also stored in the output directory given by parameter `output_dir`. -->
|
| 97 |
-
|
| 98 |
-
```
|
| 99 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize pred
|
| 100 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize attn
|
| 101 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize all
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
### 3.2 Segmenting a salient region in one image
|
| 105 |
-
|
| 106 |
-
We provide TokenCut segmentation results as follows:
|
| 107 |
-
|
| 108 |
-
```
|
| 109 |
-
cd unsupervised_saliency_detection
|
| 110 |
-
python get_saliency.py --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --vit-arch small --patch-size 16 --img-path ../examples/VOC07_000036.jpg --out-dir ./output
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
## 4. Evaluation
|
| 114 |
-
Following are the different steps to reproduce the results of **TokenCut** presented in the paper.
|
| 115 |
-
|
| 116 |
-
### 4.1 Unsupervised object discovery
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
<p align="center">
|
| 120 |
-
<img width="30%" alt="TokenCut visualizations" src="examples/ex000012.jpg">
|
| 121 |
-
<img width="13.2%" alt="TokenCut visualizations" src="examples/ex000036.jpg">
|
| 122 |
-
<img width="19.8%" alt="TokenCut visualizations" src="examples/ex000064.jpg">
|
| 123 |
-
</p>
|
| 124 |
-
|
| 125 |
-
#### PASCAL-VOC
|
| 126 |
-
In order to apply TokenCut and compute corloc results (VOC07 68.8, VOC12 72.1), please launch:
|
| 127 |
-
```
|
| 128 |
-
python main_tokencut.py --dataset VOC07 --set trainval
|
| 129 |
-
python main_tokencut.py --dataset VOC12 --set trainval
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
If you want to extract Dino features, which corresponds to [the KEY features in DINO](https://github.com/XiSHEN0220/LOST/blob/main/main_lost.py#L259-L261):
|
| 133 |
-
```
|
| 134 |
-
mkdir features
|
| 135 |
-
python main_lost.py --dataset VOC07 --set trainval --save-feat-dir features/VOC2007
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
#### COCO
|
| 139 |
-
|
| 140 |
-
Results are provided given the 2014 annotations following previous works. The following command line allows you to get results on the subset of 20k images of the COCO dataset (corloc 58.8), following previous litterature. To be noted that the 20k images are a subset of the `train` set.
|
| 141 |
-
```
|
| 142 |
-
python main_tokencut.py --dataset COCO20k --set train
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
#### Different models
|
| 146 |
-
We have tested the method on different setups of the VIT model, corloc results are presented in the following table (more can be found in the paper).
|
| 147 |
-
|
| 148 |
-
<table>
|
| 149 |
-
<tr>
|
| 150 |
-
<th>arch</th>
|
| 151 |
-
<th>pre-training</th>
|
| 152 |
-
<th colspan="3">dataset</th>
|
| 153 |
-
</tr>
|
| 154 |
-
<tr>
|
| 155 |
-
<th></th>
|
| 156 |
-
<th></th>
|
| 157 |
-
<th>VOC07</th>
|
| 158 |
-
<th>VOC12</th>
|
| 159 |
-
<th>COCO20k</th>
|
| 160 |
-
</tr>
|
| 161 |
-
<tr>
|
| 162 |
-
<td>ViT-S/16</td>
|
| 163 |
-
<td>DINO</td>
|
| 164 |
-
<td>68.8</td>
|
| 165 |
-
<td>72.1</td>
|
| 166 |
-
<td>58.8</td>
|
| 167 |
-
<tr>
|
| 168 |
-
<tr>
|
| 169 |
-
<td>ViT-S/8</td>
|
| 170 |
-
<td>DINO</td>
|
| 171 |
-
<td>67.3</td>
|
| 172 |
-
<td>71.6</td>
|
| 173 |
-
<td>60.7</td>
|
| 174 |
-
<tr>
|
| 175 |
-
<tr>
|
| 176 |
-
<td>ViT-B/16</td>
|
| 177 |
-
<td>DINO</td>
|
| 178 |
-
<td>68.8</td>
|
| 179 |
-
<td>72.4</td>
|
| 180 |
-
<td>59.0</td>
|
| 181 |
-
<tr>
|
| 182 |
-
</table>
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
Previous results on the dataset `VOC07` can be obtained by launching:
|
| 186 |
-
```
|
| 187 |
-
python main_tokencut.py --dataset VOC07 --set trainval #VIT-S/16
|
| 188 |
-
python main_tokencut.py --dataset VOC07 --set trainval --patch_size 8 #VIT-S/8
|
| 189 |
-
python main_tokencut.py --dataset VOC07 --set trainval --arch vit_base #VIT-B/16
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
### 4.2 Unsupervised saliency detection
|
| 193 |
-
|
| 194 |
-
<p align="center">
|
| 195 |
-
<img width="28%" alt="TokenCut visualizations" src="examples/ex_usd_2.jpg">
|
| 196 |
-
<img width="12.5%" alt="TokenCut visualizations" src="examples/ex_usd_3.jpg">
|
| 197 |
-
<img width="14%" alt="TokenCut visualizations" src="examples/ex_usd_1.jpg">
|
| 198 |
-
</p>
|
| 199 |
-
|
| 200 |
-
To evaluate on ECSSD, DUTS, DUT_OMRON dataset:
|
| 201 |
-
```
|
| 202 |
-
python get_saliency.py --out-dir ECSSD --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset ECSSD
|
| 203 |
-
|
| 204 |
-
python get_saliency.py --out-dir DUTS --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUTS
|
| 205 |
-
|
| 206 |
-
python get_saliency.py --out-dir DUT --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUT
|
| 207 |
-
```
|
| 208 |
-
This should give:
|
| 209 |
-
|
| 210 |
-
<table>
|
| 211 |
-
<tr>
|
| 212 |
-
<th>Method</th>
|
| 213 |
-
<th colspan="3"> ECSSD</th>
|
| 214 |
-
<th colspan="3"> DUTS</th>
|
| 215 |
-
<th colspan="3"> DUT-OMRON</th>
|
| 216 |
-
</tr>
|
| 217 |
-
<tr>
|
| 218 |
-
<th></th>
|
| 219 |
-
<th>maxF</th>
|
| 220 |
-
<th>IoU</th>
|
| 221 |
-
<th>Acc</th>
|
| 222 |
-
<th>maxF</th>
|
| 223 |
-
<th>IoU</th>
|
| 224 |
-
<th>Acc</th>
|
| 225 |
-
<th>maxF</th>
|
| 226 |
-
<th>IoU</th>
|
| 227 |
-
<th>Acc</th>
|
| 228 |
-
</tr>
|
| 229 |
-
<tr>
|
| 230 |
-
<td>TokenCut</td>
|
| 231 |
-
<td>80.3</td>
|
| 232 |
-
<td>71.2</td>
|
| 233 |
-
<td>91.8</td>
|
| 234 |
-
<td>67.2</td>
|
| 235 |
-
<td>57.6</td>
|
| 236 |
-
<td>90.3</td>
|
| 237 |
-
<td>60.0</td>
|
| 238 |
-
<td>53.3</td>
|
| 239 |
-
<td>88.0</td>
|
| 240 |
-
<tr>
|
| 241 |
-
<tr>
|
| 242 |
-
<td>TokenCut + BS</td>
|
| 243 |
-
<td>87.4</td>
|
| 244 |
-
<td>77.2</td>
|
| 245 |
-
<td>93.4</td>
|
| 246 |
-
<td>75.5</td>
|
| 247 |
-
<td>62,4</td>
|
| 248 |
-
<td>91.4</td>
|
| 249 |
-
<td>69.7</td>
|
| 250 |
-
<td>61.8</td>
|
| 251 |
-
<td>89.7</td>
|
| 252 |
-
</tr>
|
| 253 |
-
</table>
|
| 254 |
-
|
| 255 |
-
### 4.3 Weakly supervised object detection
|
| 256 |
-
|
| 257 |
-
<p align="center">
|
| 258 |
-
<img width="15%" alt="TokenCut visualizations" src="examples/ex_wsd_1.jpg">
|
| 259 |
-
<img width="24%" alt="TokenCut visualizations" src="examples/ex_wsd_2.jpg">
|
| 260 |
-
<img width="18.9%" alt="TokenCut visualizations" src="examples/ex_wsd_3.jpg">
|
| 261 |
-
|
| 262 |
-
</p>
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
#### Fintune DINO small on CUB
|
| 266 |
-
|
| 267 |
-
To finetune ViT-S/16 on CUB on a single node with 4 gpus for 1000 epochs run:
|
| 268 |
-
```
|
| 269 |
-
python -m torch.distributed.launch --nproc_per_node=4 main.py --data_path /path/to/data --batch_size_per_gpu 256 --dataset cub --weight_decay 0.005 --pretrained_weights ./dino_deitsmall16_pretrain.pth --epoch 1000 --output_dir ./path/to/checkpoin --lr 2e-4 --warmup-epochs 50 --num_labels 200 --num_workers 16 --n_last_blocks 1 --avgpool_patchtokens true --arch vit_small --patch_size 16
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
#### Evaluation on CUB
|
| 273 |
-
To evaluate a fine-tuned ViT-S/16 on CUB val with a single GPU run:
|
| 274 |
-
```
|
| 275 |
-
python eval.py --pretrained_weights ./path/to/checkpoint --dataset cub --data_path ./path/to/data --batch_size_per_gpu 1 --no_center_crop
|
| 276 |
-
```
|
| 277 |
-
This should give:
|
| 278 |
-
```
|
| 279 |
-
Top1_cls: 79.12, top5_cls94.80, gt_loc: 0.914, top1_loc:0.723
|
| 280 |
-
```
|
| 281 |
-
|
| 282 |
-
#### Evaluate on Imagenet
|
| 283 |
-
|
| 284 |
-
To Evaluate ViT-S/16 finetuned on ImageNet val with a single GPU run:
|
| 285 |
-
```
|
| 286 |
-
python eval.py --pretrained_weights /path/to/checkpoint --classifier_weights /path/to/linear_weights--dataset imagenet --data_path ./path/to/data --batch_size_per_gpu 1 --num_labels 1000 --batch_size_per_gpu 1 --no_center_crop --input_size 256 --tau 0.2 --patch_size 16 --arch vit_small
|
| 287 |
-
```
|
| 288 |
-
|
| 289 |
-
## 5. Acknowledgement
|
| 290 |
-
|
| 291 |
-
TokenCut code is built on top of [LOST](https://github.com/valeoai/LOST), [DINO](https://github.com/facebookresearch/dino), [Segswap](https://github.com/XiSHEN0220/SegSwap), and [Bilateral_Sovlver](https://github.com/poolio/bilateral_solver). We would like to sincerely thanks those authors for their great works.
|
| 292 |
-
|
| 293 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
+
**Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut**
|
third_party/README.md
CHANGED
|
@@ -1,293 +1 @@
|
|
| 1 |
-
|
| 2 |
-
Pytorch implementation of **Tokencut**:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
**Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut**
|
| 6 |
-
|
| 7 |
-
*[Yangtao Wang](https://yangtaowang95.github.io), [Xi Shen](https://xishen0220.github.io/), [Shell Xu Hu](http://hushell.github.io/), [Yuan Yuan](https://yyuanad.github.io/), [James L. Crowley](http://crowley-coutaz.fr/jlc/jlc.html), [Dominique Vaufreydaz](https://research.vaufreydaz.org/)*
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
[[Project page](https://www.m-psi.fr/Papers/TokenCut2022/)]
|
| 11 |
-
[[ Github (Video Segmentation) ](https://github.com/YangtaoWANG95/TokenCut_video)]
|
| 12 |
-
[[Paper](https://arxiv.org/pdf/2202.11539.pdf)]
|
| 13 |
-
[](https://colab.research.google.com/github/YangtaoWANG95/TokenCut/blob/master/inference_demo.ipynb)
|
| 14 |
-
[](https://huggingface.co/spaces/yangtaowang/TokenCut)
|
| 15 |
-
|
| 16 |
-
<p align="center">
|
| 17 |
-
<img width="100%" alt="TokenCut teaser" src="examples/overview.png">
|
| 18 |
-
</p>
|
| 19 |
-
|
| 20 |
-
If our project is helpful for your research, please consider citing :
|
| 21 |
-
```
|
| 22 |
-
@inproceedings{wang2022tokencut,
|
| 23 |
-
title={Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut},
|
| 24 |
-
author={Wang, Yangtao and Shen, Xi and Hu, Shell Xu and Yuan, Yuan and Crowley, James L. and Vaufreydaz, Dominique},
|
| 25 |
-
booktitle={Conference on Computer Vision and Pattern Recognition}
|
| 26 |
-
year={2022}
|
| 27 |
-
}
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
## Table of Content
|
| 32 |
-
* [1. Updates](#1-updates)
|
| 33 |
-
* [2. Installation](#2-installation)
|
| 34 |
-
* [2.1 Dependencies](#21-dependencies)
|
| 35 |
-
* [2.2 Data](#22-data)
|
| 36 |
-
* [3. Quick Start](#3-quick-start)
|
| 37 |
-
* [3.1 Detecting an object in one image](#31-detecting-an-object-in-one-image)
|
| 38 |
-
* [3.2 Segmenting a salient region in one image](#32-segmenting-a-salient-region-in-one-image)
|
| 39 |
-
* [4. Evaluation](#4-evaluation)
|
| 40 |
-
* [4.1 Unsupervised object discovery](#41-unsupervised-object-discovery)
|
| 41 |
-
* [4.2 Unsupervised saliency detection](#42-unsupervised-saliency-detection)
|
| 42 |
-
* [4.3 Weakly supervised object detection](#43-weakly-supervised-object-detection)
|
| 43 |
-
* [5. Acknowledgement](#5-acknowledgement)
|
| 44 |
-
|
| 45 |
-
## 1. Updates
|
| 46 |
-
|
| 47 |
-
***09/06/2022***
|
| 48 |
-
Extension work of [TokeCut Video Segmentation](https://github.com/YangtaoWANG95/TokenCut_video) is realised!
|
| 49 |
-
|
| 50 |
-
***03/10/2022***
|
| 51 |
-
Creating a 480p Demo using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [](https://huggingface.co/spaces/yangtaowang/TokenCut)
|
| 52 |
-
|
| 53 |
-
Internet image results:
|
| 54 |
-
<p >
|
| 55 |
-
<img width="20%" alt="TokenCut visualizations" src="examples/internet_image/kungfu_pred.jpg">
|
| 56 |
-
<img width="20%" alt="TokenCut visualizations" src="examples/internet_image/kungfu_attn.jpg">
|
| 57 |
-
<img width="19%" alt="TokenCut visualizations" src="examples/internet_image/pokemon_pred.jpg">
|
| 58 |
-
<img width="19.5%" alt="TokenCut visualizations" src="examples/internet_image/pokemon_attn.jpg">
|
| 59 |
-
</p>
|
| 60 |
-
|
| 61 |
-
***02/26/2022***
|
| 62 |
-
Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [](https://huggingface.co/spaces/akhaliq/TokenCut)
|
| 63 |
-
|
| 64 |
-
***02/26/2022***
|
| 65 |
-
A simple TokenCut Colab Demo is available.
|
| 66 |
-
|
| 67 |
-
***02/21/2022***
|
| 68 |
-
Initial commit: Code of TokenCut is released, including evaluation of unsupervised object discovery, unsupervised saliency object detection, weakly supervised object locolization.
|
| 69 |
-
|
| 70 |
-
## 2. Installation
|
| 71 |
-
### 2.1 Dependencies
|
| 72 |
-
|
| 73 |
-
This code was implemented with Python 3.7, PyTorch 1.7.1 and CUDA 11.2. Please refer to [the official installation](https://pytorch.org/get-started/previous-versions/). If CUDA 10.2 has been properly installed :
|
| 74 |
-
```
|
| 75 |
-
pip install torch==1.7.1 torchvision==0.8.2
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
In order to install the additionnal dependencies, please launch the following command:
|
| 80 |
-
|
| 81 |
-
```
|
| 82 |
-
pip install -r requirements.txt
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
### 2.2 Data
|
| 87 |
-
|
| 88 |
-
We provide quick download commands in [DOWNLOAD_DATA.md](./DOWNLOAD_DATA.md) for VOC2007, VOC2012, COCO, CUB, ImageNet, ECSSD, DUTS and DUT-OMRON as well as DINO checkpoints.
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
## 3. Quick Start
|
| 92 |
-
|
| 93 |
-
### 3.1 Detecting an object in one image
|
| 94 |
-
|
| 95 |
-
We provide TokenCut visualization for bounding box prediction and attention map. Using `all` for all visualization results.
|
| 96 |
-
<!--Following are scripts to apply TokenCut to an image defined via the `image_path` parameter and visualize the predictions (`pred`), the eigen attention map of the Figure 3 in the paper (`attn`), and all the figures(`all`). Box predictions are also stored in the output directory given by parameter `output_dir`. -->
|
| 97 |
-
|
| 98 |
-
```
|
| 99 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize pred
|
| 100 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize attn
|
| 101 |
-
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize all
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
### 3.2 Segmenting a salient region in one image
|
| 105 |
-
|
| 106 |
-
We provide TokenCut segmentation results as follows:
|
| 107 |
-
|
| 108 |
-
```
|
| 109 |
-
cd unsupervised_saliency_detection
|
| 110 |
-
python get_saliency.py --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --vit-arch small --patch-size 16 --img-path ../examples/VOC07_000036.jpg --out-dir ./output
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
## 4. Evaluation
|
| 114 |
-
Following are the different steps to reproduce the results of **TokenCut** presented in the paper.
|
| 115 |
-
|
| 116 |
-
### 4.1 Unsupervised object discovery
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
<p align="center">
|
| 120 |
-
<img width="30%" alt="TokenCut visualizations" src="examples/ex000012.jpg">
|
| 121 |
-
<img width="13.2%" alt="TokenCut visualizations" src="examples/ex000036.jpg">
|
| 122 |
-
<img width="19.8%" alt="TokenCut visualizations" src="examples/ex000064.jpg">
|
| 123 |
-
</p>
|
| 124 |
-
|
| 125 |
-
#### PASCAL-VOC
|
| 126 |
-
In order to apply TokenCut and compute corloc results (VOC07 68.8, VOC12 72.1), please launch:
|
| 127 |
-
```
|
| 128 |
-
python main_tokencut.py --dataset VOC07 --set trainval
|
| 129 |
-
python main_tokencut.py --dataset VOC12 --set trainval
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
If you want to extract Dino features, which corresponds to [the KEY features in DINO](https://github.com/XiSHEN0220/LOST/blob/main/main_lost.py#L259-L261):
|
| 133 |
-
```
|
| 134 |
-
mkdir features
|
| 135 |
-
python main_lost.py --dataset VOC07 --set trainval --save-feat-dir features/VOC2007
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
#### COCO
|
| 139 |
-
|
| 140 |
-
Results are provided given the 2014 annotations following previous works. The following command line allows you to get results on the subset of 20k images of the COCO dataset (corloc 58.8), following previous litterature. To be noted that the 20k images are a subset of the `train` set.
|
| 141 |
-
```
|
| 142 |
-
python main_tokencut.py --dataset COCO20k --set train
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
#### Different models
|
| 146 |
-
We have tested the method on different setups of the VIT model, corloc results are presented in the following table (more can be found in the paper).
|
| 147 |
-
|
| 148 |
-
<table>
|
| 149 |
-
<tr>
|
| 150 |
-
<th>arch</th>
|
| 151 |
-
<th>pre-training</th>
|
| 152 |
-
<th colspan="3">dataset</th>
|
| 153 |
-
</tr>
|
| 154 |
-
<tr>
|
| 155 |
-
<th></th>
|
| 156 |
-
<th></th>
|
| 157 |
-
<th>VOC07</th>
|
| 158 |
-
<th>VOC12</th>
|
| 159 |
-
<th>COCO20k</th>
|
| 160 |
-
</tr>
|
| 161 |
-
<tr>
|
| 162 |
-
<td>ViT-S/16</td>
|
| 163 |
-
<td>DINO</td>
|
| 164 |
-
<td>68.8</td>
|
| 165 |
-
<td>72.1</td>
|
| 166 |
-
<td>58.8</td>
|
| 167 |
-
<tr>
|
| 168 |
-
<tr>
|
| 169 |
-
<td>ViT-S/8</td>
|
| 170 |
-
<td>DINO</td>
|
| 171 |
-
<td>67.3</td>
|
| 172 |
-
<td>71.6</td>
|
| 173 |
-
<td>60.7</td>
|
| 174 |
-
<tr>
|
| 175 |
-
<tr>
|
| 176 |
-
<td>ViT-B/16</td>
|
| 177 |
-
<td>DINO</td>
|
| 178 |
-
<td>68.8</td>
|
| 179 |
-
<td>72.4</td>
|
| 180 |
-
<td>59.0</td>
|
| 181 |
-
<tr>
|
| 182 |
-
</table>
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
Previous results on the dataset `VOC07` can be obtained by launching:
|
| 186 |
-
```
|
| 187 |
-
python main_tokencut.py --dataset VOC07 --set trainval #VIT-S/16
|
| 188 |
-
python main_tokencut.py --dataset VOC07 --set trainval --patch_size 8 #VIT-S/8
|
| 189 |
-
python main_tokencut.py --dataset VOC07 --set trainval --arch vit_base #VIT-B/16
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
### 4.2 Unsupervised saliency detection
|
| 193 |
-
|
| 194 |
-
<p align="center">
|
| 195 |
-
<img width="28%" alt="TokenCut visualizations" src="examples/ex_usd_2.jpg">
|
| 196 |
-
<img width="12.5%" alt="TokenCut visualizations" src="examples/ex_usd_3.jpg">
|
| 197 |
-
<img width="14%" alt="TokenCut visualizations" src="examples/ex_usd_1.jpg">
|
| 198 |
-
</p>
|
| 199 |
-
|
| 200 |
-
To evaluate on ECSSD, DUTS, DUT_OMRON dataset:
|
| 201 |
-
```
|
| 202 |
-
python get_saliency.py --out-dir ECSSD --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset ECSSD
|
| 203 |
-
|
| 204 |
-
python get_saliency.py --out-dir DUTS --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUTS
|
| 205 |
-
|
| 206 |
-
python get_saliency.py --out-dir DUT --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUT
|
| 207 |
-
```
|
| 208 |
-
This should give:
|
| 209 |
-
|
| 210 |
-
<table>
|
| 211 |
-
<tr>
|
| 212 |
-
<th>Method</th>
|
| 213 |
-
<th colspan="3"> ECSSD</th>
|
| 214 |
-
<th colspan="3"> DUTS</th>
|
| 215 |
-
<th colspan="3"> DUT-OMRON</th>
|
| 216 |
-
</tr>
|
| 217 |
-
<tr>
|
| 218 |
-
<th></th>
|
| 219 |
-
<th>maxF</th>
|
| 220 |
-
<th>IoU</th>
|
| 221 |
-
<th>Acc</th>
|
| 222 |
-
<th>maxF</th>
|
| 223 |
-
<th>IoU</th>
|
| 224 |
-
<th>Acc</th>
|
| 225 |
-
<th>maxF</th>
|
| 226 |
-
<th>IoU</th>
|
| 227 |
-
<th>Acc</th>
|
| 228 |
-
</tr>
|
| 229 |
-
<tr>
|
| 230 |
-
<td>TokenCut</td>
|
| 231 |
-
<td>80.3</td>
|
| 232 |
-
<td>71.2</td>
|
| 233 |
-
<td>91.8</td>
|
| 234 |
-
<td>67.2</td>
|
| 235 |
-
<td>57.6</td>
|
| 236 |
-
<td>90.3</td>
|
| 237 |
-
<td>60.0</td>
|
| 238 |
-
<td>53.3</td>
|
| 239 |
-
<td>88.0</td>
|
| 240 |
-
<tr>
|
| 241 |
-
<tr>
|
| 242 |
-
<td>TokenCut + BS</td>
|
| 243 |
-
<td>87.4</td>
|
| 244 |
-
<td>77.2</td>
|
| 245 |
-
<td>93.4</td>
|
| 246 |
-
<td>75.5</td>
|
| 247 |
-
<td>62,4</td>
|
| 248 |
-
<td>91.4</td>
|
| 249 |
-
<td>69.7</td>
|
| 250 |
-
<td>61.8</td>
|
| 251 |
-
<td>89.7</td>
|
| 252 |
-
</tr>
|
| 253 |
-
</table>
|
| 254 |
-
|
| 255 |
-
### 4.3 Weakly supervised object detection
|
| 256 |
-
|
| 257 |
-
<p align="center">
|
| 258 |
-
<img width="15%" alt="TokenCut visualizations" src="examples/ex_wsd_1.jpg">
|
| 259 |
-
<img width="24%" alt="TokenCut visualizations" src="examples/ex_wsd_2.jpg">
|
| 260 |
-
<img width="18.9%" alt="TokenCut visualizations" src="examples/ex_wsd_3.jpg">
|
| 261 |
-
|
| 262 |
-
</p>
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
#### Fintune DINO small on CUB
|
| 266 |
-
|
| 267 |
-
To finetune ViT-S/16 on CUB on a single node with 4 gpus for 1000 epochs run:
|
| 268 |
-
```
|
| 269 |
-
python -m torch.distributed.launch --nproc_per_node=4 main.py --data_path /path/to/data --batch_size_per_gpu 256 --dataset cub --weight_decay 0.005 --pretrained_weights ./dino_deitsmall16_pretrain.pth --epoch 1000 --output_dir ./path/to/checkpoin --lr 2e-4 --warmup-epochs 50 --num_labels 200 --num_workers 16 --n_last_blocks 1 --avgpool_patchtokens true --arch vit_small --patch_size 16
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
#### Evaluation on CUB
|
| 273 |
-
To evaluate a fine-tuned ViT-S/16 on CUB val with a single GPU run:
|
| 274 |
-
```
|
| 275 |
-
python eval.py --pretrained_weights ./path/to/checkpoint --dataset cub --data_path ./path/to/data --batch_size_per_gpu 1 --no_center_crop
|
| 276 |
-
```
|
| 277 |
-
This should give:
|
| 278 |
-
```
|
| 279 |
-
Top1_cls: 79.12, top5_cls94.80, gt_loc: 0.914, top1_loc:0.723
|
| 280 |
-
```
|
| 281 |
-
|
| 282 |
-
#### Evaluate on Imagenet
|
| 283 |
-
|
| 284 |
-
To Evaluate ViT-S/16 finetuned on ImageNet val with a single GPU run:
|
| 285 |
-
```
|
| 286 |
-
python eval.py --pretrained_weights /path/to/checkpoint --classifier_weights /path/to/linear_weights--dataset imagenet --data_path ./path/to/data --batch_size_per_gpu 1 --num_labels 1000 --batch_size_per_gpu 1 --no_center_crop --input_size 256 --tau 0.2 --patch_size 16 --arch vit_small
|
| 287 |
-
```
|
| 288 |
-
|
| 289 |
-
## 5. Acknowledgement
|
| 290 |
-
|
| 291 |
-
TokenCut code is built on top of [LOST](https://github.com/valeoai/LOST), [DINO](https://github.com/facebookresearch/dino), [Segswap](https://github.com/XiSHEN0220/SegSwap), and [Bilateral_Sovlver](https://github.com/poolio/bilateral_solver). We would like to sincerely thanks those authors for their great works.
|
| 292 |
-
|
| 293 |
-
|
|
|
|
| 1 |
+
**Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|