File size: 4,625 Bytes
e59eadd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad8d78c
 
e59eadd
 
73ddb49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e59eadd
 
73ddb49
 
 
 
e59eadd
73ddb49
 
 
e59eadd
74f5ca2
0ce6136
e59eadd
 
 
 
73ddb49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e59eadd
 
 
 
73ddb49
 
 
 
 
e59eadd
73ddb49
e59eadd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bed9a6
e59eadd
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: apache-2.0
tags:
  - image-classification
  - multi-label-classification
  - booru
  - tagger
  - danbooru
  - e621
  - dinov3
  - vit
pipeline_tag: image-classification
---

# DINOv3 ViT-H/16+ Booru Tagger

A multi-label image tagger trained on **e621** and **Danbooru** annotations, using a
[DINOv3 ViT-H/16+](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
backbone fine-tuned end-to-end with a single linear projection head.

## Model Details

| Property | Value |
|---|---|
| Backbone | `facebook/dinov3-vith16plus-pretrain-lvd1689m` |
| Architecture | ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens |
| Head | `Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated |
| Vocabulary | **74 625 tags** (min frequency ≥ 50 across training set) |
| Input resolution | Any multiple of 16 px — trained at 512 px, generalises to higher resolutions |
| Input normalisation | ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` |
| Output | Raw logits — apply `sigmoid` for per-tag probabilities |
| Parameters | ~632 M (backbone) + ~480 M (head) |

## Training

| Hyperparameter | Value |
|---|---|
| Training data | e621 + Danbooru (parquet) |
| Batch size | 32 |
| Learning rate | 1e-6 |
| Warmup steps | 50 |
| Loss | `BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100 |
| Optimiser | AdamW (β₁=0.9, β₂=0.999, wd=0.01) |
| Precision | bfloat16 (backbone) / float32 (projection + loss) |
| Hardware | 2× GPU, ThreadPoolExecutor + NCCL all-reduce |

![eval_viz](./eval_viz.png)

## Usage

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

Or manually:

```bash
pip install torch torchvision safetensors Pillow requests \
            python-multipart fastapi uvicorn jinja2 aiofiles
```

### 2. Download model files

```bash
huggingface-cli download lodestones/taggerine \
    tagger_proto.safetensors \
    tagger_vocab_with_categories_and_alias_updated.json \
    tagger_ui_server.py \
    inference_tagger_standalone.py \
    --local-dir .
```

> **Note:** `tagger_proto.safetensors` is ~5.3 GB. Make sure you have enough disk space.

### 3. Download the `tagger_ui/` templates folder

The server requires the `tagger_ui/templates/` directory to be present alongside `tagger_ui_server.py`:

```bash
huggingface-cli download lodestones/taggerine \
    --include "tagger_ui/**" \
    --local-dir .
```

### 4. Run the Web UI

```bash
python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories_and_alias_updated.json \
    --port 7860
# → open http://localhost:7860
```

**CPU-only machine?** Add `--device cpu` (inference will be slower):

```bash
python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories_and_alias_updated.json \
    --device cpu \
    --port 7860
```

### Standalone CLI inference (no server)

```bash
python inference_tagger_standalone.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories_and_alias_updated.json \
    --images photo.jpg \
    --topk 30
```

## Files

| File | Description |
|---|---|
| `tagger_proto.safetensors` | Model weights (bfloat16) |
| `tagger_vocab_with_categories_and_alias_updated.json` | `{"idx2tag": [...], "tag2category": {...}}` — 74 625 tags with category metadata |
| `tagger_vocab_with_categories.json` | Same without alias data |
| `tagger_vocab.json` | Minimal vocab — `{"idx2tag": [...]}` only |
| `inference_tagger_standalone.py` | Self-contained CLI inference script (no `transformers` dep) |
| `tagger_ui_server.py` | FastAPI + Jinja2 web UI server |
| `requirements.txt` | Python dependencies |

## Tag Vocabulary

Tags are sourced from e621 and Danbooru annotations and cover:

- **Subject** — species, character count, gender (`solo`, `duo`, `anthro`, `1girl`, `male`, …)
- **Body** — anatomy, fur/scale/skin markings, body parts
- **Action / pose**`looking at viewer`, `sitting`, …
- **Scene** — background, lighting, setting
- **Style**`digital art`, `hi res`, `sketch`, `watercolor`, …
- **Rating** — explicit content tags are included; filter as needed for your use case

Minimum tag frequency threshold: **50** occurrences across the combined dataset.

## Limitations

- Evaluated on booru-style illustrations and furry art; performance on photographic images or other art works to some extend.
- The vocabulary reflects the biases of e621 and Danbooru annotation practices.

## License

Apache 2.0