File size: 4,625 Bytes
e59eadd ad8d78c e59eadd 73ddb49 e59eadd 73ddb49 e59eadd 73ddb49 e59eadd 74f5ca2 0ce6136 e59eadd 73ddb49 e59eadd 73ddb49 e59eadd 73ddb49 e59eadd 0bed9a6 e59eadd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
license: apache-2.0
tags:
- image-classification
- multi-label-classification
- booru
- tagger
- danbooru
- e621
- dinov3
- vit
pipeline_tag: image-classification
---
# DINOv3 ViT-H/16+ Booru Tagger
A multi-label image tagger trained on **e621** and **Danbooru** annotations, using a
[DINOv3 ViT-H/16+](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
backbone fine-tuned end-to-end with a single linear projection head.
## Model Details
| Property | Value |
|---|---|
| Backbone | `facebook/dinov3-vith16plus-pretrain-lvd1689m` |
| Architecture | ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens |
| Head | `Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated |
| Vocabulary | **74 625 tags** (min frequency ≥ 50 across training set) |
| Input resolution | Any multiple of 16 px — trained at 512 px, generalises to higher resolutions |
| Input normalisation | ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` |
| Output | Raw logits — apply `sigmoid` for per-tag probabilities |
| Parameters | ~632 M (backbone) + ~480 M (head) |
## Training
| Hyperparameter | Value |
|---|---|
| Training data | e621 + Danbooru (parquet) |
| Batch size | 32 |
| Learning rate | 1e-6 |
| Warmup steps | 50 |
| Loss | `BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100 |
| Optimiser | AdamW (β₁=0.9, β₂=0.999, wd=0.01) |
| Precision | bfloat16 (backbone) / float32 (projection + loss) |
| Hardware | 2× GPU, ThreadPoolExecutor + NCCL all-reduce |

## Usage
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
Or manually:
```bash
pip install torch torchvision safetensors Pillow requests \
python-multipart fastapi uvicorn jinja2 aiofiles
```
### 2. Download model files
```bash
huggingface-cli download lodestones/taggerine \
tagger_proto.safetensors \
tagger_vocab_with_categories_and_alias_updated.json \
tagger_ui_server.py \
inference_tagger_standalone.py \
--local-dir .
```
> **Note:** `tagger_proto.safetensors` is ~5.3 GB. Make sure you have enough disk space.
### 3. Download the `tagger_ui/` templates folder
The server requires the `tagger_ui/templates/` directory to be present alongside `tagger_ui_server.py`:
```bash
huggingface-cli download lodestones/taggerine \
--include "tagger_ui/**" \
--local-dir .
```
### 4. Run the Web UI
```bash
python tagger_ui_server.py \
--checkpoint tagger_proto.safetensors \
--vocab tagger_vocab_with_categories_and_alias_updated.json \
--port 7860
# → open http://localhost:7860
```
**CPU-only machine?** Add `--device cpu` (inference will be slower):
```bash
python tagger_ui_server.py \
--checkpoint tagger_proto.safetensors \
--vocab tagger_vocab_with_categories_and_alias_updated.json \
--device cpu \
--port 7860
```
### Standalone CLI inference (no server)
```bash
python inference_tagger_standalone.py \
--checkpoint tagger_proto.safetensors \
--vocab tagger_vocab_with_categories_and_alias_updated.json \
--images photo.jpg \
--topk 30
```
## Files
| File | Description |
|---|---|
| `tagger_proto.safetensors` | Model weights (bfloat16) |
| `tagger_vocab_with_categories_and_alias_updated.json` | `{"idx2tag": [...], "tag2category": {...}}` — 74 625 tags with category metadata |
| `tagger_vocab_with_categories.json` | Same without alias data |
| `tagger_vocab.json` | Minimal vocab — `{"idx2tag": [...]}` only |
| `inference_tagger_standalone.py` | Self-contained CLI inference script (no `transformers` dep) |
| `tagger_ui_server.py` | FastAPI + Jinja2 web UI server |
| `requirements.txt` | Python dependencies |
## Tag Vocabulary
Tags are sourced from e621 and Danbooru annotations and cover:
- **Subject** — species, character count, gender (`solo`, `duo`, `anthro`, `1girl`, `male`, …)
- **Body** — anatomy, fur/scale/skin markings, body parts
- **Action / pose** — `looking at viewer`, `sitting`, …
- **Scene** — background, lighting, setting
- **Style** — `digital art`, `hi res`, `sketch`, `watercolor`, …
- **Rating** — explicit content tags are included; filter as needed for your use case
Minimum tag frequency threshold: **50** occurrences across the combined dataset.
## Limitations
- Evaluated on booru-style illustrations and furry art; performance on photographic images or other art works to some extend.
- The vocabulary reflects the biases of e621 and Danbooru annotation practices.
## License
Apache 2.0
|