lodestones commited on
Commit
e59eadd
·
1 Parent(s): 5ab7ede

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ ---
5
+ license: apache-2.0
6
+ tags:
7
+ - image-classification
8
+ - multi-label-classification
9
+ - booru
10
+ - tagger
11
+ - danbooru
12
+ - e621
13
+ - dinov3
14
+ - vit
15
+ pipeline_tag: image-classification
16
+ ---
17
+
18
+ # DINOv3 ViT-H/16+ Booru Tagger
19
+
20
+ A multi-label image tagger trained on **e621** and **Danbooru** annotations, using a
21
+ [DINOv3 ViT-H/16+](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
22
+ backbone fine-tuned end-to-end with a single linear projection head.
23
+
24
+ ## Model Details
25
+
26
+ | Property | Value |
27
+ |---|---|
28
+ | Backbone | `facebook/dinov3-vith16plus-pretrain-lvd1689m` |
29
+ | Architecture | ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens |
30
+ | Head | `Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated |
31
+ | Vocabulary | **74 625 tags** (min frequency ≥ 50 across training set) |
32
+ | Input resolution | Any multiple of 16 px — trained at 512 px, generalises to higher resolutions |
33
+ | Input normalisation | ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` |
34
+ | Output | Raw logits — apply `sigmoid` for per-tag probabilities |
35
+ | Parameters | ~632 M (backbone) + ~480 M (head) |
36
+
37
+ ## Training
38
+
39
+ | Hyperparameter | Value |
40
+ |---|---|
41
+ | Training data | e621 + Danbooru (parquet) |
42
+ | Batch size | 32 |
43
+ | Learning rate | 1e-6 |
44
+ | Warmup steps | 50 |
45
+ | Loss | `BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100 |
46
+ | Optimiser | AdamW (β₁=0.9, β₂=0.999, wd=0.01) |
47
+ | Precision | bfloat16 (backbone) / float32 (projection + loss) |
48
+ | Hardware | 2× GPU, ThreadPoolExecutor + NCCL all-reduce |
49
+
50
+ ## Usage
51
+
52
+ ### Standalone (no `transformers` dependency)
53
+
54
+ ```python
55
+ from inference_tagger_standalone import Tagger
56
+
57
+ tagger = Tagger(
58
+ checkpoint_path="2026-03-28_22-57-47.safetensors",
59
+ vocab_path="tagger_vocab.json",
60
+ device="cuda",
61
+ )
62
+
63
+ tags = tagger.predict("photo.jpg", topk=40)
64
+ # → [("solo", 0.98), ("anthro", 0.95), ...]
65
+
66
+ # or threshold-based
67
+ tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)
68
+ ```
69
+
70
+ ### CLI
71
+
72
+ ```bash
73
+ # top-30 tags, pretty output
74
+ python inference_tagger_standalone.py \
75
+ --checkpoint 2026-03-28_22-57-47.safetensors \
76
+ --vocab tagger_vocab.json \
77
+ --images photo.jpg https://example.com/image.jpg \
78
+ --topk 30
79
+
80
+ # comma-separated string (pipe into diffusion trainer)
81
+ python inference_tagger_standalone.py ... --format tags
82
+
83
+ # JSON
84
+ python inference_tagger_standalone.py ... --format json
85
+ ```
86
+
87
+ ### Web UI
88
+
89
+ ```bash
90
+ pip install fastapi uvicorn jinja2 aiofiles
91
+
92
+ python tagger_ui_server.py \
93
+ --checkpoint 2026-03-28_22-57-47.safetensors \
94
+ --vocab tagger_vocab.json \
95
+ --port 7860
96
+ # → open http://localhost:7860
97
+ ```
98
+
99
+ ## Files
100
+
101
+ | File | Description |
102
+ |---|---|
103
+ | `*.safetensors` | Model weights (bfloat16) |
104
+ | `tagger_vocab.json` | `{"idx2tag": [...]}` — 74 625 tag strings ordered by training frequency |
105
+ | `inference_tagger_standalone.py` | Self-contained inference script (no `transformers` dep) |
106
+ | `tagger_ui_server.py` | FastAPI + Jinja2 web UI server |
107
+
108
+ ## Tag Vocabulary
109
+
110
+ Tags are sourced from e621 and Danbooru annotations and cover:
111
+
112
+ - **Subject** — species, character count, gender (`solo`, `duo`, `anthro`, `1girl`, `male`, …)
113
+ - **Body** — anatomy, fur/scale/skin markings, body parts
114
+ - **Action / pose** — `looking at viewer`, `sitting`, …
115
+ - **Scene** — background, lighting, setting
116
+ - **Style** — `digital art`, `hi res`, `sketch`, `watercolor`, …
117
+ - **Rating** — explicit content tags are included; filter as needed for your use case
118
+
119
+ Minimum tag frequency threshold: **50** occurrences across the combined dataset.
120
+
121
+ ## Limitations
122
+
123
+ - Evaluated on booru-style illustrations and furry art; performance on photographic
124
+ images or other art styles is untested.
125
+ - The vocabulary reflects the biases of e621 and Danbooru annotation practices.
126
+
127
+ ## License
128
+
129
+ Apache 2.0