Update README.md

0bed9a6 verified 10 days ago

4.63 kB

	---
	license: apache-2.0
	tags:
	- image-classification
	- multi-label-classification
	- booru
	- tagger
	- danbooru
	- e621
	- dinov3
	- vit
	pipeline_tag: image-classification
	---

	# DINOv3 ViT-H/16+ Booru Tagger

	A multi-label image tagger trained on e621 and Danbooru annotations, using a
	[DINOv3 ViT-H/16+](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
	backbone fine-tuned end-to-end with a single linear projection head.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Backbone \| `facebook/dinov3-vith16plus-pretrain-lvd1689m` \|
	\| Architecture \| ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens \|
	\| Head \| `Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated \|
	\| Vocabulary \| 74 625 tags (min frequency ≥ 50 across training set) \|
	\| Input resolution \| Any multiple of 16 px — trained at 512 px, generalises to higher resolutions \|
	\| Input normalisation \| ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]` \|
	\| Output \| Raw logits — apply `sigmoid` for per-tag probabilities \|
	\| Parameters \| ~632 M (backbone) + ~480 M (head) \|

	## Training

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Training data \| e621 + Danbooru (parquet) \|
	\| Batch size \| 32 \|
	\| Learning rate \| 1e-6 \|
	\| Warmup steps \| 50 \|
	\| Loss \| `BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100 \|
	\| Optimiser \| AdamW (β₁=0.9, β₂=0.999, wd=0.01) \|
	\| Precision \| bfloat16 (backbone) / float32 (projection + loss) \|
	\| Hardware \| 2× GPU, ThreadPoolExecutor + NCCL all-reduce \|

	![eval_viz](./eval_viz.png)

	## Usage

	### 1. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	Or manually:

	```bash
	pip install torch torchvision safetensors Pillow requests \
	python-multipart fastapi uvicorn jinja2 aiofiles
	```

	### 2. Download model files

	```bash
	huggingface-cli download lodestones/taggerine \
	tagger_proto.safetensors \
	tagger_vocab_with_categories_and_alias_updated.json \
	tagger_ui_server.py \
	inference_tagger_standalone.py \
	--local-dir .
	```

	> Note: `tagger_proto.safetensors` is ~5.3 GB. Make sure you have enough disk space.

	### 3. Download the `tagger_ui/` templates folder

	The server requires the `tagger_ui/templates/` directory to be present alongside `tagger_ui_server.py`:

	```bash
	huggingface-cli download lodestones/taggerine \
	--include "tagger_ui/**" \
	--local-dir .
	```

	### 4. Run the Web UI

	```bash
	python tagger_ui_server.py \
	--checkpoint tagger_proto.safetensors \
	--vocab tagger_vocab_with_categories_and_alias_updated.json \
	--port 7860
	# → open http://localhost:7860
	```

	CPU-only machine? Add `--device cpu` (inference will be slower):

	```bash
	python tagger_ui_server.py \
	--checkpoint tagger_proto.safetensors \
	--vocab tagger_vocab_with_categories_and_alias_updated.json \
	--device cpu \
	--port 7860
	```

	### Standalone CLI inference (no server)

	```bash
	python inference_tagger_standalone.py \
	--checkpoint tagger_proto.safetensors \
	--vocab tagger_vocab_with_categories_and_alias_updated.json \
	--images photo.jpg \
	--topk 30
	```

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `tagger_proto.safetensors` \| Model weights (bfloat16) \|
	\| `tagger_vocab_with_categories_and_alias_updated.json` \| `{"idx2tag": [...], "tag2category": {...}}` — 74 625 tags with category metadata \|
	\| `tagger_vocab_with_categories.json` \| Same without alias data \|
	\| `tagger_vocab.json` \| Minimal vocab — `{"idx2tag": [...]}` only \|
	\| `inference_tagger_standalone.py` \| Self-contained CLI inference script (no `transformers` dep) \|
	\| `tagger_ui_server.py` \| FastAPI + Jinja2 web UI server \|
	\| `requirements.txt` \| Python dependencies \|

	## Tag Vocabulary

	Tags are sourced from e621 and Danbooru annotations and cover:

	- Subject — species, character count, gender (`solo`, `duo`, `anthro`, `1girl`, `male`, …)
	- Body — anatomy, fur/scale/skin markings, body parts
	- Action / pose — `looking at viewer`, `sitting`, …
	- Scene — background, lighting, setting
	- Style — `digital art`, `hi res`, `sketch`, `watercolor`, …
	- Rating — explicit content tags are included; filter as needed for your use case

	Minimum tag frequency threshold: 50 occurrences across the combined dataset.

	## Limitations

	- Evaluated on booru-style illustrations and furry art; performance on photographic images or other art works to some extend.
	- The vocabulary reflects the biases of e621 and Danbooru annotation practices.

	## License

	Apache 2.0