Create eval_basics.md

ae87087 over 2 years ago

6.33 kB

	# Evaluating Anime Models Systematically - Basics

	I was trying to refine my character models when I realized how I've been making models is really inefficient.
	It typically goes like, tweak some configs or data, try some random prompts, see if they look okay.
	It should be helpful to establish a well-defined procedure.
	Then it's apparent that to evaluate fine-tuned models, knowing and quantifying how the base models perform as a baseline is essential.
	So here I am, trying to evaluate base models.

	I collected 1000 random prompts from Danbooru posts from 2021-2022 with the query `chartags:0 -is:child -rating:e,q order:random score:>=10 filetype:jpg,png,webp ratio:0.45..2.1`
	and generated 1000 640x640 images with them for each of 3 widely-used anime models:
	[animefull-latest](https://huggingface.co/deepghs/animefull-latest),
	[Counterfeit-V3.0](https://civitai.com/models/4468?modelVersionId=57618), [MeinaMix_V11](https://huggingface.co/Meina/MeinaMix_V11).


	A model can be evaluated over a number of aspects: fidelity, text-image alignment, aesthetics, diversity. Let's go through them one by one.

	## Fidelity

	Generated images should be indistinguishable from real ones. They should make sense and not contain obvious errors such as extra limbs, mutated fingers, glitches or random blobs.
	In literature, it's common to use metrics based on distribution distance, such as FID and IS. I calculated the KID score of the 3 sets of images against the 1000 real images.

	\| model \| KID (lower better) \|
	\|---\|--\|
	\| animefull-latest \| 0.01192 \|
	\| Counterfeit-V3.0 \| 0.01807 \|
	\| MeinaMix_V11 \| 0.01345 \|

	It seems like that KID does not align with human evaluation, which would generally rate animefull-latest as the worst one.
	This is kind of expected, since models with strong style would have a different image feature distribution than random real images.

	I also tried multimodal LLMs, including GPT-4V and LLaVA, and unfortunately find them quite useless. GPT-4V is supposedly SOTA, but it's clear that it quite useless at spotting generation errors.


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/bAtPwnb0TMrKZrneQbxvi.png)


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/mPrbphtOxfNtK1ClhEgGV.png)

	So currently I can't find a process that computes a fidelity score for anime models. Have to wait for someone to train a specialized model for now.


	## Text-Image Alignment

	Generated images should not contradict the text prompts. A popular metric is the CLIP score, which is the cosine similarity of the projected CLIP embeddings.
	There's also [PickScore_v1](https://huggingface.co/yuvalkirstain/PickScore_v1) which is fine-tuned on human preference data.
	These are not well-suited for anime models due to how different Booru tagging is from regular images.

	Model using booru-tag prompts can be evaluated with a tagger. Specifically, I used [wd-v1-4-moat-tagger-v2](https://huggingface.co/SmilingWolf/wd-v1-4-moat-tagger-v2) with a threshold of 0.35.
	A tag accuracy score can be defined as `#{prompted tags correctly reproduced}/#{prompted tags}`. The accuracy is macro-averaged over all images. Here are the scores:

	\| model \| tag accuracy (higher better) \|
	\|---\|--\|
	\| animefull-latest \| 0.464328 \|
	\| Counterfeit-V3.0 \| 0.434574 \|
	\| MeinaMix_V11 \| 0.375389 \|

	It can be seen that fine-tunes or merges may produce nicer images but at the cost of controllability.

	## Aesthetics

	Images should be pretty. While this is generally subjective, there are models that give an aesthetic score, either averaged from many people's preferences or personalized.
	There are CLIP based models ([aesthetic-predictor](https://github.com/LAION-AI/aesthetic-predictor), [improved-aesthetic-predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor))
	and some custom models ([anime-aesthetic](https://huggingface.co/spaces/skytnt/anime-aesthetic-predict), [cafe_aesthetic](https://huggingface.co/cafeai/cafe_aesthetic)).

	I tested averaged improved-aesthetic-predictor and anime-aesthetic:

	\| model \| improved-aesthetic-predictor (higher better) \| anime-aesthetic (higher better) \|
	\|---\|--\|-\|
	\| animefull-latest \| 6.124954 \| 0.639767 \|
	\| Counterfeit-V3.0 \| 6.359464 \| 0.789190 \|
	\| MeinaMix_V11 \| 6.474662 \| 0.829989 \|

	The two scores appears to agree.

	Interestingly, GPT-4V does a reasonable job at this.
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/vhk-IS0rZl1urd5Mqlzi9.png)

	## Diversity

	Even with the same prompt, given different random seeds, generated images should not be repetitive.
	There's this DIV score defined in the [Dreambooth paper](https://arxiv.org/pdf/2208.12242.pdf), which calculates image similarity with LPIPS.
	For this particular set of images, this metric is not applicable, and I will leave it to a future update.

	## Conclusions

	It's possible to programmatically generate some numbers given a base model. We can use the numbers as a proxy of the model's overall performance.


	## Miscellaneous notes

	I used diffusers and 13 images from animefull-latest came out as solid black images for unknown reasons even with the safety checker disabled and single precision VAE.
	These images and their counterparts were excluded in metrics calculation.

	The images and prompts can be found [here](https://huggingface.co/datasets/gustproof/sd-data/tree/main/db1k).

	It's possible that some models perform better with special configs, but for simplicity I kept them the same.

	The code for image generation and metrics is quite messy so I'm will not upload it right now, but feel free to ask questions or give suggestions.

	I probably would create a fidelity model eventually if no one does, but it will take a while.



	Prompts with more tags have lower tag accuracy
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/-CCWbnriWllqZKa5Le36M.png)

	The affect of tag position is measurable albeit less pronounced. The trend at the pos 20-25 may be due to the 77-token limit wraparound.
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/b0t53PqMR2oNQX23ZTw5h.png)

	The next post will be about evaluating character models.