Diffusion Single File
comfyui

What model was used to generate natural language captions for the dataset?

#126
by analogspiderweb - opened

I'm curious to see which LLM/VLM was used for the captions, and how capable it is for understanding more obscure / underrepresented concepts. This was already asked a couple months ago, but from what I've seen there hasn't been a straight answer given.
LoRAs have worked for me for more specific concepts, this is just to gauge how the model is captioned and perhaps understand what I could be doing differently.

Going by tdrussel's tips on how to train a LoRA https://civitai.com/models/2536147/greg-rutkowski-style-anima
he probably used Gemma. But not Gemma 4, which came out this April. In any case this link should be useful to you, it should answer all questions.

I've trained multiple loras with:

  • Tags only
  • Tags 50% + Natlang 50% (<100 words)
  • Natlang only (<250 words)

And found out that only having natlang for lora works best in my case. I used ToriiGate-0.5 in short mode and removed last 1 or 2 sentences manually if they described the overall vibe or the style of the image. Now that I think of it that process can also be done with passing all the text files into another LLM to remove any emotional / style / concept describing elements.

Multi-res training approach also helped tremendously in my case. I used [512, 768, 1024] instead of [512, 1024, 1536]. 512-768px for concepts or characters, >1024px for texture might be the solution for your case without changing anything.

Kinda unrelated to the title question, but this is for your "perhaps understand what I could be doing differently"

Sign up or log in to comment