What model was used to generate natural language captions for the dataset?

#126

by analogspiderweb - opened 5 days ago

I'm curious to see which LLM/VLM was used for the captions, and how capable it is for understanding more obscure / underrepresented concepts. This was already asked a couple months ago, but from what I've seen there hasn't been a straight answer given.
LoRAs have worked for me for more specific concepts, this is just to gauge how the model is captioned and perhaps understand what I could be doing differently.

synta

5 days ago

•

edited 3 days ago

Going by tdrussel's tips on how to train a LoRA https://civitai.com/models/2536147/greg-rutkowski-style-anima
he probably used Gemma. But not Gemma 4, which came out this April. In any case this link should be useful to you, it should answer all questions.

alright-bibi

5 days ago

I've trained multiple loras with:

Tags only
Tags 50% + Natlang 50% (<100 words)
Natlang only (<250 words)

And found out that only having natlang for lora works best in my case. I used ToriiGate-0.5 in short mode and removed last 1 or 2 sentences manually if they described the overall vibe or the style of the image. Now that I think of it that process can also be done with passing all the text files into another LLM to remove any emotional / style / concept describing elements.

Multi-res training approach also helped tremendously in my case. I used [512, 768, 1024] instead of [512, 1024, 1536]. 512-768px for concepts or characters, >1024px for texture might be the solution for your case without changing anything.

Kinda unrelated to the title question, but this is for your "perhaps understand what I could be doing differently"

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment