All "mega" versions output cartoony or heavily filtered video when doing I2V

#120
by cipherstream - opened

Like the title says, anything I generate using I2V with any of the mega vesions results in a video that looks very cartoony like everything was heavily airbrushed or had filters applied. This wasn't the case with the previous versions. v10 for example is able to create videos via I2V that looks quite realistic, keeping the same look as the original image.

I'm surprised no one else has mentioned this yet as it is so bad comparatively, that I am still using v10 for everything.

I agree. Especially noticeable when I'm doing long VACE videos. Everything becomes more and more cartoony. However, it is only with VACE that I can do long videos because it's not possible to feed multiple overlapping frames into a non-VACE model, right? VACE has other problems with motion and body language. Phr00t is saying he is addressing this in the next MEGA. But probably nothing about the cartoony thing yet. He is asking for side-by-side evidence of v10 vs MEGA.

@VainGuard
I'm working on a long running WF for Vace (MEGA) that keeps the character consistent over the hole length of the video. The results so far are looking promesing. Not sure when I will release that WF but should be within 1-2 days.

I don't have your ambitions to make workflows for others, just want to say - use reference image with latent trimm did the consistency trick for me with 'infinite video'

I'm working on a new "Mega" version that I'm hoping will improve this.

@Phr00t , were you looking for some evidence that Mega was doing cartoon vs v10 realistic?

Prompt: A close up shot of a green monster's face. He is very angry and eating popcorn. A nice woman behind him whispers into his ear. Hard light. Orbital camera. Photorealistic.

Prompt: Cinematic scene. Live action scene. Real life. High quality photography, photorealistic, fine details, real skin. In a cafe. A close up shot of a clown. He is very angry and eating spaghetti. A nice circus woman behind him whispers into his ear.

It's neat that Mega can do cartoony/CGI but it's sad when it's the default and I don't know what keywords would make it realistic like v10. I can sometimes trick Mega into "a man that is acting and dressing like a green monster" if I get a lucky seed. But the real problem is when we do extended videos, the model wants to go cartoony. The skin morphs into CGI with no texture and high contrast. This also changes the way NFSW elements are handled.

I am very interested in what Mega v5 will bring.

Are these with the NSFW LORA version?

I much prefer the cartoon green monster which is what I would expect from that prompt. The second "real" image looks like you describe: a man with makeup on, which is why a more specific prompt could help if that is what you are looking for.

The first clown one looks a little better to me, but not that much different as realism is concerned.

The "skin morphing to CGI" is a similar issue I am trying to improve with a future "Mega" version.

@Phr00t For the videos posted, the v10 was NSFW and the Mega v4 was plain. But I did tests with Mega v3.1 NSFW with same problem. I did do prompt massage tests like "man dressed and acting like a green monster" in which case Mega sometimes did a real man, but it also did an even heavier cartoon depending on the luck of the seed. I even got one that turned the girl into CGI too. I didn't want to flood the thread with all my tests.

So while you might prefer the non-realistic versions, I am trying to attempt to do all realistic stuff by default. (The green man looks ugly, but I respect the closer realism.) And it is fine if the model does cartoon/CGI, but we need a way to override the forced cartoon. I am trying to prompt with all sorts of variations of "photorealistic/high quality/real skin" but that's all ignored. I did try a CFG of 2 with cartoon, etc in the negative. It doubled the generation time and only worked once in tests.

I also did monkey tests. The v10 NSFW (bottom) looked more realistic as well. The Mega v4 (top) looks very good, but I prefer v10. The subtle difference means a lot when dealing with the uncanny valley.

Prompt: Photorealistic. A close up shot of a monkey face. He is very angry and eating popcorn. A crazy woman behind him whispers into his ear.

LoRAs are helping at least with some of my action that I'm trying to do. I'm doing tests now.

I don't want to bake any style in "by default" as I want the prompt followed with as little bias as possible. v10 perhaps had too much human realism bias, where "Mega" might have a touch too much cartoon. I think I will be able to finally upload "Mega v5" tomorrow (hopefully) which I think is a step up in overall quality (but its ultimately fuzzy math). I'm still working out some minor issues, but this is a frame from my internal "Mega v5" with your same prompt:

image

sa_solver/beta (still trying to find the best parameters) example of current internal "Mega v5":

image

Those are screenshots from a botched model build, so it should be even better than that :D

@Phr00t , thank you for your tests. These look great!

I finally figured something out after doing more tests. (I was doing silly things like translating prompt to Chinese and trying all sorts of tags.) But I finally got a realistic green monster in Mega without just a man in green makeup. (Maybe you can say this does look like a man in green makeup, but I say no because his features and proportions are morphed Hulk-like but have skin texture and fine hair which is what I was going for in this case.) The formula of the prompt is more important than I expected. I was putting "Photorealistic" as a prefix and/or suffix. That had no baring on the subject. I finally put "photorealistic green monster" together and things started working. So simple, but so powerful.

"A close up shot of a photorealistic real life green monster's face. He has real hair beard, is very angry and eating popcorn. A nice woman behind him whispers into his ear. Hard light. Incredible skin texture pores hair." (sic: I avoid commas and extra 'and's)

Putting this same prompt back into v10, it would just make a guy with some green makeup. So Mega became superior with photorealism with the prompt formula change. There is still some seed luck going on here because most of the time the skin had a CGI like sheen to it. I'm happy that maybe Mega can work the way I want if I get better at prompting the way it needs.

I need to do more tests/prompt re-writes as when I was trying to do graveyard tombstones, vampires, tilted speed lapse photography, those things came out like PS2-parity video game graphics. (While the same prompts were automatically realistic in v10.) Do I just have to call out every item as "photorealistic x"? I'll do it if I have to. Then the real test is to see if that will help the extended length videos not become cartoony.

Mega v5 is out, give it a try!

Is there a specific layer we should try to skip when using MEGA v4 ? With other wan models, I got good face results by skipping 8,9,10 in 720p

@Phr00t Mega v5 is amazing, man! I have a lot of confidence with these tests.

@Phr00t Mega v5 is amazing, man! I have a lot of confidence with these tests.

I2V issues also fixed?

I2V issues also fixed?

Not with the standard workflow. You want to use one of the extended workflows (either Tom's or mine) because it fills the VACE buffer with 12 frames of your image (you could adjust this down if it still works). It trims these at the end. So you do lose frames and you can only do 4 second segments because it has to leave latent room for the segment overlap. You can use the extended workflow with a single prompt line.

You will still loose some character consistency especially with extended segments. However, if you have a non-action scene, you can enable the reference image on plus the start image (as the same photo if you like) and it will try to enforce the consistency. You can also do a mix where you use T2V mode with reference image turned on and it's interesting.

I know what I have to do is create WAN LoRAs for my characters to have character adherence. Too bad VACE's reference image doesn't respect transparency or a mask. Because forcing the background ruins the use of the reference image.

Not with the standard workflow. You want to use one of the extended workflows (either Tom's or mine) because it fills the VACE buffer with 12 frames of your image (you could adjust this down if it still works). It trims these at the end. So you do lose frames and you can only do 4 second segments because it has to leave latent room for the segment overlap. You can use the extended workflow with a single prompt line.

Interesting Idea... I'll give it a try, report back. Doubt it'd top V9 or 10's I2V, but fingers crossed!

Sign up or log in to comment