New AI diffusion model approach solves the aspect ratio problem

GaggiX · on Sept 20, 2024

The aspect ratio problem was solved by NovelAI when they trained SD v1.4 on images with different aspect ratios using a technique they call "Aspect Ratio Bucketing", and after that it became commonly used in the final stage of training.

https://blog.novelai.net/novelai-improvements-on-stable-diff...

gwern · on Sept 20, 2024

It was also solved by the even easier approach of aspect ratio conditioning, where you just pass in the dimensions of the crop to the NN like SDXL: https://arxiv.org/pdf/2307.01952#page=3&org=stability

GaggiX · on Sept 20, 2024

How does this replace "Aspect Ratio Bucketing"? Are they padding the smaller images and masking the attention?

whywhywhywhy · on Sept 20, 2024

Nothing here seems that impressive and none of the ratios shown are deviating that much from what anything post SDXL can just do anyway.

Might have been impressed if some extreme letterbox or vertical banner style extreme portrait was shown but everything shown here works fine in SDXL and especially Flux and the cat image doesn't even feature a press conference or journalists

bee_rider · on Sept 20, 2024

The two images shown in the article using the new method are sort of… stylized or slightly cartoonish in a way that the images they generated without using their method are not. Their images also have a “perfectly framed, looking straight at the camera,” which looks a little artificial. The images not using their method have a more natural look (although, obviously, they have the issue with the duplicated subject).

I wonder if it is an unavoidable result of their method, or if it is just a little issue (of course it is hard to get infinite compute as an academic, maybe they just need to train more. Is that a thing? I don’t AI).

BugsJustFindMe · on Sept 20, 2024

Cartoonish output is a problem across the board. If you explicitly ask Dall-E for a "photograph" of something, you will very often get a result that looks like a cartoonified illustration. Prompt writers resort to specifying exact camera models and lenses to try to constrain the process.

adamanonymous · on Sept 20, 2024

There are fine tuned models out there that can generate near photo-realistic results. The base SD models and those offered by the major AI service sites have a more stylized look to them. Probably partially to work on a wider array of prompts that may include non photorealistic subjects, and partially for safety.

refulgentis · on Sept 20, 2024

The problem as described was solved eons ago. I'm honestly struggling to remember when this was an issue. Certainly pre SD 1.5, maybe 2021?

I assume something got lost in translation to PR.

NBJack · on Sept 20, 2024

1.5 still has this issue, particularly with specific subjects (i.e. the owl) any time you step significantly beyond the stock resolution (i.e. 1024x512). SDXL, while more stable, can also suffer from this.

The trouble is really the "window" by which the model operates in.

refulgentis · on Sept 20, 2024

I use 1.5 hundreds of times a day outside this resolution, it must be the subjects I'm using. And that you mention it, SD XL was awful at it.

isoprophlex · on Sept 20, 2024

That's academia for you...

refulgentis · on Sept 20, 2024

I was being polite and shading towards this common interpretation HN has of academic PR, the article contains a quite lengthy technical description.

notum · on Sept 20, 2024

Just using "cropped" as a negative prompt eliminates this issue entirely on my end and produces same results as per their owl example in SDXL.

mattstir · on Sept 20, 2024

The original paper [0] this article is based on raises a few questions for me. It compares the authors' new technique against StableDiffusion but fails to specify which version of SD they're using for that comparison. It doesn't mention how example outputs were chosen (were they cherry-picked?). For non-square images, they seem to have specifically chosen resolutions that the other models weren't trained to output (e.g., 384 x 512) without also including ones that they were trained on (e.g., 896 x 1152). I wonder how this new technique would compare with all of that accounted for.

[0] https://openaccess.thecvf.com/content/CVPR2024/papers/Haji-A...

bongodongobob · on Sept 20, 2024

What aspect ratio problem? I played around with my midjourney account just now and it flawlessly works with extreme aspect ratios.

mhog_hn · on Sept 20, 2024

Any diffusion models out there that work well for generating stylized graphics? Think the stuff on your typical SaaS website

GaggiX · on Sept 20, 2024

Ideogram 2

If you want something open source, SDXL or Flux with the right LoRA.

gwern · on Sept 20, 2024

Or Recraft https://www.recraft.ai/ if you want SVG.