The aspect ratio problem was solved by NovelAI when they trained SD v1.4 on images with different aspect ratios using a technique they call "Aspect Ratio Bucketing", and after that it became commonly used in the final stage of training.
Nothing here seems that impressive and none of the ratios shown are deviating that much from what anything post SDXL can just do anyway.
Might have been impressed if some extreme letterbox or vertical banner style extreme portrait was shown but everything shown here works fine in SDXL and especially Flux and the cat image doesn't even feature a press conference or journalists
The two images shown in the article using the new method are sort of… stylized or slightly cartoonish in a way that the images they generated without using their method are not. Their images also have a “perfectly framed, looking straight at the camera,” which looks a little artificial. The images not using their method have a more natural look (although, obviously, they have the issue with the duplicated subject).
I wonder if it is an unavoidable result of their method, or if it is just a little issue (of course it is hard to get infinite compute as an academic, maybe they just need to train more. Is that a thing? I don’t AI).
Cartoonish output is a problem across the board. If you explicitly ask Dall-E for a "photograph" of something, you will very often get a result that looks like a cartoonified illustration. Prompt writers resort to specifying exact camera models and lenses to try to constrain the process.
There are fine tuned models out there that can generate near photo-realistic results. The base SD models and those offered by the major AI service sites have a more stylized look to them. Probably partially to work on a wider array of prompts that may include non photorealistic subjects, and partially for safety.
1.5 still has this issue, particularly with specific subjects (i.e. the owl) any time you step significantly beyond the stock resolution (i.e. 1024x512). SDXL, while more stable, can also suffer from this.
The trouble is really the "window" by which the model operates in.
The original paper [0] this article is based on raises a few questions for me. It compares the authors' new technique against StableDiffusion but fails to specify which version of SD they're using for that comparison. It doesn't mention how example outputs were chosen (were they cherry-picked?). For non-square images, they seem to have specifically chosen resolutions that the other models weren't trained to output (e.g., 384 x 512) without also including ones that they were trained on (e.g., 896 x 1152). I wonder how this new technique would compare with all of that accounted for.
https://blog.novelai.net/novelai-improvements-on-stable-diff...