What is the maximum resolution possible with this? If it depends on the hardware...

jwitthuhn · on June 28, 2022

Fixed size of 256x256. It cannot go any bigger or smaller.

01acheru · on June 28, 2022

Out of curiosity: why it cannot be changed? I know nothing about this field so... thanks!

ShamelessC · on June 28, 2022

Transformers output fixed-length sequences. For this transformer they chose 256 pixels, or 32 "image tokens" that each decode to an 8-by-8 pixel "patch".

You can technically increase or decrease this - or use a different aspect ratio by using more or fewer image tokens, but this is static after you start training. It will also require more "decodes" from the backbone VQGAN model (responsible for converting pixels to image tokens), and thus take longer to run inference on.

CLIP-guided VQGAN can get around this by taking the average CLIP score over multiple "cutouts" of the whole image allowing for a broad range of resolutions and aspect ratios.

petercooper · on June 28, 2022

It's already being scaled up to 256x256 from something smaller anyway. You could add an extra upscaler to go further which I've tried with moderate success, but you're basically doing CSI style 'enhance' over and over.

freemint · on June 28, 2022

Because that is how the network is trained. You could modify the network size and retrain to get different resolutions.