Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is the maximum resolution possible with this?

If it depends on the hardware, what would be the limit when one rents the biggest machine available in the cloud?



Fixed size of 256x256. It cannot go any bigger or smaller.


Out of curiosity: why it cannot be changed? I know nothing about this field so... thanks!


Transformers output fixed-length sequences. For this transformer they chose 256 pixels, or 32 "image tokens" that each decode to an 8-by-8 pixel "patch".

You can technically increase or decrease this - or use a different aspect ratio by using more or fewer image tokens, but this is static after you start training. It will also require more "decodes" from the backbone VQGAN model (responsible for converting pixels to image tokens), and thus take longer to run inference on.

CLIP-guided VQGAN can get around this by taking the average CLIP score over multiple "cutouts" of the whole image allowing for a broad range of resolutions and aspect ratios.


It's already being scaled up to 256x256 from something smaller anyway. You could add an extra upscaler to go further which I've tried with moderate success, but you're basically doing CSI style 'enhance' over and over.


Because that is how the network is trained. You could modify the network size and retrain to get different resolutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: