Transformers output fixed-length sequences. For this transformer they chose 256 pixels, or 32 "image tokens" that each decode to an 8-by-8 pixel "patch".
You can technically increase or decrease this - or use a different aspect ratio by using more or fewer image tokens, but this is static after you start training. It will also require more "decodes" from the backbone VQGAN model (responsible for converting pixels to image tokens), and thus take longer to run inference on.
CLIP-guided VQGAN can get around this by taking the average CLIP score over multiple "cutouts" of the whole image allowing for a broad range of resolutions and aspect ratios.
It's already being scaled up to 256x256 from something smaller anyway. You could add an extra upscaler to go further which I've tried with moderate success, but you're basically doing CSI style 'enhance' over and over.
If it depends on the hardware, what would be the limit when one rents the biggest machine available in the cloud?