I haven't read through everything yet, but in the examples video[1] it's stated that they can only read a portion of the sensor to keep a high frame rate and that the camera returns one row at a time. A square image requires reading more rows and takes more time.
[1] https://youtu.be/-gMy8k4nHtw?t=154