That looks like something that can be solved by autoregressive models of today, ...

That looks like something that can be solved by autoregressive models of today, no architectural changes needed.

What you need is: good image understanding, at least GPT-5 tier, general purpose reasoning over images training, and then some domain-specific training, or at least some few-shot guidance to get it to adopt the correct reasoning patterns.

If I had to guess which model would be able to do it best out of the box, few-shot, I'd say Gemini 3 Pro.

There is nothing preventing an autoregressive LLM from revisiting images and rewriting the texts as new clues come in. This is how they can solve puzzles like sudoku.