It can definitely see color - I asked it to go to bing and search for the two most prominent colors in the bing background image and it did so just fine. It seems extremely lazy though; it prematurely reported as "completed" most of the tasks I gave it after the first or second step (navigating to the relevant website, usually).
The models are mostly I believe capable of executing however as you rightly indicated 'lazy'. This 'laziness' I think is to conserve resource usage as much as possible as given the current state of AI market the infrastructure is being heavily subsidized for the user. This leads to perhaps the model being incentivized to produce an optimum result that satisfies the user by consuming the least amount of resources.
This is also why most 'vibe' coding projects fail as the model is always going to give this optimum ('lazy') result by default.