Except if its a messy div soup with various shitty absolute and relative pixel offsets where the only way to know what refers to what is by rendering it and using gestalt principles.
It does, because it's hard to infer where each element will end up in the render. So a checkbox may be set up in a shitty way such that the corresponding text label is not properly placed in the DOM, so it's hard to tell what the checkbox controls just based on the DOM tree. You have to take into account the styling and placement pixel stuff, ie render it properly and look at it.
That's just one obvious example, but the principle holds more generally.
Spatial continuity has nothing to do w/ how neural networks interpret an array of numbers. In fact, there is nothing about the topology of the input that is any way relevant to what calculations are done by the network. You are imposing an anthropomorphic structure that does not exist anywhere in the algorithm & how it processes information. Here is an example to demonstrate my point: https://x.com/s_scardapane/status/1975500989299105981
It would have to implicitly render the HTML+CSS to know which two elements visually end up next to each other, if the markup is spaghetti and badly done.
It's not ridiculous if you understand how neural networks actually work. Your perception of the numbers has nothing to do w/ the logic of the arithmetic in the network.
The original comment I replied to said "You can navigate a website without visually decoding the image of a website." I replied that decoding is necessary to know where the elements will end up in a visual arrangement, because often that carries semantics. A label that is rendered next to another element can be crucial for understanding the functioning of the program. It's nontrivial just from the HTML or whatever tree structure where each element will appear in 2D after rendering.
2D rendering is not necessary for processing information by neural networks. In fact, the image is flattened into 1D array & loses the topological structure almost entirely b/c the topology is not relevant to the arithmetic performed by the network.
I'm talking about HTML (or other markup, in the form of text) vs image. That simply getting the markup as text tokens will be much harder to interpret since it's not clear where the elements will end up. I guess I can't make this any more clear.
The guy you are talking to is either an utter moron, severely autistic, or for some weird reason he is trolling ( it is a fresh account. I applaud you for trying to be kind and explain things to him, I personally would not have the patience.
Im not angry, Im disappointed. People are going out of their way to help you understand a topic, and the best you can do is be patronising? It means you are over confident, ignorant, rude, slow to learn, disrespectful and I think ungrateful.
If you read through the thread - to me its apparent.
Even if you make burner accounts, this behaviour doesn't help one grow.