Why do you think we have fully self driving cars instead of just more simplistic beacon systems? Why doesn't McDonald's have a fully automated kitchen?
New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.
There are just so many factors that get solved by working with what already exists.
About your self-driving car point, I feel like the approach I'm seeing is akin to designing a humanoid robot that uses its robotic feet to control the brake and accelerator pedals, and its hand to move the gear selector.
Open Pilot (https://comma.ai/openpilot) connects to your cars brain and sends acceleration, turning, etc signals to drive the car for you.
Both Open Pilot and Tesla FSD use regular cameras (ie. eyes) to try and understand the environment just as a human would. That is where my analogy is coming from.
I could say the same about using a humanoid robot to log on to your computer and open chrome. My point is also that we made no changes to the road network to enable FSD.
Yeah, that would be pretty good honestly. It could immediately upgrade every car ever made to self driving and then it could also do your laundry without buying a new washing machine and everything else. It's just hard to do. But it will happen.
Yes, it sounds very cool and sci-fi, but having a humanoid control the car seems less safe than having the spinning cameras and other sensors that are missing from older cars or those that weren't specifically built to be self-driving. I suppose this is why even human drivers are assisted by automatic emergency braking.
I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals, or thin pillars to help the passengers see the outside environment or be seen by pedestrians.
The way this ties back to the computer use models is that a lot of webpages have stuff designed for humans would make it difficult for a model to navigate them well. I think this was the goal of the "semantic web".
Every website and application is just layers of data. Playwright and similar tools have options for taking Snapshots that contain data like text, forms, buttons, etc that can be interacted with on a site. All the calls a website makes are just APIs. Even a native application is made up of WinForms that can be inspected.
Ah, so now you're turning LLMs into web browsers capable of parsing Javascript to figure out what a human might be looking at, let's see how many levels deep we can go.
Just inspect the memory content of the process. It's all just numbers at the end of the day & algorithms do not have any understanding of what the numbers mean other than generating other numbers in response to the input numbers. For the record I agree w/ OP, screenshots are not a good interface for the same reasons that trains, subways, & dedicates lanes for mass transit are obviously superior to cars & their associated attendant headaches.
Maybe some day, sure. We may eventually live in a utopia where everyone has quick, efficient, accessible mass transit available that allows them to move between any two points on the globe with unfettered grace.
That'd be neat.
But for now: The web exists, and is universal. We have programs that can render websites to an image in memory (solved for ~30 years), and other programs that can parse images of fully-rendered websites (solved for at least a few years), along with bots that can click on links (solved much more recently).
Point was process memory is the source of truth, everything else is derived & only throws away information that a neural network can use to make better decisions. Presentation of data is irrelevant to a neural network, it's all just numbers & arithmetic at the end of the day.
We can build tons of infrastructure for cars that didn't exist before but can't for other things anymore? Seems like society is just becoming lethargic.
Of course it is, everything is impractical except autogenerating mouse clicks on a browser. Anyone else starting to get late stage cryptocurrency vibes before the crash?
Actually making self driving cars is not so impractical -- insanely expensive and resource heavy and difficult, yes, but the payoffs are so large that it's not impractical.
In my country there's a multi-airline API for booking plane tickets, but the cheapest of economy carriers only accept bookings directly on their websites.
If you want to make something that can book every airline? Better be able to navigate a website.
Except if its a messy div soup with various shitty absolute and relative pixel offsets where the only way to know what refers to what is by rendering it and using gestalt principles.
It does, because it's hard to infer where each element will end up in the render. So a checkbox may be set up in a shitty way such that the corresponding text label is not properly placed in the DOM, so it's hard to tell what the checkbox controls just based on the DOM tree. You have to take into account the styling and placement pixel stuff, ie render it properly and look at it.
That's just one obvious example, but the principle holds more generally.
Spatial continuity has nothing to do w/ how neural networks interpret an array of numbers. In fact, there is nothing about the topology of the input that is any way relevant to what calculations are done by the network. You are imposing an anthropomorphic structure that does not exist anywhere in the algorithm & how it processes information. Here is an example to demonstrate my point: https://x.com/s_scardapane/status/1975500989299105981
It would have to implicitly render the HTML+CSS to know which two elements visually end up next to each other, if the markup is spaghetti and badly done.
It's not ridiculous if you understand how neural networks actually work. Your perception of the numbers has nothing to do w/ the logic of the arithmetic in the network.
The original comment I replied to said "You can navigate a website without visually decoding the image of a website." I replied that decoding is necessary to know where the elements will end up in a visual arrangement, because often that carries semantics. A label that is rendered next to another element can be crucial for understanding the functioning of the program. It's nontrivial just from the HTML or whatever tree structure where each element will appear in 2D after rendering.
2D rendering is not necessary for processing information by neural networks. In fact, the image is flattened into 1D array & loses the topological structure almost entirely b/c the topology is not relevant to the arithmetic performed by the network.
I'm talking about HTML (or other markup, in the form of text) vs image. That simply getting the markup as text tokens will be much harder to interpret since it's not clear where the elements will end up. I guess I can't make this any more clear.
The guy you are talking to is either an utter moron, severely autistic, or for some weird reason he is trolling ( it is a fresh account. I applaud you for trying to be kind and explain things to him, I personally would not have the patience.
Im not angry, Im disappointed. People are going out of their way to help you understand a topic, and the best you can do is be patronising? It means you are over confident, ignorant, rude, slow to learn, disrespectful and I think ungrateful.
If you read through the thread - to me its apparent.
Even if you make burner accounts, this behaviour doesn't help one grow.
It's on the brand of stuff that works. Expert systems and formal symbolic if-else, rules based reasoning was tried, it failed. Real life is messy and fat-tailed.
Yes, and here they also operate deterministic GUI tools. Thing is, many GUI programs are not designed so well. Their best interface and the only interface they were tested and designed for is the visual one.
What you say is 100% true until it’s not. It seems like a weird thing to say (what I’m saying), but please consider we’re in a time period where everything we say is true, minute by minute, and no more. It could be the next version of this just works, and works really well.