Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

doesn't seem like it makes sense to train AI around human user interfaces which aren't really efficient. It is like building a mechanical horse.


Why do you think we have fully self driving cars instead of just more simplistic beacon systems? Why doesn't McDonald's have a fully automated kitchen?

New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.

There are just so many factors that get solved by working with what already exists.


About your self-driving car point, I feel like the approach I'm seeing is akin to designing a humanoid robot that uses its robotic feet to control the brake and accelerator pedals, and its hand to move the gear selector.


Open Pilot (https://comma.ai/openpilot) connects to your cars brain and sends acceleration, turning, etc signals to drive the car for you.

Both Open Pilot and Tesla FSD use regular cameras (ie. eyes) to try and understand the environment just as a human would. That is where my analogy is coming from.

I could say the same about using a humanoid robot to log on to your computer and open chrome. My point is also that we made no changes to the road network to enable FSD.


Yeah, that would be pretty good honestly. It could immediately upgrade every car ever made to self driving and then it could also do your laundry without buying a new washing machine and everything else. It's just hard to do. But it will happen.


Yes, it sounds very cool and sci-fi, but having a humanoid control the car seems less safe than having the spinning cameras and other sensors that are missing from older cars or those that weren't specifically built to be self-driving. I suppose this is why even human drivers are assisted by automatic emergency braking.

I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals, or thin pillars to help the passengers see the outside environment or be seen by pedestrians.

The way this ties back to the computer use models is that a lot of webpages have stuff designed for humans would make it difficult for a model to navigate them well. I think this was the goal of the "semantic web".


> I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals

We always make our way back to trains


By the time it happens you and me are probably under the ground.


I could add self-driving to my existing fleet? Sounds intriguing.


> Why do you think we have fully self driving cars instead of just more simplistic beacon systems?

While the self-driving car industry aims to replace all humans with machines, I don't think this is the case with browser automation.

I see this technology as more similar to a crash dummy than a self-driving system. It's designed to simulate a human in very niche scenarios.


Right, let's make APIs for everything...

[Looks around and sees people not making APIs for everything]

Well that didn't work.


Every website and application is just layers of data. Playwright and similar tools have options for taking Snapshots that contain data like text, forms, buttons, etc that can be interacted with on a site. All the calls a website makes are just APIs. Even a native application is made up of WinForms that can be inspected.


Ah, so now you're turning LLMs into web browsers capable of parsing Javascript to figure out what a human might be looking at, let's see how many levels deep we can go.


Just inspect the memory content of the process. It's all just numbers at the end of the day & algorithms do not have any understanding of what the numbers mean other than generating other numbers in response to the input numbers. For the record I agree w/ OP, screenshots are not a good interface for the same reasons that trains, subways, & dedicates lanes for mass transit are obviously superior to cars & their associated attendant headaches.


Maybe some day, sure. We may eventually live in a utopia where everyone has quick, efficient, accessible mass transit available that allows them to move between any two points on the globe with unfettered grace.

That'd be neat.

But for now: The web exists, and is universal. We have programs that can render websites to an image in memory (solved for ~30 years), and other programs that can parse images of fully-rendered websites (solved for at least a few years), along with bots that can click on links (solved much more recently).

Maybe tomorrow will be different.


Point was process memory is the source of truth, everything else is derived & only throws away information that a neural network can use to make better decisions. Presentation of data is irrelevant to a neural network, it's all just numbers & arithmetic at the end of the day.


This is just like the comments suggesting we need sensors and signs specifically for self-driving cars for them to work.

It'll never happen, so companies need to deal with the reality we have.


We can build tons of infrastructure for cars that didn't exist before but can't for other things anymore? Seems like society is just becoming lethargic.


No, it's just hilariously impractical if you bother to think about it for more than five seconds.


Of course it is, everything is impractical except autogenerating mouse clicks on a browser. Anyone else starting to get late stage cryptocurrency vibes before the crash?


Actually making self driving cars is not so impractical -- insanely expensive and resource heavy and difficult, yes, but the payoffs are so large that it's not impractical.


In my country there's a multi-airline API for booking plane tickets, but the cheapest of economy carriers only accept bookings directly on their websites.

If you want to make something that can book every airline? Better be able to navigate a website.


You can navigate a website without visually decoding the image of a website.


Except if its a messy div soup with various shitty absolute and relative pixel offsets where the only way to know what refers to what is by rendering it and using gestalt principles.


None of that matters to neural networks.


It does, because it's hard to infer where each element will end up in the render. So a checkbox may be set up in a shitty way such that the corresponding text label is not properly placed in the DOM, so it's hard to tell what the checkbox controls just based on the DOM tree. You have to take into account the styling and placement pixel stuff, ie render it properly and look at it.

That's just one obvious example, but the principle holds more generally.


Spatial continuity has nothing to do w/ how neural networks interpret an array of numbers. In fact, there is nothing about the topology of the input that is any way relevant to what calculations are done by the network. You are imposing an anthropomorphic structure that does not exist anywhere in the algorithm & how it processes information. Here is an example to demonstrate my point: https://x.com/s_scardapane/status/1975500989299105981


It would have to implicitly render the HTML+CSS to know which two elements visually end up next to each other, if the markup is spaghetti and badly done.


The linked post demonstrates arbitrary re-ordering of image patches. Spatial continuity is not relevant to neural networks.


That's ridiculous, sorry. If that were so, we wouldn't have positional encodings in vision transformers.


It's not ridiculous if you understand how neural networks actually work. Your perception of the numbers has nothing to do w/ the logic of the arithmetic in the network.


Do you know what "positional encoding" means?


Completely irrelevant to the point being made.


Why are you talking about image processing ? The guy you’re talking to isn’t


What do you suppose "render" means?


The original comment I replied to said "You can navigate a website without visually decoding the image of a website." I replied that decoding is necessary to know where the elements will end up in a visual arrangement, because often that carries semantics. A label that is rendered next to another element can be crucial for understanding the functioning of the program. It's nontrivial just from the HTML or whatever tree structure where each element will appear in 2D after rendering.


2D rendering is not necessary for processing information by neural networks. In fact, the image is flattened into 1D array & loses the topological structure almost entirely b/c the topology is not relevant to the arithmetic performed by the network.


I'm talking about HTML (or other markup, in the form of text) vs image. That simply getting the markup as text tokens will be much harder to interpret since it's not clear where the elements will end up. I guess I can't make this any more clear.


The guy you are talking to is either an utter moron, severely autistic, or for some weird reason he is trolling ( it is a fresh account. I applaud you for trying to be kind and explain things to him, I personally would not have the patience.


Calm down gramps, it's not good for the heart be angry all the time.


Im not angry, Im disappointed. People are going out of their way to help you understand a topic, and the best you can do is be patronising? It means you are over confident, ignorant, rude, slow to learn, disrespectful and I think ungrateful.

If you read through the thread - to me its apparent.

Even if you make burner accounts, this behaviour doesn't help one grow.

But hey gramps could be wrong eh


Relax buddy, it's not that serious.


Yeah you are right, sorry bro I didnt have enough coffee yesterday. My bad.


No one is perfect.


Reminds me of WALL-E where there is a keypad with a robot finger to press buttons on it.


We're training natural language models to reason by emulating reasoning in natural language, so it's very on brand.


It's on the brand of stuff that works. Expert systems and formal symbolic if-else, rules based reasoning was tried, it failed. Real life is messy and fat-tailed.


And yet we give agents deterministic tools to use rather than tell them to compute everything in model!


Yes, and here they also operate deterministic GUI tools. Thing is, many GUI programs are not designed so well. Their best interface and the only interface they were tested and designed for is the visual one.


What you say is 100% true until it’s not. It seems like a weird thing to say (what I’m saying), but please consider we’re in a time period where everything we say is true, minute by minute, and no more. It could be the next version of this just works, and works really well.


It's not about efficiency but access. Many services do not provide programmatic access.


If we could build mechanical horses they wiuld be absolutely amazing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: