doesn't seem like it makes sense to train AI around human user interfaces which ...

jklinger410 · 2025-10-07T20:56:34 1759870594

Why do you think we have fully self driving cars instead of just more simplistic beacon systems? Why doesn't McDonald's have a fully automated kitchen?

New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.

There are just so many factors that get solved by working with what already exists.

layman51 · 2025-10-07T21:14:11 1759871651

About your self-driving car point, I feel like the approach I'm seeing is akin to designing a humanoid robot that uses its robotic feet to control the brake and accelerator pedals, and its hand to move the gear selector.

jklinger410 · 2025-10-08T14:29:08 1759933748

Open Pilot (https://comma.ai/openpilot) connects to your cars brain and sends acceleration, turning, etc signals to drive the car for you.

Both Open Pilot and Tesla FSD use regular cameras (ie. eyes) to try and understand the environment just as a human would. That is where my analogy is coming from.

I could say the same about using a humanoid robot to log on to your computer and open chrome. My point is also that we made no changes to the road network to enable FSD.

bonoboTP · 2025-10-07T21:16:27 1759871787

Yeah, that would be pretty good honestly. It could immediately upgrade every car ever made to self driving and then it could also do your laundry without buying a new washing machine and everything else. It's just hard to do. But it will happen.

layman51 · 2025-10-07T21:54:46 1759874086

Yes, it sounds very cool and sci-fi, but having a humanoid control the car seems less safe than having the spinning cameras and other sensors that are missing from older cars or those that weren't specifically built to be self-driving. I suppose this is why even human drivers are assisted by automatic emergency braking.

I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals, or thin pillars to help the passengers see the outside environment or be seen by pedestrians.

The way this ties back to the computer use models is that a lot of webpages have stuff designed for humans would make it difficult for a model to navigate them well. I think this was the goal of the "semantic web".

jklinger410 · 2025-10-08T14:26:30 1759933590

> I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals

We always make our way back to trains

viking123 · 2025-10-08T06:17:53 1759904273

By the time it happens you and me are probably under the ground.

iAMkenough · 2025-10-07T21:35:13 1759872913

I could add self-driving to my existing fleet? Sounds intriguing.

alganet · 2025-10-07T23:38:08 1759880288

> Why do you think we have fully self driving cars instead of just more simplistic beacon systems?

While the self-driving car industry aims to replace all humans with machines, I don't think this is the case with browser automation.

I see this technology as more similar to a crash dummy than a self-driving system. It's designed to simulate a human in very niche scenarios.

pixl97 · 2025-10-07T20:47:44 1759870064

Right, let's make APIs for everything...

[Looks around and sees people not making APIs for everything]

Well that didn't work.

odie5533 · 2025-10-07T21:06:30 1759871190

Every website and application is just layers of data. Playwright and similar tools have options for taking Snapshots that contain data like text, forms, buttons, etc that can be interacted with on a site. All the calls a website makes are just APIs. Even a native application is made up of WinForms that can be inspected.

pixl97 · 2025-10-07T21:24:47 1759872287

Ah, so now you're turning LLMs into web browsers capable of parsing Javascript to figure out what a human might be looking at, let's see how many levels deep we can go.

measurablefunc · 2025-10-07T21:36:24 1759872984

Just inspect the memory content of the process. It's all just numbers at the end of the day & algorithms do not have any understanding of what the numbers mean other than generating other numbers in response to the input numbers. For the record I agree w/ OP, screenshots are not a good interface for the same reasons that trains, subways, & dedicates lanes for mass transit are obviously superior to cars & their associated attendant headaches.

ssl-3 · 2025-10-07T22:14:10 1759875250

Maybe some day, sure. We may eventually live in a utopia where everyone has quick, efficient, accessible mass transit available that allows them to move between any two points on the globe with unfettered grace.

That'd be neat.

But for now: The web exists, and is universal. We have programs that can render websites to an image in memory (solved for ~30 years), and other programs that can parse images of fully-rendered websites (solved for at least a few years), along with bots that can click on links (solved much more recently).

Maybe tomorrow will be different.

measurablefunc · 2025-10-07T22:22:39 1759875759

Point was process memory is the source of truth, everything else is derived & only throws away information that a neural network can use to make better decisions. Presentation of data is irrelevant to a neural network, it's all just numbers & arithmetic at the end of the day.

TulliusCicero · 2025-10-07T20:56:06 1759870566

This is just like the comments suggesting we need sensors and signs specifically for self-driving cars for them to work.

It'll never happen, so companies need to deal with the reality we have.

password54321 · 2025-10-07T22:25:07 1759875907

We can build tons of infrastructure for cars that didn't exist before but can't for other things anymore? Seems like society is just becoming lethargic.

TulliusCicero · 2025-10-08T04:21:31 1759897291

No, it's just hilariously impractical if you bother to think about it for more than five seconds.

password54321 · 2025-10-08T11:04:04 1759921444

Of course it is, everything is impractical except autogenerating mouse clicks on a browser. Anyone else starting to get late stage cryptocurrency vibes before the crash?

TulliusCicero · 2025-10-08T17:31:13 1759944673

Actually making self driving cars is not so impractical -- insanely expensive and resource heavy and difficult, yes, but the payoffs are so large that it's not impractical.

michaelt · 2025-10-07T20:53:02 1759870382

In my country there's a multi-airline API for booking plane tickets, but the cheapest of economy carriers only accept bookings directly on their websites.

If you want to make something that can book every airline? Better be able to navigate a website.

odie5533 · 2025-10-07T21:07:05 1759871225

You can navigate a website without visually decoding the image of a website.

bonoboTP · 2025-10-07T21:18:20 1759871900

Except if its a messy div soup with various shitty absolute and relative pixel offsets where the only way to know what refers to what is by rendering it and using gestalt principles.

measurablefunc · 2025-10-07T21:53:01 1759873981

None of that matters to neural networks.

bonoboTP · 2025-10-07T22:00:58 1759874458

It does, because it's hard to infer where each element will end up in the render. So a checkbox may be set up in a shitty way such that the corresponding text label is not properly placed in the DOM, so it's hard to tell what the checkbox controls just based on the DOM tree. You have to take into account the styling and placement pixel stuff, ie render it properly and look at it.

That's just one obvious example, but the principle holds more generally.

measurablefunc · 2025-10-07T22:03:53 1759874633

Spatial continuity has nothing to do w/ how neural networks interpret an array of numbers. In fact, there is nothing about the topology of the input that is any way relevant to what calculations are done by the network. You are imposing an anthropomorphic structure that does not exist anywhere in the algorithm & how it processes information. Here is an example to demonstrate my point: https://x.com/s_scardapane/status/1975500989299105981

bonoboTP · 2025-10-07T22:26:04 1759875964

It would have to implicitly render the HTML+CSS to know which two elements visually end up next to each other, if the markup is spaghetti and badly done.

measurablefunc · 2025-10-07T22:37:54 1759876674

The linked post demonstrates arbitrary re-ordering of image patches. Spatial continuity is not relevant to neural networks.

bonoboTP · 2025-10-07T22:49:38 1759877378

That's ridiculous, sorry. If that were so, we wouldn't have positional encodings in vision transformers.

measurablefunc · 2025-10-07T22:58:16 1759877896

It's not ridiculous if you understand how neural networks actually work. Your perception of the numbers has nothing to do w/ the logic of the arithmetic in the network.

bonoboTP · 2025-10-07T23:19:12 1759879152

Do you know what "positional encoding" means?

measurablefunc · 2025-10-07T23:21:58 1759879318

Completely irrelevant to the point being made.

ionwake · 2025-10-07T22:29:35 1759876175

Why are you talking about image processing ? The guy you’re talking to isn’t

measurablefunc · 2025-10-07T22:34:43 1759876483

What do you suppose "render" means?

bonoboTP · 2025-10-07T22:48:18 1759877298

The original comment I replied to said "You can navigate a website without visually decoding the image of a website." I replied that decoding is necessary to know where the elements will end up in a visual arrangement, because often that carries semantics. A label that is rendered next to another element can be crucial for understanding the functioning of the program. It's nontrivial just from the HTML or whatever tree structure where each element will appear in 2D after rendering.

measurablefunc · 2025-10-07T23:00:38 1759878038

2D rendering is not necessary for processing information by neural networks. In fact, the image is flattened into 1D array & loses the topological structure almost entirely b/c the topology is not relevant to the arithmetic performed by the network.

bonoboTP · 2025-10-07T23:39:11 1759880351

I'm talking about HTML (or other markup, in the form of text) vs image. That simply getting the markup as text tokens will be much harder to interpret since it's not clear where the elements will end up. I guess I can't make this any more clear.

ionwake · 2025-10-08T09:23:21 1759915401

The guy you are talking to is either an utter moron, severely autistic, or for some weird reason he is trolling ( it is a fresh account. I applaud you for trying to be kind and explain things to him, I personally would not have the patience.

measurablefunc · 2025-10-08T18:29:10 1759948150

Calm down gramps, it's not good for the heart be angry all the time.

ionwake · 2025-10-09T12:07:20 1760011640

Im not angry, Im disappointed. People are going out of their way to help you understand a topic, and the best you can do is be patronising? It means you are over confident, ignorant, rude, slow to learn, disrespectful and I think ungrateful.

If you read through the thread - to me its apparent.

Even if you make burner accounts, this behaviour doesn't help one grow.

But hey gramps could be wrong eh

measurablefunc · 2025-10-09T22:54:49 1760050489

Relax buddy, it's not that serious.

ionwake · 2025-10-10T10:05:32 1760090732

Yeah you are right, sorry bro I didnt have enough coffee yesterday. My bad.

measurablefunc · 2025-10-10T16:26:31 1760113591

No one is perfect.

aidenn0 · 2025-10-07T22:33:37 1759876417

Reminds me of WALL-E where there is a keypad with a robot finger to press buttons on it.

CuriouslyC · 2025-10-07T20:50:11 1759870211

We're training natural language models to reason by emulating reasoning in natural language, so it's very on brand.

bonoboTP · 2025-10-07T21:19:13 1759871953

It's on the brand of stuff that works. Expert systems and formal symbolic if-else, rules based reasoning was tried, it failed. Real life is messy and fat-tailed.

CuriouslyC · 2025-10-07T21:54:04 1759874044

And yet we give agents deterministic tools to use rather than tell them to compute everything in model!

bonoboTP · 2025-10-07T22:02:20 1759874540

Yes, and here they also operate deterministic GUI tools. Thing is, many GUI programs are not designed so well. Their best interface and the only interface they were tested and designed for is the visual one.

ivape · 2025-10-07T21:57:32 1759874252

What you say is 100% true until it’s not. It seems like a weird thing to say (what I’m saying), but please consider we’re in a time period where everything we say is true, minute by minute, and no more. It could be the next version of this just works, and works really well.

wahnfrieden · 2025-10-07T20:47:44 1759870064

It's not about efficiency but access. Many services do not provide programmatic access.

golol · 2025-10-07T21:39:18 1759873158

If we could build mechanical horses they wiuld be absolutely amazing!