It is embarrassingly, shockingly bad, because these models are *advertised* and ...

kristjansson · on July 10, 2024

It's surprising because these models are pretty ok at some vision tasks. The existence of a clear failure mode is interesting and informative, not embarrassing.

knowaveragejoe · on July 10, 2024

Not only are they capable of understanding images(the kind people might actually feed into such a system - photographs), but they're pretty good at it.

A modern robot would struggle to fold socks and put them in a drawer, but they're great at making cars.

pixl97 · on July 10, 2024

I mean, with some of the recent demos, robots have got a lot better at folding stuff and putting it up. Not saying it's anywhere close to human level, but it has taken a pretty massive leap from being a joke just a few years ago.

TeMPOraL · on July 11, 2024

They're hardly being advertised or sold on that premise. They advertise and sell themselves, because people try them out and find out they work, and tell their friends and/or audiences. ChatGPT is probably the single biggest bona-fide organic marketing success story in recorded history.

foldr · on July 11, 2024

This is fantastic news for software engineers. Turns out that all those execs who've decided to incorporate AI into their product strategy have already tried it out and ensured that it will actually work.

ben_w · on July 11, 2024

> Turns out that all those execs who've decided to incorporate AI into their product strategy have already tried it out and ensured that it will actually work.

The 2-4-6 game comes to mind. They may well have verified the AI will work, but it's hard to learn the skill of thinking about how to falsify a belief.

TeMPOraL · on July 13, 2024

You mean this one here? - https://mathforlove.com/lesson/2-4-6-puzzle/

Looking at the example patterns given:

  MATCH
  2, 4, 6
  8, 10, 12
  12, 14, 16
  20, 40, 60

  NOT MATCH
  10, 8, 6

If the answer is "numbers in ascending order", then this is a perfect illustration of synthetic vs. realistic examples. The numbers indeed fit that rule, so in theory, everything is fine. In practice, you'd be an ass to give such examples on a test, because they strongly hint the rule is more complex. Real data from a real process is almost never misleading in this way[0]. In fact, if you sampled such sequences from a real process, you'd be better off assuming the rule is "2k, 2(k+1), 2(k+2)", and treating the last example as some weird outlier.

Might sound like pointless nitpicking, but I think it's something to keep in mind wrt. generative AI models, because the way they're trained makes them biased towards reality and away from synthetic examples.

--

[0] - It could be if you have very, very bad luck with sampling. Like winning a lottery, except the prize sucks.

ben_w · on July 13, 2024

That's the one. Though where I heard it, you can set your own rule, not just use the example.

I'd say that every black swan is an example of a real process that is misleading.

But more than that, I mentioned verified/falsified, as in the difference between the two in science. We got a long way with just the first (Karl Popper only died in 1994), but it does seem to make a difference?

TeMPOraL · on July 12, 2024

Who cares about execs? They know they work, but for them "works" is defined as "makes them money", not "does anything useful".

I'm talking about regular people, who actually use these tools for productive use, and can tell the models are up to tasks previously unachievable.

foldr · on July 12, 2024

Execs are important in the context of a discussion of how LLMs are advertised and sold.

simonw · on July 10, 2024

I see this complaint about LLMs all the time - that they're advertised as being infallible but fail the moment you give them a simple logic puzzle or ask for a citation.

And yet... every interface to every LLM has a "ChatGPT can make mistakes. Check important info." style disclaimer.

The hype around this stuff may be deafening, but it's often not entirely the direct fault of the model vendors themselves, who even put out lengthy papers describing their many flaws.

jazzyjackson · on July 10, 2024

There's evidently a large gap between what researchers publish, the disclaimers a vendor makes, and what gets broadcast on CNBC, no surprise there.

jampekka · on July 10, 2024

A bit like how Tesla Full Self-Driving is not to be used as self-driving. Or any other small print. Or ads in general. Lying by deliberately giving the wrong impression.

verdverm · on July 10, 2024

It would have to be called ChatAGI to be like TeslaFSD, where the company named it something it is most definitely not

startupsfail · on July 10, 2024

Humans are also shockingly bad on these tasks. And guess where the labeling was coming from…

fennecbutt · on July 10, 2024

Why do people expect these models, designed to be humanlike in their training, to be 100% perfect?

Humans fuck up all the time.