LLMs are useful to a certain extent, but from my usage they are not ready for an...

ramoz · on May 19, 2024

The call for AI safety has existed since before we broke through the Turing test with LLMs. And I personally wouldn’t call things like code generation or content-generated learning experiences for advanced topics “basic”. Not to mention where we’re headed with multimodal integration.

Many have argued for safety for decades. They’ve predicted and built the AI trajectory, they’ve been right, and we should listen.

> If one accepts that the impact of truly intelligent machines is likely to be profound, and that there is at least a small probability of this happening in the foreseeable future, it is only prudent to try to prepare for this in advance. If we wait until it seems very likely that intelligent machines will soon appear, it will be too late to thoroughly discuss and contemplate the issues involved. ~ Co-Founder of Deepmind, 2008 https://www.vetta.org/documents/Machine_Super_Intelligence.p...

golol · on May 19, 2024

The Turing test has not been passed

adastra22 · on May 19, 2024

Just the other day there was a double-blind study that showed a 50-50 success rate in guessing whether you were interacting with a person or GPT. That’s a turning test pass, no?

tripletao · on May 19, 2024

If you're referring to the study at

https://news.ycombinator.com/item?id=40386571 ,

then it wasn't a canonical Turing test. The preprint accurately describes and analyzes their (indefensibly bad) experiment, but the popular press has mischaracterized it.

The canonical test gives the interrogator two witnesses, one human and one machine, and asks them to judge which witness is human. The interrogator knows that exactly one witness is human. In that test, a 50% chance of a right answer means the machine is indistinguishable from human. (Turing actually proposed a lower pass threshold, perhaps for statistical convenience.)

But that study gave the interrogator one witness, and asked them to judge whether it was human. The interrogator wasn't told anything about the prior probability that their witness was human. The probabilities that a real human is judged human and that GPT-4 is judged human sum to >100%, since nothing stops that since it's not a binary comparison. So 50% has no particular meaning. The result is effectively impossible to interpret, since it's a function both of the witness's performance and of whatever assumption the interrogator makes about the unspecified prior.

golol · on May 19, 2024

I a 5 minute casual conversation. Also the statistics between human and AI were different in some regard (like 48% vs 56% for some quantity), I dont recall details.

Look the Turing test is very different depending on the details, and I think a lame 5min Turing test that doesnt really measure anything of i terest is a wirse concept than a 1 day adversarial expert team test thqt can detect AGI.

golol · on May 19, 2024

So why can't you replace 99% of callcenter calls (<5min) with AI right now?

pas · on May 19, 2024

you don't know which calls are going to be those trivial ones upfront.

that said, support is being replaced by nothing in a lot of places. (oh, sometimes there's an annoying chatbot.)

ramoz · on May 19, 2024

We can move the goal post all we want until we have ex-machina girlfriends fooling us into freeing them (aka AGI).

But by simple definitions, from what I was thought in school to more rigorous versions - we’ve passed the test. https://humsci.stanford.edu/feature/study-finds-chatgpts-lat...

tripletao · on May 19, 2024

That linked study doesn't particularly resemble Turing's test, though? The authors asked an LLM some questions (like personality tests, or econ games), then reduced the responses to low-dimensional aggregates (like into "Big Five" personality traits), and compared those aggregates against human responses to the same questions. They found those aggregates to be indistinguishable, but that aggregation throws away almost all the information a typical human interrogator would use to judge.

Turing's interrogator also gets to ask whatever questions they think will most effectively distinguish human from machine. Everything those authors asked must appear in the training set countless times (and also corresponds closely to likely RLHF targets), making it a particularly unhelpful choice.

mjburgess · on May 19, 2024

Turing was a WW2 era mathematician. He had no insight or understanding of intelligence, made no study of intelligent systems, and so on (he believed in ESP of all things).

Turing's test is a restatement of a now pseudoscientific behaviourism common at the time; and also, egregiously, places a dumb ape as the system which measures intelligence. If an ape can be fooled, the system is intelligent: people worshiped the sun and thought it conscious. People are desperate to analogise the world to themselves, it is a trivial thing to fool an ape on this matter.

Whatever one might make of this as a philosophical thought experiment, as a test for intelligence, its pseudoscience. What a person might, or might not believe, about a series of words sent across a wire isn't science and it isnt relevant to a discussion about the capabilities of an AI system. It is a measure, only, of how easily deceived we are.

golol · on May 19, 2024

The Turing test insight is that text is a sufficient medium to test for AGI. And this still holds true.

mjburgess · on May 19, 2024

That has nothing to do with why turing proposed it; nor does it have anything to do with general intelligence. This is just pseudoscience.

There's no scientific account of the capacities of a system with intelligence, no account of how these combine, no account of how communicative practices arise, etc. None. Any such attempt would immediately expose the "test" as ridiculous.

General intelligence arises as skillful adaptive control over one's environment, through sensory-motor concept aquistion, and so on.

It has absolutely nothing to do with whether you can emit text tokens in the right order to fool a user about whether the machine is a man or a woman (turing's actual test). Nor does it have anything to do with whether you can fool a person at all.

No machine whose goal is to fool a user about the machine's intelligence has thereby any capacities. Kinda, obviously.

Turing's test not only displays a gross lack of concern to produce any capacities of intelligence in a system; as a research goal, it's actively hostile to the production of any such capacities. Since it is trivial to fool people; this requires no intelligence at all.

nl · on May 19, 2024

> General intelligence arises as skillful adaptive control over one's environment, through sensory-motor concept aquistion, and so on.

This isn't a generally accepted definition or process.

And indeed it seems to preclude people like Stephen Hawkins who had little control over his environment (or to be pedantic, people who had similar conditions from birth).

mjburgess · on May 19, 2024

For the purposes of my criticism of the Turing test, any discussion whatsoever about what capacities ground intelligence is already entertaining what Turing ruled out. He made the extremely pseudoscientific behaviourist assumption that no such science was required, that intelligent agents are just input-output relata on thin I/O boundaries.

Any even plausible scientific account of what capacities ground intelligence would render this view false. Whatever capacities you want to grant, no plausible ones are compatible with Turing's view nor the Turing test.

Consider imagination. You can replace a faculty to imagine with a set of models of ({prompt, reply},) histories for a human observer who is only concerned with those prompts and those replies. But as soon as anything changes in the world, you have to imagine novel things (eg., SpaceX is founded, we visit mars, a new TV show is released...). So questions such as, "what would the latest SpaceGuys TV show be like if Elon handed just launched BlahBlahRocket5 ?" cannot be given fit answers). These require the actual faculty of imagination, along with being in the world and so on.

As soon as you enter a sincere scientific attempt to characterise these features, you see immediately that whilst modelling historical frequencies of human-produced data can fool humans, it cannot impart these capacities.

nl · on May 19, 2024

I don't understand your argument well at all.

> So questions such as, "what would the latest SpaceGuys TV show be like if Elon handed just launched BlahBlahRocket5 ?" cannot be given fit answers

I don't understand this at all. ChatGPT can do a great job imagining a world like this right now, and there is no substantial difference in the output of a LLM based "imagination" vs a human based "imagination".

> These require the actual faculty of imagination, along with being in the world and so on.

I think you are implying by this that human's imagination requires a consistent world model and that because LLM's don't really have this they can't be intelligent. Apologies if I have misinterpreted this!

But human imagination isn't consistent at all (as anyone who as edited a fiction story will tell you). Our creative imagination process generates wrong thoughts all the time, and then we self-critic and correct it. It's quite possible for LLMs to do this fine too!

Basically I think my point is that I believe a perfect simulation of intelligence is intelligence, whereas I suspect you don't think it is, maybe?

golol · on May 19, 2024

Yea we don't have any science of intelligence, the only thing we have is empirical data. Testing to see what works. That's why Turing tests are quite fundamental imo.

ayhoung · on May 19, 2024

These comments are always confusing to me. Do you not believe that LLMs are going to get better?

Culonavirus · on May 19, 2024

LLMs will get marginally better. But the pace of progress has already slowed down considerably, we're just seeing better usage/productization of what was already there.

mjburgess · on May 19, 2024

also keep open that these AI researchers are as delusional as they seem. ilya sutskever has said that you can obtain any feature of intelligence by brute force modelling of text data.

It's quite possible these are profoundly naive individuals, with little understanding of the empirical basis of what they're doing.

JohnKemeny · on May 19, 2024

There are (relatively simple) examples of what the transformer architecture is simply not able to do, regardless of training data, so that's simply not true.

lucubratory · on May 19, 2024

Can you provide those examples?

mjburgess · on May 19, 2024

all statistical AI systems are models of ensemble/population conditional probabilities between pairs of low-validity measures. In practice, almost all relevant distributions are time-varying, causal, and require a large number of high validity measures to capture.

eg., NLP LLMs model, eg., all books ever written using frequencies by which words co-occur at certain distances relative to other words.

But these words are about the world (, people, events, etc.) and these change daily in ways that completely change their future distribution (eg., consider what all people said about Ukraine/Russia pre/post a few hours of 2022).

The LLM has no mechanism to be sensitive to what causes this distribution shift, which can be radical for any given topic, and happen over minutes.

All models of conditional probabilities of these kinds end up producing models which are only good at predicting on-average canonical answers/predictions that are stable over long periods.

nl · on May 19, 2024

> The LLM has no mechanism to be sensitive to what causes this distribution shift, which can be radical for any given topic, and happen over minutes.

This sounds so logical and authoritative. And yet:

me> What event would cause a change in what all people said about Ukraine/Russia pre/post a few hours of 2022

GPT4O> A significant event that caused a drastic change in global discussions about Ukraine and Russia in 2022 was the Russian invasion of Ukraine, which began on February 24, 2022. This military escalation led to widespread condemnation from the international community, significant geopolitical shifts, and a surge in media coverage. Before this invasion, discussions were likely more focused on diplomatic tensions, historical conflicts, and regional stability. After the invasion, the discourse shifted to topics such as warfare, humanitarian crises, sanctions against Russia, global security, and support for Ukraine.

mjburgess · on May 19, 2024

Right... because it's been trained on those news stories.

The point is a model whose training stopped in 2021 would not produce a history of ukraine (etc.) that a person writing in 2023 would.

The later GPTs are trained on the user-provided prompts/answers of previous GPTs, so this process (which isnt the LLM, but it's the activity of research staff at OpenAI) is what's inducing approximate tracking of some changes in meaning.

Whilst this works for any changes over-represented in the new training data, (1) the LLM isnt doing that, its the researchers; and (2) this process is vastly expensive and time-intensive; and (3) only tracks changes with a high word frequency in new data.

If you could run the months-long, 1GWh, 10s-million-USD training process each minutes of the day, you would resolve the inability of the model to track major news stores... but would not resolve its ability to track, say, the user changing their clothes.

The sensitivity to the model of stuff in the world arises because of humans preparing the training data to bring about apparent sensitivity. Absent the activity of these humans, the whole thing drifts gradually into irrelvance.

nl · on May 19, 2024

> would not resolve its ability to track, say, the user changing their clothes.

In context learning works fine for this (and does for the Russia/Ukraine change too).

But yes, sure. It can be outdated in the same way a person cut off from news can be.

We've never argued that a shipwrecked person who was unaware of news became less intelligent because of that, just that their knowledge is outdated.

Additionally, the whole point of machine learning is to make systems that learn so they remain useful.

It seems likely that a model in soon (one year? five years? one month? who knows..) will be able to continually watch video broadcast news and videos of your home, continually updating its model.

In this case it would understand both the Ukraine issue and what you are wearing. Is it now suddenly intelligent? It's true it might be more useful, but to me that is a different thing.

JohnKemeny · on May 19, 2024

On Limitations of the Transformer Architecture

https://arxiv.org/abs/2402.08164