Current techniques can *sometimes* get causation right. What's it going to take ...

nl · on Feb 26, 2023

People are (on average) surprisingly bad at causation calculations, and even worse at explaining them.

https://yaledailynews.com/blog/2015/01/20/humans-understandi...

https://onlinelibrary.wiley.com/doi/10.1111/cogs.12213

civilized · on Feb 26, 2023

I read the first link and I don't agree with your gloss on it at all. It sounds to me like people interpret causality in a more nuanced and practical way than researchers expected.

nl · on Feb 27, 2023

I thought that the isolation of "causal islands" would clearly affect performance.

But the Wason selection task[1] has hard numbers. This is an example:

You are shown a set of four cards placed on a table, each of which has a number on one side and a colored patch on the other side. The visible faces of the cards show 3, 8, red and brown. Which card(s) must you turn over in order to test the truth of the proposition that if a card shows an even number on one face, then its opposite face is red?

In Wason's study, not even 10% of subjects found the correct solution.[5] This result was replicated in 1993.

"if.. then" is logical deduction but also causal reasoning of course.

[1] https://en.wikipedia.org/wiki/Wason_selection_task

cubefox · on Feb 27, 2023

I asked Bing Chat the task (altered from your example), without telling it its name. What did it do? It searched the Web for "Wason selection task"(!) and then proceeded to give the wrong solution, based on Internet references.

Apparently it got confused because the web examples differed from mine, and what was correct in the web examples was incorrect in mine. Sigh. I guess GPT-4 or 5 will handle it?

nl · on Feb 27, 2023

Interesting!

As I noted elsewhere in the thread chatGPT seemed to handle it correctly.

cubefox · on Feb 27, 2023

It seems then Bing Chat's ability to search the Internet worked against it. Normally it is smarter than ChatGPT/GPT-3.5.

civilized · on Feb 27, 2023

We should remember that, thanks to RLHF, ChatGPT has received a ton of feedback on the correct way to solve common benchmark problems.

I have seen multiple instances in which it got a problem laughably wrong, this got publicized, then days or hours later it was always giving the right answer.

civilized · on Feb 27, 2023

Current generation LLMs get much easier problems wrong, problems that people have no trouble with.

nl · on Feb 27, 2023

I think it would be interesting to see benchmarks on that.

I tested that example on chatGPT and it was correct, with a really good explanation even when I modified the question away from the example on Wikipedia.

Some related results (eg https://arxiv.org/pdf/2206.14576.pdf - see "Causal reasoning: Interventions after passive observations" section) indicate it would be competitive.

nl · on Feb 27, 2023

Actually https://arxiv.org/pdf/2207.07051.pdf is DeepMind doing Wason (and other) tests on Chincilla and they find it scores between 40% and 60% on "realistic" and "shuffled realistic" Wason tests (I think. It's hard to read. See Figure 5).

cubefox · on Feb 26, 2023

Yeah, but ChatGPT is not worse at causal reasoning than at any other reasoning tasks. Its intelligence is limited, and smarter systems, such as Bing Chat, consistently do better at arbitrary reasoning tasks.

To be sure, I also think that pure LLMs have some fundamental reasoning limits, but I'm not too sure.

antibasilisk · on Feb 27, 2023

> Current techniques can sometimes get causation right. What's it going to take to get it right as reliably as a person?

In the same way that Sydney is able to refer to the internet, an AI that can get causation right would probably require the ability to refer to a formal logic engine