Current techniques can sometimes get causation right. What's it going to take to get it right as reliably as a person?
I predict that the "Transformers Plus Scale" formula will not magically deliver reliability. Other ideas will be needed.
Many people seem to be so impressed at these models ever getting something right, they assume that always getting it right will be trivial. Well, mark this page and come back in five years. All the crowing about The Bitter Lesson will look quaint and everything I said here will be vindicated.
Exponential improvement on benchmarks is an iron law... until it isn't.
I read the first link and I don't agree with your gloss on it at all. It sounds to me like people interpret causality in a more nuanced and practical way than researchers expected.
I thought that the isolation of "causal islands" would clearly affect performance.
But the Wason selection task[1] has hard numbers. This is an example:
You are shown a set of four cards placed on a table, each of which has a number on one side and a colored patch on the other side. The visible faces of the cards show 3, 8, red and brown. Which card(s) must you turn over in order to test the truth of the proposition that if a card shows an even number on one face, then its opposite face is red?
In Wason's study, not even 10% of subjects found the correct solution.[5] This result was replicated in 1993.
"if.. then" is logical deduction but also causal reasoning of course.
I asked Bing Chat the task (altered from your example), without telling it its name. What did it do? It searched the Web for "Wason selection task"(!) and then proceeded to give the wrong solution, based on Internet references.
Apparently it got confused because the web examples differed from mine, and what was correct in the web examples was incorrect in mine. Sigh. I guess GPT-4 or 5 will handle it?
We should remember that, thanks to RLHF, ChatGPT has received a ton of feedback on the correct way to solve common benchmark problems.
I have seen multiple instances in which it got a problem laughably wrong, this got publicized, then days or hours later it was always giving the right answer.
I think it would be interesting to see benchmarks on that.
I tested that example on chatGPT and it was correct, with a really good explanation even when I modified the question away from the example on Wikipedia.
Some related results (eg https://arxiv.org/pdf/2206.14576.pdf - see "Causal reasoning: Interventions after passive observations" section) indicate it would be competitive.
Actually https://arxiv.org/pdf/2207.07051.pdf is DeepMind doing Wason (and other) tests on Chincilla and they find it scores between 40% and 60% on "realistic" and "shuffled realistic" Wason tests (I think. It's hard to read. See Figure 5).
Yeah, but ChatGPT is not worse at causal reasoning than at any other reasoning tasks. Its intelligence is limited, and smarter systems, such as Bing Chat, consistently do better at arbitrary reasoning tasks.
To be sure, I also think that pure LLMs have some fundamental reasoning limits, but I'm not too sure.
> Current techniques can sometimes get causation right. What's it going to take to get it right as reliably as a person?
In the same way that Sydney is able to refer to the internet, an AI that can get causation right would probably require the ability to refer to a formal logic engine
I predict that the "Transformers Plus Scale" formula will not magically deliver reliability. Other ideas will be needed.
Many people seem to be so impressed at these models ever getting something right, they assume that always getting it right will be trivial. Well, mark this page and come back in five years. All the crowing about The Bitter Lesson will look quaint and everything I said here will be vindicated.
Exponential improvement on benchmarks is an iron law... until it isn't.