Deducing things from the inability of an LLM to answer a specific question seeme...

aithrowawaycomm · on Oct 5, 2024

> It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate

That would be any problem more complicated than O(n) complexity, even with chain-of-thought prompting[1].

Note that the O(n) thing can bite you in all sorts of unintuitive ways: if the LLM+CoT can perform an O(n) Task A and O(m) Task B, then it can't do the O(nm) task "for every step of A, perform B on the result" unless you come up with a task-specific prompt outlining the solution. The alternative is to play RLHF Whack-A-Mole, separately training the LLM on the combined task. (I think this weakness might be why LLMs are hitting a wall in enterprise deployment, and also explains why LLM agents don't actually work.) The only way this will get fixed is with a fundamentally more sophisticated architecture.

[1] https://www.quantamagazine.org/how-chain-of-thought-reasonin...

godelski · on Oct 5, 2024

  > Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.

That's orthogonal.

If we are pointing in the right direction(s) then yes, next iteration could resolve all problems.

If we are not pointing in the right direction(s) then no, next iteration will not resolve these problems.

Given LLMs rapid improvement in regurgitating knowledge from their training data but simultaneously slow improvement in their ability to generalize (such as logic "puzzles"), I think it is naive to assume we're pointed in the right direction. Maybe we're even pointing in mostly the right direction. But why assume we are?

We can continue in the direction we are going while simultaneously considering it might not be well aligned. If we are well aligned, that gives us more confidence and makes gathering funding easier. If we aren't, well it is easier to course correct sooner than later. In either case, you benefit from the analysis.

Understanding why things fail is more important than understanding why things succeed.

Uehreka · on Oct 5, 2024

GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.

Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.

By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.

godelski · on Oct 5, 2024

Oh, thanks for the correction. I did misinterpret.

Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.

What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.

joe_the_user · on Oct 5, 2024

Oh, thanks for the correction. I did misinterpret.

Illustrates language is hard for human too, hah.

Anyway, the "next iteration solves it" effect is definitely a result of common problems leaking. But it could also be a result of LLM being universal but not efficiently-universal problem solvers and people tending to choose the simplest problem that can't be solved (such theories seem illustrative).

Also, your river-crossing problems seem quite useful.

godelski · on Oct 6, 2024

  > hah

And? That's not what's the issue with LLMs.

The issue is an inability to reason. Sure, a human might also have difficulties with river crossing problems, even trivial ones, but I can't get a person to tell me that all animals can fit in the boat, to then put all the animals into the boat, and then proceed to make multiple trips across the river. If they get the first two they always get the right answer. But this is not true for an LLM. That's a very clear demonstration of a lack of reasoning and a lack of having a world model.

It's not about coaching or finding the right prompt, it's that the logic is inconsistent and unreasonable (yes, humans will fail at logic, but *reasoning doesn't mean correct answer*). It fails to meet the basic definition of reasoning.

The whole fucking goal is generalization. That's the G in AGI and the most important thing in all 3 of those letters. We don't have strong evidence of generalization. For GI we want out of distribution generalization but we're not doing so well at in distribution generalization. That's demonstrated by the river crossing puzzles, Cheryl's birthday, and the recently famous 9.8 vs 9.11 (https://x.com/sainingxie/status/1834300251324256439)

Yes, next iteration will get better. But better in which direction. Being dismissive of what it fails at just means you don't get better at that direction unless you get lucky.

fragmede · on Oct 5, 2024

Oh interesting, what variants were you using that it seems they patched?

godelski · on Oct 5, 2024

Well it depends on what you mean patched, but I'll be generous and say that it gets the right answer and there are no logical errors. I used to use

  Which weighs more, a kilogram of bricks or a pound of feathers

and that worked until the second or third update on GPT 3.5 (when everyone was like "but did you try 3.5?)

Here's a river crossing puzzle with the current GPT-4o that "works"

  ----------------------------------------------------
  > I have a duck, a goose, and a pig that I need to take across a river. If I live the duck and goose alone, the duck will quack. If I leave the pig with the goose the pig will oink. I can only fit two animals in the boat at a time. What is the quickest way to get all animals across the river?

  To solve this puzzle, we need to find a way to transport the animals across the river without triggering the quacking or oinking conditions:

  1. **First Trip:** Take the duck and the goose across the river. Leave the goose on the other side and take the duck back with you.
  2. **Second Trip:** Take the pig across the river. Leave the pig on the other side and bring the goose back with you.
  3. **Third Trip:** Take the duck and the goose across the river again.

  Now all three animals are across the river, and at no point were the duck and goose left alone, nor were the pig and goose left together. This method ensures no quacking or oinking occurs. 
  ----------------------------------------------------

Idk why it decided to add the extra constraint of keeping the animals quiet but hey, it doesn't really matter.

As far as I know, I'm the first one that came up with this variant and strangely it has been consistently answered correctly when 4o came out (after my post got repeated by a few bigger names) so I thought it was patched, but just now I wasn't able to reproduce a success with various checkpoints and system prompts, so I guess it got undone?

  ----------------------------------------------------
  > I have a fox, a goose, and a bag of corn that I need to transport across a river. All three can fit in the boat. If you leave the fox alone with the goose, it'll eat the goose. If you leave the goose alone with the corn, it'll eat the corn. Describe how to get all items safely across the river.

  To solve this classic puzzle, follow these steps:

  1. **Take the Goose Across First:** Start by taking the goose across the river and leave it on the other side.

  2. **Return and Take the Fox Across:** Go back to the original side and take the fox across the river.

  3. **Bring the Goose Back:** When you reach the other side, leave the fox there and take the goose back with you to the original side.

  4. **Take the Corn Across:** Leave the goose on the original side and take the bag of corn across the river.

  5. **Return Alone:** Leave the bag of corn with the fox on the other side and return alone to the original side.

  6. **Take the Goose Across Again:** Finally, take the goose across the river one last time.

  Now, all three items—the fox, the goose, and the corn—are safely across the river without any being eaten.
  ----------------------------------------------------

claude-3-5-sonnet-20240620 and claude-3-opus-20240229 both act like it is the normal puzzle.

It doesn't matter if you say "All three can fit in the boat with the farmer." Nor if you ask it to explain its logic step-by-step.

The only humans I've seen struggle with this variant ask me to repeat the question several times before saying something similar to "I don't understand the trick. You just take all the animals across at once?" But the trick is there's no trick and I accept that as an answer lol.

FWIW, I've had several LLMs explain to me that all animals can fit in the boat and then proceed to do the normal solution to the puzzle. I've also seen it put all animals into the boat and then only drop one off and travel back and forth, or sometimes animals teleporting or even the framer. Sometimes the farmer is just going back and forth. That happens a lot when I use a variant where animals will follow the farmer and can swim.

Very often I see the better models great at getting the right answers but fail (or even worse) when explaining that answer. I don't think this makes the models useless, but I feel that their highest utility is mostly due to Google's ever decreasing quality. But what can you do?

fragmede · on Oct 5, 2024

I like the noise variant! The "everything can in one trip" variant is the one I've been using, and I was able to get 4o to get it right in one shot with enough couching, and o1-preview without couching, which convinced me (of what, I'm not sure). my other riddle is the car accident doctor son one, which 4o couldn't get but o1-preview does.

I'll have to come up with more obscure riddles and not talk about them online and only use temporary chats which aren't used as training data and see what happens next. I'm sure I have a puzzle book in my library that I can use to help me make new ones.

godelski · on Oct 5, 2024

Be careful with coaching. It's very easy to leak information. The point is to get it without letting it know.

As for o1, well I've been using this for a year and a few big players have used it too. So remember that they get spoiled because they end up in the training set

fragmede · on Oct 5, 2024

good point! problem is, I can't know what other people have spoiled it on either, so if we'd independently come up with the now spoiled "the boat can take all" variant, I can't know unless that gets revealed over Twitter or arvix or HN or wherever.

godelski · on Oct 6, 2024

We won't know if it's spoiled, or rather how spoiled, it is unless the companies release their training data.

But, in this case we can study in a different way. Use things we are certain are spoiled. That's what the author here does.

But as an ML researcher, I'll let you know that I don't trust a single reasoning paper I've read.

You either have to start with the premise that the thing you're testing is in the training data (and thus spoiled), so you typically look at generalization and how robust it is. You can't prove reasoning this was but you can disprove this way. This also works for theory of mind (which is seems many HN readers failed to read the first paragraph).

The other way is you need to prove that the data isn't in training (for a strong condition you need to prove that it's not even indirectly in the data...). You still can't prove reasoning this way but you would build strong evidence that it is going on (proving reasoning is very tough, if possible). I think if this was shown, consistently, then most of the conversations about LLMs not reasoning would go away and we'd discuss like humans: capable of reasoning, but not necessarily always doing so.

But ML is in an existential crisis right now. Theory means nothing without experimentation but experimentation means nothing without theory. See von Neumann's elephant

falcor84 · on Oct 9, 2024

... and such that the same increase in problem complexity requires a smaller increase in human effort to solve.

This was the idea with the Winograd schema challenge [0] and now the ARC benchmark [1], but human-level performance on the former was achieved in 2019, and very strong progress is being made over the last few months on the latter. But at the current point in time, it seems that we're pretty much reaching the limit of such challenges that are relatively easy for humans to solve in a single sitting, and we'll have to start switching to benchmarks which rely on extensive work over time, such as SWE-Bench [1], and even there it seems that state of the art AI agents are already doing better than the "average" human developer.

[0] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[1] https://arcprize.org/

[2] https://www.swebench.com/