What's going on with their SWE bench graph?[0] GPT-5 non-thinking is labeled 52....

Aurornis · 2025-08-07T17:23:25 1754587405

As someone who spent years quadruple checking every figure in every slide for years to avoid a mistake like this, it’s very confusing to see this out of the big launch announcement of one of the most high profile startups around.

Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.

ertgbnm · 2025-08-07T17:40:12 1754588412

It's literally a billion dollar plus release. I get more scrutiny on my presentations to groups of 10 people.

dbg31415 · 2025-08-07T17:44:42 1754588682

I take a strange comfort in still spotting AI typos. Makes it obvious their shiny new "toy" isn't ready to replace professionals.

They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.

The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.

nine_k · 2025-08-07T18:25:18 1754591118

Cynically, the AI is ready to replace professionals, in areas where the stakeholders don't care too much. They can offer the services cheaper, and this is all that matters to their customers. Were it not so, companies like Tata won't have any customers. The phenomenon of "cheap Chinese junk" would not exist, because no retailer would order to produce it.

So, brace yourselves, we'll see more of this in production :(

saati · 2025-08-07T19:49:00 1754596140

Does something where you don't care about quality this much need doing at all?

dvfjsdhgfv · 2025-08-07T19:25:00 1754594700

Well, the world will split into those who care, and fields where precision is crucial, and the rest. Occasional mistakes are tolerable but systematic bullshit is a bit too much for me.

nine_k · 2025-08-07T19:46:46 1754596006

This separation (always a spectrum, not a split) already exists for a long time. Bouts of systemic bullshit occur every now and then, known as "bubbles" (as in dotcom bubble, mortgage bubble, etc) or "crises" (such as "reproducibility crisis", etc). Smaller waves rise and fall all the time, in the form of various scams (from the ancient tulip mania to Ponzi to Madoff to ICOs, etc).

It seems like large amounts of people, including people at high-up positions, tend to believe bullshit, as long as it makes them feel comfortable. This leads to various irrational business fashions and technological fads, to say nothing of political movements.

So yes, another wave of fashion, another miracle that works "as everybody knows" would fit right in. It's sad because bubbles inevitably burst, and that may slow down or even destroy some of the good parts, the real advances that ML is bringing.

croemer · 2025-08-07T17:34:35 1754588075

Yes this is quite shocking. They could have just had o3 fact check the slides and it would have noticed...

throwaway0123_5 · 2025-08-07T17:50:06 1754589006

I thought so too, but I gave it a screenshot with the prompt:

> good plot for my presentation?

and it didn't pick up on the issue. Part of its response was:

> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.

I think visual reasoning is still pretty far from text-only reasoning.

abirch · 2025-08-07T17:40:42 1754588442

o3 did fact check the slides and it fixed its lower score.

mixologic · 2025-08-07T17:34:33 1754588073

They let the AI make the bars.

kridsdale3 · 2025-08-07T17:46:04 1754588764

Vibegraphing.

datadrivenangel · 2025-08-07T18:12:16 1754590336

Stable diffusion is good for this!

varispeed · 2025-08-07T17:39:19 1754588359

and then check.

jama211 · 2025-08-07T19:03:18 1754593398

Well, clearly they didn’t

alfalfasprout · 2025-08-07T17:51:52 1754589112

Probably generated with GPT-5

smartmic · 2025-08-07T18:00:27 1754589627

The needle now presses a little deeper into the bubble.

achrono · 2025-08-07T20:17:12 1754597832

I think this just further demonstrates the truth behind the truly small & scrappy teams culture at OpenAI that an ex-employee recently shared [1].

Even with the way the presenters talk, you can sort of see that OAI prioritizes speed above most other things, and a naive observer might think they are testing things a million different ways before releasing, but actually, they're not.

If we draw up a 2x2 for Danger (High/Low) versus Publicity (High/Low), it seems to me that OpenAI sure has a lot of hits in the Low-Danger High-Publicity quadrant, but probably also a good number in the High-Danger Low-Publicity quadrant -- extrapolating purely from the sheer capability of these models and the continuing ability of researchers like Pliny to crack through it still.

[1] https://calv.info/openai-reflections

KoolKat23 · 2025-08-07T20:41:44 1754599304

I don't think they give a shit. This is a sales presentation to the general public and the correct data is there. If one is pedantic enough they can see the correct number, if not it sells well. If they really cared grok etc. Would be on there too.

whatever1 · 2025-08-07T21:06:49 1754600809

The opposite view is to show your execs the middle finger on nitpicking. Their product is definitely not more important than ChatGPT-5. So your typo does not matter. It didn't ever matter.

nicce · 2025-08-07T18:31:41 1754591501

It is not mistake. It is common tactic to make illusion of improvement.

dvfjsdhgfv · 2025-08-07T19:20:19 1754594419

Would they risk such an obvious blunder and being ridiculed for being "AI-sloppy"? I don't believe it.

nicce · 2025-08-07T19:34:54 1754595294

I don’t believe for mistake either. As others have said, these graphs are worth of billions. Everything is calculated. They take the risk that some will notice but most will not. They say that it is mistake for those who notice.

crmi · 2025-08-07T20:10:55 1754597455

Perhaps they're taking a leaf from nvidias book - influencers dunking on their bar charts gives a lot of free press coverage/mindshare

MrNeon · 2025-08-07T19:53:30 1754596410

I've seen that sentiment on reddit as well and I can't phantom how you think it being on purpose is more likely than a mistake when

1 - The error is so blatantly large

2 - There is a graph without error right next to it

3 - The errors are not there in the system card and the presentation page

nicce · 2025-08-07T23:40:39 1754610039

Not sure what to think anymore https://www.vibechart.net/

blitzar · 2025-08-07T17:52:12 1754589132

It wouldnt have taken years of quadruple checks to spot that one.

everfrustrated · 2025-08-07T17:56:21 1754589381

Possibly they rushed to bring forward the release annoucement

maldonad0 · 2025-08-07T18:29:07 1754591347

It's not a mistake. It's meant to misled.

real_marcfawzi · 2025-08-07T18:17:05 1754590625

Humans hallucinate output all the time.

rafark · 2025-08-07T19:15:29 1754594129

Not as much as current llms. But the point is that AIs are supposed to be better than us, kind of how people built calculators to be more reliable than the average person and faster than anyone.

renewiltord · 2025-08-07T17:45:55 1754588755

I'm just going to wildly speculate.

1. They had many teams who had to put their things on a shared Google Sheets or similar

2. They used placeholders to prevent leaks

2.a. Some teams put their content just-in-time

3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream

4. Other teams corrected their content

5. The presentation view being started means that only the ones in 2.a were correct.

Now we wait to see.

bigyabai · 2025-08-07T17:55:17 1754589317

6. (Occam's Razor) It just didn't perform that well in trials for that specific eval.

renewiltord · 2025-08-07T18:04:08 1754589848

That is obviously wrong since the numbers are right but the graph is wrong and you can see it correct on the website…

yz-exodao · 2025-08-07T17:36:24 1754588184

Also, what's this??? https://imgur.com/a/5CF34M6

croemer · 2025-08-07T17:47:31 1754588851

Imgur is down, hug of death from screenshot links on HN.

  {"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}

Or rate limited.

Anon1096 · 2025-08-07T18:07:32 1754590052

This is what Imgur shows to blacklisted IPs. You probably have a VPN on that is blocked.

croemer · 2025-08-07T19:59:31 1754596771

Ugh, why lie to users... Just say the IP is blacklisted.

Thanks for the tip btw.

hk__2 · 2025-08-07T20:09:10 1754597350

Because when you know it’s blacklisted you might try with a different IP, whereas if you don’t you will just wait (forever).

lucb1e · 2025-08-07T21:33:43 1754602423

Imagine we wouldn't tell criminals the law because they might try to find holes... This is just user-hostile and security through obscurity. If someone on HN knows that this is what is shown to banned people then so will the people that scrape or mean harm to imgur

dnissley · 2025-08-08T04:17:21 1754626641

In a world where we couldn't arrest criminals, only keep track of them in a log book, yeah that's probably exactly what we'd do

hk__2 · 2025-08-10T15:39:08 1754840348

There’s no law here, just someone trying to protect their website.

card_zero · 2025-08-07T18:51:09 1754592669

https://i.postimg.cc/mrF87xpQ/YMADeqH.jpg

dvfjsdhgfv · 2025-08-07T19:21:34 1754594494

Lol this is pure vibegraphing!

koolala · 2025-08-07T17:55:25 1754589325

stats say this image got 500 views. imgur is much much more populated than HN

superkuh · 2025-08-07T18:00:52 1754589652

In 2015, yes. In 2025? Probably not. Imgur is enshittifying rapidly since reddit started it's own image host. Lots of censorship and corporate gentrification. There's still some hangers on but it's a small group. 15 comments on imgur is a lot nowadays.

clolege · 2025-08-07T17:53:10 1754589190

Not GPT-5 trying to deceive us about how deceptive it is?

therein · 2025-08-07T18:06:09 1754589969

Why would you think it is anything special? Just because Sam Altman said so? The same guy who told us he was scared of releasing GPT-2.5 but now calling its abilities "toddler/kindergarten" level?

clolege · 2025-08-07T20:58:05 1754600285

My comment was mostly a joke. I don't think there's anything "special" about GPT-5.

But these models have exhibited a few surprising emergent traits, and it seems plausible to me that at one point they could intentionally deceive users in the course of exploring their boundaries.

Is it that far fetched?

therein · 2025-08-07T21:02:40 1754600560

There is no intent, nor is there a mechanism for intent. They don't do long term planning nor do they alter themselves due to things they go through during inference. Therefore there cannot be intentional deception they partake in. The system may generate a body of text that a human reader may attribute to deceptiveness but there is no intent.

clolege · 2025-08-07T22:32:14 1754605934

> There is no intent

I'm not an ML engineer - is there an accepted definition of "intent" that you're using here? To me, it seems as though these GPT models show something akin to intent, even if it's just their chain of thought about how they will go about answering a question.

> nor is there a mechanism for intent

Does there have to be a dedicated mechanism for intent for it to exist? I don't see how one could conclusively say that it can't be an emergent trait.

> They don't do long term planning nor do they alter themselves due to things they go through during inference.

I don't understand why either of these would be required. These models do some amount of short-to-medium term planning even it is in the context of their responses, no?

To be clear, I don't think the current-gen models are at a level to intentionally deceive without being instructed to. But I could see us getting there within my lifetime.

CamperBob2 · 2025-08-07T19:36:23 1754595383

If you were one of the very first people to see an LLM in action, even a crappy one, and you didn't have second thoughts about what you were doing and how far things were going to go, what would that say about you?

therein · 2025-08-07T20:35:49 1754598949

It is just dishonest rhetoric no matter what. He is the most insincere guy in the industry, somehow manages to come off even less sincere than the lawnmower Larry Ellison. At least that guy is honest about not having any morals.

jasonjmcghee · 2025-08-07T17:38:20 1754588300

Deception - guessing it's % of responses that deceived the user / gave misleading information

yz-exodao · 2025-08-07T17:39:55 1754588395

Sure, but 50.0 > 47.4...

jasonjmcghee · 2025-08-07T22:54:56 1754607296

Oh man... didn't even notice. I've been deceived. That's bad.

godelski · 2025-08-07T18:49:36 1754592576

In everything except the first set of bars, bigger bar == bigger number.

But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.

drmidnight · 2025-08-07T17:13:57 1754586837

GPT-5 generated the chart

lacoolj · 2025-08-07T17:49:32 1754588972

Best answer on this page.

Thanks for the laugh. I needed it.

arjie · 2025-08-07T17:17:09 1754587029

Must be some sort of typo type thing in the presentation since the launch site has it correct here https://openai.com/index/introducing-gpt-5/#:~:text=Accuracy...

Look at the image just above "Instruction following and agentic tool use"

mcs5280 · 2025-08-07T17:21:59 1754587319

They vibecharted

netule · 2025-08-07T17:43:50 1754588630

This reminds me of the agent demo's MLB stadium map from a few weeks ago: https://youtu.be/1jn_RpbPbEc?t=1435 (at timestamp)

Completely bonkers stuff.

datadrivenangel · 2025-08-07T21:43:24 1754603004

https://news.ycombinator.com/item?id=44830684

Bluestein · 2025-08-07T17:27:40 1754587660

New term of art :)

datadrivenangel · 2025-08-07T18:21:43 1754590903

stable diffusion is great for this!

croemer · 2025-08-07T17:13:38 1754586818

The barplot is wrong, the numbers are correct. Looks like they had a dummy plot and never updated it, only the numbers to prevent leaking?

Screenshot of the blog plot: https://imgur.com/a/HAxIIdC

hnuser123456 · 2025-08-07T17:30:24 1754587824

Haha, even with that, it says 4o does worse with 2 passes than with 1.

Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.

croemer · 2025-08-07T17:33:37 1754588017

Those are different benchmarks

hnuser123456 · 2025-08-07T17:39:30 1754588370

I see now on the website, the screenshot cut off the header for the first benchmark, looked like it was just comparing 1-pass and 2-pass.

croemer · 2025-08-07T17:48:27 1754588907

Yes, sorry didn't fit everything on the screenshot.

tacker2000 · 2025-08-07T17:34:47 1754588087

Wow imgur has gone to shit. I open the image on mobile and then try to zoom it and bam some other “related content” is opened…!

jama211 · 2025-08-07T19:04:53 1754593493

Yeah it’s basically unusable now

jml7c5 · 2025-08-07T22:26:57 1754605617

That's been an issue for years. Their swipe detection is completely broken.

edwinarbus · 2025-08-07T18:00:06 1754589606

cross-posting:

https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."

blog: https://openai.com/index/introducing-gpt-5/

anigbrowl · 2025-08-07T18:33:32 1754591612

(whispers) they're bullshit artists

It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.

hansmayer · 2025-08-07T19:26:45 1754594805

Exactly this, but it will still be a net negative for all of us. Why? Increasingly I have to argue with non-technical geniuses who have "checked" some complex technical issue with ChatGPT, they themselves lacking even the basic foundations in computer science. So you have an ever increasing number of smartasses who think that this technology finally empowers them. Finally they get "level up" with that arrogant techie. And this will ultimately doom us, because as we know, idiots are in majority and they often overrule the few sane voices.

bhouston · 2025-08-07T17:28:00 1754587680

Sounds like a graph that was generated via AI. :)

Mawr · 2025-08-07T17:43:03 1754588583

Don't ask questions, just consume product.

nonhaver · 2025-08-07T17:12:33 1754586753

also wondering this. had to pause the livestream to make sure i wasnt crazy. definitely eyebrow raising

bwestergard · 2025-08-07T17:13:45 1754586825

"GPT-5, please generate a slideshow for your launch presentation."

Bluestein · 2025-08-07T17:19:11 1754587151

"Dang it! Claude!, please ..."

mbowcut2 · 2025-08-07T17:52:52 1754589172

it looks like the 2nd and 3rd bar never got updated from the dummy data placeholders lol.

seydor · 2025-08-07T17:44:02 1754588642

someone copy pasted the 3rd bar to the 2nd

18172828286177 · 2025-08-07T18:09:55 1754590195

Probably generated by an LLM

Upvoter33 · 2025-08-07T17:15:04 1754586904

Tufte used to call this creating a "visual lie" - you just don't start the y-axis at 0, you start it wherever, in order to maximize the difference. it's dishonest.

amarcheschi · 2025-08-07T17:16:21 1754586981

52 above 60 seems wrong whatever way you put it

mikert89 · 2025-08-07T17:41:19 1754588479

AGI is launching, lets complain about the charts

amarcheschi · 2025-08-07T17:58:45 1754589525

Any time now