Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Watching the livestream now, the improvement over their current models on the benchmarks is very small. I know they seemed to be trying to temper our expectations leading up to this, but this is much less improvement than I was expecting


I have a suspicion that while the major AI companies have been pretty samey and competing in the same space for a while now, the market is going to force them to differentiate a bit, and we're going to see OpenAI begin to lose the race toward extremely high levels of intelligence instead choosing to focus on justifying their valuations by optimizing cost and for conversational/normal intelligence/personal assistant use-cases. After all, most of their users just want to use it to cheat at school, get relationship advice, and write business emails. They also have Ive's company to continue investing in.

Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.

Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.


I think this is pretty much what we've already seen happening, in fact.


> I know they seemed to be trying to temper our expectations leading up to this

Before the release of the model Sam Altman tweeted a picture of the Death Star appearing over the horizon of a planet.


Is he suggesting his company is designed with a womp rat sized opening that if you shoot a bullet into makes the whole thing explode?


You know, I used to bullseye small thermal exhaust ports in my T16 back home, they're not much smaller than womp rats.


You know, I used to bullseye T16s in my womp rat back home, they're not much bigger than thermal exhaust ports.


lol


He also said he had an existential crisis that he was completely useless now at work.


Good that he finally came to the realization lol


Law of diminishing returns.

We’re talking about less than a 10% performance gain, for a shitload of data, time, and money investment.


I'm not sure what "10% performance gain" is supposed to mean here; but moving from "It does a decent job 95% of the time but screws it up 5%" to "It does a decent job 98% of the time and screws it up 2%" to "It does a decent job 99.5% of the time and only screws it up 0.5%" are major qualitative improvements.


Yeah I think that throwing more and more compute at the same training data produces smaller and smaller gains.

Maybe quantum compute would be significant enough of a computing leap to meaningfully move the needle again.


What exactly is being moved? It's trained on human data, you can't make code more perfect than what is written out there by a human.


Some think it’s possible, I don’t, we agree actually.


GPT-5 is #1 on WebDev Arena with +75 pts over Gemini 2.5 Pro and +100 pts over Claude Opus 4:

https://lmarena.ai/leaderboard


This same leaderboard lists a bunch of models, including 4o, beating out Opus 4, which seems off.


In my experience Opus 4 isn't as good for day to day coding tasks as Sonnet 4. It's better as a planner


"+100 points" sounds like a lot until you do the ELO math and see that means 1 out of 3 people still preferred Claud Opus 4's response. Remember 1 out of 2 would place the models dead even.


That eval hasn't been relevant for a while now. Performance there just doesn't seem to correlate well with real-world performance.


What does +75 arbitrary points mean in practice? Can we come up with units that relate to something in the real world.


Also, the code demos are all using GPT-5 MAX on Cursor. Most of us will not be able to use it like that all the time. They should have showed it without MAX mode as well


Sam said maybe two years ago that they want to avoid "mic drop" releases, and instead want to stick to incremental steps.

This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.


Then why increment the version number here? This is clearly styled like a "mic drop" release but without the numbers to back it up. It's a really bad look when comparing the crazy jump from GPT3 to GPT4 to this slight improvement with GPT5.


GPT-5 was highly anticipated and people have thought it would be a step change in performance for a while. I think at some point they had to just do it and rip the bandaid off, so they could move past 5.


Maybe its time to switch to year based versioning, or increment by an integer for every small new feature like everyone else does.


Honestly, I think the big thing is the sycophancy. It's starting to reach the mainstream that ChatGPT can cause people to 'go crazy'.

This gives them an out. "That was the old model, look how much better this one tests on our sycophancy test we just made up!!"


Because it is a 100x training compute model over 4.

GPT5.5 will be a 10X compute jump.

4.5 was 10x over 4.


Even worse optics. They scaled the training compute by 100x and got <1% improvement on several benchmarks.


It is almost as if there’s a documented limit in how much you can squeeze out of autoregressive transformers by throwing compute at it


Is 1% relative to more recent models like o3, or the (old and obsolete at this point) GPT-4?


It was relative to the number the comment I replied to included. I would assume GPT-5 is nowhere near 100x the parameters of o3. My point is that if this release isn't notable because of parameter count, nor (importantly) performance, what is it notable for? I guess it unifies the thinking and non-thinking models, but this is more of a product improvement, not a model improvement.


The fact that it unifies the regular model and the reasoning model is a big change. I’m sure internally it’s a big change, but also in terms of user experience.

I feel it’s worthy of a major increment, even if benchmarks aren’t significantly improved.


Claude code already does that. It is an improvement but not a big change in any way.


Well yeah, but it’s a major break from the previous slate of OpenAI models. What else were they going to call it that makes any sense? o4o?


He said that because even then he saw the writing on the wall that LLMs will plateau.


> Sam said maybe two years ago that they want to avoid "mic drop" releases, and instead want to stick to incremental steps.

He also said that AGI was coming early 2025.

People that can't stop drinking the kool aid are really becoming ridiculous.


The hallucination benchmarks did show major improvement. We know existing benchmarks are nearly useless at this point. It's reliability that matters more.


I’m more worried about how they still confidently reason through things incorrectly all the time, which isn’t quite the same as hallucination, but it’s in a similar vein.


Yeah, people never do that. Or at least I don't. I don't know about you.


There’s a name for the fallacy you’re using.


im sure i am repeating someone else but sounds like we're coming over the s-curve


My thought exactly.-

Diminished returns.-

... here's hoping it leads to progress.-


It is at least much cheaper and seems faster.

They also announced gpt-5-pro but I haven't seen benchmarks on that yet.


I am hoping there is a "One more thing" that shows the pro version with great benchmark scores


I mean that's just the consequence of releasing a new model every couple months. If Open AI stayed mostly silent since the GPT-4 release (like they did for most iterations) and only now released 5 then nobody would be complaining about weak gains in benchmarks.


If everyone else had stayed silent as well, then I would agree. But as it is right now they are juuust about managing to match the current pace of the other contenders. Which actually is fine, but they have previously set quite high expectations. So some will probably be disappointed at this.


Well it was their choice to call it GPT 5 and not GPT 4.2.


It is significantly better than 4, so calling it 4.2 would be rather silly.


Is it? That's not super obvious from the results they're showing.


Yes it is, if we're talking about the original GPT-4 release or even GPT-4o. What about the results they've shown is not obvious?


I see incremental improvements in almost all domains?


If they had stayed silent since GPT-4, nobody would care what OpenAI was releasing as they would have become completely irrelevant compared to Gemini/Claude.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: