“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as...

rco8786 · 2025-05-22T17:33:25 1747935205

It could be! But that's also what people said about all the models before it!

kmacdough · 2025-05-22T19:29:59 1747942199

And they might all be right!

> This tech could lead to...

I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.

Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.

A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.

sagarpatil · 2025-05-23T03:31:51 1747971111

Did you not see the live stream? They took a feature request for excalidraw (table support) and Claude 4 worked on it for 90 minutes and the PR was working as expected. I’m not sure if they were using sonnet or opus.

andrepd · 2025-05-23T11:15:34 1747998934

Pre-prepared demos don't impress me.

spiderfarmer · 2025-05-23T21:51:26 1748037086

By that logic sport athletes don’t impress you. Movies don’t impress you. Theater doesn’t impress you. Your date won’t impress you. Becoming a parent won’t impress you.

Most things in life take years of preparation.

xaoz · 2025-05-24T01:51:55 1748051515

The issue with preprepared demos is that you can carefully curate the scenario and make choices that you know are likely to show the best outcome, out of a wide range of possibilities. If you know your model (or VC product demo etc) performs poorly under certain conditions, you simply avoid them. This is a reason to be somewhat skeptical about demos.

rco8786 · 2025-05-24T12:50:04 1748091004

No, the logic would be that a football player's 40 time doesn't impress, because you want to see the on field performance. Or a movie trailer doesn't impress, because it's only meant to get you to watch the movie which might be trash. A first date won't impress you, it will take multiple dates to understand and love a human.

etc.

max_on_hn · 2025-05-22T18:49:38 1747939778

I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!

[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).

troupo · 2025-05-23T07:48:04 1747986484

> I am incredibly eager to see what affordable coding agents can do for open source :)

Oh, we know exactly what they will do: they will drive devs insane: https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...

losvedir · 2025-05-23T19:32:46 1748028766

I dunno, looking through those issues I'd be more annoyed by all the randos grandstanding in my PRs.

troupo · 2025-05-23T20:15:09 1748031309

And not all the "fix this - i fixed - no you didn't - here's the fix — there's no fix" back and forth with the AI?

There's very little grandstanding in the comments. They are all very tame, all things considered.

dr_dshiv · 2025-05-23T07:25:32 1747985132

Especially since the EU just made open source contributors liable for cybersecurity (Cyber Resilience Act). Just let AI contribute and ur good

hn111 · 2025-05-23T07:57:57 1747987077

Didn’t they make an exception for open-source projects? https://opensource.org/blog/the-european-regulators-listened...

dr_dshiv · 2025-05-23T16:08:51 1748016531

“Anyone opensourcing anything while in the course of ‘commercial activity’ will be fully liable. Effectively they rugpulled the Apache2 / MIT licenses... all opensource released by small businesses is fucked. where the was no red tape now there is infinite liability”

This is my current understanding, from a friend not a lawyer. Would appreciate any insight from folks here.

spiderfarmer · 2025-05-23T21:53:06 1748037186

So you’re willfully spreading FUD in hopes someone will enlighten you?

dr_dshiv · 2025-05-23T22:58:36 1748041116

The law is real. What emotions would you suggest are appropriate?

spiderfarmer · 2025-05-24T04:41:28 1748061688

Open Source is exempt, provided you don’t make a profit: https://kevinboone.me/open_source_liability.html

dr_dshiv · 2025-05-24T08:30:36 1748075436

So it applies to anyone who figures out how to monetize open-source contributions. Seems like a major issue to me. Not exactly something that makes Europe a good place for tech.

spiderfarmer · 2025-05-26T05:07:53 1748236073

The US has the reputation that rich competitors will abuse the judicial system to sue you into bankruptcy. Still, a lot of people want to start their tech startup there.

andrepd · 2025-05-23T11:17:18 1747999038

Yeah, just the usual hn FUD about the EU.

throwaway894345 · 2025-05-23T16:22:40 1748017360

Can you reconcile that with this sibling comment?

https://news.ycombinator.com/item?id=44074070

I don’t have an opinion, just trying to make sense of contradictory claims.

ModernMech · 2025-05-22T16:59:38 1747933178

That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!

yosito · 2025-05-22T17:29:49 1747934989

Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.

9dev · 2025-05-22T17:44:51 1747935891

Actually, I think the opposite: Upgrading a project that needs dependency updates to new major versions—let’s say Zod 4, or Tailwind 3—requires reading the upgrade guides and documentation, and transferring that into the project. In other words, transforming text. It’s thankless, stupid toil. I’m very confident I will not be doing this much more often in my career.

mikepurvis · 2025-05-22T17:56:25 1747936585

Absolutely, this should be exactly the kind of task a bot should be perfect for. There's no abstraction, no design work, no refactoring, no consideration of stakeholders, just finding instances of whatever is old and busted and changing it for the new hotness.

maoberlehner · 2025-05-23T07:52:23 1747986743

It seems logical, but still, my experience is the complete opposite. I think that it is an inherent problem with the technology. "Upgrade from Library v4 to Library v5" probably heavily triggers all the weights related to "Library," which most likely is a cocktail of all the training data from all the versions (makes me wonder how LLMs are even as good as they are at writing code with one version consistently - I assume because the weights related to a particular version become reinforced by every token matching the syntax of a particular version - and I guess this is the problem for those kinds of tasks).

For the (complex) upgrade use case, LLMs fail completely in my tests. I think in this case, the only way it can succeed is by searching (and finding!) for an explicit upgrade guide that describes how to upgrade from version v4 to v5 with all the edge cases relevant for your project in it.

More often than not, a guide like this just does not exist. And then you need (human?) ingenuity, not just "rename `oldMethodName` to `newMethodName` (when talking about a major upgrade like Angular 0 to Angular X or Vue 2 to Vue 3 and so on).

dvfjsdhgfv · 2025-05-22T19:34:09 1747942449

So that was my conviction, too. However, in my tests it seems like upgrading to a version a model hasn't seen is for some reason problematic, in spite of giving it the complete docs, examples of new API usage etc. This happens even with small snippets, even though they can deal with large code fragments with older APIs they are very "familiar" with.

mikepurvis · 2025-05-22T20:06:39 1747944399

Okay so less of a "this isn't going to work at all" and more just not ready for prime-time yet.

cardanome · 2025-05-22T18:19:17 1747937957

Theoretically we don't even need AI. If semantics were defined well enough and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code. Just a bunch of simple scripts that perform text transformations.

The problem is purely social. There are language ecosystems where great care is taken to not break stuff and where you can let your project rot for a decade or two and still come back to and it will perfectly compile with the newest release. And then there is the JS world where people introduce churn just for the sake of their ego.

Maintaining a project is orders of magnitudes more complex than creating a new green field project. It takes a lot of discipline. There is just a lot, a lot of context to keep in mind that really challenges even the human brain. That is why we see so many useless rewrites of existing software. It is easier, more exciting and most importantly something to brag about on your CV.

Ai will only cause more churn because it makes it easier to create more churn. Ultimately leaving humans with more maintenance work and less fun time.

afavour · 2025-05-22T18:30:53 1747938653

> and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code

In some cases perhaps. But breaking changes aren’t usually “we renamed methodA to methodB”, it’s “we changed the functionality for X,Y, Z reasons”. It would be very difficult to somehow declaratively write out how someone changes their code to accommodate for that, it might change their approach entirely!

mdaniel · 2025-05-22T18:48:59 1747939739

There are programmatic upgrade tools, some projects ship them even right now https://github.com/codemod-com/codemod

I think there are others in that space but that's the one I knew of. I think it's a relevant space for Semgrep, too, but I don't know if they are interested in that case

yosito · 2025-05-23T00:20:25 1747959625

That assumes accurate documentation, upgrade guides that cover every edge case, and the miracle of package updates not causing a cascade of unforeseen compatibility issues.

crummy · 2025-05-23T01:27:51 1747963671

There might be a lot of prior work out there to train on though.

There's some software out there that's supposed to help with this kind of thing for Java upgrades already: https://github.blog/changelog/2025-05-19-github-copilot-app-...

MobiusHorizons · 2025-05-23T03:37:18 1747971438

Except that for breaking changes you frequently need to know why it was done the old way in order to know what behavior it ago have after the update.

csomar · 2025-05-23T04:43:52 1747975432

That's the easiest task for an LLM to do. Upgrading from x.y to z.y is for the most part syntax changes. The issue is that most of the documentation sucks. The LLM issue is that it doesn't have access to that documentation in the first place. Coding LLMs should interact with LSPs like humans do. You ask the LSP for all possible functions, you read the function docs and then you type from the available list of options.

LLMs can in theory do that but everyone is busy burning GPUs.

dakna · 2025-05-22T19:22:36 1747941756

Google demoed an automated version upgrade for Android libraries during I/O 2025. The agent does multiple rounds and checks error messages during each build until all dependencies work together.

Agentic Experiences: Version Upgrade Agent

https://youtu.be/ubyPjBesW-8?si=VX0MhDoQ19Sc3oe-

yosito · 2025-05-23T00:18:35 1747959515

So it works in controlled and predictable circumstances. That doesn't mean it works in unknown circumstances.

tmpz22 · 2025-05-22T17:07:28 1747933648

And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.

For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.

mikepurvis · 2025-05-22T18:00:31 1747936831

"... and breaking out of an agentic workflow to deep dive the problem is quite frustrating"

Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?

As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?

dvfjsdhgfv · 2025-05-22T19:37:09 1747942629

> Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Still, the big short-term danger being you're left with code that seems to work well but has subtle bugs in it, and the long-term danger is that you're left with a codebase you're not familiar with.

mikepurvis · 2025-05-22T19:57:19 1747943839

Being left with an unfamiliar codebase is always a concern and comes about through regular attrition, particularly if inadequate review is not in place or people are cycling in and out of the org too fast for proper knowledge transfer (so, cultural problems basically).

If anything, I'd bet that agent-written code will get better review than average because the turn around time on fixes is fast and no one will sass you for nit-picking, so it's "worth it" to look closely and ensure it's done just the way you want.

jasonthorsness · 2025-05-22T17:43:12 1747935792

The agents will definitely need a way to evaluate their work just as well as a human would - whether that's a full test suite, tests + directions on some manual verification as well, or whatever. If they can't use the same tools as a human would they'll never be able to improve things safely.

soperj · 2025-05-22T19:48:40 1747943320

> if mostly because agentic coding entices us into being so lazy.

Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.

weq · 2025-05-23T04:25:55 1747974355

I can just imagine it now, you launch your AI coded first product and get a bug in production, and the only way the AI can fix the bug is to re-write and deploy the app with a different library. Your then proceed to show the changelog to the CCB for approval including explaining the fix to the client trying to explain its risk profile for their signoff.

"Yeh, we solved the duplicate name appearing the table issue by moving databases engines and UI frameworks to ones more suited to the task"

mewpmewp2 · 2025-05-23T12:31:50 1748003510

I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.

epolanski · 2025-05-23T09:50:02 1747993802

> having package upgrades and other mostly-mechanical stuff handled automatically

Those are already non-issues mostly solved by bots.

In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.

BaculumMeumEst · 2025-05-22T16:59:36 1747933176

Anyone see news of when it’s planned to go live in copilot?

vinhphm · 2025-05-22T17:06:27 1747933587

The option just shown up in Copilot settings page for me

bbor · 2025-05-22T19:22:45 1747941765

Turns out Opus 4 starts at their $40/mo ("Pro+") plan which is sad, and they serve o4-mini and Gemini as well so it's a bit less exclusive than this announcement implies. That said, I have a random question for any Anthropic-heads out there:

GitHub says "Claude Opus 4 is hosted by Anthropic PBC. Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's Anthropic 1P? Based on the only Kagi result being a deployment tutorial[2] and the fact that GitHub negotiated a "zero retention agreement" with the PBC but not whatever "1P" is, I'm assuming it's a spinoff cloud company that only serves Claude...? No mention on the Wikipedia or any business docs I could find, either.

Anyway, off to see if I can access it from inside SublimeText via LSP!

[1] https://docs.github.com/en/copilot/using-github-copilot/ai-m...

[2] https://github.com/anthropics/prompt-eng-interactive-tutoria...

Workaccount2 · 2025-05-22T20:16:53 1747945013

Google launched Jules two days ago, which is the gemini coding agent[1]. I was pretty quickly accepted into the beta and you get 5 free tasks a day.

So far I have found it pretty powerful, its also the first time an LLM has ever stopped while working to ask me a question or for clarification.

[1]https://jules.google/

l1n · 2025-05-22T19:57:40 1747943860

1P = Anthropic's first party API, e.g. not through Bedrock or Vertex

Barbing · 2025-05-30T19:38:35 1748633915

Interesting, first link changed now:

"Claude Opus 4 and Claude Sonnet 4 are hosted by Anthropic PBC and Google Cloud Platform."

They also mention:

"GitHub has provider agreements in place to ensure data is not used for training."

They go on to elaborate. Perhaps this kind of offering instills confidence in some who might not trust model providers 1:1, but believe they will respect their contract with a large customer like Microsoft (GitHub).

BaculumMeumEst · 2025-05-22T17:09:03 1747933743

Same! Rock and roll!

denysvitali · 2025-05-22T17:49:59 1747936199

I got rate-limited in like 5 seconds. Wow

minimaxir · 2025-05-22T17:17:46 1747934266

The keynote confirms it is available now.

jasonthorsness · 2025-05-22T17:43:53 1747935833

Gotta love keynotes with concurrent immediate availability

brookst · 2025-05-22T19:59:58 1747943998

Not if you work there

echelon · 2025-05-22T20:16:15 1747944975

That's just a few weeks of DR + prep, a feature freeze, and oncall with bated breath.

Nothing any rank and file hasn't been through before with a company that relies on keynotes and flashy releases for growth.

Stressful, but part and parcel. And well-compensated.

brookst · 2025-05-23T01:39:20 1747964360

Sometimes. When things work great.

Sometimes you just hear “BTW your previously-soft-released feature will be on stage day after tomorrow, probably don’t make any changes until after the event, and expect 10x traffic”

phito · 2025-05-23T19:52:38 1748029958

I don't see how a LLM could do better than a bot, eg renovate

ed_elliott_asc · 2025-05-22T20:44:02 1747946642

Until it pushes a severe vulnerability which takes a big service doen