>executing large-scale changes in entire repositories in 3 years
You can look at SWE-Agent, it solved 12 percent of the GitHub issues of their test dataset. It probably depends on your definition of large-scale.
This will get much better, it is a new problem with lots of unexplored details, and we will likely get GPT-5 this year, which is supposed to be a similar jump in performance as from 3.5 to 4 according to Altman.
This is a laughable definition of large-scale. It's also a misrepresentation of that situation: It was 12% of issues in a dataset for the top 5000 repositories pypy packages. Further "solves" is a incredibly generous definition, so I'm assuming you didn't read the source or any of the attempts to use this service. Here's one where it deletes half the code and replaces network handling with a comment to handle network handling: https://github.com/TBD54566975/tbdex-example-android/pull/14...
"this will get much better" is the statement I've been hearing for the past year and a half. I heard it 2 years ago about the metaverse. I heard it 3 years ago about DAOs. I heard it 5 years about block chains...
What I do see is a lot more lies. Turns out things are zooming along at the speed of light if you only read headlines from sponsored posts.
We unfortunately have no idea what they consider a success! That's just one of the most recent ones by some random user who wanted to use the program in the real world.
You can look at SWE-Agent, it solved 12 percent of the GitHub issues of their test dataset. It probably depends on your definition of large-scale.
This will get much better, it is a new problem with lots of unexplored details, and we will likely get GPT-5 this year, which is supposed to be a similar jump in performance as from 3.5 to 4 according to Altman.