GitHub Copilot – Lessons

jmaker · on Aug 4, 2024

There have been many exaggerated claims surfacing across Twitter and the blogosphere. What I found to resonate best with my own experience is that Copilot is okay on simple tasks, but actually misguides, confuses, and in fact breaks your flow of thought on anything beyond that. I like that sort of smart autocompletion and snippet retrieval for boilerplate, templates code, but for the business logic proper it’s just awful. It either doesn’t get it at all but wants to insert some crap and the IDE isn’t always helpful to recognize that it’s distracting you, or it hallucinates something that seems to fit at first sight but once you accept the suggestion moments later you have to undo it because you realize how far off that is, how it misrepresents the domain entities or logic, as if a junior dev suddenly injected that code into your shared file buffer in a pair programming session, while you were focusing. So while Copilot autocomplete is on, you always have to analyze what a gung-ho quasi junior dev knee-jerks into your file buffer.

As many have witnesses, the Copilot Chat has been just terrible. I give up on my hopes for the current wave of the AI evolution.

What is uniform across all LLM is the lack of nuance. Even when they manage to generate some domain specific characters that do make sense, there’s always lack of detail and nuance to the domain, regardless how you try to trick your prompts to retrieve something from a very specific context. I’m impressed by how much easier it’s become for me to get a digest of longer texts, but at the same time I’m disappointed by the quality of the results. It is very rare that I get what I actually ask for. It’s like talking to a mid level consultant who pretends to know everything but the output is rather questionable, and you just give up and seek to end the meeting.

geepytee · on Aug 5, 2024

>As many have witnesses, the Copilot Chat has been just terrible. I give up on my hopes for the current wave of the AI evolution.

Hey, I'd love to hear more about why you think Copilot Chat performs poorly.

I personally had a very similar experience and eventually decided to build a solution for it while in YC (shameless plug but I wrote about the Copilot bugs that annoyed me here: https://docs.double.bot/copilot)

wslh · on Aug 4, 2024

> ...okay on simple tasks, but actually misguides, confuses, and in fact breaks your flow of thought on anything beyond that.

Beyond the LLMs themselves, don't you think this is related to the data used for training? I mean, this echoes what you find as a developer on the wild, you can find answers to "all" simple things, much less discrete answers to complex stuff, and almost nothing for very complex stuff and/or specific programming domains.

meowfly · on Aug 4, 2024

I think half the problem is the form factor of using it as an auto complete.

I might instinctively know what to write, hit tab, and miss what copilot actually wrote isn't what I wanted to write. For example, it might add when you meant to subtract from a list but my quick scan didn't catch it. So you originally miss it, bump into a bug when writing tests, and spend more time fixing it than if you had just typed it out to begin with.

_xnmw · on Aug 4, 2024

For me, Copilot and its sisters have been like self-driving cars: a slightly more advanced IDE autocomplete, analogous to a slightly more advanced cruise control. But it's far from Level 5 self driving and it's not obvious whether we will ever reach that.

You still have to keep your hands on the wheel and you need driving expertise. But since writing uncommitted code has less disastrous potential consequences, it is much more usable.

I still don't believe the people who claim they are using AI to write or rewrite entire codebases. Maybe for the first version of toy projects. But I've yet to see anyone using AI to automatically write entire features that span an enterprise software codebase.

anotherpaulg · on Aug 4, 2024

I’m the author of aider, an ai pair programming tool. I use aider to develop aider, and it keeps track of how much of its own code that it writes.

Aider wrote 58% of the code in the last release, and >40% of the previous few.

The release history page [0] plots this stat for each release over the last 12+ months. The overall trend is pretty cool, especially since Claude 3.5 Sonnet.

It’s not an enterprise code base, but it’s not a toy code base either.

[0] https://aider.chat/HISTORY.html

underlines · on Aug 4, 2024

I think what many people miss, is how long context LLMs get better (needle in a needle stack) and how Context is very important. With github copilot or with continue for VS Code the main issue lies in how they decide which context to give: Ideally it's a graph of all function calls and instantiations of classes so whenever your cursor is in a particular spot you could hop through the graph and get all important code pieces as context.

Currently continue uses vector search of chunks which is just a crutch. I am not sure what copilot or Aider does, but the right context is key.

Another way to improve a coding assistant is to move away from simple paradigms to agent based Workflow that can deploy sub agents and break down tasks. All in the back autonomously while you code and then it surfaces suggestions or changes. This will get possible with increasing inference speeds (groq) and better agent frameworks.

Most devs at our company say LLMs are useless for coding and github copilot is a glorified autocomolete costing 20USD. I think the tech and ecosystem will improve a lot over time and they underestimate LLM abilities due to their bias.

relistan · on Aug 4, 2024

To your point: check out Supermaven. I have no affiliation. But it has way more context than Copilot and is way better at suggestions that make sense in the codebase.

matsemann · on Aug 4, 2024

I kinda wonder what the percentage would be for my use of autocomplete. I only ever write out variable names once, the other times they are completed. Like I write "wacz" and it sees it matches with WArehouseZoneController and inputs that.

I probably type less than half the code I commit, even without AI assistance?

martin-t · on Aug 5, 2024

Certainly. That statistic is misleading without knowing what exactly it measures. The author no doubt knows it. I am surprised you're the only person on HN pointing it out.

_xnmw · on Aug 4, 2024

Ok you’ve convinced me to give Aider a try. But is there a way to feed it documentation of a particular library or API? I want it to read the Stripe docs for me for example and then implement a feature.

anon22981 · on Aug 4, 2024

A team in our company did I an actual refactoring for a customer using some generative AI. They claimed it gave them like 10-15% estimated speed boost - but then again, people who’d try this out probably also buy into the hype and will say they had a speed boost regardless. Also obviously no control to compare to, so not much gained there.

JTyQZSnP3cQGa8B · on Aug 4, 2024

I’m not dismissing their achievement, but estimates is the most complicated thing in programming, almost as difficult to do as solving P=NP so I would be very cautious when someone says this.

mellosouls · on Aug 4, 2024

Copilot (that uses GPT 4 underneath)

I'm not sure this is true, Copilot Chat uses GPT4o but I've never seen a clarifying statement that the more immediate inline Copilot uses anything more than a 3 variant.

You would think that if that was the case, it would have been heavily sold.

The thing is, that if people using Copilot are assuming 4 but actually seeing the results of 3, it goes a significant way to explaining why they are sometimes underwhelmed or disappointed.

I'm happy to be corrected with a link that specifically clarifies that inline Copilot uses 4+.

elzbardico · on Aug 4, 2024

Every time I see someone extolling the virtues of LLM code generators I can't help but have the opinion that they must be shitty developers.

Copilot, Cody, Jetbrain's AI thing. All of them are helpful tools for a small set of tasks, like a super-charged code completion tool. But nothing more than it.

teaearlgraycold · on Aug 4, 2024

My favorite thing to use them for is JSONSchema/TypeScript interface code generation. The alternative for large objects is that I might just YOLO the code. So Copilot’s stupid but fast typing skillset is pretty useful. These kinds of translation tasks are unsurprisingly an area where a language model performs very well.

elzbardico · on Aug 4, 2024

Yep, and this is a great time saver, but more importantly, focus and morale saver. There's a definite benefit of being freed from toil, from boring stuff like that.

Quothling · on Aug 4, 2024

We use copilot as fancy auto-complete. Our original hopes for it was more than that, but it's not been up to the task. Yes it solves leetcode as the article mentions, but that is next to useless for most of what our developers do, what isn't useless is how good it is at replacing code snippets. Especially because it's very transferable between developers, as they no longer build up an archive of personal snippets. Or at least not as many of them. So it's much easier to onboard new developers and get them to be more productive than it was before copilot.

I don't think anyone at our shop has high hopes for LLM's in programming beyond efficiency anymore. I'd like to see Github copilot head in a direction where it's capable of auto-updating documentation such as JSdoc when functionality changes, LLM's are already excellent at writing documentation on "good" code, but the real trick is keeping it up-to-date as things change. I know this is also a change-management issue, but in the world where I mainly work, the time to properly maintain things isn't always prioritized by the business at large. Which obviously costs the business down the line, often grievously so, but as long as "IT" has very little pull in many organisations it's also just the state of things. I'd personally love for them to get better at writing and updating tests, but so far we've been far less successful with it than the author has.

As far as efficiency and quality goes our in-house measurements point in two directions. For inexperienced developers quality has dropped with the use of LLM's, which in our house is completely down to how employees (and this is not just developers) tend trust LLM's more than they would trust search results. So much so that a lot of AI usage has basically been banned from the wider organisation by the upper decision makers because quality is important in what we do. Yes I know this is ironic when you look at how they prioritize IT in an organisation where 90% of our employees use a computer 100% of their working hours. Anyway, as far as efficiency goes there are two sides. When used as fancy auto-complete we see an increase in work output across every kind of developer, however, when used as a "sparring partner" we see a significant decrease. We don't have the resources to do a lot of pair-programming and a couple of developers might do direct sparring on computation challenges for 1-2 hours a week. They are free to do so more, and they aren't punished for it as we don't do any sort of hourly registration on work, but 1-2 hours is where it's at on average. Sometimes it'll increase if they are dealing with complex business processes or if we're on-boarding some one new.

> Copilot is very useful to scan existing code for any errors or missed edge cases

Aside from tests I think this is the one part of the article I really haven't seen in our very anecdotal testing. But maybe this is down to us still learning how to adopt it properly or a difference in coding style? Anyway, almost all of our errors aren't with the actual code but rather with a misrepresentation/misunderstanding/unreported-change-in of business logic, and this has been the area where LLM's have been the weakest for us.

ljm · on Aug 4, 2024

I’ve only used Copilot with the option to use open source code disabled. It’s taken the boredom out of dealing with boilerplate heavy code and manual copy/paste - stuff you could already handle with templates snippets and keyboard macros of course - but as you say it’s not really much good for anything else, and I’ve seen plenty of code that is hard to review because an LLM created it.

In terms of what the article calls skill atrophy, this is probably why I limit my usage to snippets on steroids. I tried GPT4 more directly a few times and while it was all impressive at the start, it’s all surface level and the hallucinations are bad but incredibly subtle.

p0nce · on Aug 4, 2024

Well this seems to highlight that modifying code is harder than creating it, and yet again LLM does the easy thing ok.

sidcool · on Aug 4, 2024

I am curious. Any reason why this post went from from front page to 156+ rank in a matter of minutes?

djaouen · on Aug 4, 2024

My whole thing with AI is, "But what if it's wrong?" At least with GitHub Copilot, I can correct the autocompleted output. I am less convinced about other ideas, such as having computers built solely with LLMs as an OS.

sidcool · on Aug 4, 2024

OP here. Happy to hear any feedback on the content and presentation of the post.

relistan · on Aug 4, 2024

Good post, thanks for engaging.

sidcool · on Aug 4, 2024

Happy to.