Calling things "slop" is just begging the question. The real differentiating factor is that, in the past, "human-generated slop" at least took effort to produce. Perhaps, in the process of producing it, the human notices what's happening and reconsiders (or even better, improves it such that it's no longer "slop".) Claude has no such inhibitions. So, when you look at a big bunch of code that you haven't read yet, are you more or less confident when you find out an LLM wrote it?
If you try and one shot it, sure, but if you question Claude, point out the errors of its ways, tell it to refactor and ultrathink, point out that two things have similar functionality and could be merged. It can write unhinged code with duplicate unused variable definitions that don't work, and it'll fix it up if you call it out, or you can just do it yourself. (cue questions of if, in that case, it would just be faster to do it yourself.)
I have a Claude max subscription. When I think of bad Claude code, I'm not thinking about unused variable definitions. I'm thinking about the times you turn on ultrathink, allow it to access tools and negotiate it's solution, and it still churns out an over complicated yet partially correct solution that breaks. I totally trust Claude to fix linting errors.
It's hard to really discuss in the abstract though. Why was the generared code overly complicated? (I mean, I believe you when you say it was, but it doesn't leave much room for discussion). Similarly, what's partially correct about it? How many additional prompts does it take before you a) use it as a starting point b) use it because it works c) don't use any of it, just throw it away d) post about why it was lousy to all of the Internet reachable from your local ASN.
I've read your questions a few times and I'm a bit perplexed. What kind of answers are you expecting me to give you here? Surely if you use Claude Code or other tools you'd know that the answers are so varying and situation specific it's not really possible for me to give you solid answers.
However much you're comfortable sharing! Obviously ideal would be the full source for the "overly complicated" solution, but naturally that's a no go, so even just more words than a two word phrase "overly complicated". Was it complicated because it used 17 classes with no inheritance and 5 would have done it? Was it overly complicated because it didn't use functions and so has the same logic implemented in 5 different places?
I'm not asking you, generically, about what bad code do LLMs produce. It sounds like you used Claude Code in a specific situation and found the generated code lacking. I'm not questioning that it happened to you, I'm curious in what ways it was bad for your specific situation more specifically than "overly complicated". How was it overly complicated?
Even if you can't answer that, maybe you could help me reword the phrasing of my original comment so it's less perplexing?
You're proposing a truism: if you don't get a good result, it's either because your query is bad or because the LLM isn't good enough to provide a good result.
Yes, that is how this works. I'm talking about the case where you're providing a good query and getting poor results. Claiming that this can be solved by more LLM conversations and ultrathink is cope.
I've claimed neither. I actually prefer restarting or rolling back quickly rather than trying to re-work suboptimal outputs - less chance of being rabbit holed. Just add what I've learned to the original ticket/prompt.
I have pretty much the same amount of confidence when I receive AI generated or non-AI generated code to review: my confidence is based on the person guiding the LLM, and their ability to that.
Much more so than before, I'll comfortably reject a PR that is hard to follow, for any reason, including size. IMHO, the biggest change that LLMs have brought to the table is that clean code and refactoring are no longer expensive, and should no longer be bargained for, neglected or given the lip service that they have received throughout most of my career. Test suites and documentation, too.
(Given the nature of working with LLMs, I also suspect that clean, idiomatic code is more important than ever, since LLMs have presumably been trained on that, but this is just a personal superstition, that is probably increasingly false, but also feels harmless)
The only time I think it is appropriate to land a large amount of code at once is if it is a single act of entirely brain dead refactoring, doing nothing new, such as renaming a single variable across an entire codebase, or moving/breaking/consolidating a single module or file. And there better be tests. Otherwise, get an LLM to break things up and make things easier for me to understand, for crying out loud: there are precious few reasons left not to make reviewing PRs as easy as possible.
So, I posit that the emotional reaction from certain audiences is still the largest, most exhausting difference.
The code I've seen generated by others has been pretty terrible in aggregate, particularly over time as the lack of understanding and coherent thought starts to show. Quite happy without it thanks, haven't seen it adding value yet.
Or is the bad code you've seen generated by others pretty terrible, but the good code you've seen generated by others blends in as human-written?
My last major PR included a bunch of tests written completely by AI with some minor tweaking by hand, and my MR was praised with, "love this approach to testing."
I think maybe there's another step too - breaking the design up into small enough peices that the LLM can follow it, and you can understand the output.
So do all the hard work yourself and let the AI do some of the typing, that you’ll have to spend extra time reviewing closely in case its RNG factor made it change an important detail. And with all the extra up front design, planning, instructions, and context you need to provide to the LLM I’m not sure I’m saving on typing. A lot of people recommend going meta and having LLMs generate a good prompt and sequence of steps to follow, but I’ve only seen that kinda sorta work for the most trivial tasks.
Unless you're doing something fabulously unique (at which point I'm jealous you get to work on such a thing), they're pretty good at cribbing the design of things if it's something that's been well documented online (canonically, a CRUD SaaS app, with minor UI modification to support your chosen niche).
And if you are doing something fabulously unique, the LLM can still write all the code around it, likely help with many of the components, give you at least a first pass at tests, and enable rapid, meaningful refactors after each feature PR.
I don't really understand your point. It reads like you're saying "I like good code, it doesn't matter if it comes from a person or an LLM. If a person is good at using an LLM, it's fine." Sure, but the problem people have with LLMs is their _propensity_ to create slop in comparison to humans. Dismissing other people's observations as purely an emotional reaction just makes it seem like you haven't carefully thought about other people's experiences.
My point is that, if I can do it right, others can too. If someone's LLM is outputing slop, they are obviously doing something different: I'm using the same LLMs.
All the LLM hate here isn't observation, it's sour grapes. Complaining about slop and poor code quality outputs is confessing that you haven't taken the time to understand what is reasonable to ask for, aren't educating your junior engineers how to interact with LLMs.
> Perhaps, in the process of producing it, the human notices what's happening and reconsiders (or even better, improves it such that it's no longer "slop".)
Given the same ridiculously large and complex change, if it is handwritten only a seriously insensitive and arrogant crackpot could, knowing what's inside, submit it with any expectation that you accept it without a long and painful process instead of improving it to the best of their ability; on the other hand using LLM assistance even a mildly incompetent but valuable colleague or contributor, someone you care about, might underestimate the complexity and cost of what they didn't actually write and believe that there is nothing to improve.
There's probably a difference in degree, however. Alopecia Areata is much more uncommon, while regular male pattern baldness is very common.
There's also the fact that Alopecia Areata is actually more common in women, which I imagine exaggerates the distress compared to the more run of the mill MPB.
I realize you didn't mean to use a study on Alopecia Areata, but the difference in degree could be quite large.
It's also possible that people taking Finasteride might be a more potent selection of people that are distressed about hair loss, and are therefore more likely to exhibit depression, etc. As in, if people with androgenetic alopecia are more likely to be depressed, people who take finasteride may be a sampling of those people who are distressed enough to seek and maintain treatments.
Additionally, the kind of person who would reach for prescription medication vs accepting hair loss may be predisposed to depression. I.e. this may be selecting for people who struggle with self-acceptance generally.
I also wonder whether there's some degree of placebo going on. Patients know finasteride is anti-androgenic; perhaps when they inevitably experience some symptoms associated with hypogonadism they assume the worst and lament the choice between having hair and feeling youthful. This would also explain why many who get off finasteride don't notice their symptoms improve.
Personal bias: I've taken finasteride for years with no side effects.
This is exactly why people thought isotretinoin (brand name Accutane) caused suicides (and required huge hurdles to access for years). It turns out that people suffering from physical disfigurements, such as acne, are more prone to suicide than the general population. Not sure if this is also true of androgenetic alopecia but it would hardly be surprising.
I don't think we're saying different things. People who are distressed about their appearance are more likely to be depressed, and people who seek medicine and surgeries are probably more distressed still, and therefore more likely to be depressed, ..
It did jump out at me that the paper repeatedly cites studies that found a correlation between finasteride and psychological side effects, and then talks about them as though they're evidence of causation.
I remember discussing with some coworkers a year(?) ago about autocomplete vs chat, and we were basically in agreement that autocomplete was the better feature of the two.
Since we've had Claude Code for a few months I think our opinions have shifted in the opposite direction. I believe my preference for autocomplete was driven by the weaknesses of Chat/Agent Mode + Claude Sonnet 3.5 at the time, rather than the strengths of autocomplete itself.
At this point, I write the code myself without any autocomplete. When I want the help, Claude Code is open in a terminal to lend a hand. As you mentioned, autocomplete has this weird effect where instead of considering the code, you're sort of subconsciously trying to figure out what the LLM is trying to tell you with its suggestions, which is usually a waste of time.
LSP giving us high-quality autocomplete for nearly every language has made simple llm-driven autocomplete less magical. Yes, it has good suggestions some of the time, but it's not really revolutionary
On the other hand I love cursor's autocomplete implementation. It doesn't just provide suggestions for the current cursor location, it also provides suggestions where the cursor should jump next within the file. You change a function name and just press tab a couple of times to change the name in the docstring and everywhere else. Granted, refactoring tools have done that forever for function names, but now it works for everything. And if you do something repetitive it picks up on what you are doing and turns it into a couple quick keypresses
I often ask Claude to scan through the code first and then come back with questions related to the task. It sometimes comes back with useful questions, but most of the time it acts like a university student looking for participation marks from a tutorial; choosing questions to signal understanding rather than be helpful.
It's often said that formal verification moves bugs from the implementation to the specification, and there's also the problem of proving equality between the formally proven specification and the implementation.
The first problem is just hard, the second can be approached by using a language like F* or Dafny that compiles to executable code. IIRC Amazon has had some success by using Lean for the specification and using fuzzing to test that the implementation matches.
It does say something that the models simultaneously:
a) "know" that they're not able to do it for the reason you've outlined (as in, you can ask about the limitations of LLMs for counting letters in words)
b) still blindly engage with the query and get the wrong answer, with no disclaimer or commentary.
If you asked me how many atoms there are in a chair, I wouldn't just give you a large natural number with no commentary.
A factor might be that they are trained to behave like people who can see letters.
During training they have no ability to not comply, and during inference they have no ability to choose to operate differently than during training.
A pre-prompt or co-prompt that requested they only answer questions about sub-token information if they believed they actually had reason to know the answer, would be a better test.
Your prompting suggestion would certainly make them much better at this task, I would think.
I think it just points to the fact that LLMs have no "sense of self". They have no real knowledge or understanding of what they know or what they don't know. LLMs will not even reliably play the character of a machine assistant: run them long enough and they will play the character of a human being with a physical body[0]. All this points to the fact that "Claude the LLM" is just the mask that it will produce tokens using at first.
The "count the number of 'r's in strawberry" test seems to just be the easiest/fastest way to watch the mask slip. Just like that, they're mindlessly acting like a human.
Groq and Cerebras definitely have the t/s, but their hardware is tremendously expensive, even compared to the standard data center GPUs. Worth keeping in mind if we're talking about a $20 subscription.
reply