> Except in very minor cases, duplication is virtually always worth fixing.
I disagree with the severity of this, and would posit that there are duplications that can't be "fixed" by an abstraction.
There are many instances I've encountered where two pieces of code coincided to look similar at a certain point in time. As the codebase evolved, so did the two pieces of code, their usage and their dependencies, until the similarity was almost gone. An early abstraction that would've grouped those coincidentally similar pieces of code would then have to stretch to cover both evolutions.
A "wrong abstraction" in that case isn't an ill-fitting abstraction where a better one was available, it's any (even the best possible) abstraction in a situation that has no fitting generalization, at all.
Agreed. Abstractions also tend to be more resistant to change, both from a technical level, and a social level.
At a technical level an abstraction will have more call sites to worry about in different contexts, the more wrong the initial abstraction the harder it will be to change.
The social level is maybe even more problematic. Abstractions seem more important than calling code and will experience more friction in code review. This change friction can also increase with the "wrongness" of the initial abstraction. The starting point makes less sense so a reviewer needs to work more to understand the context. If the abstraction is gnarly enough then it's possible that the reason for the abstraction is almost obscured. Even someone who knows _how_ it works might have lost the forest through the trees and push back on changes that simplify it or improve it if the change is a sufficiently large departure from the initial state. In this case you can often see small incremental changes get added easier but this just makes the shared code a bit gnarlier for next time.
This is my beef with naively applied DDD, separation of concerns, and design patterns.
Usually what happens is the 'clean' code ideal comes first, and then the implementation is squeezed into it. This then informs the organisation (or architecture) of the rest of the codebase and your software design has become a matter of putting pegs into the right-shaped holes.
I have never found that kind of highly abstracted code easier to work with than some simple procedural alternative that is easy to delete and easy to refactor, so long as effort was put into writing it well.
Of course, the patterns have a purpose and do help when used nicely - a lot of code you write will fall into some of those patterns even without you explicitly mentioning it. It's just...doing it for the sake of it is a problem.
The pile of abstractions stacks to the moon but when you chase down what's actually happening a 60 file repo ends up holding like 12 lines of actual programming.
> At a technical level an abstraction will have more call sites to worry about in different contexts, the more wrong the initial abstraction the harder it will be to change.
As I recently called it, infrastructure and systems lose agility as they gain dependency and move down the stack.
If you have like 1 customer and they have good retries, honestly: fuck everything. Deploy master, in fact, deploy every keystroke to prod. It'll be fine.
At the same time, about 30k - 40k FTEs of our B2B customers depend on one of my Postgres instances during business hours and about twice of that during different holiday seasons. Honestly? Nothing touches the system-level settings of these database systems unless we have pondered a change for 2 weeks. And even then we will schedule an approved change over 4 weeks across applicable postgres clusters. The carnage a bad change at this level can cause is ridiculous enough to not be.
I try to think about whether two concepts are innately similar or incidentally similar. Computing compounding interest for a home equity loan and a mortgage might be innately similar. A desired change to one will probably make a desired change to the other. Computing growth of a fruit fly population and computing compounding interesting for a loan might be incidentally similar. Until you change your "computeExponentialGrowth" function to now handle occasional decimations from environmental sources, and anyone looking at the code wonders what the heck that looks like for a loan.
If you've got your abstractions correct, then the exponential growth term and the decimation term will be partial differentials which will compose together nicely
Duplication can sometimes be useful, for instance if you have many small variations on a central process. Trying to make one process with all the edge cases baked in leads to overly-complex, hard to reason about, expensive software.
In my experience, the right way to handle this sort of situation is to create a functional mini-DSL for the process that handles all the implementation details, then create a "default" process which serves as a template. If a process needs slightly different logic, just copy the template, update the DSL to support any new logic, and update the template with the new DSL statements. This approach lets you give semantic meaning to implementation details, and you can see where all the different custom logic is at a glace by looking at all the template copies. As long as the template is only calling out to DSL actions with no internal logic of its own and process flow is correctly encapsulated in the DSL, you should never need to update templates to change behavior, only update the DSL.
Instead of one gigantic function with 50 parameters, you have 100 "template" functions, that all make use 60 different "helper" functions (what you're calling the DSL).
Instead of castles-of-logic abstraction, it's nuts-and-bolts or grass-roots abstractions. I've never come across a name for this development style.
But it generally works extremely well when building processes for tens/hundreds of data formats or customers or what have you.
Though it's less true today and in languages that are not C or Fortran. Even something like C++ or Java has the template method pattern, which gets you 80% of the way there. Dynamic languages like Python or Ruby tend to have pretty reasonable facilities for building DSLs, as do more modern languages like Scala and Rust.
This is generally my approach to data ingress/egress (ETL)... I'd rather have a hundred similar, small scripts for each data source than try to create one complex (monstrosity) application to handle them all.
So, to maintain your project, it's not sufficient for me to know language X -- I also have to learn a bunch of domain-specific sub-languages, which any developer on the project may have implemented in response to an arbitrary problem, which almost certainly has no kind of formal specification or stability guarantee, and which can change over time in any way without warning? No sir.
"So, to maintain your project, it's not sufficient for me to know language X -- I also have to learn a bunch of functions, and understand the business processes that are used to create the abstractions, and trust code from my coworkers with just unit tests and type checking? No sir."
It seems like you're equivocating language-level concepts like syntax, semantics, grammar, etc. with program-level concepts like types, functions, names, etc. But these are two categorically different things.
I think one example where duplication > abstraction is in tests. I personally find tests that have a ton of extra helper classes/functions to do stuff like set up fixtures or do assertions to be painful to deal with. Taken to an extreme you end up with a mini test framework that obscures the actual test cases and is as hard to understand as the code in question.
I'm not against shared test fixtures or some utility functions, but IMHO, it's better to have some duplication but clearer tests.
> I personally find tests that have a ton of extra helper classes/functions to do stuff like set up fixtures or do assertions to be painful to deal with.
I think it depends on the context. For example, I typically agree, but when I was writing authz tests [0], I ended up writing a DSL so that 1) I'd more more inclined to write the thousands and thousands of tests, and 2) I'd be able to focus on the actual authz assertion and not on verbose setup.
I couldn't imagine writing those policy tests without that abstraction. I would have lost my mind with all of the repetition, and would have almost assuredly made mistakes.
Thank you for the link. This is inspiring. Do you have any resources you could link to that would explain some or all of the style for these tests? The general approach I mean.
Honestly, I don't have many resources to provide. I read a lot of policy tests via GitHub search (e.g. path:spec/policies/*/*.rb), but couldn't find anything that looked like what I wanted. I wrote the DSL as-needed in order to fully test my app's authz while migrating from Pundit to ActionPolicy (while also introducing fine-grained permissions).
It's not the prettiest when you actually look beneath the covers [0], but it does what I wanted -- providing an easy way to write exhaustive authz tests. Without the DSL, I probably wouldn't have written the tests. The PR for said migration was massive [1], and was a prerequisite to going open source [2].
I like it when you have nice, composable utility functions. Ideally each test contains a short preamble setting up the appropriate context for the test to run. The preamble elucidates what the tests are actually testing. It can also serve as documentation on how to use those functions.
There will probably be some duplication across tests, but if the utility functions are idempotent/composable, they're usually pretty easy to read/understand and equally mechanical to write/update.
I would add that you should duplicate the common, cross-cutting setup (eg. faked/mocked dependencies that don't matter), but make the test conditions themselves explicit.
You get a feel for the correct granularity the more tests you write within the codebase. If you try to be too clever in saving boilerplate, you'll cause pain for future modifications and maintainers. Sometimes fixing "clever" tests takes longer than the code change itself.
Given a long enough timeline, every abstraction turns wrong.
The answer isn't to not abstract, the answer is to tear it out when it turns wrong. That was actually the original point of the popular article that streamlined this view - that we shouldn't be afraid of tearing them out, not that we shouldn't make them in the first place. Most people just read headlines though.
The resistance to tearing out a bad abstraction isn't just cultural: combining two different functions into one is a lossy operation, which makes splitting an abstraction harder than creating it in the first place.
While the functions are distinct the call sites are self-documenting. You know which calls are for which purpose because the names are different. After combining them to deduplicate the code, you've lost that information, and to disentangle the abstraction now requires you to infer and reintroduce that lost information.
It's not that it can't be done, but there is real friction that doesn't just exist in people's heads.
It feels like the same thread you're describing, but I guess it's pulling on the other end of it. It's thinking about how to name things in a way that makes it easier to see that the implementations might diverge later, and simplify actually doing so (by preserving more of this intentional context).
Tearing out an abstraction requires a lot of knowledge about the abstraction and how it is used throughout the system. Over time, statistically zero maintainers will have that level of knowledge about any non-trivial system.
In contrast, duplicated code may be annoying, but it is usually simple to understand and maintain by anyone with knowledge of the project and the programming language.
Leaky abstractions are way way way worse than no abstraction at all.
A good example of this is operations type stuff, like the pile of shell scripts or terraform files or whatever that get used to deploy your app. These scripts benefit greatly from a one to one relationship between the thing you're creating and the written text describing it. Not having a situation where changing one thing breaks everything else is a huge help there.
> An early abstraction that would've grouped those coincidentally similar pieces of code would then have to stretch to cover both evolutions.
This seems to be the underlying assumption behind most uses of the "duplication is cheaper than the wrong abstraction" quote, but the assumption is simply incorrect. You should almost never try to expand abstractions in this manner. If you don't treat the abstractions relating to the thing you want to change in your codebase as "the" place where you need to make your change, and instead eagerly make new abstractions and throw old ones away as required, you won't really run into this problem.
In fact, this predominant mindset where creating abstractions is strongly discouraged leads to the very problem that mindset is based on, as it will simply encourage junior developers and the like to modify the existing abstraction, creating the aforementioned kind of mess where abstractions become complicated through repeated modification, instead of creating new abstractions when appropriate, because creating abstractions has a stigma attached to it.
Additionally, if someone has made a "wrong" abstraction based on something silly like two pieces of code simply being similar in terms of their structure and those use cases start to drift apart, you should feel eager to simply split apart the abstraction, be it into bare implementations or two new abstractions, or any other combination. Abstractions are cheap as long as you don't give them special significance.
I think there's a middle ground here. The original quote does not mean DRY=bad, abstraction=bad. The point is there is a non-zero cost to these things. A bad abstraction can, as you say, accumulate to something terrible through inertia or inexperience. A bad abstraction, even if caught early, was probably not worthwhile - I mean, it took time just to make the original one right? This does not mean that we should be scared of abstraction in general, but in my opinion abstractions that are purely for the sake of reducing duplication should be viewed with an extra level of apprehension.
When an abstraction evolves to a point where it needs to be split into two separate implementations to meet diverging needs…
you will need to replace that abstraction with duplication.
Which is the right thing to do because that duplication is cheaper than maintaining the wrong abstraction.
I think this post makes the mistake of thinking that the only way in which duplication comes up is that it is discovered in the codebase, and we have the choice of abstracting it away or keeping it.
On the contrary, duplication can - and should - be consciously introduced to fix bad abstractions when we find them in the codebase.
> When an abstraction evolves to a point where it needs to be split into two separate implementations to meet diverging needs…
you will need to replace that abstraction with duplication.
Hard disagree. When the formerly common parts of an abstraction evolve to no longer be common, then that duplication no longer exists. There now exists two abstractions, one for each of the diverging needs. There may be some leftover commonality that can be abstracted out, but it's no longer the original abstraction.
The point is that they were never actually common in the first place, only superficially similar.
You're saying we should look for duplications, abstract them, and then every time a change needs to be made to the abstraction to suit only one of the use cases, refactor the codebase to de-abstract and re-duplicate, undoing the work we did in the name of DRY in the first place.
That is a lot more work and a lot more confusion and a lot more headache for maintainers and reviewers than copy-pasting the thing the first time, having realized that the duplication was incidental, not structural.
Let's take this line of reasoning to its extreme:
I notice that there's a section of my code that's repeated twice where we add one to a value, so I abstract it into a function called add1(x:int). Some time later, at places where add1 is used we sometimes need to actually add a value other than one, so we need to make a decision: do we refactor everything and re-duplicate, or do we stick the DRY principle and make our abstraction more accomodating? The path of least resistance is to stick to DRY because it's a smaller and more comprehensible commit, so we add an optional arg, add1(x: int, operand?: int). Some time later one of the callers to this function needs to pass a vector instead of a single value, so we need our add1 function to have polymorphism and conditional logic in it now, and potentially more arguments. Sooner or later we have a frankenfunction that's hundreds of lines long and branches a bazillion ways and might as well be a turing machine in itself.
> You're saying ... refactor the codebase to de-abstract and re-duplicate, undoing the work we did in the name of DRY in the first place.
That's the exact opposite of what I'm advocating for, but perhaps I didn't express myself well.
> Sooner or later we have a frankenfunction that's hundreds of lines long and branches a bazillion ways and might as well be a turing machine in itself.
Yeah, that's not a good abstraction, and not at all what I meant.
To some extent I agree, though I don't think DRY means to remove all similar looking lines of code and put that behind a procedure. Generic code vs abstractions are different.
Instead, any given task (which already is an abstraction) should exist in only one place. That is DRY, I would paraphrase it to mean any given abstraction should be done in one place (and combine with SRP to say further that one place should only do that one abstraction)
If one place can be updated independently of another, it argues it is not the same task to begin with. DRY'ing that code is a misnomer IMHO, instead that code is being put behind a procedure and is being made generic (and not necessarily more abstract. Abstracting hides details, putting a block of code behind a procedure with full parameterization is not hiding details, it's just a procedure [and let us hark back to the days of procedural programming and ways that can become mess])
DRY and SRP (single responsibility principle, AKA the DnD principle) need to be considered together.
> I don't think DRY means to remove all similar looking lines of code and put that behind a procedure.
Applying DRY primarily to structural duplication never occurred to me, and only this discussion brought this way of thinking to my attention. It's always been about semantic duplication to me. Often, but not necessarily, semantically duplicate code has structurally duplicate lines of code.
But now I think I understand why some junior programmers I consulted for denied that their system was rampant with the duplication I was seeing. To them, the code was different because it looked different; that it all was doing more or less the same thing, but with different inputs, seemed lost on them. I have a vague recollection of one of them saying something about "parameterizing" the code, and then dismissing it.
I'm going to have to dig through my notes from that gig to see if I can better clarify to myself the implications of the focus on only structural duplication and why it might lead a programmer to overlook very obvious opportunities for removing semantic duplication, or not understand how to fix it even if they do see it.
ETA: I looked waaaay back to the ur-discussion on the c2 wiki and found this: "It's okay to have mechanical, textual duplication (the equivalent of caching values: a repeatable, automatic derivation of one source file from some meta-level description), as long as the authoritative source is well known." https://wiki.c2.com/?DontRepeatYourself
"Every piece of knowledge must have a single, unambiguous, authoritative representation within a system".
What piece of knowledge does your add1 function represent? I don't think you're strawman is actually DRY (which is a problem - damp strawmen can mildew).
I do agree that there are sometimes tradeoffs that can make a less DRY approach a better one, but not all deduplication is "DRY".
Every element of data, core capability, or logical object should have a single authoritative source. Having multiple db connections or ui controllers would, in fact, be a horrible nightmare, obviously.
But not all code is knowledge, some of it's just boring work that does something to some local stuff and it doesn't necessarily need to be sharable with other parts of the code base, just because the logic looks kinda similar.
Right. My quote was the original formulation of DRY as a principle, from The Pragmatic Programmer, and I think it's a good principle.
The common (mis?)understanding of DRY as primarily syntactic both overshoots and undershoots. As many here have discussed, it can lead to combining things you shouldn't (I jokingly call this "Huffman coding"), but it can also fail to recommend combining knowledge when it is represented differently. If I'm saying "there is a button here" in my HTML and in my CSS and in my JS, that's three places for that piece of knowledge even if those three places don't look anything alike. Changing the CSS to "here is what my buttons look like" and the JS to "here is how my buttons behave" would be DRYer.
In many cases you cannot see the correct abstraction without introducing the duplication back. When working with particularly messy code I often do sort of https://en.wikipedia.org/wiki/Karnaugh_map of important variable states to see what actually happens before I can refactor it.
This is basically introducing the duplication back.
Whether you keep the duplicated code or refactor it in a different way is another question, what matters for the "duplication is cheaper than wrong abstraction" to be true is just the fact that by introducing abstraction early you wasted time refactoring one way and back. Refactoring isn't free. So in fact leaving the duplication there would have been cheaper - Q.E.D.
It doesn't mean you should never risk it, but it does mean you should think hard before you do it.
In that situation, the correct thing to do is, when the two pieces drift away from each other, to recognize that they are no longer the same abstraction and to break the connection. That may be painful - you have to look at everywhere that abstraction is used and figure out which thing it really is, and change the code to reflect it.
But if that's going to happen, then in the early days, a little duplication was probably better.
> There are many instances I've encountered where two pieces of code coincided to look similar at a certain point in time. As the codebase evolved, so did the two pieces of code, their usage and their dependencies, until the similarity was almost gone
> There are many instances I've encountered where two pieces of code coincided to look similar at a certain point in time. As the codebase evolved, so did the two pieces of code, their usage and their dependencies, until the similarity was almost gone. An early abstraction that would've grouped those coincidentally similar pieces of code would then have to stretch to cover both evolutions.
Then you split that abstraction again. It's very cheap and very quick.
Many people talk about the issue like it was an absolute in the code, but that's wrong approach. If you end up writing 4 functions that are the same, by all means, merge it into one.
If then you need to add a parameter only this code path uses and rest doesn't care about, by all means split it back. Moving blocks of code around is cheap.
I think the key here is the oft repeated but often poorly understood maxim to favor composition ("has a") over inheritance ("is a").
If you have a mixin (or other means of composition) that you use in several places and one diverges, it's easy to remove it. If you use inheritance, it's going to be more painful.
A language that offers OOP via prototypes instead of classes like JS can (sometimes) give you the best of both worlds, but it will confuse a lot of devs who aren't familiar with that kind of OO design.
Splitting the abstraction is never cheap and quick, mostly because of politics. With duplicated code you often can assign a single responsible owner to each duplication.
However, once abstracted, the code may suddenly be used by a number of different teams. You will need to get this work on their roadmap, increasing the friction to get this done. In many companies, this will also end up in endless discussions about the new approach.
Solution there would be to make the abstraction "opt-in", such that a team can elect to duplicate or abstract as desired. Also helps if the "main" abstraction is itself composed from smaller abstractions, from which downstream teams could then pick-and-choose rather than having to either fully abstract or fully duplicate.
This is a good point. Following Conway's Law, a team may choose to duplicate code or do thing theoretically sub-optimally simply to avoid having to deal with other teams.
You're absolutely right that it's important to look beyond how two modules superficially look right now, and look instead at how they change. However, if you've always defined your abstractions based on what their consumers need rather than what their implementations have, then you shouldn't ever need to stretch them. They're not trying to "cover" both cases, they're trying to solve a problem that both cases have. Your two cases are not implementations of the abstraction, they are consumers of it. If one case grows to not have that problem, it just stops asking for that abstraction. If it grows to have more problems, it just asks for more abstractions. The original abstraction, if based on a common need, doesn't have to change.
That's not to say abstractions never change -- they do. But they change because your understanding of the sub-problem they're solving has changed, not because their implementations or consumers have changed.
Sometimes duplication is cheaper than the wrong abstraction.
And
Sometimes it's better to abstract away a duplication rather than let it lie.
And that's the mark of becoming a master at the craft. Being able to recognize all of these various slight permutations of state and what to do about them.
Rule of thumbs really need to be told like this. Or they will be missused. Either by newbies that doesn't know any better or unpleasant programmers that will show their dogmatic beliefs down your throat with the common wisdom as excuse.
As a FYI, just as it's OK to abstract away duplication in code, it's OK to do the opposite, remove abstraction and add duplication.
So in your particular case, it could have been possible to abstract away the code at that point in time and once they diverge, remove the abstraction and duplicate, then adjust one of the duplicates (which no longer is a proper duplicate really).
> As a FYI, just as it's OK to abstract away duplication in code, it's OK to do the opposite, remove abstraction and add duplication.
> So in your particular case, it could have been possible to abstract away the code at that point in time and once they diverge, remove the abstraction and duplicate, then adjust one of the duplicates (which no longer is a proper duplicate really).
This sounds nice in theory, but the reality is that the effort required to make these two kinds of changes is not symmetric. It's about 10 times easier to get a PR approved and merged that combines similar looking code into a function than vise versa. If you any suspicion at all that an abstraction you're making may need to be removed and duplicated in the future, you're better of just never abstracting in the first place.
It sucks pushing a change which unwinds an abstraction like that through code review. It's usually a lot easier to just never abstract it in the first place.
I buy into the same belief as you here, but I guess you could easily argue that you could create a suitable fitting abstraction earlier on with the understanding that you can "detach" them once the point that they're fundamentally different comes
The point of abstraction is to reduce the number of concepts in play. If you're still tracking which old concept is "really" being used every time, you haven't actually abstracted over anything, you're just naming things badly.
> The point of abstraction is to reduce the number of concepts in play.
I'm not sure I agree with this. For me, the point of abstraction is divide the number of concepts between the layers you introduce, effectively to hide concepts from the layers where you don't want to have to care about them. Often times, abstractions adds the total number of concepts at play, but hides them beneath/above the layers.
The problem is that there's an impetus to continue working on top of established facilities, because it's usually incrementally less work than reworking a piece of code into something else. Plus it's difficult to recognize ahead of time when something is about to become a problem, rather than fix something that's already a problem.
Also, you become a better programmer if you write duplicate code and then learn how to abstract it for cases that make sense. I also don't believe that dupe code is always a bad thing. Like everything else in software engineering, IT DEPENDS.
I think this is an example that highlights abstract vs generic code. An abstraction should have hidden those evolutions entirely, which potentially means there would have been two code paths behind the abstraction. Moving that logic to a generic procedure with full parameterization I wouldn't call an abstraction (it's code that has been made generic). Generic code is more complex. DRY is not about making everything that can be generic, generic - it's about making sure a single thing is done in one place (and only that one place, DRY & SRP go together).
My problem always is often when writing a function to remove duplication brings up the question of where to put it. If its only called inside one module doesn't matter really. But if not you've created a dependency. Which is bad.
I think how much you hate that may depend on your language and the program. Some big enterprise Java monolith is a garbage dump of thousands of small files. So who cares. In C without name spaces and the need for headers you care more.
The problem is that "duplication is cheaper than the wrong abstraction" is basically an excuse that lazy devs use not to engineer their code.
The other one I hear a lot is "it's not realistic to reach 100% test coverage / type safety" when submitted code with `any` all over it and zero tests.
One of the biggest problems with deduplication is you can end up with shared code that's full of corner case handling for different situations. Then your nice shared example becomes a tangled rats nest you can't unravel.
You got a good point about code evolution. Has anyone taken a look at it from a biological perspective? Seems like such problems can occur in genetics and nature might have come up with some tricks we can use
> An early abstraction that would've grouped those coincidentally similar pieces of code would then have to stretch to cover both evolutions.
In that case, my takeaway would be that it ain't the abstraction itself that's wrong, but the unwillingness to get rid of it (or decompose it) when it no longer serves its purpose.
So... you kept modifying the two similar pieces of code until they became dissimilar. Why do you think that you wouldn't be able to modify the abstraction if you saw that it doesn't fit anymore?
I think part of the issue here is that a fair number of programmers work in shops where they have very limited agency. They are tasked with making the minimum defensible change to add a feature or fix a bug. They are not allowed to change the tests or suggest refactoring. So those things just don't occur.
I see, but wouldn't this lack of agency, or, more precisely the inability to escalate the problem to someone with more edit rights be the actual issue?
I could see this also happening in situations where such problems arise at component interface boundaries, where the pressure not to change comes not as an organizational policy but rather inability to influence external components (eg. the OS offers a poor abstraction -- the user-space program developer wouldn't be changing how OS works to rectify things). But, this is again sort of an administrative problem, because, ideally, the user-space program developer should be able to convince the OS developer to change the interface if it's found to be not a good abstraction...
But, yes, I can see how in practice that'd be a very difficult thing to do.
I guess I was trying to claim that in a lot of places there isn't such a person.
yeah - if an application company runs up against a poorly designed OS interface - they usually aren't aware, or just back away slowly. they don't have the scope or mandate to pursue it.
that kind of behavior often extends to library dependencies, and even internal interfaces that the company should ostensibly own.
its not so much an administrative problem as a desire to limit spending and time on software and do a kind of agile mvp glue job over an unbounded number of external dependencies. that often leads to an unmaintainable hairball. but if the alternative is to hire a bunch of really experienced developers and let them 'do the right thing' for 5 years...
even if you do, there is a still a good chance that the output isn't gonna be that great. software people lost a lot of credibility with that model. we've probably swung too far in the other direction
I might have to say some unkind things here, but statements like:
instead of “duplication is cheaper than the wrong abstraction”, I would say “duplication is cheaper than confusing code littered with conditional logic”.
seems like it's looking at this problem from an extremely narrow context.
The truth is that the phrase "wrong abstraction" is (more or less) unquantifiable, which makes the original phrase, as employed, sort of like a koan. It addresses the very human tendency to see patterns in noise, and our ability to "transmit" such hallucinations to other humans via natural language and other means.
The closest I can get to - given my at-best-apprentice status as a formal programmer - is the quantitative test I developed for CCS (conditional content systems), where the abstraction lies in the SNS[1], and the de-duplication mechanism is applicability[2]. Since each applicability statement carries its own overhead, there's a limit on how much "abstraction" the model can take before it's using quantitatively more keystrokes than duplication.
The test goes like this: take the flat text procedures for ALL the configurations, and add it together. Now, take the conditionalized, applicability-laden procedure that unifies the procedure, and measure its file size. If the latter is LARGER than the former, then you're using the wrong SNS/applicability model for rolling up this content.
Thing is, this is inevitable if you throw enough dissimilar configurations at a CCS, because each configuration has its own overhead, and eventually that outpaces the content itself.
You can address this in a bunch of ways - like adding a containing pseudo-product that has all the configurations inside of it - but the actual real Product Management might not let you build on the applicability like that, because the Product itself isn't sold that way. Any other abstraction isn't available to you, because in the end this is natural language, which - unlike structured language - resists first order abstractions really well. This is one of those instances where, yes, the abstraction of the SNS/Applicability is worse - quantifiably - than duplication. All that complexity would be better handled via version control fork/branch relationships - far outside of the realm of natural language.
[1] standard numbering system, a sort of numeric designator of functional systems, the primary way that content is designated as semi-independent modules.
[2] conditional "chunks" that turn on and off depending on the applicability statement
It'd be wonderful if we could measure the utility of software engineering choices by counting keystrokes or measuring file sizes or putting them in a turbo encabulator and seeing which one has more modial interaction with its magneto-reluctance. Unfortunately, reality is just too complicated, with far too many tradeoffs to be balanced. I'd recommend deep thought and discussion about the domain over looking at a graph of your codebase's sinusoidal repleneration.
> All that complexity would be better handled via version control fork/branch relationships
Holy smokes, my turbo-sarcasmo detector just broke! But yeah, that's more or less the TLDR of my point. The phrase "wrong abstraction" does some heavy lifting, but it's not a bad concept, even if largely a qualitative one. No one should use a single metric to toss ginormous architecture decisions - they're tools to inform educated judgement, not replace it.
Re: fork/branch shenanigans, no, you're right, that's not an optimal way to handle variance . . in a normal programming language. In the context of natural language, it's not the same kettle of fish, because, well, lots of reasons, probably the most prominent being the "messy unidirectionality" of NL that's all mish mished with its extremely complex grammar vs constructed languages. Chopping up giant documents into tiny pieces a la CCS[1] systems has made this a stew of problems, but for some reason Leadership is fond of the idea. It's not unlikely that specialized on-prem LLMs are going to nuke the CCS concept from orbit in the next five years, except for those cases where the CCS is a contractual requirement for doing the work.
The saying 'duplication is cheaper than the wrong abstraction' is a gem of a saying, but like many pieces of wisdom, takes experience to fully understand.
I first saw the saying when DRY was being applied without any nuance. If a piece of code appeared in two places, it was obvious, and important, to factor it out, because that was 'good coding practice'.
The saying being discussed was pushback against that kneejerk, thoughtless application of DRY. The 'cheaper than the wrong abstraction' is pointing out that DRY isn't a 'no tradeoff' policy. By factoring out any duplication, many uses pass through the same code. If the uses don't quite match, there is a tendency for the code to get modified to fit them anyway. This, over time, makes the shared code simultaneously unfit for use, and widely used. A recipe for poor code quality and system health. Ironically, this is the outcome that DRY was called in to address.
The most important thing to note about DRY is that it's not about code -- it's about knowledge. You should not repeat knowledge -> logic, constants, etc. If the temperature is 87 and the price of the widget is 87 that is coincidence and not repetition.
There should just be one source of truth for any logic or process. If you duplicate that then bad things will eventually happen.
> You should not repeat knowledge -> logic, constants, etc.
It's easy to shoot yourself in the foot with that. I spent a few years working with binary formats, where you have lots of constants (byte offsets, flags, etc).
There were two approaches to dealing with this:
1) Some people just put the raw numbers in the code, and if it wasn't clear added a comment
2) Other people used constants for everything, defined somewhere else, often through multiple layers of abstraction.
If you follow DRY, then you should always chose 2). But in my experience, this often makes the code extremely hard to read. You often have to look through a dozen header files to find the constants used in a 5 line bit of code, and it becomes really hard to reason about the code.
And in the end you almost always still need some duplication, because you have other files where you can't use the constants (eg. in sample files, docs, external libraries,etc.)
I don't understand? Why do you need to know the value of the constant if the name is descriptive enough?
X = 5; // File header offset
vs.
X = FILE_HEADER_OFFSET;
I don't think there's much value in the idea that just because someone can implement something badly, it validates the whole idea. You should always use constants instead of raw values if they have some meaning. If someone, somehow, manages to make that so complicated that it's a pain it's still not the fault of the concept.
> And in the end you almost always still need some duplication, because you have other files where you can't use the constants
And those will invariably end up out of date and incorrect at some point.
When I'm looking at a hex dump, FILE_HEADER_OFFSET doesn't tell me where to start looking for the header. So I need to open a new tab in my source code editor, find the header file where FILE_HEADER_OFFSET was defined, only to figure out that it is defined as FILE_MAGIC_LENGTH + RESERVED_FLAG_LENGTH, and FILE_MAGIC_LENGTH is defined as LENGTH(FILE_MAGIC_LITERAL), and FILE_MAGIC_LITERAL is not defined in the source code, because it is provided as a compiler flag that is generated by the Configure script, and there are three conflicting definitions of RESERVED_FLAG_LENGTH, and you are not sure which one is the correct one.
Get a better editor. I can mouse over FILE_HEADER_OFFSET anywhere it's used and get the value.
But still your describing a bit of a wild setup. If the value of FILE_HEADER_OFFSET is 42, how did you come to calculating the value? What if MEMORY_BLOB_OFFSET is also coincidentally 42? If you have dozens of source files filled with literal magic values that's completely unmaintainable and inscrutable.
A constant doesn't have to be in a separate file if it's only used in one place.
Results may vary and depend on the code in question as well as the language you are using.
We - a former team a couple of years ago using Java - started to duplicate code in Java, because we were totally tired of interface'ing and class'ing everything away that was not DRY. It became to tedious to bloat code with them as well as understanding whole classes when all you got was references to other interfaces etc.
If there is a small service architecture like in Angular with TypeScript, abstracting away becomes fun and useful.
It all depends. But what I really do not miss is the pile of interfaces in Java and C#. These became so tough to grasp and entangled, that we DRY'ed this cesspool. DRY on DRY so to say.
So your issue was with the nature of the language and the size of the project more than the application of DRY?
I think I see what you're getting at, but I've certainly also seen very large Java projects that are simple at a high level and composed in such a way that they're still legible without a ton of duplication. These might be somewhat orthogonal concepts.
I think it’s down to the systems, and I think the people who favour abstraction often forget who needs to write it. Duplication isn’t just cheaper than the wrong abstraction, it’s cheaper than almost any abstraction. Not because it should be, mind you, but because duplication works for a tired Thursday afternoon programmer and abstraction doesn’t. Maybe it’s because I spent some time in management, but a key concept I worked with when I did that was how we have two modes of mental capacity. One where we have the energy and wit to do the right thing, and one where we haven’t slept for a week, and, well… it’s Thursday afternoon after a day of too many useless meetings.
I think the best way I saw it put was for a Theme-park to coin a slogan that any employee would be able to find inspiration in when dealing with a customer on that Thursday afternoon. To me most abstractions are similar to having a slogan along the lines of “Think Different”, which is an absolutely useless concept when you’re tired and dealing with an angry customer in your summer job about an hour before you clock out.
I obviously don’t think you should avoid all abstraction. The author of the article is right, theoretically at least, it’s just that this way of thinking rarely works out. Similar to you, my experience is that it tends to fail after a few years of changing needs.
These days I favour abstraction only when it’s use is never altered in the slightest. For everything else duplication is so much easier to handle over 5+ year periods. Of course there are many ways to deal with this. Small single purpose functions are abstractions as well, just don’t build big OOP hierarchies. Because they just don’t work for those Thursday afternoons.
Entirely agree. DRYing is compression. It serves similar purpose. Like a mathematical equation, it's terse, light, elegant, easy to carry in one's pocket, yet loaded with meaning. It's not, however, zero cost. At maintenance time, it needs unpacking. DRYing is also not an exact science, nor a hard rule. It's the factorization of a specific idea (at least that's what it should be). When applied, it requires usability and cognitive considerations. That's the delicate trade-off. "Will people coming after me have an easy time figuring this out?" The newbie who comes to an existing code base and proceeds to indiscriminately DRY things up often doesn't realize that the reason they were able to do so in the first place is because, although repetitive, they could understand the original source, often without much effort. That's why repetition is cheaper than the wrong abstraction.
I'm a DevOps engineer. I totally buy duplication is better than the wrong abstraction but I'd like to nuance it: duplication is better than an abstraction used by two disparate parties (groups of people that don't talk to each other).
This is in agreement with Conway's law, which absolutely governs everything I do. I work on a DevOps team that supports several different development teams all working on different things. The code I write for those teams I often duplicate along team boundary lines. Build scripts, for example, I write and I put them in each team's git repository. These might look very similar. This allows the scripts to grow and change and evolve according to the different teams needs without the teams needing to talk to each other.
"Proper duplication" goes back to separation of concern. If you have two different concerns (using the lens of Conway's law, two very different teams) using the same code, perhaps they should not be using the same code because that is not a separation of concern. Separate the concerns by separating the code paths both concerns use.
This type of duplication is praised in more depth on wingolog[1]. I highly recommend reading it as something every engineer should read.
It's very important to know when to duplicate and when not to do so, because duplicating it the wrong time can lead to pain, but not duplicating can lead to pain also.
This. I think you’re hitting the nail on the head. The question is whether there are multiple dependencies on a given bit of code. When there are multiple dependencies, changing the code because one of them wants something means the code needs to be checked and tested against all the other places the code is being used. And it’s really really common to have inadequate understanding and inadequate test coverage, so things break, and hence people develop superstitions about code that shouldn’t be touched.
Another way of putting it is that if the code is really truly duplicated, then it doesn’t need to change at all. If it has to change, the need for change is there because the multiple parties depending on that code have slightly different needs and slightly different ideas about what they want. Abstracting the code to make deduplication happen is just a way of spackling over those differences, but it can and does often cause trouble down the road, even when it’s done well. Once abstracted for two dependencies, a third dependency or more without test coverage can make changes exponentially more dangerous and error prone.
Duplication is good when forking for separate parties (or separate dependencies), each of whom may wish to customize the code, and now they are free to do so without the risk or fear of breaking someone else. I feel like the author of the article didn’t understand the benefits of duplication.
Very sad hearing this from a DevOps engineer. While the config smear of ops is encouraged by their tools (Terrorform is a fantastic example), a DevOps engineer that does not dedicate themselves to DRY practices will erode the productivity of an organization by default. I remember how Terrable things were before our Ops team developed strong module abstractions for our infrastructure. And get them to talk to each other.
This is a very amateurish take
The author very clearly (at least at the time of writing this) has not dealt with complex code bases.
> If I were to see a confusing piece of code littered with conditional logic, I wouldn’t see it and think “oh, there’s an incorrect abstraction”, I would just think, “oh, there’s a piece of crappy code”. It’s neither an abstraction nor wrong, it’s just bad code.
This is the primary issue. The author does not recognize that poor abstractions can involve more than just a lot of conditional logic. That sometimes, that conditional logic bubbles in places where secondary to where the bad abstraction was made.
A simple (real) example of this. One seen code where "get, these two objects share a field, let's pull out a base object and have them both inherit from it, after all, duplication is bad!". Then later on "hey, here's two other objects with the same field, but they don't have that old base objects field, duplication is bad, so let's make a third base class"
This sort of thinking resulted in a really gnarly object graph. But further, down stream code had to do type checks and casting to compensate for this bad abstraction.
All because the original dev didn't want to duplicate a field on two otherwise unrelated objects.
And worse, you the dev that works on this code years later are left with the option "keep it as is, of rewrite and touch 100s of files potentially breaking large amounts of code)."
Oh, not too mention the unit tests that accompanied such code, ironically, filled to the brim with duplication around this hierarchy making minor charges massive.
On smaller less complex code bases you rarely see this comedy/tragedy play out.
Class inheritance is flawed because it tries to be two things at once: a shared "surface" (public members, polymorphism, etc.) and shared implementation. An abstraction is only a surface -- this could be an interface, a function declaration, or even a data model. It almost never happens that implementation-sharing and surface-sharing completely coincide, and this is why class inheritance is falling out of favor and something I completely avoid (occasionally I will use abstract classes, but I usually regret it later). This is where "favor composition over inheritance" comes from. I'd go so far to say that because they cannot be completely divorced from implementation details, base classes cannot even be called abstractions.
So if "wrong abstractions" includes shoddy base-class shenanigans, then the statement becomes almost tautological. Of course duplication is better than class inheritance -- everything is better than class inheritance. So the real statement there is "class inheritance is actually awful", which is important to understand, but a side point to this debate.
If you don't count class inheritance as abstraction, then the tradeoff between code duplication vs. abstraction becomes much more nuanced, and that's what all this discussion is about. I certainly don't agree that ignoring class inheritance is a signal that the author is amateurish. Many complex codebases have no class inheritance at all.
Inheritance is very useful in domains like game engines, where it is very common to have a base object such as "Node" that has some properties that every object in the scene graph must have, which all share the same implementation. For example they should all have a parent property and a collection of children, and ways to modify those properties. They'll also share methods such as "render" which probably must be overridden in every subclass. Its not impossible to solve this with interfaces and composition but those solutions are sub-optimal.
An example you might be more familiar with is the DOM of a web browser - every element has some basic properties and methods that all share an implementation.
Quite the opposite: game engines are one of the few places where the sub-optimality and fundamental problems with object inheritance became so overwhelming that people starting abandoning their deeply ingrained CS 101 models of Dog : Animal and invented Entity-Component-System architecture, which at its extreme uses no object inheritance at all and is a deeply "relational" model. Game engines which don't do this were either mostly developed before ECS was invented/popularized (Unreal) or are specifically targeting beginners who have little more than a CS 101 understanding of OO programming (and also following Unreal's lead).
DOM elements are a better example, but just because that's how they are done doesn't mean that's how they should be done. Does a <script> element really need a "focus()" method? It has one. Does a <br> element need an "innerHTML" property? It has one. Does a <head> element need an "offsetHeight" property? It has one. If you look at the history of the development of HTML and JavaScript as a shining ideal of software engineering, you're certainly in the minority (this is all before TypeScript, which is a shining ideal of type systems!). The HTMLElement class has 134 properties, most of which make no sense for most elements. It has a long history and a lot of excuses for becoming what it is today, but I would not recommend you follow that lead in your own designs.
Not really. Composition has been the preferred technique for a long time already.
A lot of Games and GUIs that use inheritance worked in spite of that inheritance. In more complex object graphs there were always things like override boolean DoNotActuallyRender() in one or two children of the RenderableNode class to account for special behaviour.
ECS is just the nail in the coffin of inheritance in game engines. And it's not even new anymore, it has been fashionable for what, almost 15 years now?
> Its not impossible to solve this with interfaces and composition but those solutions are sub-optimal.
The growing popularity of ECS and data-oriented design in game engines suggests otherwise: keeping components separate from entities enables both performance enhancements and separations of concerns that are much more difficult to achieve with the traditional inheritance-based approach. To illustrate a bit:
> it is very common to have a base object such as "Node" that has some properties that every object in the scene graph must have, which all share the same implementation. For example they should all have a parent property and a collection of children, and ways to modify those properties.
You don't need subclasses for that; you just need a table of entity IDs (where both the things to render and the scene itself are entities) and parent IDs, which you can then recursively walk to get the entities you want to render:
WITH RECURSIVE entity_children AS (
SELECT id, parent FROM entities
UNION ALL
SELECT ec.id, ec.parent
FROM entity_children AS ec
JOIN entities AS e ON ec.id = e.parent
)
INSERT INTO scene_entities (scene, entity)
SELECT $scene_entity_id, id
FROM entity_children
WHERE parent = $scene_entity_id;
(Obviously you probably won't actually be running SQL queries in a game engine's rendering loop; this is just to illustrate the logic.)
Once you've got that list...
> They'll also share methods such as "render" which probably must be overridden in every subclass.
You don't need subclasses for that; you just need a table of entity IDs and things to render, which you can then query and send to the GPU:
INSERT INTO some_buffer_in_GPU_memory (entity, mesh, texture, position)
SELECT se.entity, em.mesh, et.texture, ep.position
FROM scene_entities AS se
JOIN entity_meshes AS em ON se.entity = em.entity
JOIN entity_textures AS et ON se.entity = et.entity
JOIN entity_positions AS ep ON se.entity = ep.entity
WHERE se.scene = $scene_entity_id;
(Again: you probably ain't actually using SQL for this; this is also overly simplified, since most modern game engines use all sorts of other stuff besides a mesh, texture, and position when rendering something. Note also that "em.mesh", "et.texture", and "ep.position" need not be actual meshes/textures/positions, but could instead be indices into buffers already on the GPU.)
The key advantage in both of these cases is that the parent/child data and the render data can live where they make the most sense, and can be processed by independently-running systems with minimal contention. This is critical for processing game logic in parallel - something which the game industry is learning the hard way with legacy engines that can't fully exploit multicore hardware.
I was looking for a word to describe my feeling about the article and "amateurish" fits the bill.
What mostly took me down was: (for example, the same several lines of code duplicated across distant parts of the codebase dozens of times, and with inconsistent names which make the duplication hard to notice or track down)
It is silly example because in such scenario there is no way you can even start writing abstraction to handle that.
Other part is what cogman10 wrote, wrong abstraction is not "simply piece of code gathering if statements". Wrong abstraction is piece of code or whole part of system where you cannot simply add an if statement and get going. Wrong abstraction might be something that actively prevents you from changing code in meaningful way.
There is also another comment I would riff off about DevOps and having scripts per team/domain even if mostly those look the same you never know what the team will require. Nowadays domain driven development is in vouge, mostly because it recognizes separation of concerns is much more important than DRY.
To finish off, author also assumes abstractions are born by de-duplication of code, yes we discuss "duplication is cheaper" so as finishing wanted to rant on something. Worst abstractions I saw in practice were born in heads of "Astronaut Architects" who built some system top down making stuff up "because it should be like that". Other bad ones were done by junior devs who were high on DRY.
Better rule is the first time you duplicate, just copy/paste the code. But once you duplicate a third time, that's a pretty good signal that you have something that is common enough to be abstracted, and you can write the abstraction then.
In my experience, one is unlikely to remember how many times a piece of code is being copy-pasted, unless they are doing it in the same coding session. Everyone copy-pastes, thinking that they'll DRY the next time and the codebase turns into a mess.
Well, when you make a new abstraction, that same PR will presumably also include changes to some N call sites to make use of that abstraction, right? The important thing is that N>=3. Anything less than that is, more often than not, premature. And in my experience, premature and/or leaky abstractions are far (far) more harmful to codebases than copy-pasted code.
I abstract, i.e, put code into its own function, even when N=1, to make surrounding code more readable and simpler. At N=2, I definitely abstract. Abstracting doesn't mean building class hierarchies, when they are not needed. I abstract at the required level.
Premature abstraction, in my book, means abstracting in expectation of future needs. It is different than abstracting small and often. My abstractions are not premature, because they address existing needs.
I view copy-pasted code as cancer of a codebase. It is very difficult to tell if two similar looking code function the same way, therefore making it practically impossible to DRY away, whereas calling the same function leaves no room for guess. Copy-pasted code reduces readability, understandability, makes code longer. It never happens under my supervision.
fn a
foo a
foo b
foo c
bar d
bar e
baz f
baz g
baz h
and
fn b
foo
bar
baz
fn foo
a
b
c
fn bar
d
e
fn baz
f
g
h
which of a or b is more readable and simpler? (Spoiler: it's almost always a.)
DRY isn't about repetition of text in a source file, it's about repetition of authority in your domain model. You evaluate it at an architectual level, not in PR diffs or whatever.
Of course fn a is simpler and more readable, simply because it's not 200-lines long and there are no indentations and there are no states to keep track of, etc. This example is not at all like a real code and is just a straw man of my argument.
DRY at any architectural level corresponds naturally to repetition of text in source files.
I think your disagreements are valid, but I don't think it is fair to say this is an amateurish take or infer the author's level of experience. Your example of unnecessary inheritance hierarchies (which I have also faced many times in real world scenarios) may even be a symptom of exactly what the author is saying: what you might call a "bad fitting abstraction" the author would just call "bad code". The implementation details of how code gets shared (composition vs inheritance) is a subtle but still vital consideration to the cost benefit analysis. The author is observing that it might be misleading or dangerous advice to urge developers to choose duplication just because issues with abstraction have been historically observed, which I completely agree with and do not consider myself to be an amateur. I also agree with you and other posts that the author fails to mention the (exponentially higher) costs of abstraction boundaries that also span human organizational boundaries.
i wouldn't create a base class until there are a non-trivial amount of common properties shared by several classes and i find that i am adding more such common properties. and when a class appears that doesn't have one of these common properties, then perhaps it makes more sense to move that one no-longer-common property out of the base class back into the individual classes so that again i can have all classes share the same base class.
> Not every piece of code is an abstraction of course. To me, an abstraction is a piece of code that’s expressed in high-level language so that the distracting details are abstracted away. If I were to see a confusing piece of code littered with conditional logic, I wouldn’t see it and think “oh, there’s an incorrect abstraction”, I would just think, “oh, there’s a piece of crappy code”. It’s neither an abstraction nor wrong, it’s just bad code.
The wrong abstraction isn't crappy code itself. It is a reasonable looking piece of code that will force the next person into writing crappy code to accommodate it.
Edit: I think the entire project of TensorFlow is a good example of this. They built the library around a "graph" entity, and anything you did had to be shoehorned to fit that. That worked OK for some straightforward neural networks and situations for a while. As the area evolved though, it proved very burdensome. They tried to evolve it into TensorFlow 2.0 which was more forgiving, but by that point it was too late, the ecosystem became a mess. PyTorch stole the thunder because they didn't make the wrong abstraction (though I'm not sure if "duplicating" is what helped them do that)
Abstraction is not just about hiding code - its about reducing options. You purposefully reduce options to make the system easier to reason about. A "function" in a programming language is an abstraction over machine code. It looks like variables have scope in an isolated environment, and it looks like the braces mean something, but it's compiled down to machine instructions that have no such concept. Goto considered harmful, but compiled machine code is littered with jump instructions (of course). You can do a lot of funky tricks with machine code that the higher abstraction of a programming language doesn't let you do. When you create an abstraction you reduce options for the user of that abstraction. So abstractions tend to gather cruft over time because users want those restrictions relaxed to do their special thing.
Absolutely right. One of the most important questions to ask an abstraction is: what can I not do with this? If the answer is "nothing -- you can do everything you could before", then the abstraction is an inner platform. The entire power that abstraction brings is in "focusing" on the problems we care about solving; it must make other problems impossible (ideally ones we don't care about). It follows from the No Free Lunch theorem.
One way to make sure your abstractions are focused on solving the right problems is to always define them based on what you need, not based on what you have. The root of the abstraction vs. duplication debate comes down to this. Indeed it's unhelpful to look at two pieces of code and say "these look the same; I will abstract them!". Instead you say "wow these have really similar needs; I will define exactly what that need is and they'll both ask for it."
I resent how much we've trained developers to value concision over everything else. I can't tell you how many times I've seen people use DRY as a justification to alias stuff that's already heavily abstracted by the framework that they use, ending up with less useful interfaces. Either that, or they'll explode the cognitive load by building crazy type hierarchies and inserting opaque anti-patterns like factories and decorators and whatnot.
These are "the wrong abstractions" in the sense that they're not actually crappy code full of conditionals and are actually well-redacted and not all that hard to decipher. They're "the wrong abstractions" in the sense that there's either a way to do it that is simpler and makes fewer assumptions, or in the sense that they are worse than "no abstraction" which is to say sticking to the abstractions that have already been invented for you by people whose jobs it is to do that exact work for millions of engineers and are therefore probably way better equipped.
The most important underlying issue isn't discussed in the article:
DRY must be understood and applied correctly.
"Every piece of knowledge must have a single, unambiguous, authoritative representation within a system"
The keyword here is _knowledge_.
When we see duplication, repetition and so on, then that might be because that piece of code represents:
- data of different entities that have similar structures
- logic that just happens to be similar.
- boilerplate code
None of these things have anything to do with representing the same piece of knowledge in a program. In fact, you can easily get into trouble _especially_ if you think the first two things are violating DRY when they are not.
I agree with the article, wholeheartedly though. If your code or data _model_ is not DRY, you can get into trouble very easily. Very nasty bugs, regressions during maintenance or extension, hours spent in frustration, money lost etc. On top of that: Non-DRY code almost always _proliferates incidental complexity_, because if you don't fix it, then eventually you patch over it.
Here's the best case scenario: Even if you are aware of code not being DRY, do everything right and turn multiple knobs at the same time to change or extend it correctly instead of fixing it, you will do so with much more reluctance and it will be much more mentally taxing.
Non-DRY code is by definition complex: You now have more interconnected parts than you need. So really, if you make your code more DRY, you _simplify_ it.
My favorite counter-acronym to DRY is WET: write everything twice (or thrice!). Doing and then redoing it once you understand it better is the best way to learn how to apply DRY correctly.
> Non-DRY code is by definition complex: You now have more interconnected parts than you need. So really, if you make your code more DRY, you _simplify_ it.
It really depends, I think there are some assumptions here that could use clarification. The whole point of choosing duplication is to disconnect parts that shouldn’t be connected, so I don’t understand what you mean about non-DRY code being more interconnected than duplicated code. Conscious duplication (often called “forking”) allows people who depend on a piece of code to change it without breaking anyone else. When you merge two pieces of similar code, they already had two or more separate uses, and you’re adding a new connection, tying together the fates of two or more different users. From now on, if they don’t have exactly the same agenda, there will be tension and/or bugs.
If deduplication requires adding an abstraction layer, then that absolutely is adding complexity, and it happens because the code being de-duplicated was not exactly the same. Code that’s truly duplicated doesn’t need to change in order to de-duplicate. So you can delete a copy in that case and centralize the dependencies onto the remaining copy. That eliminates code but doesn’t really simplify; it has the potential to simplify future development, but it doesn’t simplify the code at the moment of deletion. With modern build systems and project structures, however, it might take a lot of work and it might add complexity to get the DRY code into the right spot where it’s visible to everyone who needs it. Another reason for duplication is to avoid having to do backflips to get the code into the right file or scope.
Well all repeated code can evolve independently too, so by that definition, how can non-DRY code exist?
This discussion is a bit too abstract and losing meaning as a result. I think you’re glossing over the very real scenarios of code (“knowledge”) that’s similar but not the exactly the same; of copies of code that are similar being merged; of code that’s exactly the same being intentionally forked and “repeated” and placed nearby; and others. Repeated code definitely can be the same piece of “knowledge” and exist in multiple places without coordination. After that the copies might drift in different directions, and the DRY dogma says this is bad. This is what people mean when they talk about DRY principles, the general idea to look for and merge similar code, and to prevent multiple pieces of similar code from being able to evolve independently.
DRY in fact doesn’t really come up if two pieces of code are exactly the same; then it’s pretty obvious and easy to factor it. It’s a non-issue. The reason this is a concept we talk about is because bits of code are similar but not the same, or they’re they same but people need them to be modifiable without risk.
I don’t know about using the word “knowledge” to represent processes with dependencies that change over time, it doesn’t seem like the best terminology for this discussion. Specifically, the word “knowledge” tends to imply a timeless static quality to the code. In reality all the of issues worth discussing here, all of the problems with DRY, and all of the benefits of DRY, relate to how code changes over time.
Your point about similar but not the same code being different knowledge actually kind-of illustrates exactly why merging similar bits of code is dangerous, precisely why you shouldn’t just apply DRY principles blindly, and why sometimes rejecting the idea of DRY is appropriate in a given situation.
The point that I wanted to make above is that DRY has a precise meaning and we seem to talk past each other because you use a much looser definition of DRY than I.
With that loose definition we think in terms of arbitrary similarity and duplication. And I absolutely agree with you that it's dangerous to use abstraction to patch over this perceived similarity.
Let's look at a trivial example of applying DRY correctly:
Say we have a database schema with a column representing some URI of an entity like "person/12" which encodes ":kind/:id". It's trivially obvious that this should be a computed column, which is derived from other columns that represent the "single, unambiguous, authoritative" pieces of knowledge of these values.
(It's no coincidence that DRY applies neatly to data models. The Pragmatic Programmer, which coined the term, uses data models to explain the principle as well).
Really what DRY teaches us is to have a separation between authoritative knowledge and derived knowledge. Typically we're talking about data representing information here, but it can just as well be a piece of logic that is used to validate user input that you better have in just one place (think of the fun that you would have otherwise...) and so on.
Of course we need derived knowledge all the time, but we want to treat that very differently. We use things like code generators, cache invalidation, materialized views, macros, temporal databases and so on in order to protect ourselves from having to coordinate derived knowledge in tandem with authoritative knowledge.
And it's not just IT related. Even game programmers and compiler writers like to use DRY code in order to reduce coordinating (and growing) memory allocations, because it's often faster to compute values on the fly from a cached array, than to fetch memoized values.
> I don’t know about using the word “knowledge” to represent processes with dependencies that change over time, it doesn’t seem like the best terminology for this discussion. Specifically, the word “knowledge” tends to imply a timeless static quality to the code. In reality all the of issues worth discussing here, all of the problems with DRY, and all of the benefits of DRY, relate to how code changes over time.
I fully agree with the latter statement here. Thinking of how code changes over time and the implied coordination, is like a razor that we can use to determine whether some piece of code is violating actual DRY and whether we should do something about it.
> My favorite counter-acronym to DRY is WET: write everything twice (or thrice!). Doing and then redoing it once you understand it better is the best way to learn how to apply DRY correctly.
I don't like it.
It's so much worse than DRY.
Imagine you have something that is duplicated. Instead of deduplicating it, you leave it.
Later at one location you make modifications, like rename a variable, introduce an optimization, whatever.
Now, if you don't explicitly remember, that duplication is hidden.
Months later you need to change or use the functionality somewhere else.
How likely is it you'll edit it just in one place or introduce another duplication?
Now let that run for a few years and you'll have this all over the code base.
Sounds bad, the only problem is it’s a straw man imaginary scenario. Write everything twice isn’t a call to leave repeated code around at all, it’s a call to learn by doing rather than believe that dogmatic principles like DRY can get you there the first time. The idea behind WET is to use the right tool for the job, and acknowledging that you probably won’t know what the right tool is until after you’ve tried doing the job. DRY is the right tool for some jobs, but not all jobs.
There are downsides to duplicating code, and there are downsides to merging code. I’ve seen examples of such downsides in both directions in practice. The main downside of DRY I was trying to point out in other comments is that multiple dependencies on code adds additional complexity and risk regardless of the quality of the abstraction. A lot of people here are arguing the quality of the abstraction is what matters, but that’s only sometimes true. And in the cases where eschewing DRY is called for, it often has nothing to do with the quality of the abstraction.
For testing I prefer DAMP. Descriptive And Meaningful Prose/Phrases. I’ve watched otherwise smart people wrestle with testing boilerplate when requirements change and I’ve had my fill for this lifetime.
Each test is a separate story. At most tests in a suite should share setup code. Anything more than that is coupling of tests, which is a no-no. The distinction between mocks and fakes are the most common place I see this blow up in our faces. Fakes result in coupling of tests. They were difficult to write so they get amortized across ten tests, making new requirements difficult to impossible to add without accidentally removing coverage of other requirements.
A fairly common pattern that I've seen over and over in multiple domains is this:
Given a group of of "things" with a start and stop date, list all the things that are "active" during a given date range.
Some one abstracted it because we have several "things" that use this logic.
Then it had a bug because some of the things are inclusive and some are exclusive.
Then it had a bug because some of the things use dates and some timestamps.
Then it had a bug because some of the things are timezone aware and some are not.
So we started down the path of a rather simple query construction becoming a complex thing with flags for inclusive/exclusive for start and end, timezone settings ...
> So we started down the path of a rather simple query construction becoming a complex thing with flags for inclusive/exclusive for start and end, timezone settings ...
Forcing the caller to _think_ about inclusivity and timezone awareness is not a bad thing, rather the opposite. These are important decisions to be taken: the abstraction is not trivial because what it abstracts actually does have inherent complexity.
If the abstraction forces you to take the necessary decisions (inclusive? timezone?) without having to think of how to implement them, it doesn't sound like a bad abstraction. Too often these decisions are not thought about, and the expected behaviour is "whatever is implemented".
Who knows whether some things are inclusive or not? Who knows what use dates and timestamps? It seems like this should be abstracted somewhere and this knowledge codified in one single place. It sounds like your abstraction, in this case, isn't very abstract at all.
That is common for bad abstractions -- they add a layer but they don't actually encapsulate any knowledge. To use this abstraction, you shouldn't be passing any flags for inclusive/exclusive, etc -- it should know that for you.
That sounds like your code just isn't properly typed.
For example in Rust the first bug would be caught by `Range` vs `RangeInclusive`.
The second bug would trivially be caught because dates and timestamps are different types.
The third is trickier, but (depending on exactly what you mean) that can be caught with static types too.
Pointing your finger in the wrong place IMO. If anything this refactoring highlighted worrying inconsistencies in your code that probably would have cropped up as bugs elsewhere.
Great example. One way to avoid these problems is having lots of tests written for the various uses of the abstracted thing so you know they’re all covered. But also, if all of these things function in different nuanced ways, is it really any benefit to have them all jammed into the same abstraction in the first place? I’ve found this comes down to personal taste. I prefer a little duplication if it means not having to “own” an abstraction that I’ll need to heavily document and hope people read the documentation for in order to not break. But some would rather own the one point of failure.
One should at minimum name the things that behave differently by different names, what is a common practice in data modeling.
I expect all those bugs to return again and again as different people maintain that code. At least with code deduplication they would have a clear alarm telling them their knowledge is wrong and they must pay attention. But with each query doing everything people will just assume they know it all.
The point isn’t the interface though it’s the implementation. And if many of those things are implementing the same search functionality slightly differently, you’re back to the same spot, except now your bugs are spread across multiple sites, often with duplication.
The underlying issue is just that correctness is hard I think.
Let me give an example bad abstraction that isn't due to littered conditionals, but still very bad.
One time company A had a database, and code that loaded persisted object state from them. Some of the objects could be soft deleted. Rather than check various objects for soft deletion, the team decided to check all objects for soft deletion, regardless of their type, by querying a table where objects had to be listed if they were still live (not soft deleted).
Fast forward a few years, everybody follows this pattern, and there is massive hotspotting of that central "object lifetime" table that has basically two columns (object_id, is_deleted) that becomes a latency bottleneck because absolutely everything is joining on it all the time.
Truth is, it made it convenient to code with this, because you never had two ways of checking whether an object was live, and by construction you could never make the mistake of operating on a soft deleted object or forgetting to implement lifecycle deletion.
But man was that a poor abstraction. It was probably redundant with database functionality. It gave soft deletion capabilities even to things that didn't need soft deletion. It had a significant latency cost. But everybody adding a new object type just picked it because it was the way the company has decided it would do soft deletion.
I feel you are describing an implementation that was once fine but is no longer satisfactory, rather than an abstraction, which perhaps could have been made easier to fix with a bit more abstraction: a function to do the soft deletion if possible, with a better-performing (albeit probably more complex) way of determining whether soft deletion was an option.
> Sounds like it was just what was needed at the time
The problem I've seen often in codebases is that as an abstraction or pattern grows more unwieldy, they don't take the time to update it.
They often don't get revisited until they're so bad that they can't be ignored.
Handling a something with a switch or if/else is fine if there's only 2 or 3 options, but people will often just keep piling on. When it's 10 things, changing it becomes much more work so people will continue to add to it. Then when it breaks at 20 things, someone will come in and say "Why did we write it this way in the first place? It doesn't make any sense!"
I'm often torn between pragmatically writing the simplest code possible and being proactive about abstracting early to prevent an eventual breakdown of the pattern.
How does a switch break at 20 items? Any respectable compiler or interpreter should handle that fine. If it was 32k cases, I could imagine why it would raise an error. But 20? Seriously?
Often, writing more cases into a switch statement is way easier and less boiler-plate-y than abstracting it out to subclasses or a dictionary or whatever.
I have seen ton of time wasted due to the wrong abstraction.
Through it's a question about how much and what you duplicate.
Which means I somewhat partially agree with the articles which is more well nounced then the title implies.
One of the most common case of bad de-duplication is doing so with code which happens to be mostly same but there is nothing on a business logic pov which makes it the same.
Or code which differs mainly in points which the language used needs a lot of complexity to abstract over.
In my experience having a more power full type system, like in Scala, Haskell or Rust one one side has the benefit of making the refactoring much less bug prone, but also are easier to go into the "abstraction introduces too much complexity" territory. In the end using a type system _appropriately_ is a skill. One some which some technical very skilled people are missing.
Through what I also realized is that with strict type system a "top down abstractions" using e.g. custom traits/interfaces/abstract classes tend to be much more likely to cause issues compared to composite bottom up abstractions using closures to fill in the missing part. Sadly this kind of abstraction while simple in the simple case are also prone to need some limited degree of higher kindred typing in the less simple case. This is putting limits on how much you can practically apply them in many languages (or it accidental becoming to complex due to missing intuitive notation for the limited higher kindred type parts needed).
Through the most important thing for many projects is to make the code easy to change. And with this I mean changing the source code, not having complicated abstractions allowing you to use the same source code in many different ways even through you only do use it in one way at any point in time.
There are two situations I’ve observed where Sunk Cost Fallacy reliably doesn’t kick in. One is three line functions and unit tests. The other is duplicated code. It’s better to err on the side of mistakes that people don’t get precious about fixing later.
A lot of the arguments I have with coworkers end up being about friction and blind spots about friction. “You” think these things don’t slow you or others down later, but I have a bibliography of incidents that say you’re wrong. Wishful thinking is married to magic thinking, and they have a child named “mortgaging the future”.
I dislike code duplication. But do you know what I like even less?
Giant functions with 12 keyword arguments passed up and down a call stack, because those functions have many callers which want slightly different things.
Choosing the wrong abstraction often leads to endless kludges and special cases. Two warning signs are functions with 12+ keyword arguments, and strange class hierarchies full of callbacks that only interact with a few functions.
The problem with all programming advice is that it needs to come with George Orwell's classic advice to "Break any of these rules sooner than say anything outright barbarous."
If programming advice makes your code look obviously gross, ignore the advice.
Worse, for those 12 parameters (invariably booleans or if you are lucky enums) functions is that usually only a small subset of all possible flag combination is tested (or even meaningful). Worst part is when the flags are directly or indirectly under user (or configuration) control and the application can go into uncharted territory.
Worse still, good luck refactoring those functions when you have no idea which combinations are actually meant to be supported and what their original semantics where.
Ah, but the solution is to turn that 12 argument function into a class that does one thing (runs the function), and dependency inject all those arguments. It still totally sucks, but you can pretend you're writing "clean" code by obfuscating the parameter passing.
Abstraction tends to shift logic from procedural to structural.
Rather than 12 keyword arguments and 12 branches in 1 big function, it should be 12 small classes (in OOP) or 12 small functions (in FP) that each handle one of the branches. All organized in some way that thelogic of executing those parts is shared in the structure of the code.
> Rather than 12 keyword arguments and 12 branches in 1 big function, it should be 12 small classes (in OOP) or 12 small functions (in FP) that each handle one of the branches.
I mean, sure, you could convert your library into 12 little classes, or a collection of purely-functional combinators. Sometimes that helps. Sometimes it makes the situation even worse.
Some of the most terrifyingly inappropriate abstractions I've seen in my career involved complex class hierarchies, or worse, things like "abstract interpretation over the free monad."
There's no substitute for asking, "Are these things I'm trying to abstract over actually similar in any fundamental way?" And "Is this code actually just horrible?"
> "Are these things I'm trying to abstract over actually similar in any fundamental way?"
I'm not sure why this is even a point; if there's no similarity of the things that are being abstracted, why would one even discuss abstraction in the first place? The point of abstraction is that there is some fundamental similarity that the abstraction addresses. `ICloudStorage` abstracts `GoogleCloudStorage` and `AwsS3Storage` because at some level, they both have the same abstract operations: read, write, delete, etc.
The point is that there may be superficial similarity. Code that happens to look the same, but represents different "pieces of knowledge" that are likely to change independently probably shouldn't be unified.
> I don’t see how it can be said, without qualification, that duplication is cheaper than the wrong abstraction.
I mean, this statement IS qualified. The word "wrong" is doing some heavy lifting. Part of what makes an abstraction wrong is when it is expensive to use as tiny differences emerge in the requirements.
It’s also wrong when active epics contain a third implementation of the “same” pattern.
It’s been a while since I’ve seen as much time wasted as trying to a tract the second implementation only to be proven wrong by the third. So instead of being for instance 8, 8 and 16 points to implement, it ends up barely squeaking by as 8, 16 and then 16 again.
It’s one thing to fight the Rule of Three for things that might happen. It’s quite another when it will happen.
You’re lucky then. I joined a company that had a team of inexperienced engineers where every form or details page was a separate program and the render functions were several hundred lines long by themselves. When I joined they had a dozen pages that were each so buggy adding new ones was nearly infeasible and fixing bugs took most of the dev time. Duplicate code can certainly slow down the dev process and kill a startup.
I have seen much damage from duplicate code at multiple organizations. I have seen thoughtful abstractions work successfully to mitigate it, and rarely encountered the opposite. I have encountered multiple perjoratives: copypasta coders, couch developers et al.
The key factor when de-duping some code is to know whether the code is the same because they express the same abstraction or due to coincidence.
If they are the same abstraction then they should always be the same and you're doing the right thing to de-dupe.
If they are the same due to coincidence de-duping will tie together things that should be independent. As development continues the implementations will need to diverge. That's when you get the rat's nest of conditional logic. It's a lot easier to add a parameter and conditional logic to a function than rip it out.
It's not always easy to tell if two bits of code are the same due to coincidence or not... it might come down to nuances of business considerations that the developer has no idea about (or, since we're talking about predicting the future, no one knows about).
I don't think it can be done perfectly. But it's worth considering why not to de-dupe before you do it.
I was looking for this. There are definitely two types of duplication. For example not every use of the number 16 should be replaced with a SIXTEEN constant. However if the maximum allowed password length is 16 you shouldn't be writing 16 all over you code, you should be writing MAX_PASSWORD_CODEPOINTS because your system may depend on that value being consistent.
Although I would disagree that you should never deduplicate things that are coincidentally the same. Sometimes code that is coincidentally the same can have the same bugs and require the same updates over time, so deduplicating them can reduce maintenance cost and remove bugs. However I wouldn't race to deduplicate these things. Just if they become frequent patterns or have remained the same for long enough to justify the effort to unify them.
The author is going into technicalities without much actual substance, ending with: it depends.
I think whenever we, as programmers, try to pin down a certain principle, it bites us. Hard. DRY was cool as an observation but when it got turned into a law we saw the spaghetti code.
Duplication, on the other hand, is detested almost as much as the goto statement. Let me tell you, it's not that bad. Duplicate code makes everything more flexible. It helps you to NOT bend over backwards in order to change a line of code. It allows you to NOT touch anyone else's code.
So many good things. Of course, I agree with the author's summary of the bad things that can happen with duplicated code. But there's a litmus test for that:
If you have to make changes in multiple blocks of duplicated code in order to change the behavior of something, there's a problem. DRY out the code so you only have to touch 1 place.
If, however, 2 blocks of code LOOK similar but aren't actually the same, and changing one block doesn't make the other block outdated and stale, you are good to go.
Judge and decide. It's just 2 approaches that when taken to an extreme can cause a lot of pain, but if used with common sense, nothing is simpler.
> Duplication, on the other hand, is detested almost as much as the goto statement.
Honestly, even the goto statement isn't that bad. It's pretty useful in C code. I'm not saying anyone should put it in a new language, but the amount of hate it gets is really just related to BASIC monstrosities from the 1970s, not any real-world applications of it.
> I think “the wrong abstraction” is a confused way of referring to poorly-de-duplicated code.
But I believe this is similar to a no true Scotsman fallacy. “If you just make the right abstraction, de-duplicating is fine!”
Yes, if you’re good at making the right abstraction, it’s not worse! Those are the cases when I definitely do the refactoring: when I know for sure I know the right abstraction. Otherwise,
I defer the decision for an older, smarter, wiser me (or future maintainer).
I have been down both roads. I’ve seen unwieldy abstractions reduce a codebase down to a giant pile of edge cases, and I’ve seen codebases where making a single change to the design has required editing dozens of files. Where I’ve ended up over the years is to abstract the “big” things. The types that represent your domain. The pieces of the data layer that need to be exactly the same every time. After that, solve for large classes of problems. This may be an abstraction, a usage pattern, or just a function. Transaction management, logging, etc.
Know that if you try to wrap ANYTHING in an adapter “in case we want to swap it out later” that this almost never happens, and when it does the abstraction you came up with is probably inadequate. Transaction handling in one tech is different than another. Or logging context is handled via disposable scopes instead of as part of the log entry. For those cases, if someone isn’t already maintaining a good abstraction (like MassTransit) then it probably doesn’t exist.
It depends. If it's not some core area of the code, but more like a script, some code that lives at the periphery, it might be better to "duplicate" almost similar code that is hard to abstract.
I saw attempts to remove "duplication" that made the code so hairy and hard to read, as opposed to very readable. I put duplication in quotes, because code might be similar, but not 100%.
Some code is easy to deduplicate.
Some code might be hard, and if the overengineering is done to remove 2 occurrences at some code periphery, is not worth it.
Duplication is a superpower if you can put your OCD into a box for a little bit and frame it as a temporary stepping stone.
Refactoring nightmare codebases can become trivial if you don't mind a few copies of "the same thing" being kept around to satisfy serializers and other legacy APIs. Writing mappers between nearly-equivalent types sucks really hard but it still sucks a lot less than saying things like "lets just rewrite the whole product".
Duplication is cheaper because of how most programmers write code at their job:
- write stuff as fast as possible, without having time to think about overall architecture, especially if it involves having to cooperate with other devs. It's easier to just implement something that's as quick and as simple as possible so that it can be passed off to someone else with minimum effort.
- no need to communicate the abstraction semantics - no need for documentation outlining the abstraction, reasoning, possible expansion, etc.
- it's much easier to make localized changes. A well written abstraction will cover some logic that might be spread across multiple areas. Changing something major to the abstraction requires understanding how the abstraction affects all it's applications in the same ticket. Whereas duplicated code can result in a ticket being resolved by just making a change to a specific code block, like a function.
- Things that work well aren't appreciated. If it's easy to update an abstraction to a new feature, it'll be the expected outcome. When a change like the previous point is needed, it's much more memorable because of the frustrating experience is likely to be longer and more strenuous. We also tend to remember negative experiences over positive ones.
- Abstractions require reading more code with additional levels of indirection and devs don't like reading other people's code.
- Writing things well requires effort, so bad abstractions are more likely.
- More mature projects tend to have more abstractions because of their additional complexity, so I would guess that there's a strong correlation between difficult projects and frequency of abstractions.
- Some people went absolutely nuts with writing blogs posts, and evangelizing certain techniques which were completely unnecessary in an effort to push out content. There's lots of things to write about on implementing abstractions. But little in the other direction other than don't write unnecessary abstractions.
The flip side is, duplication is bad because when you find a bug and fix it, did you fix it everywhere? How many places were there where the bug needed fixed? Are you sure you got them all? It's much easier when there's only one place that you have to fix.
And I am only half-joking about that. I don't think that effort is that visible and often goes unrewarded. I feel like a lot of managers don't directly, but indirectly use number of tickets closed as a sign of productivity which affects promotions and compensation.
Obviously YMMV, but teams that care about their code quality to such an extent are less likely than places that act as ticket factories.
Flip side of the flip side is bad because when you fix the bug in "abstraction" or de-duplicated piece of code: how do you know you did not break something you don't know.
Duplication is easier because once you fix that single place - you are 100% sure you fixed that place and you did not break 10 other places. Maybe you know "by writing unit tests", but when you write unit tests, when you find out you broke something.
Funny story time: had an add/edit popup in system because they looked the same so dev just made it "single thing". Something like 3 months dev1 fixed something -> qa2 found a bug X -> dev2 fixed something -> qa1 found a bug Y -> dev3 fixed something -> qa 3 again found bug X. When I got into code base I noticed that ping-pong because somehow I was only sane person to check git history and I split things up. Something like that happened multiple times in my career
I'm getting shot in the foot by this right now as our team embarks on tackling some long-term tech debt.
The approach we've found that works is health checks and manually looking into cases when we think we've fixed a bug, as it will often point us to a piece of duplicated code we missed that we can wrap into the fold.
I passionately disagree with this. Abstractions inherently introduce some level of opaqueness and it's only useful in the context of making things more maintainable. Duplicated code is easier to reason about because its intent is closer to the problem it originally solved.
> for example, the same several lines of code duplicated across distant parts of the codebase dozens of times, and with inconsistent names which make the duplication hard to notice or track down
While I think there’s merit in deduplicating these situations, one pitfall is introducing coupling and tangled dependencies when DRYing.
There are ways around this of course, but I’ve come across a number of instances where deduplication has led to unnecessary coupling between modules.
The whiplash I get from reading this article is massive. One second they agree that bad abstraction (filled with conditionals) is bad but then say:
> So instead of “duplication is cheaper than the wrong abstraction”, I would say “duplication is cheaper than confusing code littered with conditional logic”. But I actually wouldn’t say that, because I don’t believe duplication is cheaper. I think it’s usually much more expensive.
(emphasis on the last sentence)
I couldn't disagree more. In fact it's an incredibly "junior dev" mindset that sees 2 pieces of similar (or _even identical_) code and is compelled to abstract it. Unless there are at a _minimum_ of 3 implementations I think it's always better to duplicate. I've watched too many "common" functions grow over time with way too many arguments, too many conditionals, and way too confusing for anyone to easily follow. The most egregious is different return values based on arguments passed in. I'm not talking "array of strings" or "null" but "array of strings" or "single string" (or worse).
Abstraction can be fun to write and it feels like you are doing something to help "future proof" (also XKCD 927 [0]) but in reality it boxes people in (especially if you try to abstract with less than 3 real implementations) and leads to overly complicated code, or worse "clever" code.
As I've grown as a dev I'm less and less inclined to write "magic" or highly abstracted code and prefer dealing with"boilerplate" that I can tweak as needed for the individual use-case. Only once I have a clear pattern of code that's been deployed and used for a good bit of time do I reach for abstraction/reusable code.
> I've watched too many "common" functions grow over time with way too many arguments, too many conditionals, and way too confusing for anyone to easily follow.
This is not the fault of the abstraction. This is the fault of (especially junior developers) treating abstractions as sacred and non-disposable, which is itself the result of a mindset in which creating abstractions is discouraged. You should almost never modify an abstraction. Don't modify abstractions to cover new use cases, and you more or less won't run into any of these issues. If you need to, create new abstractions and throw old ones away.
> Unless there are at a _minimum_ of 3 implementations I think it's always better to duplicate.
This is a silly rule to follow, except for the most inexperienced of developers, perhaps. It doesn't take long to gather enough experience to know be able to recognize in most cases whether some instance of duplication is coincidental (structurally similar by happenstance, which could be "abstracted" in a macro-like manner, resulting in something quite fragile to changes) or if you're actually encoding some piece of knowledge into an abstraction. Advice like waiting until a piece of code repeats three times encourages developers to think about abstractions in terms of structural similarity, which is exactly the opposite of how abstraction should be considered.
> This is a silly rule to follow, except for the most inexperienced of developers, perhaps.
Perhaps you'd consider me inexperienced though I don't consider myself to be so. I've learned enough times that neither I, nor my colleagues, can accurately predict the future and every time we think we know the cases that code will need to handle in the future we guess wrong more often than not.
What I'm trying to say is until you are sure a piece of code is literally the same or with tiny differences that you can cleanly abstract you shouldn't try to guess how future code will use the abstraction. It's the same rule of mine where I try to never proactively add functionality to a function/piece of code. You think that you are saving your future self (or peers) time but too many times I've see people guess wrong at what extra functionality we will need and then that code never gets touched and/or gets migrated/updated for years before someone realizes there is no calling-code that uses that functionality but we have been dragging it along this whole time.
Could you check everywhere and make sure it's not being used and thus can be removed? Maybe but I understand the desire to make as few changes as possible and preserve the functionality as it was when you first went to edit the code. Overall that's a good idea when making changes and sometimes you don't always know what params all the clients are passing to an endpoint to be sure of if something is still in use or not.
> I couldn't disagree more. In fact it's an incredibly "junior dev" mindset that sees 2 pieces of similar (or _even identical_) code and is compelled to abstract it. Unless there are at a _minimum_ of 3 implementations I think it's always better to duplicate. I've watched too many "common" functions grow over time with way too many arguments, too many conditionals, and way too confusing for anyone to easily follow. The most egregious is different return values based on arguments passed in. I'm not talking "array of strings" or "null" but "array of strings" or "single string" (or worse).
I agree with you here and tend to rather, if possible, deduplicate subsystems or sub-functions of similar looking/identical code and keep the duplicate public surfaces.
> I agree with you here and tend to rather, if possible, deduplicate subsystems or sub-functions of similar looking/identical code and keep the duplicate public surfaces.
Completely agree, take the small parts that are standalone/discrete and abstract them. I greatly prefer something like
> As I've grown as a dev I'm less and less inclined to write "magic" or highly abstracted code and prefer dealing with"boilerplate" that I can tweak as needed for the individual use-case.
This is part of creating abstractions to benefit the reader, not the writer of the code.
I'm currently refactoring a python package that was designed to make writing ETLs very elegant (it worked!), but as a consequence, when something goes wrong, figuring out what happened involves pouring through 4 different modules, class hierarchies and trying to track variables through multiple layers of abstraction. It's a nightmare for debugging.
Simple boilerplate is repetitive and boring, but man would it be so much easier to read
> Simple boilerplate is repetitive and boring, but man would it be so much easier to read
Yep, and I'll fully admit when I first started out I hated this idea and wanted everything to be super-DRY but I've swung back in the opposite direction (or at least to a good mean). I had a developer ask why we had some boilerplate semi-recently when the function in question was simply calling another function on the parent class, why not just call the parent function directly (it was protected, they wanted to just make it public). I explained that yes, right now we were doing a straight pass-through essentially (this was for a CRUD layer) but that we had learned over and over that over time we needed to add in things like business logic, validation, or data migrations and this way we just needed to change our "intermediate" function instead of adding one later and having to change all the places that were calling the "direct" function. Same idea as with getters/setters, yes you don't "need" them always when you first write them but having those hooks are invaluable down the line.
In a world of changing requirements it can be difficult to know what the right abstraction is going to be. I am happy to accept some duplication early in the development cycle until the requirements have settled. Only then it's possible to go back and refactor (which admittedly doesn't always happen in practice).
I believe duplication should raise eyebrows but it can be justified.
I disagree with OP. You cannot abstract after two or three dupes because you don't even know what you have or need yet.
Let it breathe, let it stink for a bit - THEN make an informed decision about what to refactor and abstract. You're just jerking off otherwise, and I hate working with code that's been abstracted early for no reason.
His description of his understanding does not include any reference to the "wrong"-ness of abstractions that shouldn't exist. If I read him as-is, I should conclude that the idea is to never make any abstraction at all. It obviously cannot be it since that would be stupid.
"Wrong" abstractions are already bastardized, from their first iteration. Developers decide to code them nonetheless because they estimate that their "awkwardness" is worth it in comparison to code duplication. What they fail to realize is that, to the contrary that code duplication which just "is there", the awkwardness of the abstraction will compound.
Duplication is the last resort, when one has established that he couldn't find any non-wrong abstraction.
The underlying problem is that the "don't repeat yourself" principle is often in conflict with the "single responsibility principle". Structurally, this comes down to the problem of managing dependencies. Over the years, the problem of dependency management has become bigger and more difficult to tame.
The same problem holds for internal code as well as external code. Duplicating code creates one kind of dependency problem (feature drift). Shared code creates another kind of dependency problem (increased coupling). Broadly speaking, solutions which reduce coupling are going to be cheaper to maintain.
Ideally, there would be clear, well defined layers with narrow communication protocols.
When you work in a very large and complex codebase you encounter a few things that this author doesn’t seem to consider or thinks are very minor:
1. Refactoring something introduces non-negligible risk. Consider a class with many fields and multiple mutexes it uses to control concurrent access to those fields. Even just consolidating those mutexes introduces the hard-to-conclusively-find-in-testing risks of introducing a deadlocks and livelocks. And that’s like the base case of refactoring the class: anything involving splitting the class up, moving data fields up or down the stack, changing the way member functions (which acquire locks) call each together is even more complicated and risky. It is just not worth refactoring this thing unless you have a very very good reason.
2. A function or object often has a many-to-many relationship in what it touches: it is called or accessed from multiple places and it calls and accesses many things. Non-trivial improvements to abstractions typically involve changes at both ends: which may be “as simple” as updating all the call sites to take a new argument or handle a different kind of error (hopefully all your call sites are structured so error handling is compatible with their abstraction!) or as complex as completely refactoring multiple levels up and down the stack to reflect better-abstracted semantics.
No you shouldn’t lazily copy-paste around such problems when they are straightforward enough. But it can so so much less work (and again, less risk of breaking things) to use composition + wrappers, or inheritance, or to copy some little chunk to code than to do things the “right” way.
3. Let’s face it, your cool new abstraction sounds right in your head, but in a complex system it may just be playing abstraction whackamole once all the bugs and edge cases you’re not initially considering get addressed. It may be impossible to fully understand the entire system from beginning to end, without which it’s hard to be confident you’re actually improving things before embarking on your epic partial rewrite, or at the very least know you’re not changing semantics around some arbitrarily-drawn box. But if you’re not even changing the semantics, see point 1.
> My understanding of the “duplication is cheaper than the wrong abstraction” idea, based on Sandi Metz’s post about it, is as follows. When a programmer refactors a piece of code to be less duplicative, that programmer replaces the duplicative code with a new, non-duplicative abstraction.
I think one of the main takeways from Sandi Metz's quote is that you should postpone creating the abstraction until after you have the duplicated code. Sometimes you will remove the duplication when you have just two implementations, sometimes you will want many more. Once you have the repeated code it's relatively easy to make the right abstraction.
As someone who has made a good life over the years by taking advantage of the security bugs (either to build my embedded empires--aka, jailbreaking--or to directly collect bounties) caused by all of the people who hate abstraction so much (or are merely so bad at doing it that they don't know how to do it well) that they vehemently argue that duplication is not merely a temporary pragmatic decision to incur potentially-dangerous architectural debt which you intend to come back and fix later but is somehow better than even trying to address it, I guess I find this discussion thread of people almost 100% tearing into this article's fundamental premise... kind of fun? ;P
So, yes, yes: please do continue to ensure you have so much boilerplate in your "flat and easy to understand" code that you eventually make a fatal mistake (potentially simply while doing a merge commit), refuse to factor your safety checks out into abstractions that prevent you from making the same mistake twice due to your refusal to "obfuscate the underlying API everyone knows how to use", and (my true favorite) litter your code with multiple implementations of the same algorithms that have very subtle differences in them (so called "parser differentials") as you insist on every single programming language in use having its own copy of the algorithm "for ergonomic reasons, as IPC/FFI would be crazy when I can just import a second one off-the-shelf".
"To me, an abstraction is a piece of code that’s expressed in high-level language so that the distracting details are abstracted away. If I were to see a confusing piece of code littered with conditional logic, I wouldn’t see it and think “oh, there’s an incorrect abstraction”, I would just think, “oh, there’s a piece of crappy code”. It’s neither an abstraction nor wrong, it’s just bad code."
Of course, if bad code is not an abstraction, then there can be no such thing as a bad abstraction!
More to the point, code littered with conditional logic might well be both good code and a good abstraction. There's a somewhat well-known article out there claiming that Netscape shot itself in the foot by deciding to rewrite the browser from scratch. As an example of how that went wrong, the author mentions the hapless developer trying to write code to work with some hardware component (the great many different dial-up modems that were out there at the time, IIRC), discovering that most of them had unique quirks that had to be respected, even when they nominally conformed to the same spec.
The thing is, you can no more apply abstraction to a program until everything is simple than you can apply compression to a file until its down to a byte. What's really at issue here, as Fred Brooks noted many years ago, is the difficult problem of satisfying the demands of the context's essential complexity while keeping a lid on the implementation's accidental complexity.
There are a lot of ways for good code to express bad abstractions. The abstraction could be inconsistent with other parts of the system, inconsistent with the concepts it is meant to represent, inconsistent with its own observable behavior, inherently complex or hard to reason about, inconvenient to actually use, poorly suited to whatever people actually use it for...
I've seen a lot of code that is perfectly clean and "well-organized" as code but organized into absolutely awful abstractions.
None of that goes against your core point, I just think that seeing the code and its abstractions separately is an important perspective for understanding code design.
On the flip side, it's also totally possible to have bad code but a good abstraction. Some of the best abstractions I've worked with have painful implementations, and it didn't impinge on the quality of the abstraction itself! Of course, the bad code made life a lot more painful for the people responsible for implementing and maintaining the abstraction, and I'm sure it required some real skill and experience to keep that from manifesting to users of the abstraction, but they managed it.
I think the core of the difference can be found in the What exactly is meant by “the wrong abstraction”? paragraph. Admittedly, the quoted article is also a bit confusing here, but I think it's easy to resolve.
I think the wisdom of the original saying is hard to understand when you just look at any piece of code as it exists. Instead, imagine the future. You have two pieces of code that do similar things - you can centralize them (with a bunch of conditionals) to have a "single" code path, or you can allow them to stay separate (perhaps confusing new people). The wisdom of "duplication is cheaper" is to observe that it will generally be less work to allow the duplication than to maintain the circumstantial needs over time. Each time you need to "do the same thing again but a little different" you can either add more conditionals to a single piece of code, or add another instance of 'duplication' which can just deal with the concerns at hand. It's not about "crappy code" - it's about the difficulty of having one piece of functionality serve many masters over time.
IMO, in general, you will also find that if you have many 'duplicated' copies of code, it will often be easier to see the truly duplicated sub-sections that you can DRY out into a common subroutine. I find that is easier to see with duplicates than with a single piece of complex code.
Software architecture is a domain where hard and fast rules don't work.
This is all about understanding tradeoffs and nuance.
In general, I believe that abstractions should be used with moderation, de-duplication is not always an improvement, especially in the long run.
I've made this mistake a lot as I tend to be quite obsessive with so-called code "cleanliness".
It is good that novice programmers are warned about the dark side of abstractions, but ultimately they'll have to experience it by themselves to fully grasp why and how they can be detrimental.
One of the major shifts in my coding style over the past ten years has been to increase the amount of duplication. My threshold for "I should really dedupe that" increased from ~3:7 lines to ~10:50. Looking back this was driven by two main factors: testing and performance optimization.
The testing side is just that tests become awful much faster than normal code if you dedupe them. Unit tests are supposed to be simple and independent, but deduping makes them correlated and complex. You think you'll make things simpler by extracting the common setup from twenty tests into one method, but instead you've coupled the tests so they can't individually be tweaked and laid the seeds for a monster incomprehensible test object to grow from.
The performance side is that often improving performance requires removing abstraction layers so everything is in one spot, allowing irrelevant cases to be removed. Adding the abstraction layers ahead of time makes performance worse to start with, from all the jumping and "paper over one more difference" flag checking, and also makes performance improvements harder later.
If two things are supposed to behave analogously, I'm nowadays much more likely to enforce this by testing the analogy rather than by sharing the implementation.
If working in a solitary codebase, this problem isn't very interesting. Do whatever makes your life easier.
If you're working on any kind of code that serves as a library to other code, don't mutate the signatures of your public methods/functions. Once that signature is released, the only changes to it's output should be bug fixes. If you have a need for two very similar functions, you should use 2 wrapper functions with the common code in the 3rd.
Oh... So that's how you end up with class names ending in FactoryFactory... Factorisation at any cost without making sure it makes sense and will keep making sense...
Once you've seen enough code, the right abstraction becomes easier to spot.
Applications are more similar than they are different. That's why we have the concept of design patterns since these occur with enough frequency that we should just give the abstraction a name instead of re-inventing it each time.
Problem today -- my observation -- is that many younger devs don't ever bother learning design patterns so we end up with 1) devs who aren't aware of common, existing patterns codified decades ago and then 2) think that the "wrong abstraction" is expensive partially because of a lack of knowledge of the "right abstraction" to use.
True enough. But in some companies, there's so much push that the wrong abstraction gets left in place as refactoring and rewriting get pushed down the priority stack and never happens. The way to circumvent this is to not declare the task done while it's still the wrong abstraction, but there's still a (present) schedule cost compared to just duplicating the code and tweaking it.
The cost of DRY (Don't Repeat Yourself)-ing up your code can be high, in that it increases the coupling of your code, and potentially lowers its cohesion.
Consider function def foo(a: int), called from call sites C1 and C2. Eventually C1 wants something out of foo() that it doesn't offer, but, critically, something that C2 _doesn't care about or need_. The author of foo() adds a new default argument: def foo(a: int, b:int = 0), and then there is a conditional block in foo() that deals just with this new b argument.
You've now potentially broken callsite C2, by exposing it to changes that it doesn't care about it. Put another way: you should only deduplicate the code of _all_ the call sites will _always_ change for the same reason. Otherwise, you're lowering code quality of the code by increasing coupling and lowering cohesion. Copy and pasting the code in this case makes sense, because C1 and C2 both have entirely different needs out of foo(). Overtime, foo() will accumulate more and more default arguments as the author stridently attempts to keep everything DRY, and the overall code base becomes more and more fragile.
Your code is still DRY, and you are using polymorphism (foos of different type signatures) instead of if/else, the behavior of foo(int) doesn't change, so you don't require additional tests for foo(int), the fooInternals<X,Y,Z> aren't public, and you have now added tests for foo(int, int). You aren't paying any additional costs in terms of maintenance. You aren't increasing behavioral risk at C2 for calling foo(int). You are only paying more for foo(int, int), and those are costs that you would have to pay regardless of if foo(int, int) literally duplicated the body of foo(int) for common pieces or refactored the common pieces out. You save cost for maintaining both foo(int) and foo(int, int) if the common pieces need to change, as you are adding tests for the behavioral changes to both foo(int) and foo(int,int) tests, but are only making a single change in the common code.
Also, when doing this, the abstraction is the original foo(int), not the new, additional foo(int, int). Abstraction is the assumption of some parameterized behavior via hard-coding. Here, the new, additional parameterized behavior introduced by the second b:int parameter is abstracted away in the original foo(int), not in the new foo(int, int). That doesn't make the original foo(int) abstraction wrong, because it is used in at least one call site (C2).
Only when all call sites must change to accommodate something that a new parameter allows through more than one change-set can you begin to call an abstraction wrong. Otherwise, it is a simple bug that was fixed by a single change-set.
The elephant in the room is without strong static typing and a good type checker changing abstractions is somewhere between a significant pain in the ass and downright perilous.
In my experience when you have those things, whether you make significant changes to your API or decide to dedupe old divergent copy pastes, it’s largely just busy work — very little thought involved. The type checker says change line 135 in file foo. Okay, next.
Duplication is neither "cheaper than the wrong abstraction" nor is it "one of the most dangerous mistakes in coding".
There's a cost to abstraction. There's a cost to duplication. Our job, as engineer, is to stop applying blanket statements and instead reason about the tradeoffs. And no, they aren't static tradeoffs either, because requirements and constraints don't stay static.
For some reason programmers think that an "abstraction" is the same as just naming something. If I take a bunch of code that will only work given specific, concrete conditions and give it a name like "setup()" then I have "abstracted" it.
People who know what abstraction means, and people who use it to mean indirection or naming things, will of course never agree about how useful it is.
"Duplication is cheaper than the wrong abstraction" makes sense coming from the Rails community. Between the meta-programming, lack of static types, large amount of unit tests, etc. Rails has a tendency to lock a project into an abstraction choice & is very expensive to change. The pain is particularly intense during major version Rails upgrades. From my experience, the Rails framework got in the way & bogged down project velocity. It was difficult to move away up to ~2010 as many of the jobs were locked into Rails. There were many frustrated Rails programmers around that time. When node.js, Go, & other languages/platforms came out, there were finally full stack libraries that did not lock in abstractions as heavily as Rails. Nowdays, I use astro.js, solid.js, & target isomorphic libraries. The flexibility of Javascript with the static types of Typescript make changing abstractions significantly easier. The Javascript ecosystem spent far too long focusing on SPAs when the isomorphic MPA was low hanging fruit.
Whenever this conversation is had - it seems to completely dismiss the idea of domain. Duplication doesn't happen in a vacuum - it happens within a certain context. Some acceptable conditions for duplication include:
* If two things are semantically different within the context of a domain but require similar functionality.
* Code paths with different risk profiles.
* When new functionality is evolving with domain learnings.
Fixing a bug in one place and being sure only one place was affected and being sure that one place was really fixed is cheaper
-
than fixing a bug in one place affecting 20 where in 15 places it was a proper fix, in 5 places it will break in unforeseen way when user does something different and somehow additional 2 places totally broken because no one ever knew these were affected.
In about 1990 I got tasked with building an installation and configuration system for the hardware and software package my company built. It was an Ethernet card and a TCP/IP suite being added to the PCs of the era (that had an AT/ISA bus where you had to find a free address block, then jumper the card to have the correct address, lotsa fun.)
I wrote the first system targeted at AT&Ts Unix for the 386. After it was completely done, I was assigned to do the same for Xenix. After that was completely done, I got assigned to do SCO (Santa Cruz Operation) Unix. After that, Interactive Systems (ISC). Each system had its own architecture for installation and configuration. I didn't know in advance anything about the different systems, nor any knowledge that the other systems were on the horizon. As I was writing the second system, I was refactoring like mad to avoid duplicating code, and feeling very proud due to previously learning the horrors of duplicate code. I can't remember details, but among other things files had to be placed in a specific directory hierarchy for each system, and various files had to perform certain (different) functions on each system. When I turned to the third and fourth target systems, the refactoring just became weirder and more complicated, but I was determined to avoid duplication.
Historically it turned out we never revised these releases. With 20-20 hindsight, it's a case where the refactoring was completely pointless, and code duplicated 4 times would have been way faster to create, and easier to maintain if we had made new releases. I think part of Sandi's point is that YAGNI applies as well ... a higher level abstraction may accommodate changes that never arrive, or the changes may be so large that NO abstraction will cover it.
On the opposite end of the spectrum, in 1980 (yeah, I'm really old) as a summer-hire, I'd written in HP-Basic this very funky single-purpose very primitive data base system. When I returned 9 months later after graduating, two full time guys had made small changes to the system, but one guy had made a breaking change, the other guy got pissed off and duplicated _the entire program_ (a single file, to be sure) and made one small change. Thereafter I had to maintain two versions of the thing. Gaaack. It was the ultimate lesson in "don't duplicate code". (It was also in the days before we had a version control system or diff, so backing out and correcting the change wasn't practical.) Mel, where are you now?
- duplication of what, and how many times? Three times? Five? Forty-seven? Four hundred? One line, Five lines, or fifty?
- does the abstraction completely de-duplicate? Or does it turn N big duplications into one some big common code and N much smaller duplications?
You should almost never duplicate more than two, at most three times. Or, should I say, never have to duplicate. If duplication is the best solution, because there isn't a suitable abstraction or we can't find it, there should be a macro system which can condense the duplication. So that is to say, if you have maros, there is always an abstraction that can be found, namely syntactic abstraction. Identify what is common between the duplications, and turn that into template. The variant parts become parameters.
If that turns out later to be worse than duplication, you can just expand it and keep the expansions or identify some other way to deduplicate them.
That something is the wrong abstraction is something you can only know after the fact, at the time you build the abstraction it is - or at least it should be - a reasonable choice. And later on as the code evolves there are two possible outcomes, the abstraction remains a good choice or the abstraction stops being a good choice and you have to change it. Maybe it can be saved with some refactoring, maybe at has to go completely.
But at the very least you had a working abstraction for some time and you can easily figure out all the places where this functionality is used and you have a single place to make changes when you have to make them instead of having to hunt down all the different places with slightly different implementations. Even if an abstraction breaks completely down and has to be split up into several implementations, each of those will usually have several usages which would all still be repetitions without the abstraction.
> So far so good, perhaps. But, by creating this new abstraction, the programmer signals to posterity that this new abstraction is “the way things should be” and that this new abstraction ought not to be messed with. As a result, this abstraction gets bastardized over time as maintainers of the code need to change it yet simultaneously feel compelled to preserve it.
How I've been thinking about abstractions that turn bad over time is: It was likely the correct abstraction at the time it was made, given the requirements the writer had on hand. Now that the abstraction is wrong, don't muck with it. Gather the new requirements and write a v2.
I think the vast majority of abstractions will go bad over time. To abstract is to generalize, and generalizations become invalid over time because the world evolves over time. It's sort of like trying to preserve a summary of a book that is continuously having new pages added and existing pages replaced.
I think I would want to look for an accurate "representation" and expression of the right problem, not any particular abstraction technique or mechanical refactoring.
Refactoring code to your understanding helps you understand the code but leaves the code in a different organisation to how it was, adapted for your mental model of the problem.
If programming languages were expressive enough, we could represent things how they are and replicate that base pattern to different cases or scenarios and that would be enough but unfortunately our languages are not expressive of our high level intent and invariants we want to maintain. (Such as extensibility or hookability)
In other words, get the mental model for the problem right and the abstraction will be invisible and the solution shall be obvious.
Abstraction impedance mismatch is when people introduce a design pattern or a strategy that is harder to understand than the problem that was being solved and obfuscates it.
I wish more people simply were happy with using themselves whatever set of beliefs/techiques they deemed best (abstraction, duplication, whatever), preaching nothing, and arguing less.
Which is to say, there will never be a single truth for these topics. So why not build a mindset that is ready for encountering differing opinions, diverse code?
When your job is mentoring, RCA or cleaning up after other people (hello) then these aren’t opinions and aesthetics. They’re empirical evidence and/or coping mechanisms.
Invalidating people’s coping mechanisms without proposing your own never goes over well. And sometimes even then.
When diagnosing a production issue, we don’t have the luxury of entertaining five different ways to solve the same problem. And code smells slow debugging-under-the-gun because most bugs are in code smells, so they draw your attention only to prove to be a false signal.
If you don’t do any of these things, then it’s challenging to have empathy for or understanding of the people who do. The people keeping the wheels on deserve the benefit of the doubt. In fact anyone who will stand up and fix problems when they arise deserves a bigger vote on how things get done. Everyone else’s opinions are theoretical rather than vocational.
I read a blog post somewhere (don't remember where) that describes the process of unfactoring (multiplying?) code as an exercise. Copy/paste the code until there's one straight code path per use case. Then examine the similarities and factor the code again. What you end up with will often be different from what you started with, and probably simpler, especially if the code had begun to drift from its original author's design.
So, "unfactor" the code and then factor it again. Let's call it... "refactoring."
My $0.02, then, is that "the wrong abstraction" assumes that you are unwilling to change it. What if we were comfortable tearing down our classes all willy nilly and replacing them with some other thing? Is it too risky? Does it hurt too many feelings?
Maybe the problem lies there, instead of in duplicate vs. abstract.
In my career as a software dev I've found one thing to be true - Every Paradigm I ingest that opens new windows of opportunity are great at first pass and as I learn more the more narrow the scope they can be applied. (This is kinda true in life, too. Like when people say, "Its econ 101 or bio 101." etc. What seems like a statement about 'common' knowledge is actually an indication of how shallow your knowledge is!)
Specifically related to this topic is a talk by Dan Abramov called, "The Wet Codebase" - He says it better than I can sum up and has visual aids : https://www.youtube.com/watch?v=17KCHwOwgms
Other have pointed out code that is similar in function vs similar by coincidence and I think that thought alone is worth chewing on.
When I was younger I was more productive when I didn't contemplate such matters. Maybe I wrote a lot of junky code but I got a lot of working stuff done. Now my time is wasted reading clout chasers and their opinions. Reading about coding is such a bad habit when it stops you from coding.
There's an aspect of "not seeing what others are seeing" here.
> I think “the wrong abstraction” is a confused way of referring to poorly-de-duplicated code. Here’s why. [...]
> So instead of “duplication is cheaper than the wrong abstraction”, I would say “duplication is cheaper than confusing code littered with conditional logic”. But I actually wouldn’t say that, because I don’t believe duplication is cheaper. I think it’s usually much more expensive.
It seems the author is considering 'cost' to be the mechanical effort of managing the sync/desync of the DRY code. What it's not considering is that distinct intents can incidentally use the same implementation at the moment. This is when it's not a good idea to DRY because there they are not meant to stay in sync.
Duplication is bad. In fact, duplication is one of the most dangerous mistakes in coding.
I have to disagree with this; the article feels lofty in its assumption that when you start to program you know what to abstract. More often people begin abstracting due to a misguided axioms like "DRY" rather than to solve a problem with a real cost benefit trade off. DRY as a goal in itself is fairly dangerous.
I can't count how many convoluted and confusing frameworks people have put together under this misguided perspective. It's not atypical for an abstraction born of "DRY" motivations to be more code and brittle than just copying and pasting 2 lines in 15 places.
Not to say abstractions are inherently bad, but to the point of abstracting for the sake of DRY is a mistake.
The problem is with being either dogmatic or thoughtless in either direction. I've seen what you're talking about: people combine code religiously because of DRY, leading to insane pyramids of abstractions that are impossible to modify.
However, I've also seen people copy and paste everything they ever need. When that happens, those offshoots gradually evolve independently from one another, and introducing a proper abstraction becomes a huge slog. I've spent hours reading through git blame trying to piece together a phylogenetic tree of the various copies of the same code so we can ensure that the new abstraction contains all relevant features and bug fixes. I wish those developers had thought more carefully about DRY.
I think the best balance is to use these catch phrases as principles to guide your decision making, while being willing to make exceptions when they don't apply. If DRY makes you think for a second before copying a piece of code, it's done its job, even if you decide that this situation really does call for a copy.
What seems to serve me best is keep things as simple as possible. If you add abstractions, do so to make the rest of the code easier and less complex. If you must do something complicated, break it apart as pragmatically as possible and do it in the simplest way possible. Favor YAGNI (you aren't going to need it) over corporate-wide libraries that lock you in.
Keep your codebase discoverable first. Structure by feature/function not type. Favor the local developer experience first. If you cannot open, follow and run the code easily, your developers won't be able to onboard quickly. Someone else will have to continue with your mess, make it as orderly as possible. I find that docker-compose can help a lot on this front, as can developer containers.
I think mislabeling something as a duplication is where most of these issues stem from.
Humans love to pattern match, we find patterns in things that often have no real pattern. It is not uncommon in my experience to see patterns in code, label the code as not DRY, and attempt to DRY it up. If the "duplication" detected was, in fact, not a duplication but rather code that just happens to be similar, the abstraction will often go awry.
My rule-of-thumb is to prioritize maintenance over authorship. Am I writing this code in a way that makes it easier for future me or another programmer to change it, or am I optimizing for a sleek diff in my code review? I think our code can look like breadboards instead of a bespoke printed circuit board, we have compilers for that.
I think it is worth distinguishing proper opaque abstractions, that are defined by a contract, from convenience macro-like "abstractions" that are defined by their implementation.
The former are for abstracting different implementations behind and interface and/or decoupling, and require thought, planning, and careful consideration for their evolution.
The latter are purely for convenience, to save some typing, some mental overhead when understanding code (although they can increase it just as well) and to centralize minor bug fixes or common features. For long term evolution and divergence, these abstractions should simply be macro-expanded instead of trying to refit them for the new requirements.
It's curious there's no formal concept of "unduplication" - splitting a single abstraction originally created to avoid duplication, now littered with conditionals and spaghetti, into separate abstractions that now do something unrelated.
I wish people would have this saying in the Node.js and JavaScript community. I disagree with OP about this topic.
Abstractions are like the foundations of a building. Imagine that you're building an apartment block and your job is to build the foundations but you're unsure about how tall the building will be.
If you build it on mud, that might be fine for a one story construction but once other builders start adding additional storeys on top, it will become totally unsuitable and the whole thing will have to be rebuilt from scratch. Not only that, the costs will begin to materialize immediately because those who build on top of your foundation will make all sorts of bad decisions because of your poor judgment; they might decide to build the walls out of cheap wood instead of bricks simply because it's lighter and they don't want the building to tilt and sink into the mud... Then because wood was chosen as the material, there may be a termite infestation and builders will have to apply a special varnish on the entire surface of the building... Then the varnish will turn out to be toxic and will need to be removed. Every inch of the building will have to be polished with sandpaper and painted over... And when the next storey will need to be added, they will be forced to make it out of cardboard... Then the tenants on the top floor will want their money back and the whole building will need to be destroyed anyway; all that back and forth will have been nothing but a waste of time. You would have saved an entire decade and millions of dollars if the foundation had been laid on solid bedrock in the first place. Just one small sub-par decision which triggered an avalanche of terrible decisions.
I think duplicating code makes sense and can be a wise decision early in the project because it's essentially a refusal to lay the foundation until there is more clarity about the scope of the project. It's a lot easier to refactor and combine duplicated code into a new abstraction than it is to refactor one abstraction into a different abstraction. Not to mention that developers become very attached to abstractions (including incorrect abstractions) and it tends to upset people once they're invested in it.
To talk about “duplication is cheaper than the wrong abstraction” without invoking "dependencies" at all (and their costs) means the entire premise has been missed.
Another tell:
> Don’t try to make one thing act like two things. Instead, separate it into two things.
If abstractions were so easily split like this, then the advice wouldn't hold. But they never are. Abstractions immediately accumulate dependencies making it near impossible to split them, as we all learn after living in anything other than toy code bases.
This is the hallmark of a junior (ie someone who has not been to battle much) is making de-duplicating code a priority and not understanding the cost of dependencies.
It took me a long time and many thousands of lines of code written, read and re-written in order to understand one thing:
Code is supposed to reflect the intention.
Good code reflects that intention smoothly, like a well-written paragraph of a book reflects the events that happened in the story.
DRY makes sense semantically, when a piece of code always needs to be the same as another piece of code - that's when you isolate it into a function with a semantically meaningful name and behavior. Applying DRY without understanding and indiscriminately leads only to confusion and needless complexity.
DRY to me means having a single authoritative source. So for instance, if I need to define a person data structure then I use protobuf. I can add validation rules, and types to it. I can generate bindings for java, go, ruby, etc and they can all rely on the same person structure, with the same validations. Code is technically copied but there is still a single authoritative source.
If I need handle bank transactions, then I will create a single "microservice" that knows how to create a transaction and update the account balance. I wouldn't want that logic duplicated in multiple places.
An important context is the use case. Grossly speaking, business applications tend to have a shorter lifetime and faster cycle time than system code like, say, the Linux kernel or gcc. So the cost of refactoring in the latter case is amortized over longer timescale; when you have rapid business needs it can often be better to just pmake the change in two or three places and move on because in a few years the whole thing will be replaced.
We all know of exceptions to those examples (quick-and-dirty code that survives decades later) but I think that's the way to think about it.
>It seems to me that what’s meant by “the wrong abstraction” is “a confusing piece of code littered with conditional logic”. I don’t really see how it makes sense to call that an abstraction at all, let alone the wrong abstraction.
No, it means the wrong abstraction. Like forcing a one-size-fits-all abstraction on a few pieces of duplicated code, and not waiting for them to grow to enough cases to hint at what is the best pattern/abstraction/architecture to handle them (perhaps more than one, for different classes than somebody might just shove in a single abstraction prematurely).
I have my umbrella at the ready for a downvote hailstorm: it makes perfect sense that the OP is hearing this repeated in the Rails community, as they are already enmired in the wrong abstraction ¯\_(ツ)_/¯
I think the answer here can be different depending upon the ecosystem. I confidently believe that abstraction is better instrumented and practiced in functional programming languages than those of the still-dominant object-oriented paradigm. Awkward abstractions are much easier to grow and stumble upon when the basic unit (an object) encourages private, greedy, encapsulation of data and method implementations. In functional languages, living up to DRY (don't repeat yourself) is a much more immediate and clear proposition.
I like Dan Abramov's "The Wet Codebase" (https://www.youtube.com/watch?v=17KCHwOwgms) -- I've been guilty of doing just what he says in his talk at first, removing all duplications and making the codebase DRY. But then I came to like "prefer duplication over the wrong abstraction", as Sandi Metz puts it.
Sometimes it's good to wait to have more data to make an easier and more informed decision.
> To me, an abstraction is a piece of code that’s expressed in high-level language so that the distracting details are abstracted away
That might be what an abstraction is to the author, but it's not a correct definition. Abstraction has nothing at all do with the high or low level languages.
- If your code has a bug, you will be better off without duplication, so that the bug must only be fixed once.
- If you will have to change the behavior of your code for product reasons, duplication is often better, because user needs are idiosyncratic. If the code is fully factored, you may have to pass in flags to indicate which behavior should be used in which case.
Learning to anticipate which of these two cases you might find yourself in in the future comes with experience.
I found this blog post low on insight and thoughtfulness. I've worked with engineers in the past who had an inflated esteem not just of their own abilities but of the nature of the business domain they were ostensibly building solutions inside. I have found that in many cases, there's a level of naievete commingled with arrogance that comes from never having worked with an intrinsically complex enough problem to understand the true cost of abstraction, which is always nonzero.
Now, it is the case that there are many cases where the cost of abstraction is low enough to not be ROI negative. But there are many cases otherwise. Other commentators here have done a great job of detailing that space -- that incidental and actual repetition vary, that abstractions should exist to reduce optionality and ease of reasoning rather than simply reducing code, and those are all correct. But at a very basic level, all of those observations reflect the most critical missing factor from this post, which is context.
No software is created or operated in a vacuum. Every piece of software is created by humans to solve problems for themselves or other humans. So every piece of downstream of the working process of those humans. Given that these working processes are subject to change and evolution, changes in requirements aren't edge cases but table stakes. This means that often the cost of an abstraction is not just whether it's the wrong abstraction at a point of time, but also whether it's an abstraction that is likely to erode over time given a particular working process.
With that said, a lot of this post seems like an exposition of this central point:
```
If I were to see a confusing piece of code littered with conditional logic, I wouldn’t see it and think “oh, there’s an incorrect abstraction”, I would just think, “oh, there’s a piece of crappy code”. It’s neither an abstraction nor wrong, it’s just bad code.
```
I've seen this dismissal from many engineers over my career, and in every case, without fail, it reflected an inability to deeply read and understand the code, its history, and likely its future. To all the engineers out there reading this: thinking like this will prevent you from maturing from a junior engineer to a mid level engineer, never mind a mid level engineer to a senior engineer or engineering leader. You've been forewarned.
There is a simple merit: if some code is complicated enough to make you think twice before modifying it because you’ll need to modify all the copies (and you realize that it will be not easy) - then it is better to make this code DRY.
There are some simple pieces of code that are cheap to copy and modify later. And nothing wrong will happen if you do not apply future modifications to every copy. A code like this doesn't have to be DRY.
Typing on phone so I'll be brief. The key concept I find missing from this piece is "locality".
When dealing with a complex and/or unfamiliar codebase, locality (by which I mean "I can understand this thing here without jumping around the codebase") can make up for a lot of other deficiencies.
And imho, dedupped code with an excess of if statements is actually one of the least worst things to encounter.
Many programmers believe that the more complex the better.
In my experience good code looks silly simple, such that you might think the problem was easy. And thus underrate the author ...
I have never read someone else code at work and complained that a function is too big with too many if clauses.
However, deep call trees are really hard to comprehend. Especially if some function is called multiple times in the same call stack (unless the algorithm is recursive in a good way).
This article isn’t making a distinction between the interface provided by an abstraction and the implementation details of that abstraction, which I think causes it to come to the wrong conclusions.
A bad abstraction is an interface which causes the implementation to be more complex than necessary. Uses of the interface might still look perfectly simple, but if the abstraction is bad the overall complexity could be higher.
I dunno about these debates. Feels too subjective. Nobody can spit out a number and say "look, duplication is correlated with bugs!" So it's pure taste. Maybe when hiring, we should have a simple survey that asks "tabs or spaces?", "duplication or early abstraction?", and then we only hire people who agree with the team. (joking!)
Every time you create an abstraction to remove duplication, you're tying two pieces of code together and creating a common dependency. The more dependencies you have, the harder it is to change code, because a change in one place reverberates in many places.
To me, that's the cost. You gain a decrease in code size and verbosity at a cost of making localized changes more difficult.
I call this a distinction between "inherent sameness" and "incidental sameness".
Yes, right now, those two servers have the same number of processor cores. But who's to say that after a hardware update that will still be true?
Conversely, the fact that every processor has a certain number of cores is inherent to the way we represent a processor.
In my line of industrial automation, it's almost always cheaper to pay the cost of complexity up front, and assume that every conveyor VFD might get replaced with a different model, or with a contactor, somewhere down the line. That duplication is cheap when the line is on the integrator's shop floor. Any downtime later on, when enormous dependencies have come to rely on that line, is more costly.
I find it's better to keep abstractions small and independent, so you can mix and match. Too big, and they risk not fitting future change well. Even if the smaller ones create a bit more work or "mini duplication", it's worth it to have that flexibility.
Did I miss where he justifies the statement "duplication is one of the most dangerous mistakes in coding"? That has not been my experience and it's the crux of the value judgement here so I'd expect him to explain why it's so bad.
This whole article is based on a bad reading of the problem.
The problem that happens when code is first duplicated is that the correct abstraction is a fundamental UNKNOWN.
If you knew the right way to de-duplicate it, you would of course always construct that abstraction, because that would always be better.
What happens in practice is that the wrong abstraction is usually chosen.
Then that incorrect abstraction isn't usually held around because of "[feeling] honor-bound to retain the existing abstraction" (if that's a direct quote from Sandi then I disagree with the quote and feel it has entirely the wrong emphasis). The problem is that it is always easier to add a new knob to the bad abstraction than it is to go back and de-dup the whole code and fix the abstraction. So the bad abstraction tends to accrete more bad abstractions on top of it until it becomes a mess because of doing the cheap, easy thing.
We should not do that. But the realities of software development are that when you are dealing with an orthogonal problem, you WILL wind up adding a knob to something that can be done in a day, rather than taking 2 weeks to refactor a different subsystem that your original problem only barely touches and isn't the primary concern of whatever business objective you are trying to deliver.
So the advice is to let it sit for awhile. Let the code accrete a few more requirements over the weeks or months ahead, and when you find yourself doing a double edit to both sides of the code and the right abstraction is clear to you then go ahead and de-duplicate it.
Note that if the problem is TRIVIAL then go ahead and de-dup it right from the start. This isn't advice for junior programmers who are faced with something as simple as dropping two hash keys into an array and then iterating over it so that it makes it easy to add a third key. This is more about having two classes which are fairly similar and extracting a whole base class and jamming all the shared code into the base with a tightly-coupled poorly-thought-out "wide" interface (using inheritance as a hammer to de-dup code). And the whole problem becomes even worse if someone external might come along and pick up that base class and start using it with the existing API and you might be locking yourself into a shitty API that you can't change without breaking backwards compatibility.
And even if you're in a "non-OO" language like Go you can still make this mistake by designing bad interfaces, it is the exact same thing.
Don't re-use using inheritance, but dependency injection.
A (well-tested) software component that get's dependency-injected should get considered "final". If it makes sense to adjust it, you may still can do it - there's nothing preventing you from this. But you should always be aware that logic relying on the dependency may behave different in a way you haven't forseen. If you just want to make a change for a single place in your software, you can easily replace the dependency with another one implementing the same interface. You could even decorate the original dependency if you want to re-use most of it's code.
What I want to say, nearly all of the abstraction issues come from inheritance and in many, many cases there's no need to use inheritance at all.
This code golfing always boils down to “it depends”. Senior engineers by definition nod sagely and everyone else looks around nervously. It’s a tough break. Both approaches are correct and wrong. It depends.
Wrong abstractions make the abstraction configurable, and it is a slippery slope. Keep adding more arguments, configuration options, and there is no end to it. Sometimes duplication is indeed cheaper
I have a hot take on this, which I hope will resonate with at least a few people: duplication, even of blocks of up to a few long statements, rarely bothers me, because I remember all the duplications as a single instance. I have extra ordinary memory, and this makes a huge difference in how I think of and write code. Or anything really. I save everything I've ever written, like bash history, but everything, and refer beck to it and copy paste somewhere else. I wonder if anyone else has this. This doesn't affect how I think of production code, but it hugely affects my work flow.
people are conflatting abstracting with centralizing. A central piece of code is not neccessarily an abstraction, so seeing them as (wrong) abstraction is...wrong. They can be centralized by neccessity, regardless of how poor an abstraction they seem to be
I clicked on the "more nuanced and comprehensive post" and the real TL;DR is "I define duplication differently than everybody else, and, by that definition, claim that duplication is always bad"
> Just because a piece of duplication costs something doesn’t automatically mean that the de-duplicated version costs less. It doesn’t happen very often, but sometimes a de-duplication unavoidably results in code that’s so generalized that it’s virtually impossible to understand. In these cases the duplicated version may be the lesser of two evils.
I've never so viscerally disagreed with a link on this website. Particularly this point:
> Duplication is bad. In fact, duplication is one of the most dangerous mistakes in coding.
This to me reads insane, fanatical. One of the biggest benefits of duplication that the author fails to identify the locality of logic. When, not if, things break, there's a large benefit to having all of the logic contained to a few heavy-lifter classes that contain bespoke logic and are fit-for-purpose.
"The wrong abstraction" in this case is bending over backward to fit your data into another API just to cut down on code duplication; it is better to have code with clean, uninterrupted data flow than code that frequently needs to re-translate the data to be consumed by different APIs, then decode the results back to useful logic. The translation/decoding steps are new places to introduce bugs, and the more translation or decoding required, the more bug-prone the code will be.
A good abstraction to de-duplicate code should not add complexity to the existing call sites. If you've squinted and decided that two systems are close enough that they can be abstracted together, you're likely making one or both of those code paths much more treacherous.
As a programmer, if you don't create an abstraction you'll never be more than a 1x programmer. Abstraction is how to get more productive than simply how fast you can type.
Yes, the wrong abstraction is bad. But almost every argument whether it's for/against duplication or for/against abstraction usually starts with the hidden premise that you're stuck with whatever choice you've made and code you've written forever. The underlying issue is the fear of change and the sunk cost fallacy of already written code. If you have the wrong abstraction, you can change it. If you created too much duplication, you can remove it.
> The underlying issue is the fear of change and the sunk cost fallacy of already written code. If you have the wrong abstraction, you can change it.
It's not that trivial. Consider that the wrong abstraction is reflected into your API (common), and consider that your API has many users. You are stuck with it, or you have to convince multiple teams (or, god forbid, external customers) to migrate to a better API. This can constitute a humongous waste of SWE-hours ($$$$$) and take quarters to accomplish, assuming you can get any buy in.
I think it comes down to what your organization looks like and how many users are going to be touching your code. If your abstraction is just for yourself internally and everyone else is not allowed to touch it, then fine. You will own the tech debt if the abstraction is wrong. If your abstraction has other users at your company, or external customers, it had better be the right one or at least an unavoidable stepping stone.
> If you created too much duplication, you can remove it.
This is actually true. Refactoring duplicated logic is a lot easier than fixing bad abstractions.
It is that trivial. There is no alternative. You either have an API or you don't and you either change it or you don't. Hand wringing over the potential of making a mistake is a waste of time and effort. You will make a mistake. You will never get it perfect. You just have to deal with it.
> Refactoring duplicated logic is a lot easier than fixing bad abstractions.
Then you've just created an abstraction with all that potential to be bad sometime in the future.
> You will make a mistake. You will never get it perfect. You just have to deal with it.
These two comments sound at odds. First statement says it's easy. Second statement says it's hard.
We can agree that hard things don't get solved without iterating. But a productive response to abstraction (which is really API design) being hard is not to say "stop handwringing, just do it." Instead, you can employ various strategies such as preferring experienced people to do it, making sure they did a good job of gathering requirements and considered the risks of their approach, spending time testing customer/developer ergonomics, etc. You can also defer producing an abstraction until your system is a bit larger and the duplication is becoming too much to handle, since you have a larger sample size of potential uses for your abstraction to help you converge on the correct API.
Good abstractions can be the difference between success and failure, between organizational velocity and technical debt quagmire. Saying "we should always build abstractions" when it's difficult to build them correctly in one go sounds totally wrong on its face.
Trivial doesn't mean easy; it's means unimportant. And in this case I'm referring to the whole idea of abstraction or not. You're going to do it. You should do it. Sometimes you have to do it. Discussing it is pointless. Just move on to the how.
> Saying "we should always build abstractions" when it's difficult to build them correctly in one go sounds totally wrong on its face.
If you don't do that first "one go" how do get around to right abstraction? You're telling me you wouldn't use any intuition, planning, or thinking to build that first wrong abstraction? It seems above you just described how to do exactly that. Just writing without any abstraction in mind at all is just a huge waste of time. The only time it's painful is when you decide that you can't change anything you've done. But doing no abstraction is just as painful -- you just pay for it differently.
Okay Mr. English, trivial may be defined as unimportant / trifling but if you're gonna label something like that then you have you consider why you're calling it unimportant. Still seems you're saying it's cheap, easy, and undoable. It's not.
Abstraction building is _consequential_.
> If you don't do that first "one go" how do get around to right abstraction?
Right back at you: how do you get the data about how to build your abstraction without first living without one to understand what parts of the system should be abstracted? Starting with the abstraction and assuming you can fix it later is a luxury for overstaffed teams and people working on systems that nobody actually uses.
You do realize, abstractions incur mental load on your users? Have you ever tried to use internal tools built ostensibly for your use case, only to find that it's taking more time and effort than if you lived without one? It's happened to me so many times, and the consequence of adopting a shitty abstraction has burned me so many times personally that I'm gravely aware of the downsides.
Consider a streaming data tool that stupidly allowed users to put arbitrary data in a string field, which became abused for purposes it shouldn't have been used for -- now it's critical to the company and would take a YEAR of dev resources to retire, all the while being an endless source of headache and P0 SRE escalations.
Consider an ML evaluation tool that requires you to understand how a bunch of leaky abstractions nest together. When you finally figure out how it works and it does 1 thing for you (generate P/R curves), it becomes extremely difficult to modify or debug without again understanding these horrible abstractions. Once you've spent a full quarter adopting it, you wish your team had just written something custom-but-maintainble instead of trying to get shoehorned into a tool that clearly wasn't thinking about you, and whose owners have left the company leaving you to absorb the cost.
> But doing no abstraction is just as painful -- you just pay for it differently.
It's less painful in the long run: doing no abstraction to start is often the right path. People often overrate the benefits of abstraction and underrate the clarity of repetition, especially when the repetition is trivial.
> how do you get the data about how to build your abstraction without first living without one to understand what parts of the system should be abstracted?
The minute I have any duplicated logic it gets combined as much as possible. I hate duplicating anything. Some duplication is impossible to avoid and I hate that too. I hate specifying the same logic in 2 places (say client-side and server-side for web validation). I will do anything possible to avoid that.
Good abstractions reduce mental load. I wouldn't be able to manage 20+ different applications if they didn't share a huge amount of common internal framework and have as a little duplication of logic as possible. I'm actually pretty militant about ensuring every abstraction is a programmer benefit and will remove layers if they serve no purpose.
I almost always build things out as a library. If we are consuming an external service, I will make that it's own library/framework to be consumed by the project it interacts with. Even if it's only ever used by one project. This is always more work initially but has never failed to be right choice.
Yes, in sum, but bad ones increase it, and it's not always immediately obvious to everyone that an abstraction is bad. A lot of that comes from experience.
> The minute I have any duplicated logic it gets combined as much as possible. I hate duplicating anything. Some duplication is impossible to avoid and I hate that too. I hate specifying the same logic in 2 places (say client-side and server-side for web validation). I will do anything possible to avoid that.
Good for you Glenn Coco! Work in any scaled software organization and you'll see that combining logic too early is full of hazard.
> I almost always build things out as a library ... This is always more work initially but has never failed to be right choice.
Maybe to you. I wonder if someone who's ever tried to use your library down the line ever thought "this was abstracted in a really suboptimal way but I have to live with it." Even if they did, I don't think you'd know. I've been that person too many times. People have got to stop building shitty libraries when they're not necessary and only serve to obfuscate the logic.
You argument just boils down to bad programmers writing bad code. It has nothing to do with abstraction or not. You'd be just as unhappy with an entire project that's just one grant 20,000 line file.
That's why I think the whole rant about abstraction is pointing fingers in the wrong place. Everyone either wants to find the silver bullet that will produce code perfectly the first time or find some concept to blame when it isn't. If you have experience, use abstraction to accelerate your development.
> Work in any scaled software organization and you'll see that combining logic too early is full of hazard.
Don't combine logic and you need twice as many programmers to do the work. That front-end guy to do the JavaScript for the validation, the backend guy to do the server validation, the project manager to ensure they're both always the same. I started this rant by saying if you don't abstract, you'll always be slow. And I stand by that; many projects with dozens of programmers are just manually performing work that could be abstracted away at the start.
> You argument just boils down to bad programmers writing bad code.
Good programmers can write bad abstractions as well, because good abstractions require sufficient understanding of what you're trying to abstract and we have blind spots (the unknown unknown). Thus the caution.
> It has nothing to do with abstraction or not.
"Bad code" is one thing, "bad abstraction" is a very specific subset of that problem.
> Everyone either wants to find the silver bullet that will produce code perfectly the first time or find some concept to blame when it isn't.
I think you're missing the point. Bad abstractions, especially when depended on by many users, are 10x bugs. Nobody's saying "make all code procedural and never abstract anything" -- however it's very valid to say "problems caused by bad abstractions are super bad so let's be extra careful." I mentioned strategies for being extra careful earlier: let more experienced folks do the design, defer the abstraction until you have more information, etc.
There's also levels to everything. Of course there's no repercussions for how you structured the classes in your internal facing webapp. Nobody cares about your codebase except you. If you're building a foundational building block of a complex system however (e.g. the message bus for a self driving car) you had better make the best approximation of the right answer from the get go, because that system isn't going to be re-built for many years.
> Don't combine logic and you need twice as many programmers to do the work.
This doesn't ring true. Duplicated logic doesn't always mean double work.
> I started this rant by saying if you don't abstract, you'll always be slow.
Yes and my point is: if you think abstracting is always a net win, you're probably green and haven't seen the myriad cases where it bites you.
I'm anything but green. And I have, of course, made all these mistakes over my career. But if you're capable of running in 5th gear, don't take advice that says stay in 1st until you've finished the product.
> Then you've just created an abstraction with all that potential to be bad sometime in the future.
I would argue that you've then created an abstraction, but with all the hindsight allowing you to create the _correct_ abstraction (or at least a much better chance at it approaching "correct")
100% agree, especially about refactoring duplicated logic. Super-duplicated code begs to be refactored, and having many examples of the same functionality helps you build an API that is robust without adding "what-if" functionality to try to futureproof code (impossible).
Migrating to better APIs is done all the time. It is not an issue worth discussing about anymore. But even if you have to maintain an API this doesn't mean you cannot change the underlying implementation.
I'm working with an API right now that is absolutely based on duplicated code. They have a system for querying items and the API has 3 different ways depending on the endpoint. I just found a new one the other day and I hated it -- why is this one API unnecessarily different from all the rest!
I'm building a library to call this API and I've abstracted over all these differences so my callers never have to know how messed up the underlying API is -- they get a consistent experience regardless.
I'm not sure that's true, when I look at the code bases of some of the most productive coders I can think of (John Carmack and the Doom and Quake code bases for example), they tend to be fairly conservative with how much they abstract things. There's a lot of very thoughtful data structure usage (git is another good example of this) and diligence in maintaining a coding standard (which generally has little to do with formatting), but most of the code seems to be more concrete and task focused rather than abstract.
I think Casey Muratori has a good way of thinking about this, with his concept of "semantic compression" ( https://caseymuratori.com/blog_0015 ). To me that's a lot more valuable than the ideas that you get in say something like Clean Code (which in my opinion, the popularity of that book has been a disaster for the industry)
I'm positive they don't have much in the way of duplicated code.
Abstraction can be as simple as a function.
It seems odd to give really good examples of abstraction and then sort of do a No true Scotsman argument on it. "It can't be abstraction because it's not terrible."
so your argument is, "you've got to write functions, therefore abstraction is always good" ?
yeah man i don't start in main() and never write another function ever again. You got me. But I also don't aggressively police duplication, accepting that I would always rather see what's happening without file-hopping. The correct abstractions will simplify code, and they will seem obvious (even if just in retrospect). Abstractions that force me to continually re-frame the problem i am trying to solve in terms of another's use case are antithetical to writing the sort of code that I do.
You don't have to write functions. Have you ever seen a 10,000 line code that was just a single function -- I have. It also had nested if statements 5 levels deep to handle all sorts of logic with lots of duplication. It was unmaintainable. Yet it did meaningful work and it could have been refactored to a fraction of the size and do the same work. But, to be honest, when I had to fix it I just went it and fixed in the 20 places that needed changing because it was impossible to follow.
You like good abstractions and you hate bad abstractions. I couldn't agree with that more.
There are useful abstractions and useless abstractions. A lot of the GoF designs are bad abstractions (if used indiscriminately) and clutches for bad languages. However, using problem-focused abstractions is a big time saving strategy.
Wrong abstractions percolates to your system, assumptions on how your abstraction is supposed to work is hard ossification that hides concrete implementation and their actual, generally simpler, contraints.
Basically your refactoring work now requires to understand all user code that relies on the wrong aspects of your abstractions find a way to correct it if your lucky and make it work exactly the same way the duplicated code would have.
And I didn’t mention implementations that drifts in incompatibles ways with the abstraction, a large source of errors and regrets.
The good bet for productivity is recognizable implementations patterns and duplication.
In the end refactoring duplicated code that had time to settle and drifted in legitimate ways to find your correct abstraction is a blast.
I disagree and I can provide an example. I'm creating a library to interface with a REST API. The creators of this REST API obviously didn't do any abstractions and they have multiple implementations for the same exact process: paged queries of items. There's no reason for them to be different -- they're just different because, I assume, different developers built them differently at different times. Nobody looked at this and said "This is all the same so we should have one single common implementation abstracting over all the endpoints."
However, as the developer of the interface library, I can abstract over all the differences and give my consumers the same exact same experience regardless of the API endpoint. And that's exactly what I did. So now they're all more productive because they don't need to know all these unimportant details. They don't even need to know that there's REST API. In fact, this this REST API replaces a previous API implemented with a completely different technology and we are swapping the whole thing out with minimal changes because I abstracted it years ago.
Not all abstractions are wrong. Not all concrete implementations are simpler. My goal with an abstraction is to take some else and make closer to what we need because most technology has a wide audience with wide requirements. I'm a narrow audience with narrow requirements and so I can hide the vast complexity that I simply don't care about.
Wrong abstractions start off as right abstractions and slowly become wrong abstractions. What's the point in which a right abstractions becomes a wrong abstraction? Am I sure I can identify that point? Can someone else? Can someone else which has no knowledge of the original assumptions that were implied during the initial abstraction?
There are two kinds of abstractions in your, the ones everyone complains about and the ones that no one has ever seen.
My rule of thumb is thus: Have I repeated myself three times doing the EXACT same thing? Then CONSIDER abstracting away. Otherwise, make as many implicit dependencies explicit as possible and slightly keep repeating yourself until you are exactly repeating yourself
Your rule of thumb isn't great: If you have some important logic duplicated twice but a year from now has a bug -- are you going to remember to change it both places? But, lets be honest, you probably would not create that code in the first place -- you'd have abstracted it automatically without even thinking about it.
These conversations generally tend to completely discount experience. Junior programs are often terrible at abstractions -- they either do way too much or way too little. Can I give them a hard and fast rule that they can use to never make that mistake? No, I can't. It doesn't exist. The only reason I know what's good or bad is because I've done it wrong thousands of times.
That's the problem with every single one of these articles that prescribe one true solution. It's not at all that simple.
> When, not if, things break, there's a large benefit to having all of the logic contained to a few heavy-lifter classes that are contain bespoke logic and are fit-for-purpose.
Things break in cycles. You'll have worked around a first wave and be happy that you didn't abstract your code too much as you can just fix one side of your logic very locally. It also means you didn't touch the other sides that were not directly affected, but probably needed a fix in a slightly different but overall similar way.
So you'll see your code instances all break a way or another and fix them one by one instead of hardening a central piece where you could focus your testing efforts.
Of course it's a topic that needs nuance, but if you identify a piece of code as duplicate, there will be no free lunch. Either you pay upfront the effort of abstracting, or you pay down the line the local fixes, but there's not one approach or the other that will be fundamentally wrong, I see it as a bet that pays off or not.
I think a key distinction often lost here is that generic code and abstract code are different. Abstract code hides details, generic code allows its use in more places. When hiding details, often it also becomes more generic. Making code generic does not necessarily hide details, it can very well often expose additional details
Also seemingly not mentioned - SRP (single responsibility principle). SRP & DRY should be considered together. If a person DRY'ies up code without regard to SRP, they're making any code that can be generic, generic. A rule of thumb is generic code is 3x more expensive than non-generic code.
==============
To illustrate, here is an example (and pretend that these examples are duplicated in 20 different places that all need the account balance sum):
--------------
Example (1) - non-generic, non-abstact
```
int savingsBalance = 1;
int checkingBalance = 1;
int totalBalance = savingsBalance + checkingBalance;
```
--------------
Example (2) - generic, minimally abstract
```
int savingsBalance = 1;
int checkingBalance = 1;
int totalBalance = addBalances(savingsBalance, checkingBalance);
```
--------------
Example (3) - abstract, potentially generic:
```
int totalBalance = addBalances();
```
===============
Now consider what happens if we need to add a 'brokerage account balance' to the mix (and let's say we get that value via an API call). These example change in the following ways:
Example (1), updated:
```
int savingsBalance = 1;
int checkingBalance = 1;
int brokerageBalance = fetchBrokerageBalance();
int totalBalance = addBalances(savingsBalance, checkingBalance, brokerageBalance);
```
Example (2), updated:
```
int savingsBalance = 1;
int checkingBalance = 1;
int brokerageBalance = fetchBrokerageBalance();
int totalBalance = addBalances(savingsBalance, checkingBalance, brokerageBalance);
```
Example (3), updated & unchanged:
```
int totalBalance = addBalances();
```
Example (1) & Example (2) have similar scaling behavior here (scaling relative to complexity). This illustrates a very key difference between abstract and generic code.
Now, let's say on another hand that whether we should include brokerage balance is conditional. In example 1, we have the same logic to be applied in 20 different places. We can mutate example 3 to be more generic (EG: pass in flag - `addBalances(Flags.includeBrokerageAccount)`). At this point we can say that the abstraction is wrong and needs to be split into different methods (which is fine!). Making example 3 more generic is more complex, we incur the penalty of having generic code. Example 1 is arguably the worst to have since we will get subtle errors if we fail to update everything. In part these design principles are also there to help protect updates and make them safe (very similar to the ACID guarantees of database that help make it so you can update data without breaking the overall database)
Another mention, which I won't go into detail over, boiler plate code has yet different characteristics.
In sum, it's largely a question of what kind of coupling is best and how to deal with that coupling. Duplicated code is coupled without any runtime or compile time checks that it stays in sync (if you forget to update something from example 1 above, it's a bug!). Keeping code consolidated into a common procedure does not remove that coupling, it just changes the nature of the coupling and makes it more explicit. Common code between micro-services couples those micro-services together (and that can be very bad).
Thus, we need to look at a lot of things when applying DRY: we need to consider SRP, whether we are coupling services together, and whether or not if we are simplying making non-generic code generic.
Typo in that updated Example (1), should have been:
int totalBalance = savingsBalance + checkingBalance + brokerageBalance;
It's hard to explain such complicated concepts super concisely. What I'm getting is that DRY is often equated to merely making code generic and re-used, while the goal of DRY is not at all about re-use. Generic code is more complicated than non-generic code, thus if we make code generic for the sake of making it used in many places - that is likely going to make things more complex. It's a fundamental misunderstanding that DRY is simply the act of using human pattern matching to make all similar looking code generic and re-used. Instead DRY is more about: "are we sanitizing data before sending it to the front-end? Than that should be done in one place." "Where are we configuring database connections?" etc..
Further DRY should not be the only guiding factor, SRP & coupling should always be considered at the same time as DRY.
We need to have a kind of a footnote in the pedagogy of software engineering, and engineering in general that states to avoid advice (“wisdom”, lol) expounded by blowhards. You can identify it usually by the title - it’ll have a Grand Style that betrays arrogance.
Lots of people from GoF onwards think they qualify to preach bullshit ultimatums, thinking they have it all figured out. I don’t think any of them have any fucking clue what should actually be considered harmful, what should be the two/three “hardest things in computer science”, and other nonsensical bullshit they write. With apologies to Dijkstra who I do find to have been one of the shining lights of computer science and engineering but is often misquoted/out-contexted for that considered harmful thing. His letters do betray a higher plane of wisdom.
The more recent “what programmers need to know about {x}” as if the author has any clue is just the continuation of “I’ve learned this last week/in my last project and it’s the most important thing,” instead of the trivia that it really is, or shit that’s abstracted for us nowadays and only serves to make the author feel superior. Just fuck off with all of that nonsense.
Coincidentally, I’m going to go and read the Hamming book as it’s got tangible value having been written by someone who has done something worthwhile in their career.
It sounds more like the idea you propose is "just do whatever" and that there's absolutely no experience those guys (seasoned devs and instructors) have.
There's nothing particularly nonsensical about the "two/three “hardest things in computer science” (although it was said half in jest).
The vast majority of advice like that is garbage and are trying to borrow authority of the few good articles that come out with similarly structured titles.
Usually by people that mistake "the product is successful" with "the product is well engineereed. Or mistaking their rewrite from "the worst way to solve the problem" to "the second worst way to solve the problem" <hyperbole> for "this is the best way to solve problem"
A significant amount of things said about computer "science" and engineering is opinions, more so than most believe or are willing to admit. That doesn't mean it's all wrong, but that not everything is universally applicable just because a smart person said a thing.
I enjoy the attitude, but sometimes people like to read someone else’s view on something, or to gain insights on something they don’t know anything about.
Writing authoritatively might be the only way people can get people to read some things. I’m ok with that.
I'm a copy+paste programmer, and proud of it. It's quicker, easier, and most importantly: someone else's problem to fix, if they're the type of developer who disagrees with this coding style.
I'll keep churning out duplicated code and you guys can keep refactoring against it. We all get paid so what's the problem?
I disagree with the severity of this, and would posit that there are duplications that can't be "fixed" by an abstraction.
There are many instances I've encountered where two pieces of code coincided to look similar at a certain point in time. As the codebase evolved, so did the two pieces of code, their usage and their dependencies, until the similarity was almost gone. An early abstraction that would've grouped those coincidentally similar pieces of code would then have to stretch to cover both evolutions.
A "wrong abstraction" in that case isn't an ill-fitting abstraction where a better one was available, it's any (even the best possible) abstraction in a situation that has no fitting generalization, at all.