Excellent article. However!, software, like everything, is subject to laws of physics. Entropy always wins in the end. No matter how good the original engineering and planning, without maintenance it will all fall apart soon enough.
Right. It’s not like if all the engineers walk out it’s suddenly going to fall over. It’s that when an issue comes up there may not be the expertise on hand to fix it. So the remaining twitter engineers should expect a rough time over the coming months.
You assume that small issues are ever prioritized in the first place. My experience with large tech organizations is that their priories are driven by internal culture, not best practices, just like any other company.
Although I haven't worked at Twitter, if it is anything like the places I have worked at, the small issues get ignored because nobody makes their career on things that don't show up on investor / shareholder / upper management radars. If it can get papered over so it ends up being someone else's problem in the future, what actually gets worked on is flashier things like redesigning the timeline AGAIN so users spend more time on the website or (in the case that caused me to work for small startups again) spend three weeks deciding on a super trivial design decision with 12 stake holders.
There must be some plan for this, though, right? If it were me and I were trying to do more with less I'd plan to cut some features (for instance, are Spaces that critical to the operation? Because it seems like running them would be demanding) and try to migrate things to managed services to reduce the operational load.
I don't have a DR plan for "mad billionaire buys the company", no, but I do agree with the OOM reaper logic - I start cutting the services that use the most resources that are least necessary.
I don't mean that they had one sitting around, but anyone who's stayed with the company has to find this their most urgent priority, I would think, between the focus on cutting operational costs and the simple fact that they have far fewer people around.
Well, that was a bit of a joke, but I have been around the block a few times and been on one side of a merger or other, and gone public to private and back (as an IC/soldier, never been management). The Twitter acquisition is the craziest thing I've ever seen; there's a reason it's big news.
These kinds of things have formal courtship type periods and legal filings that take months, and then they happen, and then executives plan on rolling out layoffs and cost cutting measures over years, not days or weeks, like the "plan" seems to be here. This really is madness, and I can't blame people for thinking the site might just wobble and collapse at some point, though as the author goes into in very nice detail, we do try and build resilient systems. I'm certainly curious how it all plays out, but you won't catch me making any guesses.
This kind of schedule and chaos looks more familiar for acquisitions of a small company I think. I guess it is strange for it to happen to a company of Twitter's size, but I think the big-news aspect of it has more to do with how much journalists use Twitter.
> I guess it is strange for it to happen to a company of Twitter's size, but I think the big-news aspect of it has more to do with how much journalists use Twitter.
I think you hit on the the main point: the reason this is constantly in the mainstream media is because the mainstream media so heavily relied on/still relies on twitter[1].
If this were any other large company imploding in a Musk acquisition we'd see fewer stories about it because it affects journalists less.
--------
[1] It's going to be especially funny if journalists complain that "#LearnToCode" type of tweeters aren't getting banned, and get told in response to go off and create their own twitter.
No matter where in the political spectrum one may lie, it's always satisfying to see a group being fed their own lines back.
> This kind of schedule and chaos looks more familiar for acquisitions of a small company
That's certainly possible and I wouldn't know what that would look like, and yep, in theory there are lots of financial rules and regulations regarding public companies that Musk made a bit of a hash of.
I can't agree that Twitter's journalist demographic was mostly responsible for this being big news, though I'm sure that plays a part. Musk brings the circus with him wherever he goes, but this was nuts even for him. And any $44 billion tech buyout would be news without those factors.
software, like everything, is subject to laws of physics
I disagree; math would be a closer analogy. And indeed, arithmetic still works like it did a millenia ago. Closer to the present, I have binaries from the late 80s that still work today (and I use them semi-regularly.)
Indeed, much of the impetus of the software industry seems to be to propagate the illusion that software somehow needs constant "maintenance" and change. For the preservation of their own self-interests, of course; much like the company that makes physical objects too robust and runs out of customers, planned obsolescence and the desire to change things and justify it so they can be paid to do something are still there.
It's possible to make things which last. Unfortunately, much of the time, other economic considerations preclude that.
If software ran without side effects, perhaps. But it doesn't. Databases grow, files are uploaded, logs pile on, messages and events propagates and filesystems fill up. This is why entropy matters.
Exactly. Tiny memory leaks in seldom called functions can also cause slow degradation over time. People wonder why a simple restart seems to 'fix a boatload of problems' but this is often the reason why.
> I disagree; math would be a closer analogy. And indeed, arithmetic still works like it did a millenia ago. Closer to the present, I have binaries from the late 80s that still work today (and I use them semi-regularly.)
Sure, those binaries might work the same when executed. Although the probability of that is never 100%, but as you pointed out, the rules of arithmetic aren’t expected to change any time soon. That’s correct. Unfortunately software does not exist in its own micro-verse, it’s subject to the laws of physics acting on the machines it’s running on. So while you might be able to write scripts that work decades later, it’s much harder to ensure those scripts consistently run for decades. RAM chips, CPUs, and everything in between are guaranteed to eventually fail if left running unsupervised in perpetuity. Entropy raises with complexity. At Twitter’s scale, to run a software service you need globally distributed cloud infrastructure. They likely have hundreds of services, deployed to many hardware instances distributed across the globe. Twitter isn’t 1 script running 1 time producing a single result. It’s hundreds if not thousands of systems interacting with one another across many physical machines. Layers of redundancy help, but ultimately cascading failures are a mathematical certainty. Many would argue the best strategy to reduce downtime on these systems is to actually optimize for low recovery time when you do fail.
Software is also bound to the world in other ways. Similarly to how most business processes, products and even more generally, tools, change over time, so too do the requirements placed on software systems made to facilitate or automate these things.
Ultimately the only way to escape the maintenance cost of software is to stop running it. The longer you leave a software system running, the more likely it will eventually stop.
Even if the entirety of Twitter.com were mathematically proven correct, it still would run on servers that are made of physical bits that are subject to entropy.
It’s possible to make things that last if you are in total control of the whole stack, including hardware.
Embedded systems that still do their job after 30 years do exist but they live in isolation in a specific and controlled environment, and are built for a limited, unchanging task.
On the other hand, complex web software is build on layer upon layers that are not in Twitter’s complete control.
Hardware change regularly, requiring changes at the lower levels of an OS, inducing potential changes in behaviour, performance, which require adaptation as a consequence.
And that’s before considering security, eternally moving goalposts. Not just at the OS or network level, but also at the business level.
Twitter and al are not living in a locked down context, they live in the messy world of human interactions and that alone requires constant tweaking.
So yes, a binary is more like a mathematical construct and by itself it won’t rot, but if the world around that binary changes, you need to change the binary as well, and for that you need maintenance. The amount required depends on the complexity, brittleness and how well your stack is engineered, but implying it’s a con is a bit extreme.
Computation is literally bound by entropy. Math has no such limitations unless you explicitly define them.
I thoroughly recommend researching entropy as it regards to e.g. information theory, systems engineering and even (perhaps especially) to machine learning.
Computation is ultimately about what we can compute _in this universe_ and the forward flow of time is an emergent property from the universe’s innate entropic guarantees.
Time is “pre-sorted” for us thanks to entropy, enabling us to define algorithmic complexities over the time domain in the first place.
> we don’t think so. the prod incident we heard about involved someone making an ill-advised choice to reactivate a large account, causing a huge load on the social-graph system, on the night before a prolonged high-traffic event.
Spot on. Absolutely hate this attitude that software sitting there just gathers wear and tear as if it's a mechanical device. Software is written with a particular target platform in mind: x86, ARM, Nvidia GPUs, FPGA soft-processor etc. If the hardware you are running on doesn't change, your software should still function. If the specs of that target platform don't change, your software should still function. If the specs of the target platform change but a hard-working compiler engineer has done the work to make sure your software gracefully uses the new features (for example, a compiler optimizing using AVX instructions), your software should still function.
The fact that most software doesn't continue to function even on the same platform, and on the same hardware, is a massive indictment of the software industry's standard practices.
Complex software has complex failure modes though.
An application running on a single platform, self-contained and with some basic failovers such as redundancy (2+ machines running the same application), etc. should have ridiculous uptimes.
A distributed and complex system with interdependent components, under variable load, with different capacities for subsystems running across some thousands of machines will, inevitably, encounter some unforeseen state which degrades the system as a whole. It can be a small component that cascades a failure mode in an unexpected way or can be a big component failing spectacularly due to a bug, or race condition, or a multitude of other issues that are not entirely predictable and guarded against at the time of writing such software.
The latter is what has "wear and tear", it's not one software, it's a whole system of software communicating with each other in varying states of hardware decay, you can design and build it to be resilient against a multitude of predictable issues but you can never expect that it will run perfectly fine unattended.
Unfortunately it’s not this simple because most non-trivial software is written with a dependency tree, every node of which may discover vulnerabilities (or performance problems) which, when patched, trigger update cascades in this tree.
You're forgetting that software on it's own is basically useless. In order for it to provide value, it has to be operated by a physical machine. All running software is physical, with spinning disks and mechanical relays, electrons being pushed back and forth, and photons flying around. Twitter is not a piece of software, it's a complicated physical system. software is an abstraction.
Don't forget that users have been trained to be incredibly fault tolerant as a result of how flaky general software can be. Now that cars are having BSOD that tolerance may reach new levels or just evaporate.
Thanks I agree, I think small issues can build up over time, itll be interesting to see what happens. Wish I could see the future post mortem docs from the outages
Turning it off and then back on again probably fixes the issue. There’s very unlikely a grand ticking time bomb just waiting to bring it all down. Recycling servers will probably keep it running.
> Turning it off and then back on again probably fixes the issue.
Turning a large scale system entirely off and on is never simple. Invariably you’ll run into some kind of circular dependency that must be manually investigated. And even tracking those down becomes tricky.
Classic examples are things like DNS, service locators, or authentication systems. And large tech companies are notorious for NIH-syndrome for all of those.
There’s so much redundancy built into modern distributed systems that you can reliably bounce a VM without issue. You can reliably roll bounce a series of VMs.
Twitter doesn’t have unique scale problems by todays standards.
First, Mastodon isn’t the only option - it’s not like both couldn’t fail just as both AIM and MySpace did.
Second, Mastodon runs on open protocols. That has good and bad points - for example, it won’t grow as quickly as a project with huge corporate backing - but it does mean that there’s a more direct link between the community and its longevity. Twitter isn’t just flailing because Musk is doing management by bong rip but also because he’s desperately trying to get out of a financial hole. Open source projects have different kinds of financial challenges but they’re never on the hook trying to fill a hole measured in billions of dollars, either. Given the number of communities older than Twitter I’d say it’s far from proven that Twitter will outlast anything.
I found it really interesting that one of the main Iceland instances is being run on a Raspberry Pi out of some student's living room in Sweden. Over 500 users at the moment.
https://types.pl/@tritlo/109383888427885539
why? a single app running on a single server is several orders of magnitude more resilient than a spaghetti clusterfuck of services upon services. twitter could be brought down by a single expired certificate
- completely dead with ssl invalid certificates with expired domains.
So you would have to keep moving to another mastodon instance (if you're lucky) or try and run your own instance and join the many instances with the three issues above.
There is no monetary incentive to keep a mastodon instance running and we both know that begging for donations doesn't scale.
There's a gigantic flood of new users (i.e. literally multiplying the userbase) in the past few weeks, so yeah, a lot of servers are restricting signups to cope.
I don't think "begging for donations" needs to scale, if instances get too large to keep running then smaller instances should (and do) split off, they can still talk to each other after all.