Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?


To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.


500 billions events. Always blows my mind how many people use aws


I know nothing. But I'd imagine the number of 'events' generated during this period of downtime will eclipse that number every minute.


"I felt a great disturbance in us-east-1, as if millions of outage events suddenly cried out in terror and were suddenly silenced"

(Be interesting to see how many events currently going to DynamoDB are actually outage information.)


I wonder how many companies have properly designed their clients. So that the timing before re-attempt is randomised and the re-attempt timing cycle is logarithmic.


nowadays i think a single immediate retry is preferred over exponential backoff with jitter.

if you ran into a problem that an instant retry cant fix, chances are you will be waiting so long that your own customer doesnt care anymore.


Most companies will use the AWS SDK client's default retry policy.


Why randomized?


It’s the Thundering Herd Problem.

See https://en.wikipedia.org/wiki/Thundering_herd_problem

In short, if it’s all at the same schedule you’ll end up with surges of requests followed by lulls. You want that evened out to reduce stress on the server end.


Thank you. Bonsai and adzm as well. :)


It's just a safe pattern that's easy to implement. If your services back-off attempts happen to be synced, for whatever reason, even if they are backing off and not slamming AWS with retries, when it comes online they might slam your backend.

It's also polite to external services but at the scale of something like AWS that's not a concern for most.


> they might slam your backend

Heh


Helps distribute retries rather than having millions synchronize


Yes, with no prior knowledge the mathematically correct estimate is:

time left = time so far

But as you note prior knowledge will enable a better guess.


Yeah, the Copernican Principle.

> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.

> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.

https://www.newyorker.com/magazine/1999/07/12/how-to-predict...


This thought process suggests something very wrong. The guess "it will last again as long as it has lasted so far" doesn't give any real insight. The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.

What this "time-wise Copernican principle" gives you is a guarantee that, if you apply this logic every time you have no other knowledge and have to guess, you will get the least mean error over all of your guesses. For some events, you'll guess that they'll end in 5 minutes, and they actually end 50 years later. For others, you'll guess they'll take another 50 years and they actually end 5 minutes later. Add these two up, and overall you get 0 - you won't have either a bias to overestimating, nor to underestimating.

But this doesn't actually give you any insight into how long the event will actually last. For a single event, with no other knowledge, the probability that it will after 1 minute is equal to the probability that it will end after the same duration that it lasted so far, and it is equal to the probability that it will end after a billion years. There is nothing at all that you can say about the probability of an event ending from pure mathematics like this - you need event-specific knowledge to draw any conclusions.

So while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation.


But you will never guess that the latest tik-tok craze will last another 50 years, and you'll never guess that Saturday Night Live (which premiered in 1075) will end 5-minutes from now. Your guesses are thus more likely to be accurate than if you ignored the information about how long something has lasted so far.


Sure, but the opposite also applies. If in 1969 you guessed that the wall would last another 20 years, then in 1989, you'll guess that the wall of Berlin will last another 40 years - when in fact it was about to fall. And in 1949, when the wall was a few months old, you'll guess that it will last for a few months at most.

So no, you're not very likely to be right at all. Now sure, if you guess "50 years" for every event, your average error rate will be even worse, across all possible events. But it is absolutely not true that it's more likely that SNL will last for another 50 years as it is that it will last for another 10 years. They are all exactly as likely, given the information we have today.


If I understand the original theory, we can work out the math with a little more detail... (For clarity, the berlin wall was erected in 1961.)

- In 1969 (8 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1972 (8x4/3=11 years) and 1993 (8x4=32 years)

- In 1989 (28 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1998 (28x4/3=37 years) and 2073 (28x4=112 years)

- In 1961 (when the wall was, say, 6 months old): You'd calculate that there's a 50% chance that the wall will fall between 1961 (0.5x4/3=0.667 years) and 1963 (0.5x4=2 years)

I found doing the math helped to point out how wide of a range the estimate provides. And 50% of the times you use this estimation method; your estimate will correctly be within this estimated range. It's also worth pointing out that, if your visit was at a random moment between 1961 and 1989, there's only a 3.6% chance that you visited in the final year of its 28 year span, and 1.8% chance that you visited in the first 6 months.


However,

> Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here.

It's relatively unlikely that you'd visit the Berlin Wall shortly after it's erected or shortly before it falls, and quite likely that you'd visit it somewhere in the middle.


No, it's exactly as likely that I'll visit it at any one time in its lifetime. Sure, if we divide its lifetime into 4 quadrants, its more likely I'm in quadrant 2-3 than in either of 1 or 4. But this is slight of hand: it's still exactly as likely that I'm in quadrant 2-3 than in quadrant (1 or 4) - or, in other words, it's as likely I'm at one of the ends of the lifetime as it is that I am in the middle.


>So no, you're not very likely to be right at all.

Well 1/3 of the examples you gave were right.


> Saturday Night Live (which premiered in 1075)

They probably had a great skit about the revolt of the Earls against William the Conquerer.


> while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation

It's important to flag that the principle is not trite, and it is useful.

There's been a misunderstanding of the distribution after the measurement of "time taken so far", (illuminated in the other thread), which has lead to this incorrect conclusion.

To bring the core clarification from the other thread here:

The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the estimate `time_left=time_so_far` is useful.


If this were actually correct, than any event ending would be a freak accident: since, according to you, the probability of something continuing increases drastically with its age. That is, according to your logic, the probability of the wall of Berlin falling within the year was at its lowest point in 1989, when it actually fell. In 1949, when it was a few months old, the probability that it would last for at least 40 years was minuscule, and that probability kept increasing rapidly until the day the wall was collapsed.


That's a paradox that comes from getting ideas mixed up.

The most likely time to fail is always "right now", i.e. this is the part of the curve with the greatest height.

However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.

Both of these statements are true and are derived from:

P(survival) = t_obs / (t_obs + t_more)

There is no contradiction.


Why is the most likely time right now? What makes right now more likely than in five minutes? I guess you're saying if there's nothing that makes it more likely to fail at any time than at any other time, right now is the only time that's not precluded by it failing at other times? I.E. it can't fail twice, and if it fails right now it can't fail at any other time, but even if it would have failed in five minutes it can still fail right now first?


Yes that's pretty much it. There will be a decaying probability curve, because given you could fail at any time, you are less likely to survive for N units of time than for just 1 unit of time, etc.


> However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.

This is a completely different argument that relies on various real-world assumptions, and has nothing to do with the Copernican principle, which is an abstract mathematical concept. And I actually think this does make sense, for many common categories of processes.

However, even this estimate is quite flawed, and many real-world processes that intuitively seem to follow it, don't. For example, looking at an individual animal, it sounds kinda right to say "if it survived this long, it means it's robust, so I should expect it will survive more". In reality, the lifetime of most animals is a binomial distribution - they either very young, because of glaring genetic defects or simply because they're small, fragile, and inexperienced ; or they die at some common age that is species dependent. For example, a humab that survived to 20 years of age has about the same chance of reaching 80 as one that survived to 60 years of age. And an alien who has no idea how long humans live and tries to apply this method may think "I met this human when they're 80 years old - so they'll probably live to be around 160".


Ah no, it is the Copernican principle, in mathematical form.


> The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.

I don't think this is correct; as in something that has been there for say hundreds of years had more probability to be there in a hundred years than something that has been there for a month.


Is this a weird Monty hall thing where the person next to you didnt visit the wall randomly (maybe they decided to visit on some anniversary of the wall) so for them the expected lifetime of the wall is different?


Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.

Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.


I'm not sure about that. Is it not sometimes useful for decision making, when you don't have any insight as to how long a thing will be? It's better than just saying "I don't know".


Not really, unless you care about something like "when I look back at my career, I don't want to have had a bias to underestimating nor overestimating outages". That's all this logic gives you: for every time you underestimate a crisis, you'll be equally likely to overestimate a different crisis. I don't think this is in any way actually useful.

Also, the worse thing you can get from this logic is to think that it is actually most likely that the future duration equals the past duration. This is very much false, and it can mislead you if you think it's true. In fact, with no other insight, all future durations are equally likely for any particular event.

The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic. That will easily beat this method of estimation.


You've added some useful context, but I think you're downplaying it's use. It's non-obvious, and in many cases better than just saying "we don't know". For example, if some company's server has been down for an hour, and you don't know anything more, it would be reasonable to say to your boss: "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour".

> The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic

True, and all the posts above have acknowledged this.


> "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour"

This is exactly what I don't think is right. This particular outage has the same a priori chance of being back in 20 minutes, in one hour, in 30 hours, in two weeks, etc.


Ah, that's not correct... That explains why you think it's "trite", (which it isn't).

The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the statement above is correct, and the estimate `time_left=time_so_far` is useful.


Can you suggest some mathematical reasoning that would apply?

If P(1 more minute | 1 minute so far) = x, then why would P(1 more minute | 2 minutes so far) < x?

Of course, P(it will last for 2 minutes total | 2 minutes elapsed) = 0, but that can only increase the probabilities of any subsequent duration, not decrease them.


That's inverted, it would be:

If: P(1 more minute | 1 minute so far) = x

Then: P(1 more minute | 2 minutes so far) > x

The curve is:

P(survival) = t_obs / (t_obs + t_more)

(t_obs is time observed to have survived, t_more how long to survive)

Case 1 (x): It has lasted 1 minute (t_obs=1). The probability of it lasting 1 more minute is: 1 / (1 + 1) = 1/2 = 50%

Case 2: It has lasted 2 minutes (t_obs=2). The probability of it lasting 1 more minute is: 2 / (2 + 1) = 2/3 ≈ 67%

I.e. the curve is a decaying curve, but the shape / height of it changes based on t_obs.

That gets to the whole point of this, which is that the length of time something has survived is useful / provides some information on how long it is likely to survive.


> P(survival) = t_obs / (t_obs + t_more)

Where are you getting this formula from? Either way, it doesn't have the property we were originally discussing - the claim that the best estimate of the duration of an event is the double of it's current age. That is, by this formula, the probability of anything collapsing in the next millisecond is P(1 more millisecond | t_obs) = t_obs / t_obs + 1ms ~= 1 for any t_obs >> 1ms. So by this logic, the best estimate for how much longer an event will take is that it will end right away.

The formula I've found that appears to summarize the original "Copernican argument" for duration is more complex - for 50% confidence, it would say:

  P(t_more in [1/3 t_obs, 3t_obs]) = 50%
That is, if given that we have a 50% chance to be experiencing the middle part of an event, we should expect its future life to be between one third and three times its past life.

Of course, this can be turned on its head: we're also 50% likely to be experiencing the extreme ends of an event, so by the same logic we can also say that P(t_more = 0 [we're at the very end] or t_more = +inf [we're at the very beginning and it could last forever] ) is also 50%. So the chance t_more > t_obs is equal to the chance it's any other value. So we have precisely 0 information.

The bottom line is that you can't get more information out a uniform distribution. If we assume all future durations have the same probability, then they have the same probability, and we can't predict anything useful about them. We can play word games, like this 50% CI thing, but it's just that - word games, not actual insight.


I think the main thing to clarify is:

It's not a uniform distribution after the first measurement, t_obs. That enables us to update the distribution, and it becomes a decaying one.

I think you mistakenly believe the distribution is still uniform after that measurement.

The best guess, that it will last for as long as it already survived for, is actually the "median" of that distribution. The median isn't the highest point on the probability curve, but the point where half the area under the curve is before it, and half the area under the curve is after it.

And the above equation is consistent with that.


I used Claude to get the outage start and ends from the post-event summaries for major historical AWS outages: https://aws.amazon.com/premiumsupport/technology/pes/

The cumulative distribution actually ends up pretty exponential which (I think) means that if you estimate the amount of time left in the outage as the mean of all outages that are longer than the current outage, you end up with a flat value that's around 8 hours, if I've done my maths right.

Not a statistician so I'm sure I've committed some statistical crimes there!

Unfortunately I can't find an easy way to upload images of the charts I've made right now, but you can tinker with my data:

    cause,outage_start,outage_duration,incident_duration
    Cell management system bug,2024-07-30T21:45:00.000000+0000,0.2861111111111111,1.4951388888888888
    Latent software defect,2023-06-13T18:49:00.000000+0000,0.08055555555555555,0.15833333333333333
    Automated scaling activity,2021-12-07T15:30:00.000000+0000,0.2861111111111111,0.3736111111111111
    Network device operating system bug,2021-09-01T22:30:00.000000+0000,0.2583333333333333,0.2583333333333333
    Thread count exceeded limit,2020-11-25T13:15:00.000000+0000,0.7138888888888889,0.7194444444444444
    Datacenter cooling system failure,2019-08-23T03:36:00.000000+0000,0.24583333333333332,0.24583333333333332
    Configuration error removed setting,2018-11-21T23:19:00.000000+0000,0.058333333333333334,0.058333333333333334
    Command input error,2017-02-28T17:37:00.000000+0000,0.17847222222222223,0.17847222222222223
    Utility power failure,2016-06-05T05:25:00.000000+0000,0.3993055555555555,0.3993055555555555
    Network disruption triggering bug,2015-09-20T09:19:00.000000+0000,0.20208333333333334,0.20208333333333334
    Transformer failure,2014-08-07T17:41:00.000000+0000,0.13055555555555556,3.4055555555555554
    Power loss to servers,2014-06-14T04:16:00.000000+0000,0.08333333333333333,0.17638888888888887
    Utility power loss,2013-12-18T06:05:00.000000+0000,0.07013888888888889,0.11388888888888889
    Maintenance process error,2012-12-24T20:24:00.000000+0000,0.8270833333333333,0.9868055555555555
    Memory leak in agent,2012-10-22T17:00:00.000000+0000,0.26041666666666663,0.4930555555555555
    Electrical storm causing failures,2012-06-30T02:24:00.000000+0000,0.20902777777777776,0.25416666666666665
    Network configuration change error,2011-04-21T07:47:00.000000+0000,1.4881944444444444,3.592361111111111


Generally expect issues for the rest of the day, AWS will recover slowly, then anyone that relies on AWS will recovery slowly. All the background jobs which are stuck will need processing.


Rule of thumb is that the estimated remaining duration of an outage is equal to the current elapsed duration of the outage.


1440 min




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: