As you try and push a system towards 100% reliability, you need to understand yo...

As you try and push a system towards 100% reliability, you need to understand your risk model better and better. When you get to levels around 5 nines, very unlikely events which could cause an hour's outage every 20 years start to dominate. In a system as complex as a large datacenter, it is always going to be difficult and expensive to understand all of these risks, and even more difficult and expensive to design around them.

That is why redundancy is so important. Instead of fighting an a battle which is exponentially increasing in difficulty, you chose to optimize the reliability of a single component. You give up optimizing each of the really complex subsystems (datacenters) at a certain level (3 or 4 nines) and focus on optimizing the reliability of a really simple component for detecting failures and directing traffic to the online datacenter.

Reliability engineers have known this for a really long time. If you can fit redundancy into your design, it is almost always a cheaper way to approach high reliability than optimizing the reliability of each subsystem.