I don't know if you saw this posted on HN, but Netflix test their system really ...

jdrobins2000 · on April 22, 2013

Wish I had thought of chaos monkey, much cooler than whatever I called my version of it.

A few years ago I built an automated test system in perl, complete with message bus and message listener container for running tasks on various servers. One of the automated tests I wrote had a component that would periodically (at random intervals) kill processes, unmount shared filesystems, offline interfaces, etc. to cause failovers, to verify that all processes and resources were failed over, and all tasks were reassigned to other nodes and no jobs were dropped or stalled.

It is really the only way to ensure you've covered your bases - beating the shit out of your system repetitively. It uncovered a bunch of big holes and some very obscure ones too, and once we got those fixed it ran pretty much flawlessly.

smackfu · on April 22, 2013

The interesting problem is that the underlying AWS system seems to come up with more and more interesting failure modes due to system complexity, that the testing could never catch. Like Netflix had a major outage recently on Christmas Eve 2012.

http://techblog.netflix.com/2012/12/a-closer-look-at-christm...

mistermumble · on April 22, 2013

Isn't Netflix hosted on AWS?

svedlin · on April 22, 2013

Netflix is indeed running on AWS. Some details about their back-end here:

http://techblog.netflix.com/2012/12/aws-reinvent-was-awesome...

jherrick · on April 22, 2013

They also constant kill machines (and replace them with fresh instances of the image) that participate in key load-balanced activities.