Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know if you saw this posted on HN, but Netflix test their system really well. They use so-called chaos monkey [1] that shuts down random servers on a whim. This allows them to detect and get rid of dependencies, i.e. tolerate failures in other parts of the system.

[1] http://techblog.netflix.com/2011/07/netflix-simian-army.html



Wish I had thought of chaos monkey, much cooler than whatever I called my version of it.

A few years ago I built an automated test system in perl, complete with message bus and message listener container for running tasks on various servers. One of the automated tests I wrote had a component that would periodically (at random intervals) kill processes, unmount shared filesystems, offline interfaces, etc. to cause failovers, to verify that all processes and resources were failed over, and all tasks were reassigned to other nodes and no jobs were dropped or stalled.

It is really the only way to ensure you've covered your bases - beating the shit out of your system repetitively. It uncovered a bunch of big holes and some very obscure ones too, and once we got those fixed it ran pretty much flawlessly.


The interesting problem is that the underlying AWS system seems to come up with more and more interesting failure modes due to system complexity, that the testing could never catch. Like Netflix had a major outage recently on Christmas Eve 2012.

http://techblog.netflix.com/2012/12/a-closer-look-at-christm...


Isn't Netflix hosted on AWS?


Netflix is indeed running on AWS. Some details about their back-end here:

http://techblog.netflix.com/2012/12/aws-reinvent-was-awesome...


They also constant kill machines (and replace them with fresh instances of the image) that participate in key load-balanced activities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: