I don't know if you saw this posted on HN, but Netflix test their system really well. They use so-called chaos monkey [1] that shuts down random servers on a whim. This allows them to detect and get rid of dependencies, i.e. tolerate failures in other parts of the system.
Wish I had thought of chaos monkey, much cooler than whatever I called my version of it.
A few years ago I built an automated test system in perl, complete with message bus and message listener container for running tasks on various servers. One of the automated tests I wrote had a component that would periodically (at random intervals) kill processes, unmount shared filesystems, offline interfaces, etc. to cause failovers, to verify that all processes and resources were failed over, and all tasks were reassigned to other nodes and no jobs were dropped or stalled.
It is really the only way to ensure you've covered your bases - beating the shit out of your system repetitively. It uncovered a bunch of big holes and some very obscure ones too, and once we got those fixed it ran pretty much flawlessly.
The interesting problem is that the underlying AWS system seems to come up with more and more interesting failure modes due to system complexity, that the testing could never catch. Like Netflix had a major outage recently on Christmas Eve 2012.
[1] http://techblog.netflix.com/2011/07/netflix-simian-army.html