Tuesday, April 20, 2010

Failure as a Feature

One need only peruse the EC2 forums a bit to realize that EC2 instances fail.  Shock.  Horror.  Servers failing?  What kind of crappy service is this, anyway.  The truth, of course, is that all servers can and eventually will fail.  EC2 instances, Rackspace CloudServers, GoGrid servers, Terremark virtual machines, even that trusty Sun box sitting in your colo.  They all can fail and therefore they all will fail eventually.

What's wonderful and transformative about running your applications in public clouds like EC2 and CloudServers, etc. is not that the servers never fail but that when they do fail you can actually do something about it.  Quickly.  And programmatically.  From an operations point of view, the killer feature of the cloud is the API.  Using the API's, I can not only detect that there is a problem with a server but I can actually correct it.  As easily as I can start a server, I can stop one and replace it with a new one.

Now, to do this effectively I really need to think about my application and my deployment differently.  When you have physical servers in a colo failure of a server is, well, failure.  It's something to be dreaded.  Something that you worry about.  Something that usually requires money and trips to the data center to fix.

But for apps deployed on the cloud, failure is a feature.  Seriously.  Knowing that any server can fail at any time and knowing that I can detect that and correct that programmatically actually allows me to design better apps.  More reliable apps.  More resilient and robust apps.  Apps that are designed to keep running with nary a blip when an individual server goes belly up.

Trust me.  Failure is a feature.  Embrace it.  If you don't understand that, you don't understand the cloud.


  1. What if the failure is due to a security problem? Would you really want to simply start up another instance (with the same problems)?

    What if the failure is due to behavior exhibited by a flaw in the application that occurs after X number of connections/requests/etc...? Starting a new instance doesn't resolve that, it just starts the failure cycle over again.

    I agree in general that failure is to be expected and should definitely be considered as an integral component of any application architecture, but I also think the detection of such failures is not always as straightforward as it sounds. Failure of an application != failure of a server, after all, and it's quite possible the "server" is running just fine while the application is completely whacked.

    Yes, whacked is a technical term. ;-)


  2. Very true. I'm focusing here on "hardware" failure of servers which does happen and from which you can recover in a completely automated manner.

    But there are infinite dimensions of whacked when it comes to applications and many of them do not yield easily to automated recover. However, most of those also look different than plain old server failure from a monitoring POV. For example, if your server stops responding but it's CPU is pegged, that's a different failure mode than the server simply going dark and needs to be treated differently.

    No matter what the issue is, though, from an operations perspective having an API to your infrastructure is critical in detecting problems and recovering from them.


  3. Amazon needs to offer 'Failure as a Service' - you enable this service and your EC2 instances (and other services) randomly fail with a given probability, so you could fully stress test your HA solution reliability.

  4. Well, by having this wonderful API to my operations, I can quite easily code that up myself 8^)

  5. Mitch,

    That clarification works for me. I believe a few years ago there was a similar concept for data centers, "Built to Fail", that centered around the increasingly low cost of commodity hardware and a growing lack of concern regarding length of hardware life/reliability. Virtualization and cloud certainly hammer that home in this case.