One need only peruse the EC2 forums a bit to realize that EC2 instances fail. Shock. Horror. Servers failing? What kind of crappy service is this, anyway. The truth, of course, is that all servers can and eventually will fail. EC2 instances, Rackspace CloudServers, GoGrid servers, Terremark virtual machines, even that trusty Sun box sitting in your colo. They all can fail and therefore they all will fail eventually.
What's wonderful and transformative about running your applications in public clouds like EC2 and CloudServers, etc. is not that the servers never fail but that when they do fail you can actually do something about it. Quickly. And programmatically. From an operations point of view, the killer feature of the cloud is the API. Using the API's, I can not only detect that there is a problem with a server but I can actually correct it. As easily as I can start a server, I can stop one and replace it with a new one.
Now, to do this effectively I really need to think about my application and my deployment differently. When you have physical servers in a colo failure of a server is, well, failure. It's something to be dreaded. Something that you worry about. Something that usually requires money and trips to the data center to fix.
But for apps deployed on the cloud, failure is a feature. Seriously. Knowing that any server can fail at any time and knowing that I can detect that and correct that programmatically actually allows me to design better apps. More reliable apps. More resilient and robust apps. Apps that are designed to keep running with nary a blip when an individual server goes belly up.
Trust me. Failure is a feature. Embrace it. If you don't understand that, you don't understand the cloud.
5 comments: