The Cloud and Outages: Five Key Lessons

This week a location for the EC2 product of Amazon Web Services suffered a major extended outage. Predictably there has been a lot of hand wringing, proclamations that the cloud is unreliable etc. Actually this event should focus everyone’s minds on what problems a move to the cloud solves and those that it doesn’t. The cloud does solve many problems but not all.

There are some clear lessons to be learned from this latest outage, not just relating to the cloud but relating to how to build resilient infrastructure set-ups that can keep delivering when things go wrong (because they eventually will). In this post I’ll examine what this outage tells us about the cloud and data centre based computing in general and how customers might best respond and adapt

Lesson 1: Both Cloud and Dedicated Computing Have Single Points of Failure

Moving to the cloud can improve a great many aspects of computing, from provisioning to greater flexibility and transparency. One thing that it doesn’t solve is the reliance on a set of raw materials and infrastructure to provide computing power. In a nut shell, any internet computing infrastructure relies on power, cooling (temperature control) and telecom carrier connectivity. That’s common to cloud infrastructure and dedicated infrastructure. Have a problem with any of those factors and both a cloud and its dedicated sibling will have the same type outage. So, moving to the cloud does not eliminate the single points of failure that any one data centre has.

The cloud can offer greater reliability over a dedicated set-up at the micro level because cloud providers are wholly focused on delivering performance and availability however single points of failure still exist for any cloud in any one location and customers and vendors need to be clear regarding this.

Lesson 2: Size is No Protection from Outages

Regardless of the size of vendor, outages of the type seen by AWS this week can and do happen from time to time. In this regard size is no protector against such issues simply because both small and large vendors ultimately rely on the reliability of the same largest building block of the internet, the data centre. Data centres are single points of failure that will eventually suffer some kind of outage. The larger the element to fail in the chain of computing, the larger and more serious the outage. So, if something fails at the data centre level the impact is very large.

Data centres employ N+1 and other higher redundancy set-ups to avoid outages however they will always happen eventually. The key is, customers needing very high availability in the long term need their own N+1 methodology about their computing locations whether that be for cloud or dedicated computing.

Lesson 3: All Data Centres Are Not Equal

The resilience of data centres varies enormously as does the price of space within them. Cloud vendors take very differing approaches to data centre locations and that has an impact on their long-term outage profile.

Vendors such as AWS are known to run their own data centres at a level of Tier II or low Tier III maximum and take relatively low cost locations away from networking hubs. This approach is a kind of warehouse type approach often employing relatively low density computing.

Other vendors such as ourselves take the approach of locating in premium high TierIII or TierIV data centres in network hub locations. Both approaches have benefits and drawbacks but customers should know which strategy their cloud vendor is employing because it matters to the outage profile of that cloud vendor and the strategy customers themselves should employ.

A high tier data centre in a premium network hub location will generally suffer much less frequent and shorter outages than a lower tier facility. It will generally be more expensive of course too, requiring a higher density computing set-up to be effective. This is the strategy we employ because in the long term it delivers higher availability and reduces very serious data centre outages which are the most damaging to customers.

Just as it is important when choosing where to locate physical hardware, it is just as important to understand where a cloud vendor locates its hardware and what is the quality and redundancy of the facility and networking that they employ.

Vendor openness about their data centre locations and ratings is therefore critical for customers to make informed purchasing decisions.

Lesson 4: The Price-Performance-Reliability Metric

We’ve discussed previously on this blog the importance of looking at price-performance as oppose to trying to compare naked prices. With the subject of outages being so topical it would seem the right time to nuance this concept by introducing the additional variable of reliability.

Most customers care about the confidence they can have in their computing being available at any one time. Some customers doing periodic data processing and such applications may care less however it is fair to say the majority do care about availability and reliability. Thus, like for like, a cloud that can deliver better reliability and availability over time has a higher value.

Adding the reliability metric to price performance gives a complete understanding of the value behind the prices offered by a cloud vendor. Like for like, a cloud vendor located in a low quality data centre with much less redundancy, on a price-performance-reliability metric will score lower than the same offering in a much higher quality data centre.

So customers should know where their computing is being conducted and factor this into their comparative analysis when deciding who they want to place their computing with.

Lesson 5: Achieving a highly robust set-up is cheaper and easier in the Cloud

We’ve seen that data centres remain single points of failures for both dedicated and cloud computing environments. Customers can make intelligent analysis with the right information to evaluate different cloud offerings in light of the points outlined above. This can match the customer computing need to the right cloud vendor but what if you diversify?

High availability environments generally employ multiple sites and multiple vendors to achieve their reliability. The great news is that in the cloud, creating such a set-up is significantly easier, quicker and more convenient than with dedicated hardware. Perhaps the greatest shock this week is not that we saw a major cloud have an outage but simply number of very significant websites that were relying apparently on one vendor in one location only. That doesn’t work for dedicated hardware and it doesn’t work for cloud computing.

The final clearest point that this week has shown is that customers using cloud computing should look at employing the same multi-location, multi-vendor strategies that they would do with dedicated hardware if they wish to achieve very high availability levels over the long term. Such a set-up can be created at a significantly lower cost in terms of time and money than with traditional dedicated solutions. Outages happen in and out of the cloud, the way customers can solve them remains remarkably similar.