Server Monitoring: The foundation of good cloud infrastructure

Moving from traditional servers to cloud servers and infrastructure provides a golden opportunity to re-think computing architecture in order to take advantage of the flexibility and responsiveness that the cloud has to offer.

Not often discussed, server monitoring has only gained in importance with the move to the cloud. In this blog post I outline how server monitoring can form an integral part of your cloud infrastructure and, when properly implemented, how it can open up new avenues to significant cost savings whilst protecting performance.

Creating infrastructure that auto-scales is a critical part of cloud computing. Ideally, scaling and load balancing should occur by layer independently in reaction to load and performance data. Likewise the added flexibility of being able to deploy resources instantly to exactly the areas of your infrastructure needing additional capacity is uniquely possible in the cloud. What implicitly underlies all these structures is accurate timely data on the performance and status of your cloud servers.

Dynamic Cloud Server Infrastructure

As a company our philosophy is very much one of giving our customers control an open set of choices. That approach means open software and networking layers, full customer root access to cloud servers and a high degree of transparency about what we do and how we do it. The result of this freedom is that we don’t have visibility inside our customers’ servers; we’d argue that’s actually an advantage for our customers but for auto-scaling and load balancing it does mean its something that needs input by our customers and is not offered implicitly in our platform.

The great news is that implementing features such as load balancing and scaling is significantly easier and quicker than trying say to migrate existing infrastructure onto more restrictive clouds. In other words, your return on investment from using an open cloud like ours and investing time in targeted, relevant load balancing and scaling is significantly higher than investing in re-architecting your infrastructure to work on a highly propriety cloud. Such clouds offer less control to the user but often have load balancing and auto-scaling baked-in. The trade-off isn’t in favour of customers of the cloud at all. Not only that but such baked-in services when implemented in a bespoke way are significantly more effective than ‘one-size-fits-all’ cloud vendor solutions.

Computing is a means to an end and most (but not all) of our customers are using our infrastructure to offer a service of some kind. Ultimately they wish to purchase enough resources and deploy them efficiently to deliver a certain level of service performance; that’s the ‘end’ and our cloud is the ‘means’. In a traditional non-dynamic solution, this requires purchasing enough hardware to cover peak demand loads, with servers and other infrastructure running at low utilisation for the majority of time. In the cloud, even customers with mild variances in load over time can make significant savings by varying the size of their cloud resources deployed in response to load placed on that service. In order to do this you require the following basic information on a per server basis:

  • CPU load information
  • Memory usage information
  • Application performance information
  • Network traffic information

This information is the starting point to building intelligent infrastructure that reacts to load and performance information in order to ensure reliable service delivery for the customer. Essentially the idea is to build infrastructure that scales back during quieter periods and brings on extra capacity during high demand periods and achieve this dynamically, not according to a pre-determined schedule.

How best to implement this is a blog post in itself which I’ll post up in the near future. Suffice to say the foundation of such a strategy is good system and application level monitoring. Without such information you are effectively flying blind. Essentially its a game of ‘find the bottleneck’ which needs detailed, elegantly structured information.

I’ve got three basic rules for server monitoring which I outline below.

Rule #1: First do no harm

As with the physicians maxim of ‘first do no harm’, server monitoring shouldn’t place undue load on the servers you deploy to. Server monitoring doesn’t need to be a resource hog and shouldn’t be. Unfortunately its only too easy to deploy server monitoring solutions that can become a significant component of a servers resource need. There really isn’t any point in having a server monitoring solution on your database server that ends up slowing the performance of the database itself just to tell you that the database is slow! David Roth, CEO of Appfirst comments:

“Historically the more data that was collected with performance monitoring tools the worse it affected the overhead of the application being monitored; this led to production applications with very little monitoring. AppFirst has been able to change this delivering complete visibility with 0.6-1.0% overhead.”

Its an approach that we at CloudSigma agree with. So rule #1 of server monitoring in my book is to look for a solution that doesn’t have a significant impact on the underlying systems it is monitoring. The correct solution will vary from customer to customer depending on your particular needs but in no case should you deploy a resource heavy server monitoring system on cloud infrastructure.

Rule #2: Prevention is better than cure; a pro-active approach

The ultimate aim of server monitoring in the cloud is to deliver the right information in time to allow your cloud infrastructure to react before any significant deterioration in your performance is experienced. As a first step its important to have what I call reactive information. That’s information such as when the latency of your web server file retrieval is too high or if your database is returning requests sluggishly. Increasingly its necessary to go beyond a reactive stance and to implement proactive measures that take action when the first warning signs manifest themselves, before performance is seriously impacted.

As I outlined in my post on benchmarking cloud servers, never underestimate the importance of real empirical information. In order to find tell tale warning signs of when your infrastructure may be becoming overloaded, you’ll likely need to simulate traffic to that point or look at your server logs from such an event if the data is available. Almost always there are indicators from your infrastructure and applications that can give advance warning of when additional capacity is needed and crucially exactly where.

Next generation server monitoring solutions can assist you in flagging events that can then be codified as early warning signals. These events can then be used as triggers for adjusting cloud infrastructure ahead of significant performance degradation. David Roth of AppFirst continued:

“95% of the time IT finds out about application issues from their users. Using the right tools to monitor your servers and applications allows you to be proactive and get ahead of issues before they become serious and affect your users.”

As a cloud vendor we ourselves have multiple monitoring systems at the various levels of our infrastructure including ‘canaries’ which are our test cloud servers spread throughout our system. They feed back end-user level information on our cloud’s performance. When we see significant changes on our test cloud servers it can often be the first sign of a problem switch, strange network traffic or some other issue that hasn’t fully developed yet.

Prevention is most definitely better than a cure when talking about your computing infrastructure which is why its rule #2.

Rule #3: Keep an open mind

Performance inhibitors can come from many different sources. When your infrastructure hits a performance ceiling it means there is a bottleneck somewhere and its the job of a server monitoring solution to highlight where that bottleneck is. When setting up server monitoring its all too easy to build in bias to the system by restricting monitoring to very specific areas and aspects of your infrastructure. This is a mistake. Often the source of a problem can be totally different from your initial gut instinct.

Take a holistic, infrastructure wide approach and keep an open mind. Many of our customers have found bottlenecks in their infrastructure that they didn’t know existed by using such methods. The solution to a great many performance issues can be counter-intuitive. Building server monitoring that is pre-destined will exclude many improvements that might otherwise be thrown up. Rule #3 is all about making a multi-faced implementation of server monitoring. Your implementation should capture performance issues arising from many different sources however ‘unlikely’ they may seem as a source of performance issues at the time of implementation.

Plan, Build, Iterate

Properly implemented, server monitoring in the cloud confers a disproportionate benefit over and above its deployment on dedicated hardware. There are advanced, cost effective solutions out there that can more than pay their way in cost savings and performance improvements. Server monitoring is worth doing properly and its worth improving over time. Hopefully some of the ideas I’ve raised in this blog will prove of use in the future for anyone adding server monitoring to their cloud infrastructure.


  • Conal Duffy

    Well said patrick its really an informative post so it would get all the details of a server in a right point of time

    Regards
    Conal