Technology Strategy interview with our CEO by Techworld

Jan Hedström of Techworld (www.techworld.se) recently interviewed CloudSigma CEO Patrick Baillie about our company’s technology strategy and choices. A full transcript of the interview is included below.

You use KVM virtualisation, while some other providers have chosen Xen. What factors were important when you made the decision to use KVM?

KVM forms part of the mainstream Linux kernel and for that reason has the full weight of the Linux developer community behind it. For virtualisation, stability and security are two critical elements, elements which open source software has time and again proven to be the best at delivering over the long term. Although a relatively late entrant into the hypervisor space, KVM has been gaining in momentum over time to the point where it is now overtaking other hypervisors like Xen in terms of performance and core functionality. It is no coincidence that companies like IBM, HP and Intel are fellow members of the Open Virtualisation Alliance whose aim is to promote uptake of KVM. Since making the choice of KVM early on we believe we’ve seen our choice validated. We fully expect KVM to continue to iterate faster than other alternatives and create a widening technological lead.

KVM itself has some key features which make it very attractive to us. Firstly it has the ability to virtualise cores themselves. This allows us to sell CPU in GHz rather than per core which we’d have to do with Xen. It means that our customers can specify the number of cores on a server separately from the raw GHz. This uniquely allows our customers to tailor the number of CPU threads based on their requirements. For example, big data processing usually benefits from paralisation. On our platform, a customer could create for example a 4GHz CPU server and specify 8 cores giving them more CPU threads than normally assigned for that size. This has a very dramatic effect on their computing efficiency and price/performance. Likewise, another customer needing less threads but bigger core sizes could create the same sized 4GHz CPU server and specify just 2 cores. All this is possible because of KVM’s virtualisation implementation and not possible in the same way with hypervisor’s like Xen.

Secondly, KVM offers an open platform and conveys full root/administrative access for the cloud servers it creates. Essentially as a hypervisor it doesn’t intrude into the software layer allowing full control by the user. For this reasons users in our cloud can run any operating system and version that they like. We therefore have customers running Windows, Linux, Unix (for example FreeBSD) and other OSs like Solaris without issue in our cloud. Again this isn’t possible with other cloud platforms.

What software are you using to set up the server and storage nodes, or have you put together the software yourself using a generic Linux distribution?

Our cloud offering is built in modular components but can be thought of as having three basic layers. The first layer is the front facing layer that users see, that consists of the web console, the public API and in the future mobile applications as well. The second layer is the billing, customer management and resource allocation layer. Both the first and second layer described are areas where we concentrate as a company as we feel this is the value added part of the cloud where we can design features that make our offering unique and attractive to customers. It also gives us the freedom to take a completely different approaches which we simply couldn’t do buying and using an all-in-one off the shelf packages. The final layer is the software running on the server hosts themselves which utilises Linux with KVM as described above.

The Openstack project has attracted a large number of companies participating in its community. What is your view on Openstack, is that project of interest to CloudSigma?

Actually Openstack has largely failed to attract any significant interest from other public IaaS providers and that’s very telling. Most of the support has come from private cloud vendors or companies that already use Rackspace as customers and want to interface with them more easily. For sure in those areas Openstack has been successful. Openstack is simply a translation of Rackspace’s vision of IaaS which we find to be highly limiting. As a company they have more than 80% of their revenues from traditional dedicated server hosting and as such have adopted a model that isn’t disruptive to their legacy business. This doesn’t however expose the true flexibility and power that IaaS can actually offer customers. As a pure play IaaS provider we are liberated to take a completely different approach that is highly disruptive to traditional hosting business models. That’s why we give our customers full control over their servers, we have no fixed server sizes, offer high performance and have transparent utility billing without resource bundling.

Since Q3 last year we have had a working Rackspace API emulator which allows customers to talk to our cloud using this. In reality almost no customer uses this as it only offers a subset of the flexibility of our platform and main API. So in summary, if someone wants to use Openstack in conjunction with our cloud it won’t be problematic at all to do so however most customers choose to take advantage of our extra native features.

Actually the API is not the real challenge in delivering cross cloud interoperability whilst still encouraging innovation. The real problem comes from the lack of data portability that most clouds offer. For example with CloudSigma you can FTP out your full drive images at any time and receive them as RAW ISO files. This offers a cheap and convenient way to migrate away from CloudSigma giving full data portability. Do other providers including Rackspace offer the same data portability?

Far more interesting to us are companies and services such as enStratus, Rightscale and jclouds. They offer cross-cloud integration but also do full integration so they aren’t the least common denominator. Customers can therefore choose to use just a common subset or take advantage of vendor specific features in a transparent way. That model encourages innovation and gives customers control and understanding over what vendor lock-in aspects they might be exposing themselves to.

What tools do you use for systems management and configuration? Have you been able to use an off-the-shelf free software, like Puppet or cfengine?

As an IaaS provider we don’t have access into our customer’s cloud servers. This gives them a level of freedom, security and control that other clouds don’t offer. As such the sorts of packages you are talking about aren’t that relevant to our cloud.

We do use multiple system monitoring tools both internally and externally in our cloud. This includes network infrastructure, network traffic flows, server host performance and availability, cloud services availability (such as the web console and the API server) and test VM availability and performance. We use open source tools for this but we aren’t able to publicly disclose them.

What are the most important tools that you use for systems and network monitoring?

We use open source solutions for this plus our own proprietary monitoring of various system resources. Again we aren’t able to publicly disclose the particular system choices that we have made.

You have written that CloudSigma uses Raid 6 sets within the servers, and provided good reasons why not to use a SAN. But are there any major drawbacks (economic or technical) using standard servers as storage nodes, do you think?

Our cloud in general is built on the principle of modularity and simplicity. Modularity reduces single points of failure and limits the impacts of failures when they happen in relation to the overall cloud. Simplicity means that problems when identified are that much more readily solved. The combination of the two means that if a problem occurs it has a limited impact zone and is relatively simple to resolve. SANs break both those principles which is the main reason why we don’t use them.

The main drawback of local storage is that it can become unavailable if that particular server goes down. The impact is limited and in an extreme circumstance of a unrecoverable hardware failure on a host we simply swap the drives out into another ready server.

You also wrote (in the forum) that you recently implemented a solution for distributing the data across nodes. Is that replication achieved using iSCSI, or some other, more high-level protocol?

I think I wrote we were in the process of deploying it, it isn’t live yet on the production system. We do use iSCSI in general for accessing drives and that won’t change. The new system replicates and breaks virtual drives into blocks and distributes them across hundreds of drives. It delivers more consistent storage performance because a drive’s impact is spread across so many physical drive heads. It also means that the failure of any particular server doesn’t affect overall availability of that virtual drive. In essence it eliminates the one drawback of local storage versus using a SAN which is why we are so excited about seeing it move to production later this year.

I suppose that achieving optimal load-balancing across nodes, and avoiding bottlenecks, must be a tricky problem for a cloud provider. Has this been the case for CloudSigma?

We give customers the power to implement their own load balancing through our open software and networking approach. At a cloud management level yes proper resource management is crucial to delivering excellent customer performance and more importantly consistent performance levels. The two bottlenecks usually experienced in a public cloud are IO performance and networking performance. We are now in the process of migrating our cloud to only 10GigE networking right down to the individual server node level and all new hardware is based on this new 10GigE background (such as our new US location). This has both lowered latency and significantly increased throughput capacity in our cloud to the point where networking isn’t a bottleneck any more for our users. With regards to storage performance, we actually lose money on our storage pricing because we now invest so heavily in state of the art magnetic disks as well as software management systems for storage. We will also be rolling out SSD storage options very shortly that will allow customers to dramatically boost performance by moving onto SSD based storage, this will also shed the highest storage load off our magnetic backbone storage system.

One key aspect of load management is our adaptive burst pricing mechanism. By having floating prices for on-demand computing (not our subscription pricing which is fixed), we are able to allow prices to adjust to demand in a dynamic way over time. We use short 5 minute billing cycles which makes this process very granular and accurate. A customer of ours recently graphed these prices over time so you can now visually see them adjusting in response to load at http://cloudsigma.o.lindekleiv.com/.

The end result of all this is excellent performance for customers. You can judge for yourself at CloudSleuth’s independent performance monitoring site making sure to select Europe as the region.

End of interview