Customers often ask about CPU steal time, especially those that use the CPUs heavily and for whom it’s a key performance criteria. There are quite a few differences in the setup and behaviour of CPUs and cores between physical and virtual environments. Even between cloud providers there are setups differences that make like for like comparison on the face of things difficult. For this reason we thought it useful to provide a brief overview of our setup and CPU allocation logic for customers as well as to explain the most common sources of CPU steal time.
So firstly, for those unfamiliar with the concept, CPU steal time is the time that your virtual CPU within your cloud server has to wait for the real physical CPU while the hypervisor is busy using it for other things (like other virtual machines/cloud servers). This is a great article about the CPU steal time that’s well worth reading.
A Little Information on our CPU Set-up
The first thing to understand relates to the way cores are allocated between virtual machines on each physical compute node hosting your computing. CPUs and their cores at CloudSigma are shared. In other words, we do not pin a customer cloud server to specific cores. The CPU time is assigned by the physical compute node’s scheduler dynamically and everything is shared. We believe this has a number of benefits in delivering more reliable performance holistically by allowing the compute node to make sensible allocation adjustments on the fly to balance load.
In combination with this, we use Control Groups (cgroups for short) to guarantee enough CPU time for each of the cloud servers in line with the resources you have set via the server size. In the end, the scheduler decides what to do with any remaining resources and cgroups. It’s also worth noting that we reserve a set of specific cores to be outside the range of allocation for customer computing workloads. These cores are used to run the operating system of the physical host and in particular we reserve additional cores for processing of networking and storage operations. All of this is designed to increase stability of the overall machine and to deliver reliable performance levels over time independent of other customers’ load for you as a customer.
The Sources of CPU Steal Time in a Virtualized Environment
Unlike a physical environment, there are multiple sources and situations where you can experience CPU steal time as things are more complex in a multi-tenant virtualized environment. Not all of them are really a situation where you are not receiving the CPU time you should be, in fact in many cases you can often soak up spare CPU cycles beyond your allocated size but that’s not a situation where you’d see CPU steal time. The three most common situations are outlined below in more detail.
Your Cloud Server is Overloaded
It happens! Everyone wants to use as close to full capacity for what you are paying for however if the allocated CPU to your virtual cloud server is not enough to process the workload you can see CPU steal time as things backup and queue within the virtual CPU. If this is the CPU steal time root cause then the resolution is to resize the cloud server. If this is a temporary overload you can safely leave the things unchanged and you’ll see CPU steal time disappear when your load goes down.
The Physical Server Hosting your Cloud Server is Overloaded
The host is overloaded, in this case this is a failure from our side. It’s rare but it can happen. In this case we use live migration to migration without disruption virtual machines to other physical compute nodes to bring load levels back down to normal levels. Generally we maintain hosts well below full load so if you continue to observe this over an extended period please contact us and our free 24/7 support can check the physical host you are on. If it’s not overloaded then it’s unlikely the root cause of your CPU steal time.
You are Using a Smaller Virtual Core Size
At CloudSigma we give you the ability to define the virtual core size to take advantage of having for example more CPU threads of more smaller virtual cores for any given cloud server size. The cloud server within the operating system will always see the core size as the full physical size. So if the physical core is 2.6GHz but you set your VM to be 4GHz and two cores, each virtual core will be 2GHz. So you will always see steal time but in fact that’s because you are only being allocated a pro rata amount of the total core not the full size due to the virtual core sizing being smaller. As such you should always adjust any calculations of CPU steal time to take account smaller virtual core sizing if in fact you are using that. To avoid this, you can use the full core size per core by expanding the virtual core size to the full CPU core size (e.g. Intel v4 2.6GHz).
CPU steal time in the cloud is a bit more complex than traditional single tenant physical environments but it definitely still exists. The reporting of CPU steal time by operating systems hasn’t however adjusted for the different conditions so you can get false positives. When you find CPU steal time it does usually mean there is a resource constraint happening and we hope this post helps you to quickly identify the root cause and ensure continued smooth operations.
- Are we stealing from you? Understanding CPU Steal Time in the Cloud - November 17, 2016
- Our CEO Discusses Hybrid Cloud - October 19, 2016
- Share Cloud Infrastructure Securely - August 10, 2016
- Win a OnePlus2 Phone - October 25, 2015
- Engaging hyperdrive: Our Intel v3 cloud servers - August 25, 2015