Customers often ask about CPU steal time. Especially those that use the CPUs heavily and for whom it’s a key performance criteria. There are quite a few differences in the setup and behaviour of CPUs and cores between physical and virtual environments. Even between cloud providers there are setups differences that make like for like comparison on the face of things difficult. For this reason we thought it useful to provide a brief overview of our setup and CPU allocation logic for customers as well as to explain the most common sources of CPU steal time.
So firstly, for those unfamiliar with the concept, CPU steal time is the time that your virtual CPU within your cloud server has to wait for the real physical CPU while the hypervisor is busy using it for other things (like other virtual machines/cloud servers). This is a great article about the CPU steal time that’s well worth reading.
A Little Information on our CPU Set-up
The first thing to understand relates to the way we allocate cores between virtual machines on each physical compute node hosting your computing. CPUs and their cores at CloudSigma are shared. In other words, we do not pin a customer cloud server to specific cores. The CPU time is assigned by the physical compute node’s scheduler dynamically and everything is shared. We believe this has a number of benefits in delivering more reliable performance holistically by allowing the compute node to make sensible allocation adjustments on the fly to balance load.
In combination with this, we use Control Groups (cgroups for short) to guarantee enough CPU time for each of the cloud servers in line with the resources you have set via the server size. In the end, the scheduler decides what to do with any remaining resources and cgroups. It’s also worth noting that we reserve a set of specific cores to be outside the range of allocation for customer computing workloads. We use these cores to run the operating system of the physical host. In particular, we reserve additional cores for processing of networking and storage operations. All of this aims to increase stability of the overall machine. Furthermore, it helps deliver reliable performance levels over time independent of other customers’ load for you as a customer.
The Sources of CPU Steal Time in a Virtualized Environment
Unlike a physical environment, there are multiple sources and situations where you can experience CPU steal time. This is because things are more complex in a multi-tenant virtualized environment. Not all of them are really a situation where you are not receiving the CPU time you should be, in fact in many cases you can often soak up spare CPU cycles beyond your allocated size but that’s not a situation where you’d see CPU steal time. The three most common situations follow below in more detail.
Your Cloud Server is Overloaded
It happens! Everyone wants to use as close to full capacity for what you are paying for however if the allocated CPU to your virtual cloud server is not enough to process the workload you can see CPU steal time as things backup and queue within the virtual CPU. If this is the CPU steal time root cause then the resolution is to resize the cloud server. If this is a temporary overload you can safely leave the things unchanged. You’ll see CPU steal time disappear when your load goes down.
The Physical Server Hosting your Cloud Server is Overloaded
If there is a host overload, in this case this is a failure from our side. It’s rare but it can happen. In this case we use live migration to migration without disruption virtual machines to other physical compute nodes to bring load levels back down to normal levels. Generally we maintain hosts well below full load. So if you continue to observe this over an extended period please contact us. Our free 24/7 support can check the physical host you are on. If there is no overload, then it’s unlikely the root cause of your CPU steal time.
You are Using a Smaller Virtual Core Size
At CloudSigma we give you the ability to define the virtual core size to take advantage of having for example more CPU threads of more smaller virtual cores for any cloud server size. The cloud server within the operating system will always see the core size as the full physical size.
If the physical core is 2.6GHz but your VM is 4GHz and two cores, each virtual core will be 2GHz. So you will always see steal time. In fact, that’s because you receive a pro rata amount of the total core not the full size due to the virtual core sizing being smaller. As such you should always adjust any calculations of CPU steal time to take account smaller virtual core sizing if in fact you are using that. To avoid this, you can use the full core size per core. You can do this by expanding the virtual core size to the full CPU core size (e.g. Intel v4 2.6GHz).
CPU steal time in the cloud is a bit more complex than traditional single tenant physical environments. However, it definitely still exists. The reporting of CPU steal time by operating systems hasn’t however adjusted for the different conditions. This means you can get false positives. When you find CPU steal time it does usually mean there is a resource constraint happening. We hope this post helps you to quickly identify the root cause and ensure continued smooth operations.
- Are We Stealing from You? Understanding CPU Steal Time in the Cloud - November 17, 2016
- Our CEO Discusses Hybrid Cloud - October 19, 2016
- Share Cloud Infrastructure Securely - August 10, 2016
- Win a OnePlus2 Phone - October 25, 2015
- Engaging hyperdrive: Our Intel v3 cloud servers - August 25, 2015