As a company focused on improving storage performance in virtualized datacenters, we spend a lot of time focused on just that: storage performance.
However, Martin, one of our software architects, recently noticed a performance problem that manifested as inconsistent performance even once all the obvious variables were accounted for.
Here, Martin explains the symptoms:
“We started to notice that some of our VMs were performing at half the speed of others. We looked at variables like workload, server specs, and hypervisor patch versions. Then we looked at all the data from our detailed performance instrumentation to see if there was something we were missing.”
At one point, Martin realized that along with inconsistent performance there was also inconsistent reporting of CPU utilization. While two hosts might have the same CPU model and the same workload, they would report different utilization and perform differently.
Here is one example of the difference in performance that Martin saw: the time it takes to run a SHA-1 hash. In the first screen below, you can see that one host was performing this operation in roughly 7 usec. In the second screen you can see it taking over 14 usec to run the same operation.
With further investigation, Martin isolated the problem to a feature found in both Intel and AMD chips, implemented as part of the ACPI (Advanced Configuration and Power Interface) standard. Modern chips have a mechanism that reduces the frequency of the CPU when it is doing less work. For Intel chips this is known as "SpeedStep" while AMD calls it "Cool'n'Quiet." The goal of this is to reduce overall power consumption in the datacenter, but for some workloads, the net effect can be poor performance.
In Martin's case, the bursty but CPU-intensive nature of the workload he was using made using this feature a poor choice - the CPU would drop to a lower frequency, then have insufficient resources to complete his operations quickly. We’ll let Martin explain the next step in resolving this:
“The setting that turns this on or off can be configured in two places, the BIOS and the hypervisor, with a variety of options. I needed a simple way to figure out how it was working on any given machine.”
Here’s what he did: He graphed the “run” and “used” times for a CPU. “Run” indicates wall clock time for a given VM’s set of operations. “Used” indicates how much time that set of operations would have taken at the CPU’s nominal speed – it’s highest sustained speed.
In this case, you can see that “used” is half the value of “run” which means that the CPU is running at half its potential speed. This is a good example where turning that CPU frequency limiting feature off would have resulted in significantly better performance.
At this point, Martin had both solved the root of the performance issue as well as found a clever way to see if it applied to a host. Let’s see what he did to resolve the problem.
“I wanted to avoid having to reboot the host every time I wanted to make this type of change, so I changed the Power Management setting in the BIOS to OS Controlled. Then I was able to change the setting in the vSphere Client to do all my testing without reboots. For our lab, I decided to make these our default settings. I found that ESX’s “Balanced” setting provided better performance than the Dell BIOS, and the ESX “Max Performance” setting resulted in the best performance for us.”
Here are some storage performance results in each of those cases. As you can see, letting the motherboard control the performance resulted in read latency of 11 msec, using the Balanced setting resulted in read latency of 8 msec, and using Max Performance resulted in read latency of around 5 msec - less than half of the original latency!
Below are some of the resources Martin used to understand the best practices for CPU power management in VMware environments.