After writing about a performance issue we had with HT (Hyper-threading) on Hyper-V hosts, and having seen this problem on more Hyper-V hosts. I thought I would give an update on the things I’ve learned about this issue.
I’ve learned about several key signs about identifying the issue. And that the issue exists in all Hyper-V version so far. At least up until 2016, haven’t tested 2019 yet.
First thing to keep in the back of you’re mind, is the skewed performance figures, when HT is enabled. The performance counters for the host, are in the worst case about half of what they really are. With HT enabled, a 35% utilization on the host means that in reality it is more closer to 70%. (Don’t use task manager on the host, use perfmon counters, Veaam task manager (free) or VMM to get the actual %CPU utilization). The reason is simple, HT doesn’t give you more actual cores. But for the given performance numbers, the cores are counted as real ones. With HT disabled, these numbers represent the actual load of the server.
As for the CPU performance counter for the virtual machines, it’s a bit more complicated. They are more depended on the load on the host. For example:
- a 5 core host with with HT enabled
- 5 virtual machines (no over committing),
- each vm has 1 vCPU’s.
If only one or two virtual machines have a high load, and the others remain mostly idle. 10 logical cores with HT enabled, means a maximum load of 10% per vCPU. This limit can be reached by the virtual machine(s). The tricky part is, when the load gets over a certain point on the host, this 10% CPU usage on the host, really equals to 20% real CPU utilization on the host. As soon as the Hyper-v host with HT enabled, is beginning to experience scheduling issues, the performance counters of the virtual machines, won’t go near the 10% CPU utilization. These numbers can be an indication, if the host is having these problems.
On a Hyper-V host with HT enabled and loads above 35%, stalls of the virtual machines can occur. I call them stalls for now. The effect can be described as: i.e. clicking on a folder in explorer, it takes some time before explorer reacts. But when it reacts, it appears fast. The same goes for every application used.
When these stalls occur, is dependent on which heavy load services are running in the virtual machines. From what I’ve seen, the more threads these virtual machines use, the sooner the stalls begin to occur. Virtual machines with a high load and a lot of threads (virtual machines with RDSH for instance) begin at about 35% CPU (70% real usage without HT) usage on the host. With machines who use less threads, but still create a heavy load. The stalls begin at over 40%, which is more like 80% real CPU usage on the host.
Enabling Hyper-treading, gives about the same effect as over committing the CPU resource by a factor of 2. The tricky thing about it, is that you won’t get the real performance numbers. Which makes it harder to identify the cause of performance issues.
We’ve also seen a really bad performance issue on HT enabled HV hosts running linux guests. The worst was a linux nfs server as a guest machine on Hyper-V 2016. With HT enabled, the performance of the vm was really terrible. Even with the correct Hyper-v drivers, the average cpu load was stuck above 1.x, the throughput of nfs was terrible. After disabling HT on the Windows 2016 Hyper-V host, the average cpu load, and nfs performance, were what we’d expect. The weird thing here, was that the only VM getting any load, was the nfs server. And this only had 4 vCPU’s on a 52 core Host!
As more and more real cores are added to CPU’s, you can also wonder how useful Hyper-threading is these days. For some applications/services, it still could give an advantage. But for most Hyper-V installs I’ve seen, it clearly doesn’t give an advantage. And troubleshooting, is a little tricky. For me, the reasons mentioned above, have given me a own best practice rule of disabling Hyper-threading on every Hyper-V server.