Hi folks,
Dan Cuomo back for our next installment in this blog series on synthetic accelerations. This time we’ll cover Windows Server 2012 R2. If you haven’t already read the first post on Server 2012, I’d recommend reviewing that before proceeding here. Following this post, we’ll tackle Windows Server 2016, and finally Windows Server 2019.
Windows Server 2012 R2 added a few improvements to VMQ, including Dynamic VMQ which enables the processor processing network traffic destined for a virtual machine to be updated based on some heuristics.
However, the most important feature introduced in 2012 R2 was vRSS which is critical to understand as we move into the future operating systems like Windows Server 2016 and 2019. vRSS is the basis for how we exceed the bandwidth of a single CPU as you’ll see in this post.
As such, we’ll focus this article on vRSS then cover Dynamic VMQ so you can understand the journey when we complete our series with Windows Server 2016 and 2019.
Virtual Receive Side Scaling (vRSS) was introduced in Windows Server 2012 R2. vRSS enables improved performance over VMQ alone through a couple of new responsibilities.
In the Guest
Virtual CPUs fill up just like any other CPU and can become a bottleneck just like CPUs on the host. If the virtual machine has multiple virtual CPUs, it can alleviate this chokepoint by splitting the processing for the received network traffic across each virtual CPU in the guest.
If you want to visualize this architecturally, a virtual NIC attaches to a “port” on the Hyper-V Virtual switch (vSwitch). This is similar to a physical port on a physical switch (except that it’s all virtual). In the picture below you can see that the vSwitch operates at layer 2 (ethernet). If you remember from our first article, legacy VMQ operates similarly using a MAC + VLAN filter to separate traffic into different VMQs.
Once the traffic reaches the vmNIC, it needs to traverse the network stack and be processed by the Guest’s vCPUs. In the guest, you’ll see the network traffic distributed in task manager similar to this thanks to vRSS.
This operates by using RSS inside the guest, hence the intuitive name vRSS :smiling_face_with_smiling_eyes:. Enabling or disabling this can be done by modifying the Receive Side Scaling property on the vNIC inside the guest.
On the Host
But vRSS ain’t no one-trick pony! vRSS also has responsibilities on the host as well. vRSS’s core responsibilities are:
When VMQ interrupts a processor, it always interrupts the “base processor” for a specific VMQ. In the example below, the VM with the MAC Address ending in 7A-35-0C always interrupts processor 17.
Next, we use RSS technology to spread the flows across other available processors in the processor array. While this is a bit of work for the system, the benefit is that the throughput can surpass the bandwidth of a single VMQ. Remember, the base processor must always receive the packets first, but the heavy lifting is performed by other available processors.
Note: You may recognize that vRSS and VMQ are working together. CPU spreading with vRSS requires VMQ.
Now a VM can receive the bandwidth (approximately) equivalent to the number of processors engaged to do its bidding. Typically, the default is 8 processors. Excellent…
Here’s a diagram that shows the packet flow. In the picture below:
Note: The virtual switch from the picture above was replaced with the vRSS indirection table however this is the same feature; vRSS is a function of the virtual switch.
The base CPU is chosen out of the available processors in the RSS processor array on the system. In the screen shot below, we constrain the processor array to processors 12 – 16. You can see that the physical adapter (which is configured with a virtual switch), updates the RSSProcessorArray property accordingly.
Note: There is no RSS indirection table shown because there is a virtual switch attached and TCP/IP is no longer bound to this adapter. Instead the virtual switch manages the indirection table.
Next, if we look at Get-NetAdapterVMQ, you can see that the BaseVmqProcessor for the adapter has been updated to one in the RSSProcessorArray.
And the processors listed below for each VM have been constrained between 12 and 16 as we set in a previous command.
Note: In this example, we use one possible command to get the VMQ assigned for a vNIC. You could also use Get-NetAdapterVMQQueue.
As you can see, we did not need to use the Set-NetAdapterVMQ cmdlets in the previous blog post (although they would have worked as well).
As we move forward to the next article on Windows Server 2016, you’ll begin to understand how integral RSS is to VMQ. In Windows Server 2012, the two technologies were mutually exclusive (per adapter). In Windows Server 2012 R2, RSS isn’t yet the peanut butter to VMQ’s jelly, but you can see there is a budding friendship starting to form…In the 2012 R2 days, probably closer to orange mocha frappuccinos…
VRSS intends to engage the least number of processors necessary to process the received traffic. This is for efficiency as there is a tax to engage and spread across each additional CPU. On low throughput, this may mean that only the base CPU is engaged. However, on high-throughput environments, the number of CPUs that can be engaged is equal to the value of MaxProcessors.
At maximum throughput, the base CPU may be receiving so many packets that it’s only able to process a minimal number of packets itself. Instead, it hands out packets to the others in the processor array.
In the example shown above, this VM will receive throughput equivalent to what approximately 2 – 3 processors can crank out (because the MaxProcessors value is limited to 3). As mentioned before, this is because most of the base CPU’s work is simply handing packets out to other processors. Despite engaging three processors, you’re getting a bit less than that.
In fact, you may find that despite setting MaxProcessors to 8, you may only engage a few processors, this is likely because your Base Processor is pegged and cannot hand out any more packets to other processors.
Note: The defaults are specific to your NIC, however in most cases, MaxProcessors defaults to 8 and was changed in this example only for visualization purposes.
As you can imagine, this is far better than VMQ alone. A single VM could receive approximately 15 Gbps using this mechanism – a nearly 3x improvement! In the next post on Windows Server 2016 we’ll explain how VMMQ enhances this to get even better throughput and lower the CPU cost, improving your system’s density.
You can see which processors are being engaged for virtual traffic by opening perfmon, changing to the histogram view, and adding the Hyper-V Virtual Switch Processor Object with the Packets from External/sec counter. Next use NTTTCP or CTSTraffic to send traffic from another machine to your VM.
In the picture below, you can see that two processors are engaged: CPU12 (shown in Red) and CPU13 (shown in green). Each CPU that is processing packets will be represented by a bar. The height of the bar indicates the number of packets being processed by that specific CPU.
The next major improvement in 2012 R2 was called Dynamic VMQ. To make better use of VMQ and available system resources, the OS was improved in Windows Server 2012 R2 to allow a VMQ to processor assignment be updated dynamically based on its CPU load.
If you refer back to the article on 2012, you can see an obvious challenge where multiple VMs using the same processor can easily starve one another. In other words, the maximum throughput of a single CPU core does not change from around 5 or 6 Gbps.
Imagine there are 3 VMs all with a VMQ but assigned to the same processor, and that processor can process a total of 6 Gbps. Each VM could potentially receive 2 Gbps if they were distributed evenly.
However, network traffic isn’t that predictable. It’s just as likely that 5 Gbps goes to VM1 and the remaining processing power is split among the other VMs. When the other VMs suddenly need more throughput, it’s not available for them because VM1 is already dominating the core used to process the data.
In this picture VM03 is assigned to the same core as VM01. In this way, VM01 and VM03 are in competition for the CPU cores. Their individual performance is gated by the VMs sharing that processor.
With Dynamic VMQ however, if CPU1 is not able to process the incoming packets, one of the queues can be reassigned to interrupt a different processor if one is able to sustain the workload.
The measurements determining that a queue should be reassigned is happening in Windows (this is another one of vRSS’s responsibilities). Since the queues can now move and they exist in the NIC you should pay special attention to keep your NIC’s driver and firmware up-to-date.
Dynamic VMQ is all about enabling your workloads to have consistent performance. If you have a night-owl for a CEO, and he tries to login at night when the backups are running, you want him to have the same performance as when he works during the day. Otherwise, you might have to work the same hours!
That said, there’s a couple of problems in this implementation which will be addressed in Windows Server 2019. Most notably, 2012 R2 does not handle workloads that burst very well. If you have a system that needs max throughput suddenly, Dynamic VMQ isn’t able to do that. Instead, you’ll notice that the system ramps up slowly, maxing out at around 15 Gbps. This is because the system only reserves the bare minimum resources required, pre-allocating only the single VMQ available when network throughput begins. As demand for throughput increases, the system takes some measurements, vRSS expands to more CPUs, it takes some more measurements, expands to more CPUs, and finally max throughput is reached.
As a refresher, the default queue is where all traffic that doesn’t match a filter land - this is a shared VMQ.
With vRSS however, this is less painful because the packets from the default queue can be distributed to other available processors. While VMs that land on this VMQ will not reach the same performance metrics as VMs with a dedicated queue, the delta is not as severe as enabling a much higher density of VMs on the same hardware.
Host virtual NICs cannot leverage vRSS or Dynamic VMQ. This will be addressed in later versions however if your host NICs need more than what a single VMQ can provide, Windows Server 2012 R2 can’t help you.
If you look at the properties of a Host virtual NIC there is no RSS capability listed.
We had a similar list in the article on Windows Server 2012. I’ll keep this in each blog so you can see the progression. As you can see in 2012 R2, there were performance and workload stability improvements, however the management experience didn't change much.
Other best practices
All 2012 best practices move forward including that NOT disabling RSS unless directed by Microsoft support for troubleshooting - Disabling RSS is not a supported configuration on Windows. RSS can be temporarily disabled by Microsoft Support recommendation for troubleshooting whether there is an issue with the feature. RSS should not be permanently disabled, which would leave Windows in an unsupported state.
Summary of Advantages
Summary of Disadvantages
As you can see, vRSS and Dynamic VMQ brought significant enhancements to Hyper-V networking for its time. This allowed virtual machines to nearly triple the throughput into a virtual machine AND have some level of throughput consistency for the workload by balancing across available processors. In our next article, we’ll discuss the rearchitecture of network cards and how this enabled us to overcome some of the existing shortcomings in 2012 R2.
Thanks for reading,
Dan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.