Making MsQuic Blazing Fast

Former Employee

Apr 14, 2021

It’s been a year since we open sourced MsQuic and a lot has happened since then, both in the industry (QUIC v1 in the final stages) and in MsQuic. As far as MsQuic goes, we’ve been hard at work adding new features, improving stability and more; but improving performance has been one of our primary ongoing efforts. MsQuic recently passed the 1000^th commit mark, with nearly 200 of those for PRs related to performance work. We’ve improved single connection upload speeds from 1.67 Gbps in July 2020 to as high as 7.99 Gbps with the latest builds*.

* Windows Preview OS builds; User-mode using Schannel; and server-class hardware with USO.
** x-axis above reflects the number of Git commits back from HEAD.

Defining Performance

“Performance” means a lot of different things to different people. When we talk with Windows file sharing (SMB), it’s always about single connection, bulk throughput. How many gigabits per second can you upload or download? With HTTP, more often it’s about the maximum number of requests per second (RPS) a server can handle, or the per-request latency values. How many microseconds of latency do you add to a request? For a general purpose QUIC solution, all of these are important to us. But even these different scenarios can have ambiguity in their definition. That’s why we’re working to standardize the process by which we measure the various performance scenarios. Not only does this provide a very clear message of what exactly is being measured and how, but it has also allowed for us to do cross-implementation performance testing. Four other implementations (that we know of) have implemented the “perf” protocol we’ve defined.

Performance-First Design

As already mentioned above, performance has been a primary focus of our efforts. Since the very start of our work on QUIC, we’ve had both HTTP and SMB scenarios driving pretty much every design decision we’ve made. It comes down to the following: The design must be both performant for a single operation and highly parallelizable for many. For SMB, a few connections must be able to achieve extremely high throughput. On the other hand, HTTP needs to support tens of thousands of parallel connections/requests with very low latency.

This design initially led to significant improvements at the UDP layer. We added support for UDP send segmentation and receive coalescing. Together, these interfaces allow a user mode app to batch UDP payloads into large contiguous buffers that only need to traverse the networking stack once per batch, opposed to once per datagram. This greatly increased bulk throughput of UDP datagrams for user mode.

These design requirements have led to some significant complexity internal to MsQuic as well. The QUIC protocol and the UDP (and below) work are separated onto their own threads. In scenarios with a small number of connections, these threads generally spread to separate processors, allowing for higher throughput. In scenarios with a large number of connections, effectively saturating all the processors with work, we do additional work improves parallelization.

Those are just a few of the (bigger) impacts our performance-driven design has had on MsQuic architecture. This design process has affected every part of MsQuic from the API down to the platform abstraction layer.

Making Performance Testing Integral to CI

Claiming a performant design means nothing without data to back it up. Additionally, we found that occasional, mostly manual, performance testing led to even more issues. First off, to be able to make reasonable comparisons of performance results, we needed to reduce the number of factors that might affect the results. We found that having a manual process added a lot of variability to the results because of the significant setup and tool complexity. Removing the “middleman” was super important, but frequent testing has been even more important. If we only tested once a month, it was next to impossible to identify the cause of any regressions found in the latest results; let alone prevent them from happening in the first place. That inevitably led to a significant amount of wasted time trying to track down the problem. All the while, we had regressed performance for anyone using the code in the meantime.

For these reasons, we’ve invested significant resources into making performance testing a first-class citizen in our CI automation. We run the full performance suite of tests for every single PR, every commit to main, and for every release branch. If a pull request affects performance, we know before it’s even merged into main. If it regresses performance, it’s not merged. With this system in place, we have pretty much guaranteed performance in main will only go up. This has also allowed us to confidently take external contributions to the code without fear of any regressions.

Another significant part of this automation is generating our Performance Dashboard. Every run of our CI pipeline for commits to main generates a new data point and automatically updates the data on the dashboard. The main page is designed to give a quick look at the current state of the system and any recent changes. There are various other pages that can be used to drill down into the data.

Progress So Far

As indicated in the chart at the beginning, we’ve had lots of improvements in performance over the last year. One nice feature of the dashboard is the ability to click on a data point and get linked directly to the relevant Git commit used. This allows us to easily find what code change caused the impacted performance. Below is a list of just a few of the recent commits that had the biggest impact on single connection upload performance.

d985d44 – Improves the flow control tuning logic
1f4bfd7 – Refactors the perf tool
ec6a3c0 – Fix a kernel issue related to starving NIC packet buffers
be57c4a – Refactors how we use RSS processor to schedule work
084d034 – Refactors OpenSSL crypto abstraction layer
9f10e02 – Switches to OpenSSL 1.1.1 branch instead of 3.0
ee9fc96 – Adds GSO support to Linux data path abstraction
a5e67c3 – Refactors UDP send logic to run on data path thread

Most of these changes came about from this simple process:

Collect performance traces.
Analyze traces for bottlenecks.
Improve biggest bottleneck.
Test for regressions.
Repeat.

This is a constantly ongoing process to always improve performance. We’ve done considerable work to make parts of this process easier. For instance, we’ve created our own WPA plugin for analyzing MsQuic performance traces. We also continue to spend time stabilizing our existing performance so that we can better catch possible regressions going forward.

Future Work

We’ve done a lot of work so far and come a long way, but the push for improved performance is never ending. There’s always another bottleneck to improve/eliminate. There’s always a little better/faster way of doing things. There’s always more tooling that can be created to improve the overall process. We will continue to put effort into all these.

Going forward, we want to investigate additional hardware offloads and software optimization techniques. We want to build upon the work going on in the ecosystem and help to standardize these optimizations and integrate it them into the OS platform and then into MsQuic. Our hope is that we will make MsQuic the first choice for customer workloads by bringing the network performance benefits QUIC promises without having to make a trade-off with computational efficiency.

As always, for more info on MsQuic continue reading on GitHub.

-- The MsQuic Team (Anthony Rossi, Nick Banks, Praveen Balasubramanian, & Thad House)

Updated Apr 15, 2021

Version 2.0

nibanks

Former Employee

Joined April 28, 2020

View Profile

Networking Blog

The Official Blog Site of the Windows Core Networking Team at Microsoft