Improving performance has always been a major goal for MsQuic. Recently, we have put in a lot of effort into getting ultra-low latency with MsQuic. We have prototyped a fully functioning XDP data path for MsQuic to bypass Windows TCP/IP kernel stack. While working with internal partners on this technology stack, we have learned several interesting lessons about balancing performance, which will be the focus of this blog post. The details of MsQuic + XDP will be covered in a separate article.
We look at two performance metrics here:
- RPS - The number of HTTP-style requests completed per second.
- Latency - The amount of time (in microseconds) taken for a single request to complete.
Our goal is to get as low as possible latency while maintaining the highest RPS. We will walk through what we have measured and dive deep into why we can’t always get both.
Performance Compared to Using UDP Sockets
Before getting to the main topic, let’s peek at the performance we are getting with MsQuic and XDP, which is what motivated us to use XDP. The following graph shows the RPS and latency distribution from an RPS benchmark. The benchmark is configured to only allow one outstanding request- a new request will not be initiated until the previous request is completed. The details of the test environment and configuration can be found at the end of this article. In the graph, we can see that bypassing the kernel TCP/IP stack with XDP not only gives us huge latency reduction but more than doubles RPS in this simple scenario.
Measure Performance: one request at a time vs multiple parallel requests
Next, we compare the performance of 1 parallel request vs 4 parallel requests. The graph below shows that latency in all percentiles increases more than 100% for 4 parallel requests. However, RPS is also more than doubled.
Now, we are taking a closer look at why RPS and latency are both increased. In MsQuic (regardless of XDP), streams that belong to the same connection are processed on the same thread. This also essentially means only one CPU can process all streams on the same connection. MsQuic has this design to align connections with RSS to reduce cache thrashing and improve perf. [RSS is a network driver technology that enables the efficient distribution of network receive processing across multiple CPUs via tuple-based hashing.]
So, with this design, when the server is sending responses, all four responses (each map to a stream) will have to be processed one by one sequentially. This unavoidably introduces latency. Having more parallel requests also encourages batching. For example, when sending responses for multiple streams, they can share just one syscall to notify XDP to send a batch of packets out. There are also performance benefits in XDP to send packets in batches rather than one by one.
A simple diagram below shows how processing multiple streams on a worker thread/core can introduce latency. Stream 5 in this diagram will have to wait until stream 0-4 gets processed.
Let’s scale up the benchmark a little bit to measure RPS and latency over different numbers of parallel requests. The graphs below show that when the number of parallel requests reaches 20, the RPS improvement is negligible, but latency keeps going up. This means one connection/core cannot keep up with the number of parallel requests. Adding more parallel requests essentially just adds up queueing delay to the worker thread queue.
Scale Up Horizontally: more cores and connections
An effective way to bring down latency numbers is to use more connections which fundamentally just add more cores to process requests. We are repeating the same benchmark from the previous section but this time, we use 2 connections and ensure they are on two separate threads/cores. Compared to the 1 connection benchmark, we can easily tell from the graphs below that peak RPS is almost doubled, and latency is significantly dropped.
The following figure illustrates how adding another thread/core affects latency in high level. In this diagram, streams are evenly distributed to 2 connections over 2 threads/cores. Stream 5 will only need to wait for stream 1 and stream 3 to finish processing.
- CPU: Intel Xeon Gold 6230 x 2 (20 cores / 40 threads)
- RAM: 32GB RDIMM, DDR4 2933MT/s x 12 (384 GB)
- NIC: Mellanox ConnectX-4 Lx EN MCX4 131A-GCAT PCIe 3.0 x8 (50 Gigabit)
We use secnetperf (available on GitHub) to benchmark RPS. Each request is 512B, and the response is 8KB. The request/response size is consistent across the blog post.
All benchmarks are performed over (encrypted) QUIC and are leveraging a prototype XDP-for-Windows. Absolute values are not indicative of the final product and should continue to improve.
Optimizing between RPS and latency is often a trade-off. Performance is always relative to the scenarios. Application/service owners should always measure performance in their scenarios to find the right trade-off. Scaling up horizontally is an effective way to achieve both ultra-low latency and high RPS.
In the future, we will keep investing in ultra-low latency work in MsQuic using XDP:
- Support interrupt mode for XDP (instead of always polling) to reduce idle CPU usage.
- Support app driven execution to eliminate additional app context switch overhead.
We will continue to share our findings along the journey. Ideas and code contributions are always welcomed.