Blog Post

Networking Blog
6 MIN READ

Algorithmic improvements boost TCP performance on the Internet

huanyi's avatar
huanyi
Icon for Microsoft rankMicrosoft
May 12, 2021

Improved network performance over the Internet is essential for edge devices connecting to the cloud. Last mile performance impacts user perceived latencies and is an area of focus for our online services like M365, SharePoint, and Bing. Although the next generation transport QUIC is on the horizon, TCP is the dominant transport protocol today. Improvements made to TCP’s performance directly improve response times and download/upload speeds. 

 

The Internet last mile and wide area networks (WAN) are characterized by high latency and a long tail of networks which suffer from packet loss and reordering. Higher latency, packet loss, jitter, and reordering, all impact TCP’s performance. Over the past few years, we have invested heavily in improving TCP WAN performance and engaged with the IETF standards community to help advance the state of the art. In this blog we will walk through our journey and show how we made big strides in improving performance between Windows Server 2016 and the upcoming Windows Server 2022. 

 

Introduction 

There are two important building blocks of TCP which govern its performance over the Internet: Congestion Control and Loss Recovery. The goal of congestion control is to determine the amount of data that can be safely injected into the network to maintain good performance and minimize congestion. Slow Start is the initial stage of congestion control where TCP ramps up its speed quickly until a congestion signal (packet loss, ECN, etc.) occurs. The steady state Congestion Avoidance stage follows Slow Start where different TCP congestion control algorithms use different approaches to adjust the amount of data in-flight.  

 

Loss Recovery is the process to detect and recover from packet loss during transmission. TCP can infer that a segment is lost by looking at the ACK feedback from the receiver, and retransmit any segments inferred lost. When loss recovery fails, TCP uses retransmission timeout (RTO, usually 300ms in WAN scenarios) as the last resort to retransmit the lost segments. When the RTO timer fires, TCP returns to  Slow Start from the first unacknowledged segment. This long wait period and the subsequent congestion response significantly impacts performance, so optimizing Loss Recovery algorithms enhances throughput and reduces latency. 

 

Improving Slow Start: HyStart++ 

We determined that the traditional slow start algorithm is overshooting the optimum rate and likely to hit an RTO during slow start due to massive packet loss. We explored the use of an algorithm called HyStart to mitigate this problem. HyStart triggers an exit from Slow Start when the connection latency is observed to increase. However, we found that sometimes false positives cause a premature exit from slow start, limiting performance. We developed a variant of HyStart to mitigate premature Slow Start exit in networks with delay jitter: when HyStart is triggered, rather than going to the Congestion Avoidance stage we use LSS (Limited Slow Start), an increase algorithm that is less aggressive than Slow Start but more aggressive than Congestion Avoidance. We have published our ongoing work on the HyStart algorithm as an IETF draft adopted by the TCPM working group: HyStart++: Modified Slow Start for TCP (ietf.org). 

 

Loss recovery performance: Proportional Rate Reduction 

HyStart helps prevent the overshoot problem so that we enter loss recovery in Slow Start with fewer packet losses. However, loss recovery itself might also incur packet losses if we retransmit in large bursts. Proportional Rate Reduction (PRR) is a loss recovery algorithm which accurately adjusts the number of bytes in flight throughout the entire loss recovery period such that at the end of recovery it will be as close as possible to the congestion window. We enabled PRR by default in Windows 10 May 2019 Update (19H1) 

 

Re-implementing TCP RACK: Time-based loss recovery 

After implementing PRR and HyStart, we still noticed that we tend to consistently hit an RTO during loss recovery if many packets are lost in one congestion window. After looking at the traces, we figured out that it’s lost retransmits that cause TCP to time out. The RACK implementation shipped in Server 2016 is unable to recover lost retransmits. A fully RFC-compliant RACK implementation (which can recover lost retransmits) requires per-segment state tracking but in Server 2016, per-segment state is not stored.  

 

In Server 2016, we built a simple circular-array based data structure to track the send time of blocks of data in one congestion window. The RACK implementation we had with this data structure has many limitations, including being unable to recover lost retransmits. During the development of Windows 10 May 2020 Update, we built per-segment state tracking for TCP and in Server 2022, we shipped a new RACK implementation which can recover lost retransmits. 

 

(Note that Tail Loss Probe (TLP) which is part of RACK/TLP RFC and helps recover faster from tail losses is also implemented and enabled by default since Windows Server 2016.) 

 

Improving resilience to network reordering 

Last year, Dropbox and Samsung reported to us that Windows TCP had poor upload performance in their networks due to network reordering. We bumped up the priority of reordering resilience in the Windows version currently under development, we have completed our RACK implementation which is now fully compliant with the RFC. Dropbox and Samsung confirmed that they no longer observed upload performance problems with this new implementation. You can find how we collaborated with the Dropbox engineers here. In our automated WAN performance tests, we also found that the throughput in reordering test cases improved more than 10x. 

 

Benchmarks 

To measure the performance improvements, we set up a WAN environment by creating two NICs on a machine and connecting the two NICs with an emulated link where bandwidth, round trip time, random loss, reordering and jitter can be emulated. We did performance benchmarks on this testbed for Server 2016, Server 2019 and Server 2022 using an A/B testing framework we previously built where you can easily automate testing and data analysis. We used the current Windows build 21359 for Server 2022 in the benchmarks since we plan to backport all TCP perf improvement changes to Server 2022 soon. 

 

Let’s look at non-reordering scenarios first. We emulated 100Mbps bandwidth and tested the three OS versions under four different round trip times (25ms, 50ms, 100ms, 200ms) and two different flow sizes (32MB, 128MB). The bottleneck buffer size was set to 1 BDP. The results are averaged over 10 iterations. 

 

Server 2022 is the clear winner in all categories because RACK significantly reduces RTOs occurring during loss recovery. Goodput is improved by up to 60% (200ms case). Server 2019 did well in relatively high latency cases (>= 50ms). However, for 25ms RTT, Server 2016 outperformed Server 2019. After digging into the traces, we noticed that the Server 2016 receive window tuning algorithm is more conservative than the one in Server 2019 and it happened to throttle the sender, indirectly preventing the overshoot problem. 

 

Now let’s look at reordering scenarios. Here’s how we emulate network reordering: we set a probability of reordering per packet. Once a packet is chosen to be reordered, it’s delayed by a specified amount of time instead of the configured RTT. We tested 1% reordering rate and 5ms reordering delay. Server 2016 and Server 2019 achieved extremely low goodput due to lack of reordering resilience. In Server 2022, the new RACK implementation avoided most unnecessary loss recoveries and achieved reasonable performance. We can see goodput is up over 40x in the 128MB with 200ms RTT case. In the other cases, we are seeing at least 5x goodput improvement. 

 

Next Steps 

We have come a long way iimproving Windows TCP performance on the Internet. However, there are still several issues that we will need to solve in future releases. 

  • We are unable to measure specific performance improvements from PRR in the A/B tests. This needs more investigation. 
  • We have found issues with HyStart++ in networks with jitter. So we are working on making the algorithm more resilient to jitter.  
  • The reassembly queue limit (the max number of discontiguous data blocks allowed in receive queue), turns out to be another factor that affects our WAN performance. After this limit is reached, the receiver discards any subsequent out of order data segments until in-order data fills the gaps. When these segments are discarded, the receiver can only send back SACKs not carrying new information and make the sender stall. 

-- Windows TCP Dev Team (Matt Olson, Praveen Balasubramanian, Yi Huang) 

Updated May 12, 2021
Version 1.0

10 Comments

  • Simple_Thought's avatar
    Simple_Thought
    Copper Contributor

    It would be great if this was ported back to windows 10. There are still upwards of two years of support left for IOT / LTS which I have on a few systems that are not eligible for moving to Windows 11 unfortunately 

  • koensayr's avatar
    koensayr
    Copper Contributor

    Hi there, It appears this has been fixed in the Windows 11 22H1 update, but curious if there are any plans to back port this fix to Windows 10 machines?

     

    Happy to support this effort anyway I can

  • QuantumFailure's avatar
    QuantumFailure
    Copper Contributor

    huanyi Hi! I'm sorry to drop into your thread sporadically but this is the only seemingly relevant resource I could find anywhere on this subject. I have a new PC build with the latest Windows 11 (and all up-to-date drivers, etc) and I have a very hard time generally surfing the internet. I pulled up Wireshark (everyone's best friend) and tracked dozens of TCP conversations that end in this pattern:

    TCP spurious retransmission (SPUR) --> DUP ACK --> SPUR --> DUP ACK --> SPUR --> DUP ACK --> ............... --> RST

    I've confirmed using developer tools in my browser that this is actually the direct cause of websites stalling and failing to load. For example, for a given stall, I'd take the browser record of the request, analyze the timestamps, duration, etc and compare that with a TCP conversation picked up by Wireshark. Essentially all stalls match up with a SPUR/DUP ACK/RST sequence.

    I've submitted tickets to ASUS, Intel, Microsoft... none seem to have any answers. I've been at this for weeks now and want to find some relief to this problem.

    Do you (or anyone else!) have any idea what's happening here and how I can alleviate this issue?

    Thank you in advance! I'll drop the specs below
    Benjamin

    Network Adapter Intel(R) Ethernet Controller (3) I225-V -- driver v2.1.1.7 (Microsoft signed)
    Intel(R) Management Engine Interface -- driver 2145.1.42.0 (Microsoft signed)
    MOBO ASUS Prime Z590-a -- BIOS v1402
    NVIDIA GeForce RTX 3060-Ti -- driver v31.0.15.1694

    Processor 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz 3.60 GHz
    Installed RAM 32.0 GB (31.8 GB usable)
    System type 64-bit operating system, x64-based processor
    Edition Windows 11 Pro
    Version 21H2
    Installed on ‎8/‎16/‎2022
    OS build 22000.856
    Experience Windows Feature Experience Pack 1000.22000.856.0

     

  • myNET_Sebastian Can you try the latest prerelease version of windows? It should have all the improvements and bugfixes needed for reordering resilience.

  • GaryNebbett You are right about the DSACK bug and we fixed it in Nickel back in October 2021. The latest windows prelease version should have the fix.

  • myNET_Sebastian's avatar
    myNET_Sebastian
    Copper Contributor

    Hi huanyi 

    we are also waiting for this improvement. But till now, we can't see any better speeds in the latest win10 and win 11 updates.

    Is there a plan when this feature will be ported to windows 10/11?

  • GaryNebbett's avatar
    GaryNebbett
    Copper Contributor

    There appears to be a bug in the Windows 11 implementation (at least up to build 22000.434) of this functionality; the "sanity checking" performed on DSACKs in routine TcpReceiveSack has a flaw that means DSACKs are never recognized - this affects the "tuning" ability of the RACK mechanism. I wrote a more detailed description of the problem here: https://gary-nebbett.blogspot.com/2022/01/windows-11-tcpip-congestion-control.html

  • thecooldean's avatar
    thecooldean
    Copper Contributor

    When will this be implemented in Windows 10, this has been implemented in Windows 11 but not 10. I have been getting terrible upload speeds due to out of order packet delivery as mentioned in 

    https://docs.microsoft.com/en-us/answers/questions/89768/slow-wired-upload-speed-vs-linux-on-same-hardware.html?page=3&pageSize=10&sort=oldest

    https://docs.microsoft.com/en-us/answers/questions/330172/extreamly-slow-upload-speed-in-windows-all-other-o.html

     

    Please kindly let me know if these improvements will be implemented in Windows 10.