With great backend service power comes great productivity. Over the years, we’ve tuned SharePoint in Microsoft 365 to deliver turbocharged experiences. We’ve invested billions of dollars across research and engineering to drive massive improvements to the capabilities, performance, scalability, and security when working with documents, sites, integrated email, and chat content experiences, plus fast, efficient desktop sync.
We want to take a moment to share a behind-the-scenes look at service and network optimizations, how we design, manage, and monitor SharePoint across each “mile” of connectivity between you and Microsoft datacenters.
Scroll through all and let us know what you think. And don’t miss the embedded, related video toward the end.
Networking issues are wide and varied. To help us optimize network connectivity, we look across the whole of the world’s networking infrastructure and begin to assess it across three connection milestones. The First Mile is in our full control at Microsoft. Within the Middle Mile, we partner with thousands of internet service providers (ISPs) to optimize how Microsoft services get routed. And it is in the Last Mile that we commonly see customers manage their corporate networks and device connectivity of their employees (mobile, web and desktop).
Our goal is to take the traffic bound for SharePoint and OneDrive and get into the global Microsoft network as quickly possible, allowing us to route that traffic in the most efficient, direct manner to take advantage of the performance optimization innovations we make and update across the entire SharePoint platform in Microsoft 365. We will cover these optimizations in the next section.
The SharePoint service in Microsoft 365 is unique as it manages an extremely varied mix of user traffic all the way from web pages to support experiences like portals and team sites, collaboration traffic from Office applications, to through to data heavy scenarios such as the OneDrive sync and video streaming (Stream content now utilizes the SharePoint file platform for streaming video). Such a varied traffic mix at high volumes can result in bottlenecks across the Microsoft cloud networking infrastructure – resulting in packet loss or what is commonly known as “congestion.” To address this, we optimize the global SharePoint datacenter infrastructure from the ground up ranging from our physical infrastructure consisting of servers, routers, and load balancers, plus configuration adjustments across networking protocol layers.
Our global server fleets have third-party network interface cards that come installed and are configured with factory settings. While monitoring low-level network traffic, we noticed our servers under load were exhibiting packet loss which we root caused to NICs running out of buffer space resulting in discarded packets. Working with our third-party vendor we tuned buffer depths on our NICs to eliminate packet discards. We also coupled this with a custom NDIS driver to collate packets of large transfer into a single packet to efficiently use the available buffer space.
While debugging file transfer speeds, Microsoft noticed speeds were lower than network design targets. Upon further investigation, we narrowed it down to our servers not sending the optimal amount of data to take advantage of the large bandwidth delay product (BDP) available due to high-bandwidth Microsoft global network that interconnects across continents.
SharePoint in Microsoft 365, like our on-premises server, runs on the Windows + IIS + ASP.NET stack. We partner with the Microsoft Windows and IIS teams to assess HTTP.sys to optimize how data is passed all the way from the application tier into ASP.NET/IIS and Window networking stack, allow us to optimize and maximize data in transit.
With more data flowing into the network, we design the service to avoid network overload. SharePoint adopted TCP CUBIC (including RACK-TLP) as our congestion control protocol within the Microsoft network. This allows us to take advantage of available network capacity by quickly ramping up congestion windows coupled with better recovery from congestion events; most congestion events within the Microsoft network are transient in nature vs. systemic capacity issues.
With a globally distributed user base of more than 200 million users, our goal is to get close to our users to quickly onboard the traffic to the Microsoft network or what we call the ‘first mile’; allowing us to take advantage of the optimizations we called out above, and we do this by leveraging our service front doors or network PoPs (points of presence). You can think of service front doors as the closest piece of Microsoft infrastructure to our users allowing us to provide low latency onramps to our network. The ever-expanding set of service front doors exist around the globe in key cities, interconnected with thousands of internet service providers (ISPs) to route SharePoint and OneDrive cloud traffic in the most efficient manner - to reach our network as fast as possible. We use AnyCast routing to connect the user to the closest service “front door” from their location. Regardless of where their data is stored, they will always enter the Microsoft network via the closest front door, then get routed through our dedicated, internal network to the datacenter hosting their data.
While the SharePoint service has made significant optimizations within the ‘first mile,’ we often notice user connectivity challenges in the last mile - the network segment controlled by our customers. To help our customers understand how their users connect to Microsoft, we provide insights to help identify and address network bottlenecks, available in the Microsoft 365 admin center.
The Network connectivity page distills an aggregate of numerous network performance metrics. This snapshot represents your enterprise network perimeter health, represented by a points value ranging from 0 - 100. A higher value indicates optimal network connectivity.
We leverage anonymized telemetry from first-party applications, such the OneDrive sync or Office applications, to gather low level networking information. This is then analyzed against our published best practices for network connectivity to produce an in-depth view of your organization’s connectivity to Microsoft; surfacing issues and recommendations to remediate issues such as lengthy back hauls or intermediary devices such as proxies that impact performance. We also provide a side-by-side comparison of how your organization is doing relative to Microsoft 365 customers in each location to help you benchmark your connectivity.
Learn more about the Microsoft 365 network connectivity center.
While network performance is key to moving the bits around, we have also optimized our storage layers for peak performance and reliability.
Azure storage – file ‘chunking’
We store all SharePoint and OneDrive file data securely on Azure leveraging Azure SQL for metadata and Azure Blob storage for the file contents. To provide maximum flexibility in how we store and retrieve file contents, we run every incoming file through a process called “chunking.” The incoming file is split into smaller “chunks,” individually encrypted with a unique key per chunk and written in parallel across two Azure regions for redundancy. For example, if we are storing a 500KB file, we would chunk the incoming file into five chunks of 100KB, and then encrypt each of the five chunks with a unique key and write each chunk as blobs to two Azure regions, in total ten blobs written to our storage system in parallel.
When downloading or retrieving the file, our compute nodes will reach out the Azure storage location that is closest to it to quickly retrieve the different chunks of the file in parallel. If for some reason, the chunk retrieval is taking longer than expected, we automatically reach out to the secondary region to fetch the chunk and continue processing the download request, giving us the ability to handle transient issues without impact end user performance. All this chunking is completely transparent from our users and applications that interact with SharePoint.
OneDrive differential sync
Building on the ability to store files in a chunked manner. Differential sync is a capability that allows you to use the OneDrive Sync client and sync only the parts of large files that have changed, not the entire file. This works by the OneDrive Sync client calculating which parts of the file have changed locally and uploads only those parts to the server. Server-side, we again take advantage of our chunking and merge the change into the appropriate chunk without needing to read and write the entire from our storage layers. This makes the file synchronization process faster for these files. It also reduces the time taken to upload and download a file as well as consumed bandwidth. This month we are rolling out the ability to leverage differential sync to all file types - JPEG, PDF, MOV, MP4 etc., stored in OneDrive and SharePoint.
Learn more how OneDrive sync works.
Fluid Framework is an innovative technology and set of experiences that will make collaboration seamless. It breaks down barriers between apps. With it, people coauthor at industry-leading speed allowing authors to deconstruct content into collaborative building blocks, expanding use them across numerous applications, and combine them in a new, more flexible types of documents.
All Fluid content (and components) within the Microsoft 365 ecosystem get stored as “files” in SharePoint and OneDrive. With Fluid, the service sends every key stroke back to SharePoint to process the incoming stream of changes. This gets relayed to other co-authors working on the same content, all in near-real-time co-auth experience. Behind the scenes, we persist the changes into our storage system using B-Trees to map Fluid’s distributed data structures to storage blobs that allow for O(log n) performance as we read and write parts of the Fluid file.
Learn more about Fluid Framework.
What are streaming APIs? Streaming APIs take advantage of networking and storage investments with built-in resiliency. Recent updates improve Web user experiences, OneDrive sync, and chat and meetings. Office applications are optimized across the desktop and the Web (Word, Excel, and PowerPoint).
“Turbocharging Microsoft 365 cloud user experiences” video by Shyam Narayan:
At all layers, we monitor and optimize our datacenters out to you and your employees – the backend services, storage, and applications. We do this on a continuous basis by focusing on high-quality product performance and efficient end-user connectivity.
- Shyam Narayan, principal PM manager and Mark Kashman, senior product manager - Microsoft
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.