In March of 2019, Microsoft announced the general availability of Live Events in Microsoft 365 a video and townhall solution for enterprises to host live and on-demand events at scale across Yammer, Stream, and Microsoft Teams.
As we continue to build features and experiences that improve live events, I wanted to spend some time to talk about how we embarked on this mission and how we’re continuing to test and scale Live Events in Yammer to unprecedented levels.
The modern workplace requires a modern video event experience that scales to the size of enterprises and supports worldwide viewership and traffic without lag or overload. This is where Live Events in Yammer comes in. The initial goal was to support up to 5 concurrent live events per tenant, with each event supporting up to 10k attendees, or in other words, 50k concurrent users per tenant. We began by trying to better understand the traffic pattern and the potential spikes that Yammer would need to support, thus we looked at the past Skype for Business broadcast distribution of attendees as they connected to the broadcast. It seemed that for an event with 50k attendees, about 10k of those users would join over a one-minute timespan near the beginning of the event.
Managing such an intake would be a challenge. The way our architecture is designed generated additional difficulties. In Yammer, when a notification is received by the clients to notify them of a message being posted in a Live Event a storm of requests ensues. That means 50k requests would be issued in a second to reload the feed and display the new message(s) for all users. Fortunately, our clients implement jittering which helped distribute that 50k load over a window of 2 minutes. Thus, the highest actual requests per second (rps) we had to support in this initial iteration was 830 requests per second, quite higher than our usual pattern at approx. 200rps (which would also need to be supported throughout the event).
To tackle this complex problem, we discussed multiple options and decided to address scalability for the write path and the read path.
For the purposes of this article we're going to focus on the read path improvements, mainly because (1) the write path improvements are still undergoing changes as we build a geo-distributed network and (2) the nature of Yammer assumes more people will read content in a network than post.
To test on staging or not to test on staging? To build our own tooling or not? We ultimately agreed to test on production because any other environment was not on par with it in terms of configuration, capacity and settings. Rarely is a demo environment as complex as the production environment and that was the case here as well. That's why chaos engineering is now a practice. In a truly distributed system, there are just simply too many variables to keep in sync. Here’s how we worked through the process:
Some interesting war stories :
It took several cycles, but we’ve now gotten to a state where we can safely increase the rate limit to support millions of users across multiple customers globally, all viewing and participating in a Live Event in Yammer, no matter where in the world they are.
Next mission. Flying to Mars. Our user growth has been increasing at an even more accelerated pace in the recent years. We want to be ready for our customers when that happens and so we will continue our journey to improve, re-architect and scale. Keep an eye on this blog to stay up to date with how we’re building Yammer.
Andrei is a Software Engineer on the Yammer team.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.