In March of 2019, Microsoft announced the general availability of Live Events in Microsoft 365 a video and townhall solution for enterprises to host live and on-demand events at scale across Yammer, Stream, and Microsoft Teams.
As we continue to build features and experiences that improve live events, I wanted to spend some time to talk about how we embarked on this mission and how we’re continuing to test and scale Live Events in Yammer to unprecedented levels.
The modern workplace requires a modern video event experience that scales to the size of enterprises and supports worldwide viewership and traffic without lag or overload. This is where Live Events in Yammer comes in. The initial goal was to support up to 5 concurrent live events per tenant, with each event supporting up to 10k attendees, or in other words, 50k concurrent users per tenant. We began by trying tobetter understand the traffic pattern and the potential spikesthat Yammer would need to support,thus we looked at the pastSkype for Business broadcast distribution of attendeesas they connected to the broadcast. It seemed that for an event with 50k attendees, about 10k of those users would joinover a one-minute timespan near the beginning of the event.
Managing such an intake would be a challenge. The way our architecture is designed generated additional difficulties. In Yammer, when a notification is received by the clients to notify themof a message being posted in a Live Event a storm of requests ensues. That means 50k requests would be issued in a second to reload the feed and display the new message(s) for all users. Fortunately, our clients implement jitteringwhich helped distribute that 50k load over a window of 2 minutes. Thus, the highestactual requests per second (rps) we had to support in this initial iteration was 830 requests per second, quite higher than our usual pattern at approx. 200rps (which would also need to be supported throughout the event).
To tackle thiscomplex problem, we discussed multiple options and decided to address scalability for the write path and the read path.
For the purposes of this article we're going to focus on the read path improvements, mainly because (1) the write path improvements are still undergoing changes as we build a geo-distributednetwork and (2) the nature of Yammer assumes more people will read content in a network than post.
To test on staging or not to test on staging? To build our own tooling or not? We ultimately agreed to test on production because any other environment was not on par with it in terms of configuration, capacity and settings. Rarely is a demo environment as complex as the production environment and that was the case here as well. That's why chaos engineering is now a practice. In a truly distributed system, there are just simply too many variables to keep in sync. Here’s how we worked through the process:
We considered whether to use existing tooling or build new tooling for load testing. We decided on doing both. In hindsight that was a curse and a blessing, but the work paid off and enabled us to make steps towards building a more reliable infrastructure and test our new geo-replicated environment. Having a semi-automated and repeatable process helped us iterate more quickly and with confidence.
We decided to iteratively raise the rate limit on our services to allow for more traffic to flow through and observe the various bottlenecks at the circuit level. That's why testing on staging was irrelevant for the core scenarios. Any failures and corrections we would make could not be transposed to production.
Because our load testing and traffic generation tool was still under construction, we used the simplest tool we could find.Azure DevOpsload test feature came to the rescue. It helped us evaluate our maximum load and iron out issues in our microservices.
Some interesting war stories :
Connection latency and Nginx. We observed during our load tests that a lot of the upstream connections when talking to ourauthentication service were timing out without reasonand not respecting the connection keep-alive. Our Wavefront metricssuggested they were being held occupied until timeout, no matter how big that timeout would be. After a couple of weeks of deep dives to solve this latency issue,we realized that we had removed Nginx from handling the requests before handing them over to our Dropwizard service. Pro tip:Dropwizard uses Jetty. Re-introducing Nginx fixed the problem as Nginx keeps a smaller pool of connections when talking to Jetty but it's much better at handling incoming connections and it respects the keepalive.
The drop in the number of connections maintained by Dropwizard
The drop-in latency between a downstream service and the upstream service with NGINX added in front of it
It took several cycles, but we’ve now gotten to a state where we can safely increase the rate limit to support millions of users across multiple customers globally, all viewing and participating in a Live Event in Yammer, no matter where in the world they are.
Next mission. Flying to Mars. Our user growth has been increasing at an even more accelerated pace in the recent years. We want to be ready for our customers when that happens and so we will continue our journey to improve, re-architect and scale. Keep an eye on this blog to stay up to date with how we’re building Yammer.