We wanted to share about what happened regarding the recent outage in Azure Lab Services.
First, we are very conscious of the impact of the outage in the current context, and we are deeply sorry. It should not have happened, but it did. We will share learnings and preventative steps for the future once we are back to expected level of service for all Labs. In the meantime, at the risk of oversharing, this is what happened.
The way the service works is that we have VMs and related resources capacity pre-assigned to our service in Azure. We have monitoring in place that allows us to augment our capacity as we need. In normal conditions, this process allows us to maintain a healthy reserve as we grow. We are no longer in normal conditions. The world has had to go to remote working and learning in a very short term. So our priority over the last weeks has been to provide Labs to customers needing them.
Teams, Windows Virtual Desktops and other services also needed more capacity to meet the increased demand. In Labs, we had unused capacity we could reclaim from Labs and VMs no longer in use that customers never deleted. In our pricing model, we only charge for active hours, so many customers do not delete their Labs and VMs once they no longer need it. In the last year we focused on implementing features and improving the quality of the experience and delayed implementing reclaiming procedures. This seemed to be the right trade-off for where we are in the service life cycle.
So, over the last week, we implemented a reclaiming process to liberate the unused capacity. We started by removing Labs that “reserved” capacity but never were published, then VMs that were never started followed by VMs in Labs that had seen no usage in a long time. We added the information in the UI and sent emails to the owners of Labs to be reclaimed so that they could have a chance to keep them if needed. In Labs we reclaimed, we kept the Template VM in case they would need to be republished. It has been very front of mind to make sure that there would be no loss of data to anyone. We also had a pocket of capacity that we could reclaim more aggressively from the various Labs we create for ourselves to test and reproduce issues in isolated subscriptions kept for that purpose.
We ran these procedures gradually, region to region, and scenario by scenario. The program worked and we reclaimed enough capacity to put us in a good position to meet demand. We were in a good place, or so we thought.
We realized that some of the “internal” subscriptions actually contained customer Labs and that we deleted them along with ours. It should not have happened. It did. When a user creates a Lab, our allocation algorithm decides in which specific subscription to create the lab based on current usage, unpublished labs, expected lab size, and a few other factors. We had a few “internal” subscriptions mistakenly configured to be part of the pool. So, when we applied the much more aggressive criteria to “internal” subs, we deleted Labs we should not have deleted. Again, it should not have happened.
Over the last few days, we have been heads down trying to recover these Labs. They basically fit in 2 categories:
Labs that could be recovered. In some region we were able to find the disks still intact and have recreated these labs. Access to the lab should have resumed by now. The only impact to these Labs is that in the event you need to increase the VM pool, you will need to re-publish the Lab first.
Labs that cannot be recovered. These will need to be recreated. If they were published, this means loss of data from the students. We cannot say how much this pains us.
If your Lab was not recovered, it will be identified in the Labs UI. Very shortly, you will be able to open your Lab in read only mode and you will be able to view the roster, schedule, quota and other configurations. You will also be able to delete your Lab once you no longer need it.
We understand this requires potentially significant effort. You can contact us at AzLabsCOVIDSupport@microsoft.com for any question or help.
In trying to meet demand, we ran faster on the reclaiming than we should have, and we missed the issue before it was too late. We will extract the learnings and improve our internal processes to prevent anything similar happening again. As mentioned already, this should not have happened, and we are very sorry.
Please do not hesitate to contact us,
The Labs team
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.