Mastering Maintenance - Strategies for seamless auto-updates on Azure SQL Managed Instance

Microsoft

Jan 22, 2024

Introduction

When it comes to managing your Azure SQL Managed Instances and ensuring their reliability and performance, understanding planned maintenance events is important. In this blog post, we will demystify how maintenance works, what to expect during the maintenance process, and some of the considerations and best practices associated with them.

Across Azure services, a unified process is followed to ensure that changes are carefully introduced and validated before reaching the production stage. This approach revolves around the following key principles:

Quality Assurance through Testing: Changes go through rigorous test and integration validations to ensure they meet Azure's quality standards before deployment.
Gradual Deployment and Continuous Monitoring: Changes are rolled out in a gradual manner, allowing Azure to measure health signals continuously. This enables the detection of unexpected impacts that might not have surfaced during testing.
Avoiding Production Impact: The goal is to prevent change from causing problems in the broader production environment. Steps are taken to prevent problematic changes from reaching a wide audience.

To automate change deployment while adhering to the principles above, Azure follows the Safe Deployment Practice (SDP) framework. This applies to all Azure SQL Managed Instance logical upgrades (all that are SQL related). The SDP framework ensures that all code and configuration changes progress through specific stages, monitored by health metrics. Automated actions and alerts are triggered if any degradation is detected, ensuring timely responses to potential issues.

The deployment journey begins with developers modifying their code and testing it on their systems. The code then moves to staging environments, where interaction between different components is tested. Azure's integration environment comes into play here, as it is dedicated to testing interactions between specific Azure services.

The subsequent stages include:

Canary Regions: Publicly referred to as "Early Updates Access Program" regions, these full-scale Azure regions host a diverse set of services, including first-party, third-party, and invited external customers. Canary regions undergo extensive end-to-end validations and scenario coverage at scale, aiming to replicate patterns found in public Azure regions.
Pilot Phase: After successful validation in canary regions, changes enter the pilot phase. This phase, while still relatively small in scale, introduces more diversity in hardware and configurations. It is especially vital for hardware-dependent services like core storage and compute infrastructure.
Broad Deployment: Upon positive outcomes from the pilot phase, changes are incrementally deployed to broader Azure regions. Respect for region pairing is maintained throughout this process. If any health signals arise, the deployment is paused for a thorough examination. Any changes that might introduce a regression are excluded from the payload before deployment proceeds.

The Role of Maintenance Windows

Azure maintenance windows are designated periods of time during which Microsoft performs routine maintenance tasks on its infrastructure. These tasks can include updates to the underlying infrastructure, software patches, or other improvements aimed at enhancing security, performance, or stability.

Azure's maintenance windows are the key to orchestrating smooth updates and patches across its vast infrastructure. These windows are carefully designed time frames during which updates are applied to the regions where your instance is deployed. Once the code reaches the region where your instance is deployed, the maintenance windows ensure that the slots when these deployments occur are narrowed. By strategically narrowing down these slots, Azure aims to minimize potential disruptions caused by maintenance activities.

During maintenance windows, Azure services are fully online, but might experience transient faults such as brief network interruptions or temporary loss of connectivity to a database. However, these transient faults can be eliminated with proper retry logic in place at the application level. Retry logic is designed to handle these transient faults by automatically retrying failed operations.

Understanding Maintenance Window Options

Currently, Azure SQL Managed Instance offers three distinct maintenance window options, each tailored to different needs:

Default Window: This window spans from 5 PM to 8 AM of the next day, encompassing all days of the week. While it does not mean that every day within this range will see maintenance events, it signifies that these days are candidates for deployments.

Weekday Slot: Operating from Monday to Thursday, this window runs from 10 PM in the evening until 6 AM the next morning. The narrower time span allows for focused maintenance activities, while still ensuring your Azure SQL Managed Instances are online during most of the window.

Weekend Slot: Extending from Friday to Sunday, the weekend slot provides an extended period for maintenance activities. Like the other slots, instances remain operational during this window, with potential failovers having limited impact.

Note: Local time Zone is determined by the location of Azure region that hosts the resource and may observe daylight saving time in accordance with local time zone definition. It is not determined by the time zone configured on the managed instance.

Maintenance Process

Once the maintenance window selection is made and service configuration completed, planned maintenance will occur only during the window of your choice. As the window opens and if maintenance has been scheduled for the underlying services (Maintenance does not happen in every window) Azure's deployment process begins. The virtual machines hosting the managed instances go through a rolling upgrade process where the process involves updating individual virtual machines one by one, rather than updating the entire system at once. If all the necessary Virtual machines or hosts are patched within the window, the event concludes without a hitch. However, if there are outstanding updates, the event might be extended to the next day or week, based on the maintenance window configuration.

The maintenance event may contain updates for hardware, firmware, operating system, satellite software components, or the SQL database engine. They are typically combined into a single batch to minimize the incidence of maintenance events. In the case of SQL Managed Instance, updates are combined in two batches, one focused on physical infrastructure, and another one focused on SQL engine and logical infrastructure.

Throughout the planned maintenance event, resources within your Azure SQL Managed instance environment remain accessible, ensuring minimal disruption to your services. This means that, for the most part, your database remains operational and responsive. Towards the end of the maintenance event, there is a brief reconfiguration period that occurs. During this time, some changes are applied to your database resources. However, this period is intentionally kept extremely short, typically lasting less than 8 seconds. The brevity of this reconfiguration minimizes the impact on your application's availability.

If your application is actively engaged in a long-running process (for example - a long running query in a database) when the reconfiguration occurs, it may need to reestablish its connection to the database. This is akin to what happens in an on-premises scenario when a primary database fails over to a secondary one. Having robust retry logic in your application is crucial during planned maintenance events. This logic should be programmed to handle temporary connection interruptions gracefully. If a connection is lost, the application should automatically attempt to reconnect. A well-implemented retry mechanism ensures that your application can continue its operations seamlessly after a brief interruption.

Azure SQL Managed Instance failover for high availability refers to the process and mechanisms Azure employs to ensure that your SQL Managed Instances remain highly available, especially in the face of potential disruptions such as hardware or software failures, maintenance activities, or other unexpected events. Azure SQL Managed Instance automatically handles failover in the background. In the event of a failure, it automatically switches to a standby replica to ensure minimal disruption to services.

Failovers play a crucial role in maintaining service availability during maintenance windows. If a host or VM with an Azure SQL managed instance primary replica is being patched, a failover can swiftly transfer operations to ensure continuity. These failovers typically occur one to two times during a maintenance slot, lasting around 8 seconds.

For Azure SQL Managed Instances in the business-critical tier, failovers are even faster due to the optimized setup of Always On Availability Groups and local storage.

If one is implementing disaster recovery through the configuration of auto-failover groups in Azure SQL Managed instance, we recommend that you replicate workloads across regional pairs to benefit from Azure’s isolation and availability policies. Also, failover groups in paired regions have better performance compared to unpaired regions. Azure paired regions are guaranteed not to be deployed to at the same time. However, it is not possible to predict which region will be upgraded first, so the order of deployment is not guaranteed. Sometimes, your geo-primary instance will be upgraded first, and sometimes it will be the geo-secondary.

In situations where your Azure SQL managed instance has auto-failover groups, and the groups are not aligned with the Azure region pairing, you should select different maintenance window schedules for your primary and secondary database. For example, you can select Weekday maintenance window for your geo-secondary database and Weekend maintenance window for your geo-primary database.

Although rare, failures or interruptions during a maintenance event can occur. In case of a failure, changes are rolled back, and the maintenance will be rescheduled to another time.

Considerations

Azure SQL Managed Instance consists of service components hosted on a dedicated set of isolated virtual machines that run inside the customer's virtual network subnet. These virtual machines form “virtual machine group” that can host multiple managed instances. All instances hosted in a “virtual machine group” share the same maintenance window. Specifying another maintenance window for managed instance during its creation or afterwards means that it must be placed in “virtual machine group” with corresponding maintenance window. If there is no such “virtual machine group” in the subnet, a new one must be created first to accommodate the instance. Accommodating additional instances in the existing “virtual machine group” may require a resize of the “virtual machine group”.
One cannot restore a database when the above change is going on.
Expected duration of configuring maintenance window on managed instance can be calculated using estimated duration of instance management operations.
Each new virtual machine group in a subnet requires additional IP addresses according to the virtual cluster IP address allocation.
Changing a maintenance window for an existing managed instance also requires temporary additional IP capacity, similar to when scaling the number of vCores for the respective service tier.Configuring and changing maintenance window causes change of the IP address of the instance, within the IP address range of the subnet.

Advanced Notifications

You can opt in to receive notification 24 hours prior to the maintenance event, immediately before maintenance starts, and when the maintenance window is completed. The Resource health center can be checked for more information. To receive emails, advance notifications must be configured. For more information, see Advance notifications.

As of this writing, advanced notifications for maintenance windows are in public preview for Azure SQL Managed Instance.

Best practices for resiliency during maintenance operations.

Implement retry logic in your applications for all transient errors common to the cloud environment. See this sample source code for the connection retry logic.
Use the latest drivers to connect to SQL Managed Instance. Newer drivers have a better implementation of transient error handling.
Test your application for resiliency prior to the maintenance events using User Initiated failover functionality for SQL Managed Instance.
From a networking perspective, choose the redirect connectivity mode over proxy. With Redirect, the client directly connects to the node hosting the database and does not need to connect to the gateway. This makes the SQL managed instance resilient on gateway maintenance for already established connections.
Reference: Connection types - Azure SQL Managed Instance | Microsoft Learn.

Conclusion

By understanding what to expect during a planned maintenance event and having appropriate measures in place, such as retry logic, error handling and selecting the proper connectivity mode you can help ensure a smooth transition through these events and maintain a high level of service availability for your Azure SQL Managed Instance.