Importance of application resiliency in the cloud
High availability is a fundamental part of SQL Managed Instance (MI) platform that works transparently for your database applications. As such, failovers from primary to secondary nodes in case of node degradation or fault detection, or during our regular monthly software updates are an expected occurrence for all applications using SQL Managed Instance in Azure. This is why it is important to ensure your applications (legacy apps migrated to the cloud or cloud born apps) are "cloud-ready" and resilient to transient errors typical for the cloud environments.
Your cloud-ready apps need to follow these principles:
Recommended best practices for making your apps cloud-ready are:
To learn more on building cloud-ready apps and test them with user-initiated failover functionality for SQL Managed Instance, consider this video Testing cloud app readiness with failover resiliency for SQL MI (15 minutes in length).
To learn more how to execute and monitor user-initiated failover on SQL MI, read further:
Initiate SQL Managed Instance failover on-demand
In August 2020, we have released a new feature user-initiated manual failover allowing users to manually trigger a failover on SQL Managed Instance using PowerShell or CLI commands, or through invoking an API call. Manually initiated failover on a managed instance will be an equivalent of the automated failover for high availability and software patches initiated automatically by the service.
This functionality will allow you to test your applications end to end for transient errors fault resiliency on automatic failovers in case of planned or unplanned events before deploying to production. In addition to testing how failover impacts existing database sessions, it can also help verify if it changes the end-to-end performance due to changes in the network latency. In some cases, if performance issues are encountered on SQL MI, manually invoking a failover to a new node can help mitigate the performance issue.
Because the restart operation is intrusive and a large number of them could stress the platform, only one user-initiated manual failover call is allowed every 15 minutes for each managed instance (this was reduced from the original 30 minutes in October 2020).
Ensuring that your applications are failover resilient prior to deploying to production will help mitigate the risk of application faults in production and will contribute to application availability for your customers. |
How is High Availability (HA) implemented on a Managed Instance?
Azure SQL Managed Instance (MI) is offered in two service tiers, one is Business Critical (BC) and the other one is GP (General Purpose). Both service tiers offer High Availability (HA), with different technical implementations, as follows:
|
|
|
Using the user-initiated manual failover functionality, manually initiating a failover on MI BC service tier will result in a failover of the primary node to one of the three secondary nodes. As secondary read-only nodes on the MI BC service tier can be used for read scale-out from a single node (out of three read-only secondary nodes), the user initiated manual failover capability allows also a failover of read-only replica. This means that users can manually failover the read scale-out from the current to one of the two other available read-only secondary nodes.
Manually initiating a failover on MI GP service tier will result in deallocation of the primary node, and allocation of a new node from the pool of available nodes, and reattachment of the storage from the old to the new node.
How to initiate a manual failover on SQL Managed Instance?
RBAC permissions required
User initiating a failover will need to have one of the following RBAC roles:
Using PowerShell
The minimum version of Az.Sql needs to be v2.9.0 (download link), or use Azure Cloud Shell from the Azure portal that always has the latest PowerShell version available.
If you have several Azure subscriptions, first ensure that you select the appropriate subscription where your target SQL MI is located:
PowerShell |
Select-AzureRmSubscription <SubscriptionID> |
Use PS command Invoke-AzSqlInstanceFailover with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:
PowerShell |
Invoke-AzSqlInstanceFailover -ResourceGroupName "ResourceGroup01" -Name "ManagedInstance01" |
Use the following PS command to failover read secondary node, applicable to BC service tier only:
PowerShell |
Invoke-AzSqlInstanceFailover -ResourceGroupName "ResourceGroup01" -Name "ManagedInstance01" -ReadableSecondary |
Using CLI
Ensure to have the latest CLI scripts installed.
Use az sql mi failover CLI command with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:
CLI |
az sql mi failover -g myresourcegroup -n myinstancename |
Use the following CLI command to failover read secondary node, applicable to BC service tier only:
CLI |
az sql mi failover -g myresourcegroup -n myinstancename --replica-type ReadableSecondary |
Using Rest API
For advanced users who would perhaps like to automate failovers of their SQL Managed Instances for purposes of implementing continuous testing pipeline, or automated performance mitigators, this can be accomplished through initiating failover through an API call, see Managed Instances - Failover REST API for details.
To initiate failover using REST API call, first generate the Authentication Token. One way to do that is to use a Postman client. Initiating the API call from any other client should generally work as well. This token is used as Authorization property in the header of API request and it is mandatory.
The following is an example of the API URI to call:
API URI |
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/provi... |
The following are API call properties that can be passed in the call:
API property |
Parameter |
subscriptionId |
Subscription ID to which managed instance is deployed |
resourceGroupName |
Resource group that contains managed instance |
managedInstanceName |
Name of managed instance |
replicaType |
(Optional) (Primary|ReadableSecondary)
This is the type of replica to be failed over: primary or readable secondary. |
api-version |
Static value and currently needs to be “2019-06-01-preview" |
API response will be one of the following two:
Track the operation status
Note: Completion of the failover process (not the actual short unavailability) might take several minutes at a time in case of high-intensity workloads. This is because the instance engine is taking care of all current transactions on the primary and catch up on the secondary, prior to being able to failover. |
Monitoring the failover
SQL MI Business Critical
To monitor the progress of user initiated manual failover for Business Critical service tier, execute the following T-SQL query in your favorite client (such is SSMS) on SQL Managed Instance. It will read the system view sys.dm_hadr_fabric_replica_states and report replicas available on the instance. Refresh the same query after initiating the manual failover.
T-SQL |
SELECT DISTINCT replication_endpoint_url, fabric_replica_role_desc FROM sys.dm_hadr_fabric_replica_states |
Receiving success confirmation from a PowerShell command, or from the API response indicates a successfully completed failover operation. Therefore, monitoring of the failover process is not required. It is however shown in this article for illustration purposes only. Please note that examples in this article do not include monitoring failover of secondary for SQL MI BC SKU. |
T-SQL output prior to initiating the failover will indicate the current primary replica on the MI BC containing one primary and three secondaries in the AlwaysOn Availability Group.
Monitor failover of secondary replica for BC (Business Critical) Service tier
Monitoring failover of secondary replica for MI BC is not available through DMVs. Receiving a response of success from PowerShell or API is sufficient confirmation that failover has been successful.
Monitor failover for GP (General Purpose) Service tier
As MI GP service tier is a single node system replaced with another node on the failover, you will not be able to see the role change using the above DMV example for MI BC service tier. Your T-SQL query output for MI GP service tier will always show a single node before and after the failover, something as the following:
However, there is an alternative way to monitor the GP instance failover by looking at the last start time of the SQL engine before and after the failover. Use the following T-SQL command before and after the failover to see the SQL engine start time change:
T-SQL |
SELECT DISTINCT sqlserver_start_time, sqlserver_start_time_ms_ticks FROM sys.dm_os_sys_info |
The output will be the timestamp when the SQL engine was started which should be different before and after the failover. For example, your output before the failover might show something as follows:
After the failover, the timestamp will be further in time than the previous reading. This indicates a new start time of the SQL engine, therefore indicating that failover has occurred.
Also consider the following to note the failover is in progress:
Functional limitations
Disclaimer
Please note that products and options presented in this article are subject to change. This article reflects the user-initiated manual failover option available for Azure SQL Managed Instance in February, 2021.
Closing remarks
If you find this article useful, please like it on this page and share through social media.
To share this article, you can use the Share button below, or this short link: https://aka.ms/mifailover-techblog.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.