User initiated manual failover on SQL Managed Instance
Published Jan 28 2021 08:02 AM 13.7K Views
Microsoft

Importance of application resiliency in the cloud

 

High availability is a fundamental part of SQL Managed Instance (MI) platform that works transparently for your database applications.  As such, failovers from primary to secondary nodes in case of node degradation or fault detection, or during our regular monthly software updates are an expected occurrence for all applications using SQL Managed Instance in Azure. This is why it is important to ensure your applications (legacy apps migrated to the cloud or cloud born apps) are "cloud-ready" and resilient to transient errors typical for the cloud environments.

 

Your cloud-ready apps need to follow these principles:

  • The application must be able to detect faults when they occur
  • The application must be able to determine if the fault is transient
  • The application needs to retry an operation in case of transient issues and keep count of the retries
  • The application must use an appropriate strategy for the retries (number of retries, delay between each attempt)

 

Recommended best practices for making your apps cloud-ready are:

  • Use the latest drivers to connect to SQL Managed Instance. Newer drivers have a stronger implementation of transient error handling resiliency.
  • Use the latest drivers to your advantage, the following parameters are recommended to be added to your application connection string for SQL MI: “Connect Timeout=30;ConnectRetryCount=3;ConnectRetryInterval=10”.
  • Implement retry logic in your applications for all transient errors common to the cloud environment. See this sample source code for the connection retry logic. Use this as a template to modify your apps.
  • Learn more about best practices for transient fault handling in applications in the cloud.
  • Finally, test your applications for the resiliency using user-initiated manual failover for SQL Managed Instance (subject of this blogpost).

To learn more on building cloud-ready apps and test them with user-initiated failover functionality for SQL Managed Instance, consider this video Testing cloud app readiness with failover resiliency for SQL MI (15 minutes in length).

 

To learn more how to execute and monitor user-initiated failover on SQL MI, read further:

 

Initiate SQL Managed Instance failover on-demand

 

In August 2020, we have released a new feature user-initiated manual failover allowing users to manually trigger a failover on SQL Managed Instance using PowerShell or CLI commands, or through invoking an API call. Manually initiated failover on a managed instance will be an equivalent of the automated failover for high availability and software patches initiated automatically by the service.

 

This functionality will allow you to test your applications end to end for transient errors fault resiliency on automatic failovers in case of planned or unplanned events before deploying to production. In addition to testing how failover impacts existing database sessions, it can also help verify if it changes the end-to-end performance due to changes in the network latency. In some cases, if performance issues are encountered on SQL MI, manually invoking a failover to a new node can help mitigate the performance issue.

 

Because the restart operation is intrusive and a large number of them could stress the platform, only one user-initiated manual failover call is allowed every 15 minutes for each managed instance (this was reduced from the original 30 minutes in October 2020).

 

Ensuring that your applications are failover resilient prior to deploying to production will help mitigate the risk of application faults in production and will contribute to application availability for your customers.

 

How is High Availability (HA) implemented on a Managed Instance?

 

Azure SQL Managed Instance (MI) is offered in two service tiers, one is Business Critical (BC) and the other one is GP (General Purpose). Both service tiers offer High Availability (HA), with different technical implementations, as follows:

 

  • HA for SQL Managed Instance BC (Business Critical) service tier was built based on AlwaysOn Availability Groups (AG) technology, resulting in such MI consisting of the total of 4 nodes - one primary and three secondary R/O replicas. In case of a failover, one of the secondary replicas becomes primary. This type failover typically takes only a few short seconds.
business-critical-service-tier.png
  • HA for SQL Managed Instance GP (General Purpose) service tiers was based on multiple redundancy of the storage layer and it is based on a single primary node. In case of a failover, a new node is taken from the pool of standby nodes, and the storage is re-attached from the old to the new primary node. This type of failover typically takes under a minute.
general-purpose-service-tier.png

 

Using the user-initiated manual failover functionality, manually initiating a failover on MI BC service tier will result in a failover of the primary node to one of the three secondary nodes. As secondary read-only nodes on the MI BC service tier can be used for read scale-out from a single node (out of three read-only secondary nodes), the user initiated manual failover capability allows also a failover of read-only replica. This means that users can manually failover the read scale-out from the current to one of the two other available read-only secondary nodes.

 

Manually initiating a failover on MI GP service tier will result in deallocation of the primary node, and allocation of a new node from the pool of available nodes, and reattachment of the storage from the old to the new node.

 

How to initiate a manual failover on SQL Managed Instance?

 

RBAC permissions required

 

User initiating a failover will need to have one of the following RBAC roles:

 

  • Subscription Owner role, or
  • Managed Instance Contributor role, or
  • Custom role with the following permission:
    • Microsoft.Sql/managedInstances/failover/action

 

Using PowerShell

 

The minimum version of Az.Sql needs to be v2.9.0 (download link), or use Azure Cloud Shell from the Azure portal that always has the latest PowerShell version available.

 

If you have several Azure subscriptions, first ensure that you select the appropriate subscription where your target SQL MI is located:

PowerShell

Select-AzureRmSubscription <SubscriptionID>

 

Use PS command Invoke-AzSqlInstanceFailover with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:

PowerShell

Invoke-AzSqlInstanceFailover -ResourceGroupName "ResourceGroup01" -Name "ManagedInstance01"

 

Use the following PS command to failover read secondary node, applicable to BC service tier only:

PowerShell

Invoke-AzSqlInstanceFailover -ResourceGroupName "ResourceGroup01" -Name "ManagedInstance01" -ReadableSecondary

 

Using CLI

 

Ensure to have the latest CLI scripts installed.

 

Use az sql mi failover CLI command with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:

CLI

az sql mi failover -g myresourcegroup -n myinstancename

 

Use the following CLI command to failover read secondary node, applicable to BC service tier only:

CLI

az sql mi failover -g myresourcegroup -n myinstancename --replica-type ReadableSecondary

 

Using Rest API

 

For advanced users who would perhaps like to automate failovers of their SQL Managed Instances for purposes of implementing continuous testing pipeline, or automated performance mitigators, this can be accomplished through initiating failover through an API call, see Managed Instances - Failover REST API for details.

 

To initiate failover using REST API call, first generate the Authentication Token. One way to do that is to use a Postman client. Initiating the API call from any other client should generally work as well. This token is used as Authorization property in the header of API request and it is mandatory.

 

The following is an example of the API URI to call:

API URI
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/provi...

 

The following are API call properties that can be passed in the call:

API property

Parameter

subscriptionId

Subscription ID to which managed instance is deployed

resourceGroupName

Resource group that contains managed instance

managedInstanceName

Name of managed instance

replicaType

(Optional) (Primary|ReadableSecondary)

 

This is the type of replica to be failed over: primary or readable secondary.
If not specified, failover will be initiated on the primary replica by default.

api-version

Static value and currently needs to be “2019-06-01-preview"

 

API response will be one of the following two:

  • 202 Accepted
  • One of the 400 request errors.

 

Track the operation status

 

 

Note: Completion of the failover process (not the actual short unavailability) might take several minutes at a time in case of high-intensity workloads. This is because the instance engine is taking care of all current transactions on the primary and catch up on the secondary, prior to being able to failover.

 

Monitoring the failover

 

SQL MI Business Critical

 

To monitor the progress of user initiated manual failover for Business Critical service tier, execute the following T-SQL query in your favorite client (such is SSMS) on SQL Managed Instance. It will read the system view sys.dm_hadr_fabric_replica_states and report replicas available on the instance. Refresh the same query after initiating the manual failover.

T-SQL

SELECT DISTINCT replication_endpoint_url, fabric_replica_role_desc FROM sys.dm_hadr_fabric_replica_states

 

Receiving success confirmation from a PowerShell command, or from the API response indicates a successfully completed failover operation. Therefore, monitoring of the failover process is not required. It is however shown in this article for illustration purposes only. Please note that examples in this article do not include monitoring failover of secondary for SQL MI BC SKU.

 

T-SQL output prior to initiating the failover will indicate the current primary replica on the MI BC containing one primary and three secondaries in the AlwaysOn Availability Group.

 

Replica 001.PNG

 

  • In this example we see the primary and three secondary replicas on MI BC node, each assigned to an internal IP address starting with 10.0.0.X. The primary node in this example has been allocated to the internal IP address 10.0.0.16.
  • Upon execution of a failover, running this query again would need to indicate a change of the primary node.

Replica 002.PNG

 

  • In this particular example, we can see that before the failover, the primary node IP was 10.0.0.16, whereas after the failover this node became secondary, and the new primary node became node with the IP 10.0.0.22.

 

Monitor failover of secondary replica for BC (Business Critical) Service tier

 

Monitoring failover of secondary replica for MI BC is not available through DMVs. Receiving a response of success from PowerShell or API is sufficient confirmation that failover has been successful.

 

Monitor failover for GP (General Purpose) Service tier

 

As MI GP service tier is a single node system replaced with another node on the failover, you will not be able to see the role change using the above DMV example for MI BC service tier. Your T-SQL query output for MI GP service tier will always show a single node before and after the failover, something as the following:

gp 01.PNG

However, there is an alternative way to monitor the GP instance failover by looking at the last start time of the SQL engine before and after the failover. Use the following T-SQL command before and after the failover to see the SQL engine start time change:

T-SQL

SELECT DISTINCT sqlserver_start_time, sqlserver_start_time_ms_ticks FROM sys.dm_os_sys_info

The output will be the timestamp when the SQL engine was started which should be different before and after the failover. For example, your output before the failover might show something as follows:

before.PNG

 

After the failover, the timestamp will be further in time than the previous reading. This indicates a new start time of the SQL engine, therefore indicating that failover has occurred. 

after.png

Also consider the following to note the failover is in progress:

  • Upon executing the failover of MI GP, if you refresh any T-SQL query there will be no availability (loss of connectivity) from your client until the node failover has been executed (typically under a minute). The IP address might, or might not change, depending of the automated system has decided to failover to a new node. The loss of connectivity to MI GP during the failover will be the indication of the failover execution. 

Functional limitations

 

  • Throttle mechanism is implemented to guard from potentially too many failovers. As such, you can initiate one failover on the same MI every 15 minutes. If this is the case, there will be an error message shown when attempting to initiate a manual failover within this protected time frame.

Image 03.PNG

 

  • For BC instances there must exist quorum of replicas for the failover request to be accepted. This means that failover can be executed only in the case all replicas are healthy. If this is not the case, and if one of replicas is unhealthy or being rebuilt, you will not be able to manually initiate a failover at such time.
  • It is not possible to specify which readable secondary replica to initiate the failover on. The system will determine this automatically. is because MI allows automatically for only a single read-only replica to be available to customers.
  • If there was a database recently created on SQL MI, user-initiated failover will not be possible until automated systems perform the first full backup of a new database. This is to protect the automated backup process from the failover and ensure backup integrity. Depending on the database size that needs to be backed up for the first time, the time to wait until you are able to initiate a failover might vary.
  • Failover will not be allowed if there exists a database restore in progress. In addition, databases that were just restored will need to be backed up for the first time. This means that failover is now allowed for the duration of time it takes to restore a database, plus additional time for the automated systems to make the full first backup of the restored database.

 

Disclaimer

 

Please note that products and options presented in this article are subject to change. This article reflects the user-initiated manual failover option available for Azure SQL Managed Instance in February, 2021.

 

Closing remarks

 

If you find this article useful, please like it on this page and share through social media.

 

To share this article, you can use the Share button below, or this short link: https://aka.ms/mifailover-techblog.

Co-Authors
Version history
Last update:
‎Jul 22 2022 02:35 AM
Updated by: