Automated NFS volume failover using automounter with Azure NetApp Files
Published Jun 19 2023 10:51 AM 5,765 Views

Table of Contents

Abstract

Introduction

About the test environment

About automounter

Azure NetApp Files replication

Configure Azure NetApp Files replication

Setting up automounter

The autofs configuration

NFS mount failovers using autofs

Providing write access after an NFS mount failover to a replicated destination

What happens when the source volume is available again?

Resyncing the relationship

Considerations for failover automation

Scripting out replication management tasks

Manually reverse resync using the Azure NetApp Files commands from cloud shell

Conclusion

Additional Information

 

Abstract

Looking for a way to automate NFS mount failovers when a network or availability zone becomes unavailable? Automounter can help. This article covers how to configure autofs for automatic mount failovers in Azure NetApp Files.

 

Co-authors: Justin Parisi, Azure NetApp Files Technical Marketing Engineer

 

Introduction

When you are using NFS in the cloud for your business-critical applications, you can ill-afford outages. But sometimes, things happen. Networks can fail, zones can go down, and then your application is left in a nebulous state waiting for the NFS mount to respond to read or write operations.

 

With Azure NetApp Files, volumes can be replicated across availability zones within a region to provide redundant copies of important datasets, but failover is a manual process with several steps that can take valuable time away from your application. Only two volumes can be deployed in a region with Azure NetApp Files – even in different availability zones – and the names must be different. There will also likely be different IP addresses in the Virtual Network (VNet), so in the event of a failover, a remount would be needed to gain access to the replicated volume. If there are many clients, remounts can be a challenging task to perform.

 

But what if you could automate the failover process to avoid requiring any admin interaction with a few simple configuration changes?

 

GeertVanTeylingen_0-1686917835256.png

With automounter, it is possible to provide faster failovers to volume replicas in the event of an outage in your source volume’s availability zone. Let's dig in.

 

About the test environment

The following environmental configuration was used for this example.

 

  • Ubuntu Linux 20.04.6 (IP address – 10.1.12.28)
  • Autofs version 5.1.6
  • Azure NetApp Files NFSv3 source volume in availability zone AZ01
    (Mount path: 10.1.13.5:/czr-site1)
  • Azure NetApp Files NFSv3 destination volume in availability zone AZ02
    (Mount path: 10.2.13.5:/czr-dest1)

 

About automounter

Automounter is a package you can install on nearly any Linux distribution that does exactly what it says it does – provides automatic mount functionality. This includes NFS mounts. A service (autofs) runs in the background of your client and references a set of configurable map files to mount external storage when you navigate to one of the specified paths in the files.

 

For instance, if you want to configure home directories to automount when your end users navigate to ~, automounter can do that.

 

Automounter also provides advanced functionality, such as load balancing and failovers. Simply configure your automount files with multiple entries and autofs will try each one in order. Timeouts can be configured to control how fast a retry to the new location occurs and how long an NFS mount will stay mounted when no activity is detected.

 

By configuring automounter to list a source and destination Azure NetApp Files volumes in the map file, autofs can switch from the source volume to the destination volume without any administrator interaction.

 

Azure NetApp Files replication

Azure NetApp Files provides data protection through volume replication between availability zones. You can asynchronously replicate data from an Azure NetApp Files volume (source) in one zone to another Azure NetApp Files volume (destination) in another zone using cross-zone replication. This capability enables you to fail over your critical application if a zonal outage or disaster happens.

 

By default, a replicated destination volume will be read-only until you break the relationship. Once you break the relationship and write new data to the destination volume, you can replicate the changes back to the source to ensure there is no data loss. However, this means you must decide if the failover site only needs read access or would need read-write. In some use cases (such as AI/ML data lakes for model training), read-only access may be sufficient. In other cases, you may also need to automate the relationship management tasks. We will cover that in this article, too.

 

Configure Azure NetApp Files replication

In Azure NetApp Files, you can easily configure replication between zones (or regions). Complete instructions to do that can be found here.

 

Basic steps are as follows:

 

  • Create or use an existing source volume
  • Locate the source volume resource ID
  • Create a destination volume for data replication in your desired availability zone
  • Authorize replication from the source volume

 

Once that is done, Azure NetApp Files will create the relationship, initialize it and in a few minutes, you will be able to access the destination volume (read-only) from an NFS client.

 

To test access, simply access the volume in Azure NetApp Files and locate “Mount instructions” in the left-hand menu.

 

GeertVanTeylingen_1-1686918470755.png

 

Once you have mounted the destination volume using NFS on your client, try to read and write to the share. Expected behavior is that reads should be allowed (as determined by permissions and access controls) and writes should fail, as replication destinations are read-only - until the relationship is broken.

 

For example:

root@contoso-server:/misc/crr# mount | grep crr
10.2.13.5:/crr-dest1 on /misc/crr type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.2.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.2.13.5)
root@contoso-server:/misc/crr# ls
file1  file2
root@contoso-server:/misc/crr# touch newfile
touch: cannot touch 'newfile': Read-only file system

 

Setting up automounter

The first step for setting up an automounter for the NFS client is to install (or upgrade) the autofs package. The steps will vary depending on the flavor of Linux in use, but the following shows examples for three of the major distributions.

 

RedHat

RHEL-client# sudo yum install -y autofs

 

Ubuntu

Ubuntu-client# sudo apt install -y autofs

 

SUSE Linux/SLES

SUSE-client# sudo zypper install -y autofs

 

(!) Note

 

If you already have autofs, you can replace the word “install” with “upgrade” to use the latest version of the package.

 

 

 Autofs leverages files located in /etc on your client to configure mount points. Map files are orchestrated by a master file called auto.master. This file defines the names and locations of map files for autofs to reference when starting up. Changes to this file would require a restart of the autofs service to be applied properly.

 

A map file defines a mount path and mount options, as well as the mountpoint path you want to define for your client. Map entries can use wildcards or absolute paths for the mounts. In this example, we will use auto.misc for the map entry to a source and destination volume in Azure NetApp Files.

 

The autofs configuration

 

In the auto.master for this test, there is this line:

/misc   /etc/auto.misc --timeout 10

 

This entry will point autofs to the file located at /etc/auto.misc for the mount information. Mounts performed by autofs will be done under the /misc directory path. There is a timeout of 10 seconds configured, which means mounts that are not accessed in 10 seconds will unmount. This timeout should be configured for a value that best fits your application environment. 10 seconds was used here to allow the failover to happen faster, but 10 seconds may prove to be too aggressive for your clients. For instance, what happens if there is only a brief outage in your availability zone? Do you really want/need failover to occur if the availability zone is back up after a few minutes?

 

(!) Note

 

On this client, the path /misc needed to be created.

 

 

The following entry is all that was needed for the auto.misc file.

crr -fstype=nfs,rw,nosuid,hard,tcp,timeo=60 10.1.13.5:/crr-site1 10.2.13.5:/crr-dest1

 

The above entry does the following:

 

  • Uses crr as a mount path for the NFS export
  • Mounts as “nfs” (which will negotiate to the highest supported NFS version in the Azure NetApp Files instance. In this case, NFSv3.)
  • Uses mount options of rw, nosuid, hard, tcp,timeo=60
  • Specifies the source volume export path and then the destination volume export path, separated by a newline character (\)

 

Once this is configured, autofs will mount the NFS export as soon as the client navigates to the mount path. In this case, it is /misc/crr.

 

Example:

root@contoso-server:/# mount | grep crr
root@contoso-server:/#
root@contoso-server:/# cd /misc/crr
root@contoso-server:/misc/crr# mount | grep crr
10.1.13.5:/crr-site1 on /misc/crr type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.1.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.1.13.5)
root@contoso-server:/misc/crr# ls -la
total 8
drwxrwx--- 2 root root 4096 May 12 15:42 .
drwxr-xr-x 4 root root    0 May 17 18:36 ..
drwxrwxrwx 5 root root 4096 May 16 18:39 .snapshot
-rw-r--r-- 1 root root    0 May 12 15:42 file1
-rw-r--r-- 1 root root    0 May 12 15:42 file2

 

That is all great, but what if the source volume is unavailable? What happens then?

 

NFS mount failovers using autofs

The above shows what map entry was used in auto.misc. It included both source and destination volume export paths in Azure NetApp Files. So, if you used that example in your configuration, there is really nothing else to do for autofs. Automounter will attempt to access the source volume first (since it is first in the list) and after a 10 second timeout (configurable), it will move on to the next entry – no admin interaction (such as service restarts or client reboots) needed. To break source volume access, a rule was created in the Network Security Group to block NFS ports (i.e., 111, 635, 2049) between the client and source Azure NetApp Files instance.

 

GeertVanTeylingen_2-1686918738654.png

 

Example:

root@contoso-server:/# mount | grep crr
root@contoso-server:/#
root@contoso-server:/# cd /misc/crr
root@contoso-server:/misc/crr# mount | grep crr
10.2.13.5:/crr-dest1 on /misc/crr type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.2.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.2.13.5)
root@contoso-server:/misc/crr# ls -la
total 8
drwxrwx--- 2 root root 4096 May 12 15:42 .
drwxr-xr-x 4 root root    0 May 17 18:36 ..
drwxrwxrwx 5 root root 4096 May 16 18:39 .snapshot
-rw-r--r-- 1 root root    0 May 12 15:42 file1
-rw-r--r-- 1 root root    0 May 12 15:42 file2 

Some considerations:

 

  • If an application was reading or writing to the share during the outage, there may be a need to address the application – autofs is not application-aware. Data loss may occur. In some cases, you may want the application to no longer have access to a mount point in an outage. Check with your application vendor for recommendations.

 

  • Replicated volumes will also have a bit of a lag time between replicas depending on how you configured the schedules for SLOs (Service Level Objectives). The minimum available replication schedule in Azure NetApp Files currently is 10-minute intervals, meaning you will be missing at most 20 minute’s worth of data between the source and destination volumes.

 

  • Support for autofs issues falls under the NFS client OS (Operating System) provider.

 

  • If write access is required for the destination volume, the replication relationship must be broken. See below on how to accomplish that.

 

Providing write access after an NFS mount failover to a replicated destination

As mentioned, destination volumes in a replication will not allow writes until broken. Breaking a replication relationship is quite simple in Azure NetApp Files, however.

 

Complete steps can be found here.

 

Basically, navigate to the destination volume, select “Replication” from the left menu and click “Break Peering” at the top of the page.

 

GeertVanTeylingen_3-1686918805667.png

 

Once that is done, the destination will be immediately read-writeable (provided you have permissions to write to the share).

root@contoso-server:/misc/crr# touch peer-broken
root@contoso-server:/misc/crr# ls -la
total 8
drwxrwx--- 2 root root 4096 May 17 19:07 .
drwxr-xr-x 4 root root    0 May 17 18:44 ..
drwxrwxrwx 5 root root 4096 May 17 18:39 .snapshot
-rw-r--r-- 1 root root    0 May 12 15:42 file1
-rw-r--r-- 1 root root    0 May 12 15:42 file2
-rw-r--r-- 1 root root    0 May 17 19:07 peer-broken
root@contoso-server:/misc/crr# mount | grep crr
10.2.13.5:/crr-dest1 on /misc/crr type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.2.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.2.13.5)

 

What happens when the source volume is available again?

Eventually, the source volume will be available once again. When that happens, autofs will automatically mount the first entry in the map again when there is an unmount/remount/service restart – again, with no admin interaction.

 

This sounds good in theory, but if you have already started writing to the destination volume, then you might not want the client to fail back until you resync the replication, as the new data will not exist on the source volume.

 

For example, when the NSG (Network Security Group) rule is set to allow traffic, autofs remounts the source volume and the newly created file (named “peer-broken”) is not shown:

root@contoso-server:/misc/crr# ls -la
total 8
drwxrwx--- 2 root root 4096 May 17 19:07 .
drwxr-xr-x 4 root root    0 May 17 18:44 ..
drwxrwxrwx 5 root root 4096 May 17 18:39 .snapshot
-rw-r--r-- 1 root root    0 May 12 15:42 file1
-rw-r--r-- 1 root root    0 May 12 15:42 file2
root@contoso-server:/misc/crr# mount | grep crr
10.1.13.5:/crr-dest1 on /misc/crr type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.1.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.1.13.5)

 

This remount behavior can be controlled with timeout values, but if a client reboots or the service restarts, then the source volume will be remounted. As such, careful consideration should be made as to if automatic failover with autofs is hugely beneficial to your application. In read-only scenarios, there is little to no danger using autofs for automatic failovers, but when data is being written, care must be taken and alternate workflows involving external automation with conditional actions should be considered instead.

 

Resyncing the relationship

If you broke the replication relationship and wrote some data to it, you would need to resync it to ensure the source volume is identical to the destination.

 

Full explanation of that is here.

 

Like other tasks for replication in Azure NetApp Files, this one is simple, too. Just navigate to your source volume’s replication page and select “Reverse Resync.”

 

GeertVanTeylingen_4-1686918937511.png

 

Once the relationship is resynced, the mirror state will show “Mirrored” and the destination will be re-established.

 

Now, the new file you wrote to the destination (in this example, peer-broken) will show up on the source.

 

Example:

root@contoso-server:/misc/crr1# ls -la
total 8
drwxrwx--- 2 root root 4096 May 17 19:07 .
drwxr-xr-x 4 root root    0 May 17 19:18 ..
drwxrwxrwx 5 root root 4096 May 17 20:30 .snapshot
-rw-r--r-- 1 root root    0 May 12 15:42 file1
-rw-r--r-- 1 root root    0 May 12 15:42 file2
-rw-r--r-- 1 root root    0 May 17 19:07 peer-broken
root@contoso-server:/misc/crr1# mount | grep crr
10.1.13.5:/crr-site1 on /misc/crr1 type nfs (rw,nosuid,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=60,retrans=2,sec=sys,mountaddr=10.1.13.5,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=10.1.13.5)

 

Considerations for failover automation

While the automounter portion of NFS mount failover is straightforward, there are still steps that require intervention, such as if you want the destination volume that was failed over to be writeable. These steps can be scripted, but careful consideration should be made when making your replicated volume writeable to prevent data loss, as automounter is not application-aware.

 

Things to keep in mind:

  • Is your dataset read-only or read-write?
  • If read-write, are files constantly being written or only written to after a specific application process is enacted (such as a scan or a read of a large dataset)?
  • What amount of time of loss of access to a file share can your application handle?
  • Does your application fail if a write fails, or does it provide a retry mechanism?
  • Are there periods of inactivity in your application? If so, for how long?
  • How much of a delta of changes in data can your application handle?

 

The concern here about automating the destination volume to become read-writeable is application awareness. When a cutover happens, the application does not know there has been a cutover. All it cares about is that there is access to the data – which there now is.

 

While a replicated volume can have as little as a 20-minute delta in data differences, there is still a difference. Will your application be able to handle that difference? What if the source volume in the availability zone is unable to replicate again after an outage?

 

What happens if a file is being actively written to in the middle of the outage? How will your application react to that file being suddenly gone? Can it restart the file write?

 

After availability zone access is restored, automounter will re-mount the source volume as soon as there is a remount. This can happen after a timeout period or client reboot. How will your application handle that scenario, where the data it just finished writing is no longer available because a replication resync has not been performed yet?

 

Read-only data sets will not have any issues here, as there will be no active writes or changes to consider. But if your application is writing data, you need to ensure automatic failovers will not break things worse than not having access to the volumes at all.

 

That all being said, if you do want to automate the replication portions of this process in Azure NetApp Files, here are some starting points.

 

Scripting out replication management tasks

You can use Azure PowerShell AZ Module and the Azure CloudShell to run commands in Azure NetApp Files to manage your replication relationships.    

 

This article covered that you would need to break a relationship to make a volume read-writeable. To do that, you can use the AZ.NetAppFiles modules. Specifically, you would use Suspend-AzNetAppFilesReplication. This command is run against the source volume.

 

Example:

PS /> Suspend-AzNetAppFilesReplication -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -PoolName "eastUS2-pool" -VolumeName "crr-site1"    
PS /> Get-AzNetAppFilesReplicationStatus -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -VolumeName "crr-site1" -PoolName "eastUS2-pool" 

Healthy : True
RelationshipStatus : Idle
MirrorState         : Broken
TotalProgress       : 17504
ErrorMessage        :            

 

Doing this will make the destination volume read-write. As mentioned, there are several considerations for doing this due to automounter’s lack of application awareness.

 

Manually reverse resync using the Azure NetApp Files commands from cloud shell

Once the source volume’s access is restored, if you have broken the relationship, you will need to resync the relationship to capture any changes made on the destination volume. Resume-AzNetAppFilesReplication is what you would use for that. Running it on the source volume will make that volume the new destination volume to capture any changes made on the destination during the failover event.

 

PS /> Resume-AzNetAppFilesReplication -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -VolumeName "crr-site1" -PoolName "eastUS2-pool"      
PS / > Get-AzNetAppFilesReplicationStatus -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -VolumeName "crr-site1" -PoolName "eastUS2-pool"  

Healthy             : True
RelationshipStatus : Idle
MirrorState         : Mirrored
TotalProgress       : 20856
ErrorMessage        :

 

In the image below, we can see that the original source (crr-site1) is now the destination volume after running resume and the original destination (crr-dest1) is now the source volume.

 

GeertVanTeylingen_5-1686919274221.png

 

Once you verify the new data from the destination volume has been replicated to the source volume, you can run the suspend command again on the source volume and then resume would be run on the original destination volume (in this case, crr-dest1).

 

PS /> Suspend-AzNetAppFilesReplication -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -VolumeName "crr-site1" -PoolName "eastUS2-pool"     
PS / > Resume-AzNetAppFilesReplication -ResourceGroupName "contoso.rg" -AccountName "contoso-westus2" -PoolName "westUS2-pool" -VolumeName "crr-dest1"
PS / > Get-AzNetAppFilesReplicationStatus -ResourceGroupName "contoso.rg" -AccountName "eastus2-account" -VolumeName "crr-site1" -PoolName "eastUS2-pool" 

Healthy             : True
RelationshipStatus : Idle
MirrorState         : Mirrored
TotalProgress       : 3352
ErrorMessage       : 

 

Now the original source (crr-site1) is the source again, and the original destination (crr-dest1) is now the destination.

 

GeertVanTeylingen_6-1686919335449.png

 

Any scripts you create using these steps should consider the application’s fault tolerance, failover logic and whether any manual intervention should be incorporated.

 

Conclusion

 

Using NFS automounter is a straightforward way to not only automate your NFS mounts, it is also a way you can add failover logic to your NFS clients. This solution is ideal for read-only workloads, such as AI/ML training, but can also be extended into read-write use cases with the proper guardrails.

 

Additional Information

  1. Use availability zones for high availability in Azure NetApp Files | Microsoft Learn
  2. Cross-zone replication of Azure NetApp Files volumes | Microsoft Learn
  3. Create volume replication for Azure NetApp Files | Microsoft Learn
  4. Manage disaster recovery using Azure NetApp Files cross-region replication | Microsoft Learn
  5. Automounter
Version history
Last update:
‎Jun 16 2023 05:52 AM
Updated by: