Azure High Performance Computing (HPC) Blog

12 MIN READ

Azure Managed Lustre: not your grandparents' parallel file system

Microsoft

Aug 03, 2023

Lustre has long been the gold standard for extreme performance and scalability amongst parallel file systems used in HPC. The open-source community has continually improved Lustre features and performance to keep pace with the ever-growing demands of HPC, and despite being over twenty years old, Lustre is still powering many of the fastest supercomputers in the world with file systems that can store hundreds of petabytes and deliver terabytes per second of performance. It’s little wonder that when most people hear Lustre, they envision petabytes of data stored across dozens of racks and thousands of hard drives.

Taking a step back though, is a single, massive file system really the ideal solution for most of today’s HPC workloads? It’s fast and convenient, but putting hundreds of users on a single file system results in the noisy neighbor problem; performance can be unpredictable as different workloads collide. Managing a single huge file system also increases the odds of corner cases—like network disruptions or failovers—happening and generating cryptic errors that are hard to diagnose. In an ideal world, each project—or maybe each user—would have their own, private parallel file system that doesn’t share performance with others, and administration of these private file systems would be handled automatically.

This might sound like a dream in the traditional on-premises HPC world, but this is exactly what we’ve done with Azure Managed Lustre File System (AMLFS).

Azure Managed Lustre in a nutshell

With a few clicks of a web interface or an Azure Resource Manager template, AMLFS lets you provision an all-flash Lustre file system in minutes. You get the IP address of your Lustre Management Service (MGS) back, and you just mount Lustre on your compute nodes as you would with any other Lustre file system. Your compute nodes use it like any other file system, and since it is Lustre, the performance and scalability benefits of Lustre are immediately available.

What’s different is that this Lustre file system is all yours. If someone else in Azure is running a job that creates a million files, you won’t ever know it because your Lustre servers and SSDs are exclusively yours. Because there’s no difference in price if you provision two 64 TiB file systems or a single 128 TiB file system, your apps with challenging I/O patterns can get their own file systems to separate them from the rest with no extra cost. And because Azure manages the Lustre servers on your behalf, you manage the Lustre file system as a single entity which makes managing multiple AMLFS instances straightforward.

Because you have your own Lustre file system, you can also shut it off when you no longer need the performance. AMLFS natively integrates with Azure Blob through Lustre’s Hierarchical Storage Management (HSM) capabilities, meaning data in your blob storage account is transparently hydrated into your Lustre file system as you read it. When you no longer need the parallel performance of Lustre, a simple click or terminal command will sync your files back to objects. You don’t have to manually copy files to or from AMLFS; you don’t even need to have any compute nodes running to perform this data tiering. All the data migration is handled within the service itself, so there’s no babysitting file transfers as you spin up and down AMLFS instances.

So, while Azure Managed Lustre is still Lustre, thinking about it only as a tool to deploy a single monolithic file system leaves a lot of value on the table. Instead, think about it as a way to accelerate I/O performance for specific users, applications, or projects on-demand.

Rethinking how Lustre can address I/O challenges in Azure

It’s common to hear people describe file systems in terms of how many petabytes they can store and how much bandwidth they can deliver--for example, a 4 PiB file system capable of 100 GB/s. It’s tempting to apply this thinking to Azure Managed Lustre and calculate how much it would cost to provision a 4 PiB Lustre file system that runs 24/7 for three years as you would on-prem.

Don’t do this!

You’d be taking a solution (a single 4 PiB file system) and fitting your problem (storing 4 PiB of performance-sensitive data) to it. Instead, it’s more helpful to think about the problem and optimize the solution for it.

Let’s take the above example of a 4 PB file system that delivers 100 GB/s since it’s like what we often encounter when a customer comes to us with HPC requirements. When we dig into what they really want to achieve with this 4 PB file system though, we often find that the real problem is that, for example, there are eight different projects who will each generate up to 500 TiB, and each of them needs up to 100 GB/s to support their largest parallel jobs.

In this case, it’s actually better to give each project its own 500 TiB file system so that their jobs aren’t impacting each others’ performance, and by using the 250 MB/s/TiB AMLFS service, each 500 TiB file system would still be able to achieve 100 GB/s to applications. Giving each project its own file system is perfectly reasonable with AMLFS as well because Azure manages the deployment, operations, and maintenance of the Lustre servers that make up an instance for you.

The next thing to consider is how often each project is actively processing all their data. While some workloads do run 24/7/365, it’s more common to have alternating periods of high-intensity computing followed by relative quiet. For example,

A new geophysical dataset might come in and trigger a few days of intense parallel processing followed by a week of interactive visualization, team discussion, and planning next steps.
Highly parallel modeling may occur during the work week when financial markets are open, but weekends are relatively quiet.
A high-priority project comes up that pulls engineers off one project to help with another, causing that original project to go quiet.

In these cases where a project is in a quiet period, you don’t really need a 100 GB/s parallel file system. With AMLFS, you can use Lustre HSM to dehydrate the file system down to blob storage, then completely deprovision the file system with just a few commands or clicks. When the project is ready to pick back up, it’s just a few more commands or clicks to recreate the file system and rehydrate it from blob. What’s more, if you don’t need the full performance of Lustre but need to check on a few files during a quiet period, you can use Blob NFS or BlobFuse to mount your blob container directly on your compute nodes without needing to provision a whole new AMLFS file system. Your files and data are still accessible as if they were being read through AMLFS, but the scalability (and cost) would be lower.

This flexibility of AMLFS allows you to do things with Lustre that are impossible with traditional monolithic file systems. For example, you can

Dial up or down the bandwidth of your file system as needs change. There’s nothing stopping you from dehydrating a 150 125 MB/s/TiB AMLFS instance and rehydrating it as a 250 MB/s/TiB file system, and there’s no file copying or migration needed since HSM takes care of that for you.
Expand or contract the size of your AMLFS instance. You can also dehydrate a 500 TiB file system and rehydrate it as a 100 TiB file system. All your files still appear in the smaller file system, but you’ll only be able to pull 100 TiB of data in at once. Again, no file copies are needed since AMLFS ways shows all your files in the Lustre mount even their contents remain only in blob.
Combine the rich features of Azure Blob into HPC workflows. Once data has been written and tiered to blob, you can add blob index tags to manage and find this data later, apply data lifecycle management policies to automatically move data to colder access tiers as it ages, or automatically trigger a serverless data processing pipeline using Azure Blob storage bindings for Azure Functions.

In a traditional Lustre file system, dynamically scaling performance or capacity is a manual, fault-prone process, and integrating Lustre with rich metadata processing capabilities and data processing automation requires building complex and custom software around the file system. With AMLFS, these capabilities are all available through programmable APIs and automation.

The Azure Managed Lustre File System experience in practice

This all sounds good on paper, but it’s fair to ask how much extra complexity is involved with using such a flexible service. Let’s walk through a concrete example of what using AMLFS might look like for a project that will need up to 100 TiB of storage and 10 GB/s of bandwidth. Conceptually, we can think about the storage architecture as looking something like this:

Now let's walk through how your workflow could use this.

Step 1. Data ingest

If the project involves processing data from an external source (DNA sequencers or telescopes, for example), the first step is to copy that into an Azure Blob storage container dedicated to that project using tools like sftp or azcopy. Let’s say there’s 10 TiB of raw input data; with azcopy and plenty of network bandwidth, this might take half an hour.[1]

Step 2. Prepare the high-performance project space

Once the initial project data is in the project’s blob container, we would then create an AMLFS instance.

Let’s say we expect this project to last a month; the simplest approach is to create one AMLFS instance dedicated to this project that will stay running for the whole month. We’d provision a 100 TiB AMLFS instance so that we can store all the project’s inputs, outputs, and scratch data at its biggest point. Knowing we want at least 10 GB/s, we need at least 10 GB/s per 100 TiB, or 100 MB/s/TiB. The lowest-cost AMLFS option that meets this requirement is the 125 MB/s/TiB offering.

The AMLFS instance can be created using either the Azure portal or an ARM template, and the blob container storing our input data will be specified during provisioning. It can take up to an hour for the AMLFS instance to deploy, and if the input data is spread over millions of files, it could take longer since AMLFS needs to read all the filenames from our blob container to populate the Lustre namespace during this provisioning.

Step 3. Run your HPC workloads!

Once AMLFS has been provisioned, we can mount it using either the Lustre client included with the Azure HPC images or prebuilt Lustre clients. The contents of the attached blob container will automatically be visible from the Lustre mount, and data will be copied into Lustre the first time each file is opened and read. By only loading data into AMLFS when it is accessed, we can connect a relatively small AMLFS instance to an enormous blob container if only a small portion of the container’s contents will be “hot” and require high bandwidth access.

New files can be created and written to Lustre by parallel applications throughout the project, and these changes will not be automatically copied back to blob. This avoids temporary scratch data from being unnecessarily duplicated to blob, but we can use either Lustre HSM (lfs hsm_archive) or an AMLFS archive job to synchronize files back to blob when we know a file or dataset should be kept long-term.

In a sense, AMLFS provides the convenience of a read-through cache for blob storage containers, and Lustre HSM commands provide fine-grained control over this “cache.” We can hydrate files’ contents without reading them from a client using lfs hsm_restore, and we can write back modified data using hsm_archive. Unlike a cache though, data in AMLFS will never be “evicted” automatically, so you’ll always get consistent performance when accessing data that’s been loaded into AMLFS.

Step 4. Decommission the high-performance project space

Once a project has completed and its data no longer requires high-performance access, we can deprovision its AMLFS instance. First, the data on AMLFS that we want to keep must be archived back to blob using an AMLFS archive job or Lustre HSM (lfs hsm_archive). This will transparently push data from Lustre back to blob, and AMLFS will still be online, and data will be fully accessible.

AMLFS handles this data movement for us though, so we can also unmount Lustre from all our clients and deprovision those compute node VMs while the archive job is running. This data movement is optimized by AMLFS, so archiving 10 TiB of modified data back to blob could complete in less than an hour.[2]

The status of an archive job can be checked via the Azure portal or using Lustre HSM (lfs hsm_status), and once the archive job has completed, it’s safe to deprovision (delete!) the AMLFS instance.

Long-term data management

Once our modified data is back in blob, it can be accessed and managed just like any other blob.

All data on blob storage can be tiered using data lifecycle management policies so that, for example, data that hasn’t been modified in one month is tiered from hot to cool storage to reduce costs without impacting retrieval time. If after six months the data is not modified, a different policy could tier it to an archival tier where its cost is lowest, but it may take hours to retrieve.

Regardless of whether objects are in hot, cool, or archive tiers, their metadata will always be available in our blob container. The names and locations of all the objects in our project’s blob container will always look the same to users and applications as their contents are tiered up and down—files never have to be copied or moved to different directories to get the cost benefits of storing data in cooler tiers. This largely applies to file-based access too; we can mount our project’s blob container using BlobFuse, Blob NFS, or a new AMLFS instance, and all our files and directories will be there regardless of what blob tier they’re in.[3]

Reading data from objects/files in different tiers is also straightforward. Data in hot and cool tiers can be read instantaneously via BlobFuse, Blob NFS, or AMLFS, but reading data from cool objects has a higher cost per I/O. To access data from an object that has been archived, it must be “rehydrated” using a separate step using the Azure Portal or command-line tool before it can be read.

Fine tuning how AMLFS is integrated into workflows

We gave each project its own AMLFS instance and chose to keep AMLFS running for the entire lifetime of the project in this example, but that’s not the only way to integrate AMLFS into the way we work.

AMLFS provides the flexibility to create, hydrate, dehydrate, and destroy AMLFS instances on a per-user, per-project, or even per-job basis. with different performance levels and capacities can also spin up and down as jobs’ needs evolve over time. However, it is possible to go overboard! If a job only needs high-performance access to data for a few minutes, using AMLFS may be overkill since the provisioning process itself will take a few minutes.

Keep the following in mind when deciding how long to keep AMLFS instances running:

AMLFS itself takes a few minutes to provision and configure before it’s ready for you to use.
Using AMLFS with a container containing millions of files will take longer to provision since AMLFS has to read all objects’ metadata before it’s ready to use.
Reading data from blob into AMLFS (hydrating data) takes time regardless of whether that data is pre-loaded using HSM or being loaded the first time it is read. This data loading may be limited by the egress limit of the blob storage account (120 Gbps by default), so repeatedly create and destroy 500 TiB AMLFS instances may cause a lot of time to be spent loading data into AMLFS.
Repeatedly hydrating and dehydrating data with AMLFS may incur transaction costs on the blob storage account. Be especially mindful when using AMLFS with containers that contain objects in cool or cold tiers. Accessing data already loaded into in AMLFS has no transaction costs, but HSM operations that perform a lot of I/O operations to blob may.

There are no hard and fast rules about how long you should plan to keep AMLFS instances online; it depends on how long that data will require high-performance access and how long it will take to create a new AMLFS instance and load data from blob into it.

[1] This rate would be limited by the 10 GB/s of our AMLFS instance in this example.

[2] Assuming blob storage account with a default ingress limit of 7.5 GB/s.

[3] Objects in the archive tier is different; in certain circumstances, you will have to rehydrate from archive into a warmer tier before it will appear in the file namespace.

Updated Aug 03, 2023

Version 2.0

hpc

storage

glennklockwood

Microsoft

Joined July 22, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity