First published on MSDN on Sep 17, 2018
Written by Kanchan Mehrotra, Tony Wu, and Rakesh Patil from AzureCAT. Reviewed by Solliance. Edited by Nanette Ray.
This article is also available as an eBook:
Find Part 1 of the this series here:
Lustre is currently the most widely used parallel virtual file system (PVFS) in high-performance computing (HPC) solutions. Lustre file systems can scale to tens of thousands of client nodes and tens of petabytes of storage. Our Lustre file system performed well on Azure for large file system. You can refer to the testing results in this article for more details.
Lustre clusters contain four kinds of systems:
A single node can be used to serve as both an MDS and MGS as we have done in our testing. Figure 1 shows a standard Lustre file system architecture.
Figure 1 . Lustre file system architecture
Lustre runs on a variety of Linux kernels from Linux distributions, including Red Hat Enterprise Linux (RHEL) and CentOS. For our performance evaluation, we installed Lustre 2.9.0 on an OpenLogic CentOS 7.3 image from the Azure Marketplace and configured the nodes as follows:
To scale the cluster, we added more OSSs with 10 additional disks of 1 TB each. We ran performance tests for varieties of OSS nodes—that is, 1, 2, 4, 8, 12…20. The test environment also used openMPI (included in CentOS 7.2) as the Message Passing Interface (MPI).
TIP: This document walks you through a manual Lustre installation, so you can see how to scale specific components later if desired. However, you can also install and configure Lustre 2.9.0 using an Azure Resource Manager template available on GitHub. To test your setup, install the testing tools described later in this document. Azure Marketplace also includes Intel Cloud Edition for Lustre software.
sudo yum update
sudo reboot
sudo yum install microsoft-hyper-v
NOTE : LIS packages are intended to improve network performance, but we performed our benchmark tests without them. We didn’t see any performance improvements when LIS packages were installed and sometimes even saw lower performance. When we tested, the NDIS single root I/O virtualization (SR-IOV) interface was not available, but it will help .
[lustreserver]
name=lustreserver
baseurl=https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7/server/
enabled=1
gpgcheck=0
[e2fs]
name=e2fs
baseurl=https://downloads.hpdd.intel.com/public/e2fsprogs/latest/el7/
enabled=1
gpgcheck=0
[lustreclient]
name=lustreclient
baseurl=https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7/client/
enabled=1
gpgcheck=0
yum repolist
yum install lustre-2.9.0-1.el7.x86_64
yum install e2fsprogs
#Configure LNET
echo "options lnet networks=tcp"> /etc/modprobe.d/lnet.conf
#Automatically start servers
chkconfig lnet --add
chkconfig lnet on
chkconfig lustre --add
chkconfig lustre on
reboot
Whether you created a single MGS/MDS node as we did or separate nodes, the next step is to configure the node in the cluster. The following parameters are used:
fsname : Names the file system, for example, LustreFS
mgs : Indicates the node is MGS
mdt : Indicates the node is MDS
backfstype : Identifies the backing file system
To configure the MGS/MDS node, execute the following:
mkfs.lustre --fsname=LustreFS --mgs --mdt --backfstype=ldiskfs --reformat /dev/sdc
mkdir /mnt/mgsmds
mount -t lustre /dev/sdc /mnt/mgsmds
Before adding the OSS node to the Lustre cluster, we implemented RAID level 0 on all the additional disks attached to the machine (10 × 1 TB). RAID 0 provides no fault tolerance, but we selected the locally redundant storage replication option for Azure Storage . Azure provides three-way mirroring on disk, and we wanted to stress-test the file system. We first installed the mdadm package, which allowed us to configure and manage RAID devices in Linux.
To implement RAID 0 on the storage node:
yum clean all && yum update
yum install mdadm –y
ls -l /dev | grep sd
cat /proc/mdstat
RAID 0 is now configured on the OSS server. On the node, run the following commands to configure the object storage target (OST):
mkfs.lustre --fsname=LustreFS --backfstype=ldiskfs --reformat --ost --mgsnode=MGSMDS00 --index=0 /dev/md0
mkdir /mnt/oss
mount -t lustre /dev/md0 /mnt/oss
NOTE : Whenever you add another OSS node to the cluster, increment the index—for example, from 0 to 1.
For the client node to work with the Lustre cluster, you must install the required RPM packages and configure the node. For other kernel versions than the one we used, see the Lustre Support Matrix.
yum install lustre-client-dkms-2.9.0-1.el7.noarch --skip-broken
mkdir <directory>
mount -t lustre <hostname>@tcp:<filesystem> <directory>
Where:
< directory > is the directory to be mounted with the Lustre file system, such as /mnt/lustre.
< hostname > is the name of the host used for the management server. In our case, MDS and MGS are served by a single node named MGSMDS00.
< filesystem > is the name of the Lustre file system. We specified the name LustreFS earlier while creating the file system (see “Configure MGS/MDS node”).
For example:
mkdir /mnt/lustre
mount -t lustre MGSMDS00@tcp:/LustreFS /mnt/lustre
To validate performance of the Lustre file system, we ran tests to measure maximum throughput and IOPs. We changed various IOR test parameters, including the transfer rate, block size, number of processes, and number of client nodes.
Our results were good, but inconclusive, because of the many factors that affected the performance of our system:
To measure read and write throughput, we used the following IOR syntax:
mpiexec --hosts=<client_hostname> -np <n_procs> /lustre/software/ior/bin/ior
-a MPIIO -v -B -F -w -t 32m -b 4G -o /mnt/lustre/test.`date
+"%Y-%m-%d_%H-%M-%S"`
Where:
-v indicates verbose mode
-B indicates by passing cache
-F indicates file per process
-w indicates write operation
-t indicates transfer rate (size of transfer in bytes)
-b indicates block size
-o indicates output test file
<client_hostname> is the name of the host used as the client
<n_procs> is the number of processes
Table 1 . Maximum throughput results for Lustre file system
Figure 2 presents the write throughput results, with the y -axis showing maximum throughput and the x -axis showing the number of OSS nodes.
Figure 2 . Graphical representation of write throughput results
Figure 3 shows how read throughput improves with the number of server nodes. The y -axis shows maximum throughput and the x -axis shows the number of OSS nodes.
Figure 3 . Graphical representation of read throughput results
To measure IOPs, we used the following syntax:
mpiexec --hosts=<client_hostname> -np <n_procs> /lustre/software/ior/bin/ior
-a MPIIO -v -z -F –w -r -t 4k -b 320m -o /mnt/lustre/test.`date
+"%Y-%m-%d_%H-%M-%S"`
Where:
-v indicates verbose mode
-z indicates random task ordering
-F indicates file per process
-w indicates write operation
-r indicates read operation
-t indicates transfer rate (size of transfer in bytes)
-b indicates block size
-o indicates output test file
<client_hostname> is the name of the host used as the client
<n_procs> is the number of processes
The results are shown in Table 2.
Table 2 . IOPs results for Lustre file system
Figure 4 provides the read IOPs results. The y-axis shows IOPs, and the x -axis shows the count of OSS nodes.
Figure 4 . Write IOPs showing linear scale as nodes are added
Our nonlinear results, shown in Figure 5, were unexpected and most likely related to the caching effect. We ran out of time for further investigation, but we suspect the Lustre caching effects were the culprit.
Figure 5 . Read IOPs showing nonlinear scale
For the metadata operation testing, we used MDTest with the following syntax on the client node:
mpiexec --hosts=<client_hostname> -np <n_procs> /lustre/software/mdtest/mdtest
-I 10 -i 2 -z 5 -b 5 -d /mnt/lustre/
Where:
-I indicates the number of items per directory in the tree
-i indicates iterations
-b indicates block size
-z indicates random task ordering
-d is the directory where the tests are run (multiple paths require fully-qualified pathnames)
Figure 6 shows the MDTest output.
Figure 6 . Sample MDTest output for client host with four processes
Overall, system performance was good, but we noticed a few testing quirks. Here are our observations and conclusions:
Thank you for reading!
This article is also available as an eBook:
AzureCAT Guidance
"Hands-on solutions, with our heads in the Cloud!"
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.