First published on MSDN on Nov 27, 2018
Written by Kanchan Mehrotra, Tony Wu, and Rakesh Patil from AzureCAT. Reviewed by Solliance. Edited by Nanette Ray.
This article is also available as an eBook:
Find previously published parts of this series here:
To validate the assumptions made when designing a PVFS, monitoring is essential. Our informal performance testing included only a basic level of monitoring, but we recommend monitoring both the cluster and the nodes to see if the system needs to be tuned or sized to suit the demand. For example, to monitor I/O performance, you can use the tools included with the operating system, such as iostat, shown in Figure 1.
Figure 1. Monitoring I/O performance using the iostat command
For cluster-level management, Ganglia offers scalable, distributed monitoring for HPC systems, such as clusters and grids. Ganglia is based on a hierarchical design targeted at federations of clusters. It makes use of widely used technologies, such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. Figure 2 shows an aggregated view from Ganglia, displaying compute clients and other components.
Figure 2. Ganglia aggregated view
Ganglia includes three main components:
To install Ganglia for monitoring the status of the grids or cluster:
BeeGFS includes lightweight monitoring that provides client statistics from the perspective of the metadata operations as well as throughput and storage nodes statistics. As Figure 3 and Figure 4 show, you can monitor individual nodes, see an aggregated view, and get real-time feedback in one-second intervals.
Figure 3. BeeGFS monitoring views
Figure 4. Monitoring work requests in BeeGFS
The goal of our performance testing was to see how scalable Lustre, GlusterFS, and BeeGFS are based on a default configuration. Use our results as a baseline and guide for sizing the servers and storage configuration you need to meet your I/O performance requirements. Performance tuning is critical.
Our test results were not always uniform—we spent more time performance-tuning BeeGFS than Lustre or GlusterFS. When we tested our deployments, we got good throughput for both Lustre and GlusterFS, and performance scaled as expected as we added storage nodes.
Our performance results were based on a raw configuration of the file systems obtained in a lab with an emphasis on scalability. Many variables affect performance, from platform issues to network drivers. Network performance, for instance, obviously affects a networked file system.
We hope to do more comprehensive testing in the future to investigate these results. In the meantime, we can’t emphasize repeatability enough in any deployment. Use Azure Resource Manager templates, run scripts, and do what you can to automate the execution of your test environment on Azure. Repeatability is the cornerstone of solid benchmarking.
For more information, see the following resources:
Thank you for reading!
This article is also available as an eBook:
Azure CAT Guidance
"Hands-on solutions, with our heads in the Cloud!"
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.