Azure NetApp Files (ANF) was generally available on May 2019. Ever since it has been widely adopted across industries, including many silicon companies running their Electronic Design Automation (EDA) workloads on Azure. Azure NetApp Files provides 3 different service levels to ensure throughput, NFS 3.0/NFS4.1/SMB mount protocols connecting from Windows or Linux VMs, and takes only minutes to setup. Enterprise can seamlessly migrate their applications to Azure with an on-premises-like experience and performance.
EDA workloads are generated by SPEC SFS® benchmark suite, to “measure file server throughput and response time.” The benchmark suite generates EDA operations on the ratio of EDA_FRONTEND and EDA_BACKEND as 3:2, to simulate a classic IC-design workload. The distribution of the operations are illustrated below:
(From SPEC SFS® 2014)
The goal of this article is to share lessons learned from running the SPEC SFS® EDA stress test on Azure NetApp files. Including:
Architecture
The test was performed in Azure EAST US regions. Mix of Premium and Ultra service level with different sizes of volumes were tested. E64dsv4 or D64dsv4 VMs were acting as clients, which generated EDA workload operation, and reside in the same Proximity Placement Group with Accelerated Networking enabled.
The average ping latency from VMs to ANF was around 0.6~0.8 milliseconds.
The overall performance is determined by two key metrics: Operations/sec and Response Time (millisecond). The test will incrementally generate EDA workload Operations, and Response Time will be recorded accordingly. The Operations/sec also indicates the combined read/write throughput (MB/s).
Performance Tuning
Below table shows the 8 different options have been examined and their performance impact. Please see Appendix for details and their applicability. If no specifically stated, NFS vers=3, mountproto=TCP, and default MTU=1500 were applied.
The below results shows that the first 3 (A, B & C) options can all improve response time and maximize throughput significantly, and the effectiveness can be added up.
NFS 4.1 shows poor performance in this test compared to NFS 3. So be cautious of using it if there’s no specific security requirements.
TCP shows slightly better performance than UDP (‘mountproto=udp’)
Appropriate rsize/wsize could improve performance. But be cautious to modify the default value (1MB/1MB) as unproper value could also inferior performance.
There’s no significant impact on performance when changing VM's MTU # from default 1500 bytes to 9000 bytes.
Cost-Effective analysis
One advantage of ANF is it provides 3 different service level with different pricing structure. That is, users are able to change the service level and volume size to reach the most cost-effective sweet spot when running their applications on Azure.
In below chart it shows:
Which implies high IOPS EDA workloads (such as the LSF events share) can generally be done on the Premium service level and without excessively large volumes.
Scalability
Summary
It's important to keep in mind that in real-world, storage performance is impacted by a wide range of factors. This article is by no means to provide an ultimate guidance, but to share lessons learned from running the standard EDA benchmarking tools, and examine some generic performance best practice could be applied in your applications running on Azure.
Generally, the first 4 options (A, B, C & D) are suggested to be applied when applicable. As their effectiveness can be added up, they also improve Max. IOPS and bandwidth on regular FIO test:
And as stated please be cautious to change rsize/wsize (option F) as it could also impact performance in different way.
Appendix:
1. Tuning /etc/sysctl.conf (option B) example on Ev4/Dv4:
sudo vi /etc/sysctl.conf
Append or update the following attributes:
net.core.somaxconn = 65536
net.core.netdev_max_backlog = 300000
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 87380 16777216
net.ipv4.tcp_fin_timeout = 5
To make the change effective:
sudo sysctl -p
2. Upgrade Linux kernel to 5.3+ to be able to utilize “nconnect” (option C). Please note that you will need to reboot the VM at the end of the upgrade. So it might not be applicable for some cases.
# CentOS/Redhat 7+
sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
# CenOS/Redhat 8+
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo yum -y --enablerepo=elrepo-kernel install kernel-ml
sudo reboot
# check version:
uname -r
3. actimeo and nocto
The actimeo and nocto mount options are used primarily to increase raw performance. Please review NetApp ONTAP’s Best Practice Guide for applicability of your applications.
4. mounting examples:
TCP:
sudo mount -t nfs -o rw,nconnect=16,nocto,actimeo=600,hard,rsize=1048576,wsize=1048576,vers=3,tcp 10.1.x.x:/ultravol ultravol
UDP:
sudo mount -t nfs -o rw,nconnect=16,nocto,actimeo=600,hard,rsize=1048576,wsize=1048576,vers=3,mountproto=udp 10.1.x.x:/ultravol ultravol
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.