%3CLINGO-SUB%20id%3D%22lingo-sub-2103451%22%20slang%3D%22en-US%22%3EPerformance%20impact%20of%20enabling%20Accelerated%20Networking%20on%20HBv2%20and%20HC%20virtual%20machines%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2103451%22%20slang%3D%22en-US%22%3E%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-center%22%20image-alt%3D%22accelnet.jpg%22%20style%3D%22width%3A%20540px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254291iE11CA6FF4BB4BD66%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22accelnet.jpg%22%20alt%3D%22accelnet.jpg%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EAzure%20Accelerated%20Networking%20is%20now%20available%20on%20HBv2%2C%20HC%20and%20HB%20virtual%20machines%20(VMs).%20Enabling%20this%20feature%20improves%20networking%20performance%20between%20VMs%20when%20connecting%20over%20the%20Ethernet-based%20vNICs%2C%20which%20is%20useful%20for%20scenarios%20like%20high-performance%20filesystems%20created%20on%20Azure%20VMs%20and%20mounted%20against%20client%20compute%20VMs.%20In%20this%20article%20we%20measure%20the%20network%20latency%2C%20bandwidth%20and%20I%2FO%20performance%20connecting%20HPC%20VMs%20to%20an%20NFS%20server%2C%20with%20Accelerated%20Networking%20enabled%20and%20disabled%20to%20see%20the%20impact.%20This%20article%20also%20covers%20network%20tuning%20to%20get%20the%20best%20performance%20with%20Accelerated%20Networking%20on%20HBv2%20and%20HB%20VMs.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CH3%20id%3D%22toc-hId-1970991667%22%20id%3D%22toc-hId-1970991667%22%3EEthernet%20network%20latency%20and%20bandwidth%20benchmarks%3C%2FH3%3E%0A%3CP%3EThe%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fvirtual-network%2Fvirtual-network-bandwidth-testing%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%22%3Entttcp%3C%2FA%3E%20tools%20was%20used%20to%20perform%20Accelerated%26nbsp%3B%20Networking%20bandwidth%20tests.%3C%2FP%3E%0A%3CP%3EThe%20following%20command%20line%20parameters%20were%20used%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%3Entttcp%20-r%20-m%2064%2C*%20--show-tcp-retrans%20--show-nic-packets%20eth0%20%20%20(on%20receiver)%3CBR%20%2F%3Entttcp%20-s%20-m%2064%2C*%2C%24server_ip%20--show-tcp-retrans%20--show-nic-packets%20eth0%20%20(on%20sender)%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3BNetwork%20latencies%20were%20measured%20using%20the%20linux%20sockperf%20tool.%3C%2FP%3E%0A%3CPRE%3E%2Fusr%2Fsbin%2Fsysctl%20-w%20net.core.busy_poll%3D50%3CBR%20%2F%3E%2Fusr%2Fsbin%2Fsysctl%20-w%20net.core.busy_read%3D50%3CBR%20%2F%3Esockperf%20server%20-i%20%24server_ip%20--tcp%20-p%208201%20%20%20(on%20receiver)%3CBR%20%2F%3Esockperf%20sockperf%20ping-pong%20-i%20%24server_ip%20-p%208201%20-t%2020%20--tcp%20--pps%3Dmax%20%20(on%20sender)%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-center%22%20image-alt%3D%22CormacGarvey_0-1612987099244.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254003iBAFF5B82A35C70EE%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_0-1612987099244.png%22%20alt%3D%22CormacGarvey_0-1612987099244.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-center%22%20image-alt%3D%22CormacGarvey_4-1612987498861.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254011iA79B97D4C72712AD%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_4-1612987498861.png%22%20alt%3D%22CormacGarvey_4-1612987498861.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CH3%20id%3D%22toc-hId-163537204%22%20id%3D%22toc-hId-163537204%22%3E%26nbsp%3B%3C%2FH3%3E%0A%3CPRE%3E%3CSTRONG%3ENOTE%3C%2FSTRONG%3E%3A%20CentOS-HPC%207.8%20was%20used%20for%20all%20network%20latency%20and%20bandwidth%20benchmarks.%20The%20HBv2%20network%20%3CBR%20%2F%3E%20%20%20%20%20%20bandwidth%20test%20applying%20the%20network%20tuning%20described%20below%20did%20not%20achieve%20the%20expected%20~38%20Gbps%2C%3CBR%20%2F%3E%20%20%20%20%20%20but%20using%20CentOS-HPC%207.7%20we%20were%20able%20to%20achieve%20~38%20Gbps.%20An%20updated%20version%20of%20CentOS-HPC%207.8%3CBR%20%2F%3E%20%20%20%20%20%20will%20be%20released%20at%20a%20later%20date%20to%20correct%20this%20performance%20problem.%3C%2FPRE%3E%0A%3CH3%20id%3D%22toc-hId--1643917259%22%20id%3D%22toc-hId--1643917259%22%3E%26nbsp%3B%3C%2FH3%3E%0A%3CH3%20id%3D%22toc-hId-843595574%22%20id%3D%22toc-hId-843595574%22%3EEthernet%20network%20tuning%20for%20HB120_v2%20and%20HB60%3C%2FH3%3E%0A%3CP%3EOn%20HB120_v2%20and%20HB60%20some%20manual%20network%20tuning%20is%20needed%20to%20see%20the%20performance%20benefits%20of%20accelerated%20networking.%3C%2FP%3E%0A%3CPRE%3E%3CSTRONG%3ENOTE%3C%2FSTRONG%3E%3A%20Network%20tuning%20will%20be%20included%20in%20future%20Marketplace%20HPC%20images%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3BHere%20are%20the%20manual%20network%20tuning%20steps%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EChange%20the%20number%20of%26nbsp%3B%20multi-purpose%20channels%20for%20the%20eth2%20network%20device.%20The%20default%20number%20of%20multi-purpose%20channels%20on%20HBv2%2C%20HB%20and%20HC%20SKU's%20is%2031.%20In%20out%20testing%2C%204%26nbsp%3B%20multi-purpose%20channels%20gives%20the%20best%20performance.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CPRE%3Eethtool%20-L%20eth2%20combined%204%3C%2FPRE%3E%0A%3CUL%3E%0A%3CLI%3E%26nbsp%3BPin%20the%20first%20four%20multi-purpose%20channels%20of%20device%20eth2%20to%20vNUMA%200%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3ETo%20get%20the%20first%20four%20multi-purpose%20channel%20indices%3C%2FP%3E%0A%3CPRE%3Els%20%2Fsys%2Fclass%2Fnet%2Feth2%2Fdevice%2Fmsi_irqs%3C%2FPRE%3E%0A%3CP%3EMap%26nbsp%3Bfirst%20four%20multi-purpose%20channel%20to%20vNUMA%200%3C%2FP%3E%0A%3CPRE%3Eecho%20%220%22%20%26gt%3B%20%2Fproc%2Firq%2F%24%7Birq_index%5B0%5D%7D%2Fsmp_affinity_list%3CBR%20%2F%3Eecho%20%221%22%20%26gt%3B%20%2Fproc%2Firq%2F%24%7Birq_index%5B1%5D%7D%2Fsmp_affinity_list%3CBR%20%2F%3Eecho%20%222%22%20%26gt%3B%20%2Fproc%2Firq%2F%24%7Birq_index%5B2%5D%7D%2Fsmp_affinity_list%3CBR%20%2F%3Eecho%20%223%22%20%26gt%3B%20%2Fproc%2Firq%2F%24%7Birq_index%5B3%5D%7D%2Fsmp_affinity_list%3C%2FPRE%3E%0A%3CPRE%3E%3CSTRONG%3ENOTE%3A%20%3C%2FSTRONG%3EThere%20is%20a%20script%20called%20map_irq_to_numa.sh%20in%20the%20azurehpc%20git%20repo%20to%20do%20this%20automatically.%20(%3CA%20href%3D%22https%3A%2F%2Fgithub.com%2FAzure%2Fazurehpc%2Ftree%2Fmaster%2Fexperimental%2FAccelNet_tuning%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%22%3Ehere%3C%2FA%3E)%3C%2FPRE%3E%0A%3CUL%3E%0A%3CLI%3EPin%20your%20executable%20(i.e%20ntttcp)%20to%20vNUMA%200%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CPRE%3Etaskset%20-c%200-3%20ntttcp%20%26lt%3Bntttcp_args%26gt%3B%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CH3%20id%3D%22toc-hId--963858889%22%20id%3D%22toc-hId--963858889%22%3EI%2FO%20Performance%20benchmark%26nbsp%3B%3C%2FH3%3E%0A%3CP%3EWe%20performed%20synthetic%20I%2FO%20benchmarks%20(FIO)%20on%20HC44%20and%20HB120_v2%20connected%20to%20an%20NFS%20server%2C%20to%20determine%20the%20performance%20impact%20of%20Accelerated%20Networking%20on%20network%20storage%20I%2FO%20performance.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ENFS%20server%20configuration%3C%2FP%3E%0A%3CPRE%3ED64s_v4%20(6%20x%20P30%20disks)%3CBR%20%2F%3ENFS%20server%20used%20CentOS%207.8%20and%20HPC%20I%2FO%20clients%20used%20CentOS-HPC%207.8%3CBR%20%2F%3EExpected%20theoretical%20peak%20I%2FO%20performance%20%3D%20~1200%20MB%2Fs%20(Due%20to%20D64s_v4%20and%20P30%20disk%20limits)%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22CormacGarvey_5-1612987606845.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254016i6E6673DD4D097C54%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_5-1612987606845.png%22%20alt%3D%22CormacGarvey_5-1612987606845.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22CormacGarvey_7-1612987742405.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254018i835353C356D99091%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_7-1612987742405.png%22%20alt%3D%22CormacGarvey_7-1612987742405.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22CormacGarvey_0-1612990804433.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254034iF42398BCA6D7AD18%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_0-1612990804433.png%22%20alt%3D%22CormacGarvey_0-1612990804433.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22CormacGarvey_1-1612990837629.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F254035i30317AE1DB20DF3B%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22CormacGarvey_1-1612990837629.png%22%20alt%3D%22CormacGarvey_1-1612990837629.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CPRE%3E%3CSTRONG%3ENOTE%3C%2FSTRONG%3E%3A%20In%20this%20I%2FO%20benchmark%20an%20NFS%20server%20was%20used%20in%20which%20the%20D64s_v4%20and%20P30%20disk%20limits%20restricted%20I%2FO%20performance%3CBR%20%2F%3E%20%20%20%20%20%20even%20though%20the%20network%20had%20more%20bandwidth%20to%20go%20faster.%20If%20a%20network%20storage%20solution%20is%20used%20with%20faster%20disks%3CBR%20%2F%3E%20%20%20%20%20%20or%20higher%20throughput%2C%20greater%20gains%20in%20I%2FO%20performance%20would%20be%20expected%20by%20enabling%20Accelerated%20Networking.%3C%2FPRE%3E%0A%3CH3%20id%3D%22toc-hId-1523653944%22%20id%3D%22toc-hId-1523653944%22%3E%26nbsp%3B%3C%2FH3%3E%0A%3CH3%20id%3D%22toc-hId--283800519%22%20id%3D%22toc-hId--283800519%22%3ESummary%3C%2FH3%3E%0A%3CUL%3E%0A%3CLI%3EEnabling%20accelerated%20networking%20on%20HPC%20VMs%20has%20a%20significant%20impact%20on%20front-end%20network%20performance%20(latency%20and%20bandwidth).%3C%2FLI%3E%0A%3CLI%3EHB120_v2%20and%20HB60%20SKUs%20require%20network%20tuning%20to%20benefit%20from%20Accelerated%20Networking.%3C%2FLI%3E%0A%3CLI%3EAccelerated%20networking%20improves%20network%20storage%20I%2FO%20performance%2C%20especially%20read%20I%2FO%20at%20lower%20client%20counts.%3C%2FLI%3E%0A%3C%2FUL%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-2103451%22%20slang%3D%22en-US%22%3E%3CP%3EAzure%20Accelerated%20Networking%20is%20now%20available%20on%20HBv2%2C%20HC%20and%20HB%20virtual%20machines%20(VMs).%20Enabling%20this%20feature%20improves%20networking%20performance%20between%20VMs%20when%20connecting%20over%20the%20Ethernet-based%20vNICs%2C%20which%20is%20useful%20for%20scenarios%20like%20high-performance%20filesystems%20created%20on%20Azure%20VMs%20and%20mounted%20against%20client%20compute%20VMs.%20In%20this%20article%20we%20measure%20the%20network%20latency%2C%20bandwidth%20and%20I%2FO%20performance%20connecting%20HPC%20VMs%20to%20an%20NFS%20server%2C%20with%20Accelerated%20Networking%20enabled%20and%20disabled%20to%20see%20the%20impact.%20This%20article%20also%20covers%20network%20tuning%20to%20get%20the%20best%20performance%20with%20Accelerated%20Networking%20on%20HBv2%20and%20HB%20VMs.%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-2103451%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAccelerated%20Networking%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EHPC%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3ENetwork%20Storage%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Microsoft

accelnet.jpg

Azure Accelerated Networking is now available on HBv2, HC and HB virtual machines (VMs). Enabling this feature improves networking performance between VMs when connecting over the Ethernet-based vNICs, which is useful for scenarios like high-performance filesystems created on Azure VMs and mounted against client compute VMs. In this article we measure the network latency, bandwidth and I/O performance connecting HPC VMs to an NFS server, with Accelerated Networking enabled and disabled to see the impact. This article also covers network tuning to get the best performance with Accelerated Networking on HBv2 and HB VMs.

 

Ethernet network latency and bandwidth benchmarks

The ntttcp tools was used to perform Accelerated  Networking bandwidth tests.

The following command line parameters were used

 

ntttcp -r -m 64,* --show-tcp-retrans --show-nic-packets eth0   (on receiver)
ntttcp -s -m 64,*,$server_ip --show-tcp-retrans --show-nic-packets eth0 (on sender)

 

 Network latencies were measured using the linux sockperf tool.

/usr/sbin/sysctl -w net.core.busy_poll=50
/usr/sbin/sysctl -w net.core.busy_read=50
sockperf server -i $server_ip --tcp -p 8201 (on receiver)
sockperf sockperf ping-pong -i $server_ip -p 8201 -t 20 --tcp --pps=max (on sender)

 

CormacGarvey_0-1612987099244.png

 

CormacGarvey_4-1612987498861.png

 

NOTE: CentOS-HPC 7.8 was used for all network latency and bandwidth benchmarks. The HBv2 network 
bandwidth test applying the network tuning described below did not achieve the expected ~38 Gbps,
but using CentOS-HPC 7.7 we were able to achieve ~38 Gbps. An updated version of CentOS-HPC 7.8
will be released at a later date to correct this performance problem.

 

Ethernet network tuning for HB120_v2 and HB60

On HB120_v2 and HB60 some manual network tuning is needed to see the performance benefits of accelerated networking.

NOTE: Network tuning will be included in future Marketplace HPC images

 Here are the manual network tuning steps

  • Change the number of  multi-purpose channels for the eth2 network device. The default number of multi-purpose channels on HBv2, HB and HC SKU's is 31. In out testing, 4  multi-purpose channels gives the best performance.
ethtool -L eth2 combined 4
  •  Pin the first four multi-purpose channels of device eth2 to vNUMA 0

To get the first four multi-purpose channel indices

ls /sys/class/net/eth2/device/msi_irqs

Map first four multi-purpose channel to vNUMA 0

echo "0" > /proc/irq/${irq_index[0]}/smp_affinity_list
echo "1" > /proc/irq/${irq_index[1]}/smp_affinity_list
echo "2" > /proc/irq/${irq_index[2]}/smp_affinity_list
echo "3" > /proc/irq/${irq_index[3]}/smp_affinity_list
NOTE: There is a script called map_irq_to_numa.sh in the azurehpc git repo to do this automatically. (here)
  • Pin your executable (i.e ntttcp) to vNUMA 0
taskset -c 0-3 ntttcp <ntttcp_args>

 

I/O Performance benchmark 

We performed synthetic I/O benchmarks (FIO) on HC44 and HB120_v2 connected to an NFS server, to determine the performance impact of Accelerated Networking on network storage I/O performance.

 

NFS server configuration

D64s_v4 (6 x P30 disks)
NFS server used CentOS 7.8 and HPC I/O clients used CentOS-HPC 7.8
Expected theoretical peak I/O performance = ~1200 MB/s (Due to D64s_v4 and P30 disk limits)

 

CormacGarvey_5-1612987606845.png

 

CormacGarvey_7-1612987742405.png

 

CormacGarvey_0-1612990804433.png

 

CormacGarvey_1-1612990837629.png

NOTE: In this I/O benchmark an NFS server was used in which the D64s_v4 and P30 disk limits restricted I/O performance
even though the network had more bandwidth to go faster. If a network storage solution is used with faster disks
or higher throughput, greater gains in I/O performance would be expected by enabling Accelerated Networking.

 

Summary

  • Enabling accelerated networking on HPC VMs has a significant impact on front-end network performance (latency and bandwidth).
  • HB120_v2 and HB60 SKUs require network tuning to benefit from Accelerated Networking.
  • Accelerated networking improves network storage I/O performance, especially read I/O at lower client counts.