We’re excited to announce the release of Network HUD - A new feature that proactively identifies and remediates operational networking issues on Azure Stack HCI. Network HUD is available in the November update for both 21H2 and 22H2 Azure Stack HCI subscribers!
Running and maintaining a network for your business applications is a hard job. Ensuring a workload is stable and optimized requires coordination across the physical network (switch, cabling, NIC), host operating system (e.g., virtual switch, virtual NICs, etc.), and of course the application that runs inside the VMs or Containers. Each of those have their own configurations, have different capabilities, and may be managed by different teams. Even if you’ve perfectly implemented your “golden configuration” your environment may still experience the ripple effect of a bad configuration from another part of your network that degrades your application performance.
Given the sheer number of tools and technologies needed to manage the network components listed above, complexity in the network has reached an all-time high. If you look across the Azure Stack HCI OS and all of its event logs, performance counters, and tooling, there is an incredibly powerful amount of information at your fingertips. However, trying to distill all of this down when an issue occurs frequently requires expertise, time, and is unfortunately retroactive once a problem has already occurred.
This is where Network HUD shines. Network HUD analyzes the information coming from event logs, performance counters, tooling like Pktmon, network traffic, and the physical network devices in real-time to identify issues BEFORE they happen. In many cases, it will PREVENT issues from occurring by modifying your system to ensure that issues are not exacerbated. When it can’t prevent an issue, Network HUD will alert you with an actionable message that tells you how to solve the problem in your environment. Over time, we plan to enhance Network HUD with learning capabilities that, for example, will identify high and low traffic times to ensure that maintenance tasks do not interfere with workloads achieving their expected performance levels.
In our next blog, we’ll look at the capabilities that are shipping with the November content update but in this article, we’ll discuss some basics of Network HUD.
Getting started with Network HUD is easy. To install Network HUD, first ensure you’ve installed the November update on 21H2 or 22H2 Azure Stack HCI releases. Next, check out the installation instructions located here which outline some additional steps. Finally, review the following requirements for Network HUD to do its work which are outlined below.
Network HUD understands how you intend to use your adapters and as a result can manage the stability across the cluster. Imagine Node1 in your cluster has an unstable adapter. Without informing the other nodes of the issue, the healthy nodes could overwhelm the Node1 and cause a larger issue (e.g., cluster crashes or Storage Spaces Direct rebuilds).
To address this, Network HUD works in tandem with Network ATC. When Network HUD identifies instability on one node, it informs Network ATC which can manage the cluster-wide configuration and ensure that the healthy nodes do not overload the degraded nodes.
As a result, Network HUD requires that an adapter is part of a Network ATC intent.
Network HUD takes advantage of capabilities in the physical switch to ensure that your configuration matches what’s on the physical network. For example, we can determine whether the locally connected switchports has the correct data center bridging configuration required for RDMA storage traffic to function (and as previously mentioned, we know which switchports to look at because the adapters are part of a Network ATC storage intent).
To ensure Network HUD can validate the physical network, make sure the switches connected to your cluster nodes is one of the devices that we’ve verified has the necessary capabilities.
Microsoft has worked with each of these vendors to ensure that the devices listed support the capabilities that we require for Azure Stack HCI. If your device is not listed, contact your vendor as your deployment is not using an approved switch.
Importantly, we may introduce new requirements to bring additional functionality or improved stability to your deployments in the future. While we earnestly attempt to work with every vendor to maintain a stable list of devices, some devices will not maintain their status moving forward. Therefore, if you’re purchasing a new switch, we recommend you use one of the devices listed with the latest OS version.
Network HUD uses the existing cluster health infrastructure that your cluster is already using for Storage Spaces Direct.
To get timely alerts even if you don’t have Windows Admin Center or the Azure Portal open when issues occur, ensure that you’ve configured the Azure Insights for your Azure Stack HCI cluster and setup action groups in the portal (if you haven’t already).
Network HUD is a true cloud service that runs on-premises. At the time of writing, there are several issues that Network HUD can detect (disconnecting NICs, resetting NICs, PCIe oversubscription, etc.). However, we’re actively working on bringing more capabilities to you and will make these available as soon as they’re ready. We’ll announce the availability of new Network HUD content here on the blog and of course update our documentation, so stay tuned!
That’s right, just like any other Azure service, the latest updates will be available to you and will bring quality improvements as well as new capabilities that detect more issues. We’re closely connected with our support teams as they bring word of new and emerging issues that Network HUD could detect.
At the time of writing, you’ll periodically need to run:
Install-Module -Name Az.StackHCI.NetworkHUD -Force
on each of the nodes in your cluster whenever new content updates are available. We expect to improve this experience in the future but would love to hear your feedback.
Network HUD is a new feature, available with the November update on Azure Stack HCI that detects operational network issues causing stability issues or degrade performance. It distills the various indicators of problems generated by event logs, performance counters, the physical network and more, to proactively identify issues and alert you with contextual messages that you can act on. It also integrates with the existing alerting mechanisms you’re already used to and leverages Network ATC for intent-based analytics and remediation.
With Network HUD, your network will soon run more smoothly. You can rest easily knowing that it’s there to watch and prevent network issues. Install now!
Dan “Heads up” Cuomo
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.