Blog Post

Core Infrastructure and Security Blog
4 MIN READ

Azure Monitor: How To Get Alerts for Disconnected Arc Agents

BrunoGabrielli's avatar
Jul 22, 2024

[20240906 - Update: This alert is available in AMBA as of release 2024-06-05.]

 

Ciao Readers,

Just a week no writing :smile:. The moment of another blog post arrived :lol::lol:.

 

In this post, I am going to show you how to set up alerts for disconnected Arc agents using Azure Monitor. If you are not familiar with Azure Arc, it is a service that lets you manage and govern your hybrid cloud resources from a single pane of glass. More about it in the Azure Arc overview public documentation page.

 

One of the benefits of using Arc is that it allows you to collect data from your hybrid resources, so you monitor the health and performance of them. It is ‘a prerequisite’ for enabling Azure Monitor. With that in mind, why it is important to get the alert when a hybrid virtual machine gets disconnected, or the Arc agent status is reported as Offline? Ouch, you did not know they were offline !!!

 

 

 

There are several reasons that spread from management to compliance including monitoring why you need to be aware if your resources are communicating properly or not :smile:. Let me give you a few of them:

 

  1. When a hybrid virtual machine is onboarded, every connection is authenticated using a Managed Identity created automatically during the onboarding process. This System Assigned Managed Identity is renewed automatically and can be set as expired if the system does not communicate for more than 60 days. Should this be the case, there is no way to reset the identity. You have to offboard and re-onboard the machine together with all the installed extensions and configurations
  2. When the hybrid machine is disconnected, no monitoring data can be sent. This can lead up to something really bad like:
    1. Customers go blind about infrastructure health
    2. Machine will maintain the unsent monitoring data in the local cache on the C drive using up to 10 GB of disk space
    3. Old, cached data will be deleted so monitoring data loss is expected
    4. Machines with small disks can quickly and easily run out of disk space. Can you imagine that on a Domain Controller?

I just gave you two reasons and given them, I do not think you need any additional one, right? I think you have got the importance of being alerted when an Arc agent gets disconnected as soon as possible by now. Yes, the sooner, the better.

 

Therefore, you will agree with me that it is necessary to create an alarm. To achieve the goal of creating the alert, you can take advantage of the ability to Create alerts with Azure Resource Graph and Log Analytics.

Let us have a look at the query to be used. The query should give you back one line per monitored server (any alert should give you actionable information and the affected resource is the first in the list) where the last status is reported as Disconnected.

A good query should return records for hybrid machine not connected since a given amount of time. The value in this case is your choice, but I would recommend something not that wide (15 minutes could be a good compromise).

Once you have a good record set, you should configure the alert rule to use the Table rows as Measure and the Count as aggregation type. The Aggregation granularity, which is driving the data range the query will consider, could be set at 1 day

 

 

The alert rule logic will be then configured to measure the number of rows returned by the query. The alert will fire if records (even a single one) are returned.

 

 

 

Assuming that your preference will be to get an alert where resources have not been connecting for the last 15 minutes, you create an alert that uses a query similar to the following one:

 

 

arg("").resources
| where type == "microsoft.hybridcompute/machines"
| where tostring(properties.status) == "Disconnected"
| extend lastContactedDate = todatetime(properties.lastStatusChange)
| where lastContactedDate <= ago(15m)
| extend status = tostring(properties.status)
| project id, Computer=name, status, lastContactedDate

 

 

Running the suggested query, will return something similar to the following image, which will fire the alert in line with the Alert logic condition provided as sample:

 

 

I trust you all will be more than able to continue with alert creation; hence I am to stop here avoid consuming your eyes anymore :happyface:.

 

Thanks for reading through !!!

 

Disclaimer

The sample scripts are not supported under any Microsoft standard support program or service. The sample scripts are provided AS IS without a warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

 

Updated Sep 18, 2024
Version 2.0
  • niranfreeman's avatar
    niranfreeman
    Copper Contributor

    Once these alerts are received, Is there a workaround for restoring the disconnected Arc Agents?

  • niranfreeman unfortunately by design there is no channel to reconmect them from the Azure Portal. If this is your question.

    Me too find it hard to reconmect them at scale, e.g. after internet outage, or network issues.

     

    Sometimes changes / updates on the Arc Agent also prevent the simple reconmection as they changed certs. So you would need to update the agent first to latest monthly. In this case it's easier to remove and install the agent while also removing the Arc object.

     

    Then if the agent was offline for too long the connection will be invalid. 

     

     

    In general the workflow is clunky as removing the object also break things, like Tags etc. So at the end of the day you would like to automate this process via PowerShell*. 

     

    I hope that the Auto-updater for Arc will improve things a lot. Esp updating offline Arc Agents and try reconnect them automatically. 

     

    *There is a PowerShell Module for Arc by Microsoft, aswell an improved one by MVP colleague Kaido Järvemets you can find though his blog. 

  • Deyemu's avatar
    Deyemu
    Copper Contributor

    Thanks for the query, this is very useful. I have this set up and the query correctly finds disconnected agents when run manually, however the search alert rule never fires for me. Should this complete the relevant action once per frequency of evaluation, or only if there is a state change in the number of disconnected machines? 

  • Hello Deyemu ,

    I would try by changing both the aggregation granularity and frequency of evaluation down to 5 or 10 minutes. I would also set the alert rule to split by name