Much needed improvement in reliability of SCOM Linux monitoring - Agents randomly going grey

Microsoft

Mar 10, 2021

"Linux agents have heartbeat issues. Literally everyday heartbeat alerts are detected on random Linux machines, but then when those machines are checked, they are found to be up and running. To solve this problem in the console, the agents under question are restarted. But that is an issue when there are so many agents and a lot of heartbeat alerts!" - Large SCOM customer monitoring 1000s of Linux agents.

The origin of the problem

The problem was a big one: Until the release of SCOM 2019 UR1, individual Linux agents would randomly 'grey-out' in the SCOM Ops and Web Console as shown below in the image, where the second agent is ‘greyed out’.

A ‘grey agent’ technically means that the state of the agent is ‘Unknown’. The agent might be healthy or unhealthy, but the Management Server that is watching the agent can’t reach the agent to determine its health. Imagine going for a vacation and losing your phone in the process. You might be healthy and having the time of your life, but other folks concerned about you have no way of reaching you to determine your state since connections are dead.

On a technical level, a management server regularly sends a signal to each agent it is monitoring, and expects a reply from each one of them (called an agent heartbeat). When an agent fails to send a heartbeat response upon receipt of a signal from the management server, it greys out in the console.

Why is that a problem

Agents greying out leads to alerts being generated. This is a problem for the system administrator who is monitoring the system because of the following reasons:

Based on the alerts, they will contact the Linux Admin in charge of the Linux servers. The Linux Admin then checks the Linux agent, only to find it up and running. This wastes the time of both the Linux Admin and the System Admin
The pattern is random. That means the System Admin is not able to determine which agent greying alert is real and which is a false alarm. They have to equally check all the agents that are greyed out
Sometimes agents grey out at night or after midnight. The system admin then has to manage the system at such odd hours.

The solution

When multiple customers brought the issue to our attention and upon further investigation, the issue was found to be in the agent. Earlier, the Linux agent was running one process that was responsible for both sending heartbeats and for managing the collection of performance metrics. For any reason, if the process became too busy in collecting performance metrics, it would become stalled. If the process becomes stalled, it would not be able to send heartbeats also to the management server causing the agent to grey out.

With SCOM 2019 UR1 onwards, a dedicated process was introduced in the SCOM Linux agent to send the heartbeat. The process called ‘omiagent’ running under the ‘omi’ user now is responsible for sending heartbeat responses regularly to the management server whenever the management server sends a heartbeat request to the agent.

How it solved the problem

Since the problem existed in the SCOM Linux agent and not the Management Server, for customers who upgraded their agent to use the latest version, agents don’t grey out anymore unless there is a genuine problem because the performance metrics process has been decoupled from the heartbeat process. For the system and Linux administrators, this solves the problem of:

Random greying out of agents
False alarms

The overall impact is in terms increased reliability of SCOM Linux monitoring. This will lead to time saved investigating false alarms and also stress and tensions relieved in terms of not having to needlessly worry whether the agent going gray in the Ops console is due to a false alarm or if it is actually indicates the Linux server being down/unreachable