11-16-2017 01:18 AM
11-16-2017 01:18 AM
I work in a team which manages monitoring for our on-prem Linux environment. We have been asked to manage the monitoring for the cloud-based solution that our internal BU's are progressing with. We have noticed a massive lag in the threshold being triggered to the time we receive an alert. In some cases, it took over 9 hrs and that was just a basic heartbeat search query which was checking every 5min for the last 4hrs which should have generated an alert.
My question to the community is what other solutions have you turned to which have helped you overcome this latency issue? We are in an industry where real-time is a must, as we are in the financial sector where time lost is money lost money.
any suggestions or feedback is welcomed.
11-16-2017 04:04 AM
You can use OMS for monitoring and give near real time monitoring for metrics. OMS also can generate alerts.
11-16-2017 05:19 AM
Thank you for responding, currently we use OMS but have noticed the lag in alerting vs threshold being breached. Even though its set to 5min polling for Alerting it takes far long for the alert to be triggered. I verify this by typing in Alert in Log Analytics which returns nothing for the past 1hr. Which is why I was wondering if people use anything else other than OMS to do metric monitoring.
11-16-2017 09:49 AM
Can you post your query? I think you may be doing something in the query that is causing that level of lag. I'd say 20 minutes is a pretty reliable level of lag from condition to alert in my experience, so this sounds like either something wrong with the query you're using or there is some latency elsewhere in the system.
I have this query for latency that runs as an alert, and it pretty reliably gives me an idea when things are slow in the system:
| order by TimeGenerated
| limit 1
I alert on that when the number of entries is less than 1.
As for your cloud based solution: OMS/Azure Log Analytics isn't very suited to endpoint monitoring (such as URLs and DNS responses). I've turned to Anturis for endpoint monitoring in the past as it is very low cost and can monitor anything with a URL attached to it.
Out of curiosity, how real time are you looking for? The last time I checked the SLA from OMS, latency of up to 2 hours was within the SLA, but we've recently moved over to Azure Log Analytics and I haven't seen the SLAs within Azure Log Analytics.
11-16-2017 11:44 AM
For near real time alerting scenarios on metrics, we have announced a public preview https://azure.microsoft.com/en-au/blog/get-alerts-faster-with-near-real-time-alerting-for-azure-plat...
Additionally we are currently reviewing this SLA in particular as it relates to warm path logging.