Observability using ADX
The customers are looking for cost effective tools for enterprise monitoring initiatives of their tech infrastructure. Wanted to share my successful experience of customer scenarios where ADX is the critical component for building a scalable ‘Observability Monitoring solution’. It is a fully managed data service for streaming rapidly changing data and converting it into real-time insights. This can be accomplished with deeper analysis into the behaviors of the Network, Application, Services, Databases, or any technical instances that generate logs. The goal is to design a comprehensive operational monitoring with proactive controls.
3 Pillars of Observability:
The Manager of Monitoring solution built for Observability,
- understands application and infrastructure designs.
- identifies the events and helps set-up workflows/alerts.
- facilitates the planning of mitigation efforts proactively.
- enables us to understand the full context of the activities in any technology solutions and applications.
- helps to manage resources effectively and ensure SLA, SLO and SLIs.
The turn-key monitoring solutions for Observability available in the market may have some built-in features for logs, traces and alerts. But there could be some down sides with multiplying costs, limited integration and flexibility of deeper analysis beyond what is built-in.
Because ADX was built for big data telemetry workloads it shines in certain key areas
- Being cost effective while handling huge volumes of data from different sources.
- As a PaaS Platform you have complete control of your data and flexibility to implement any feature needs.
- Flexibility in ingesting data with wide range of options
Solution:
Azure solution uses ADX as the storage/ETL/analysis/visualization tool with its rich set of built-in Time-Series analytics and Machine Learning features. The system can be scalable to any volumes/types/sources of timestamped records. Millions of records and gigabytes of information per day can be loaded, handled, analyzed, and managed seamlessly.
- Ingestion: The data from on-prem, Azure and third-party clouds can be ingested into ADX. It can be batching (by duration, size, event limitations) or streaming. The integration is easier with built-in connectors that are readily available in the service.
Ingestion can use one of the following methods:
- Managed pipelines using Event Hubs / Event Grids/ IoT Hubs.
- Built-in connectors in Azure data Factory to ingest the data into ADX.
- Ingestion plugs-in like logstash, kafka, apache spark connector.
- Power apps connect to wide variety of ingestion sources.
https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-overview
- Analysis: The ‘Big data analytics’ with ADX has a rich set of time-series analysis.
The ADX Cluster is designed to be a performant system that can run Kusto queries on millions of records. There is an elaborate set of time-series, geospatial, tabular and window functions that can be aggregated by different dimensions. There is simple/ easy-to-use enrichment functionality and machine learning analytics that come out of the box as listed below:
- The records can be parsed, unpacked and dynamic columns can be created.
- Easily generate reports, alerts, metrics, and views on different dimensions of time series data.
- You can have deeper analytics with the Kusto queries on the data with the ML features like Forecasting, Clustering, Anomaly detection etc.
- The custom code for ML analysis can be written in ADX editor in Python and can be included in your analysis.
- There are multiple ways to enable AI/ML capabilities using the data in the ADX, outside of the tool by following methods:
- ADX as a source to Modeling/Text analytics in Synapse.
- ADX as input to other Machine learning environments.
- Sourcing the data for any Supervised learning, or other deep learning frameworks.
- Visualization: The data staged into ADX cluster can be used directly into BI dashboard. Below are different possibilities to accomplish this:
- The huge volumes of data can be fed into a dynamic-parameterized dashboard in ADX.
- There is a wide range of features visualization options when the ADX DB is integrated to Azure Grafana instance. Azure Grafana is a fully managed service for analytics and monitoring solutions. It's supported by Grafana Enterprise, which provides extensible data visualization with the high availability and security of Azure cloud.
- The summary statistics of the performance metrics from ADX can be presented in canned reports in Power BI.
End to End Observability: In an example scenario, dump data from Service now to EventHubs ,Non-Azure cloud logs streamed to Azure events or landed to ADLS , use logstash or Azure Data Factory for third party logs , store logs from Azure Monitoring or Log analytics as well.
The general idea here is, once you understand the footprint of your data, you can backtrack to the incident. Drilling down into the details of the encapsulated request-object, we can understand information from disparate systems but related events. The result is a layer of information that can identify connected blocking events, dependencies, exceptions, traces etc. We can build a monitoring system to tie the time dimension, issue, severity, type of issue from different sources.
Implementation Details:
This will be a solution to “Generate – Collect – Ingest - Process/Conform – Store – Model/Serve/Deliver analysis on Logs, metrics and traces”.
- Setup system health monitor for different aspects of the operational data coming in.
- Setup triggers for Compliance, Service health, Cloud optimization suggestions, BC/DR performance, Time to detect and Time to act, Performance/ Remediation alerts etc.
- Any occurrence of performance deviation or degradation can be identified/alerted.
- The issue once surfaced, helps the engineers to test and mitigate.
- Define different levels of granularity on the analysis as per the customizations.
- Retry logic, buffering, handling, throttling ranges can be setup accordingly.
- The solution will enable end-to-end correlation of the logs that are part of a business process.
- This will be the foundation for Incident Management System.
- Initially parent-child dependencies are identified for different business workflows that are related.
- The hierarchy needs to be set up once and then the analysis will be easy.
- The solution can trigger remediation work automatically.
- Outages and linked incidents can be predicted to kick-off the automation to avoid any escalation.
- Within a few days, postmortem analysis can lead you to review any changes or follow-ups from custom reports.
- Correlation of Resource metrics /logs/health, Connectivity, traffic, diagnostics is easier.
- The design decisions to be considered will base on your technology infrastructure as listed below:
- NSG, DNS, app gateways, firewall, Load balancers, Traffic manager, app telemetry analytics, hybrid connectivity, variables on network events.
- No. of services in dev/prod environments, No. of apps and components that need to be monitored and setup metric-based alerting.
- Microservices architecture in place.
- Streaming processes implemented.
- Alerting needs metrics while analysis needs logs/checkpoints, within acceptable threshold.
- Equal partnership is needed from the teams who own all the above resources.
- Within Azure ecosystem, monitoring is easier if Azure Monitoring, Log Analytics, App Insights are enabled and integrated into this solution.
- Outside Azure, use first party tools for collecting.
- The data loaded into ADX supports query-exploration, learning, dashboarding and integration to 3rd party tool if needed.
- Operational scenarios may not need more than 2 weeks of data.
- Choose the caching/persistence strategy or Serverless processing (Spark, ML cluster)on the archived data.
- Rather than one monolithic solution, it can be a scalable solution across clusters.
- Cross-cluster queries give meaningful insights as well.
Efficient Cost Management:
The solution built with ADX gets a lot more capacity for $ when the costs are controlled. The cost with custom solution with ADX can be controlled with appropriate compression/cache/archiving techniques and a price sensitive design.
- ADX cost is mainly made up of two factors:
- The number of VMs you are running for the cluster. You pay for the virtual machines along with an ADX markup based on the number of cores.
- Overall storage: The persistent layer of ADX is ADLS Gen2 Storage. You pay for the data stored here based on your ADX retention period.
As data is ingested into ADX each column is indexed and compressed. The amount of compression varies, and you can verify the amount you get after ingestion. But the median to expect is 7x compression from the uncompressed size of data being ingested. This by itself makes ADX a very cost-effective big data platform but there are other triggers you control to further drive down the cost.
Controlling ADX cost:
- Utilize Optimized Autoscale. This will help keep your cluster at the correct size and scale in/out as needed.
- Utilize Reserved Instances of ADX over pay-as-you-go to get up to 30% discount https://learn.microsoft.com/en-us/azure/data-explorer/pricing-reserved-capacity
- Optimize Cache Configuration based on query patterns (less cache = less VMs)
- Example: If 95% of your queries are for the last 20 days set your cache policy for 20 days
- If different tables have different cache requirement then set your retention policy per table
- Down sample with Materialized-Views
- Only keep the data as long as you need analytics from it
- Example: If you have a regulatory requirement to keep data 10 years but you only need it queriable for 1 year. Set the retention in ADX to 1 year and use continuous export to export the data to an archive storage account for the 10 year retention.
- To limit the amount of data coming in with logs, enable Adaptive sampling/ Endpoint sampling.
- Enable Traces only when debugging is needed.
Conclusion:
The details explained above help to plan/design an end-to-end solution that is customized to accommodate all the customer needs with a cost-efficient design. It can scale to the capacity/performance requirements, at the same time it gives total access to the control/data pane of Monitoring.