07-05-2019 03:43 PM - edited 07-16-2019 11:12 AM
07-05-2019 03:43 PM - edited 07-16-2019 11:12 AM
Over the past few years, Azure Data Explorer has become a critical part of running Live Site for most internal Azure services. This post describes a simple approach to model objects in your services using event streams ingested into Azure Data Explorer. The sample streams described below are based on real streams used by teams in Azure Compute (including my team) to monitor and manage all Virtual Machines (VMs) and host servers in Azure. We’ve built a lot of tooling in Azure Compute around the data modeling and query patterns shared below.
If you don’t already have a pipeline setup for your service logs to be ingested into Azure Data Explorer, take a look at this guide.
Object properties are modeled as columns of a table and each row is a snapshot representation of that object.
Let’s look at a sample event schema for a VM.
| take 1
Our services emit these snapshots periodically and on every column change – the latter is essential to look at when things changed without waiting on the snapshot period.
We do this for every key object modeled in the system – service endpoints, configuration values, VMs, host servers, etc. This snapshot pattern works well for long lived objects – short lived objects like requests typically have their properties emitted as part of the initial request processing. You can also consider pushing immutable properties of an object as a separate stream with lower snapshot frequencies.
In the example above, if you wanted to get the latest state of a single object, this is quite straightforward:
| where VMId == "3f576830-2d8f-481b-8448-bca561090831" | top 1 by PreciseTimeStamp desc
What if you wanted the latest snapshot for several objects at a time? This is where arg_max is helpful.
arg_max finds rows corresponding to the expression that needs to be maximized. If there are n VMs in your query, arg_max can be used to maximize PreciseTimeStamp for every unique VMId – the end result will be n rows, where each row corresponds to the latest snapshot for that VM.
For example, the query below finds the latest VM state distribution in the region westus.
| where Region == "westus" and PreciseTimeStamp > ago(2h)
| summarize arg_max(PreciseTimeStamp, *) by VMId
| summarize count() by State
The other thing to note is that the query looks back 2 hours here – 2x the snapshot interval which is hourly for this table. Since the full snapshot event is emitted on every property change for the VM, this query does give you the latest state of every object in Azure Data Explorer, not just what was visible on the hourly snapshot.
Also, the query result might include VMs that have already been deleted. To deal with this, we model the deletion itself as a state on the snapshot, so that such results can be filtered if required.
Since we also model every service configuration value and the whole service itself as an object, we use the pattern above to track each config value and build version as rollouts occur in the fleet.
For example, the query below gives us the version distribution of our services running across Azure host servers.
| summarize arg_max(PreciseTimeStamp, *) by HostId
| summarize dcount(HostId) by Region, BuildCommitHash
The BuildCommitHash in our snapshots enables us to track every code change committed as it’s deployed across the fleet. We have some tooling that takes in a commit, gets the latest snapshots of the service, runs git merge-base and provides the engineer with the exact spread of host servers that have that fix. We also use similar queries to setup reports using Microsoft Flow to track build versions, A/B testing of config and more.
Let’s say you wanted to know when the version of a host service changed using snapshots – you could do a self-join with the same table, though this makes for a complicated query. This pattern is solved quite cleanly with the sort and prev operators.
The prev operator allows you to access rows before the current row being processed (from a sorted dataset). To find the when a value changed, you can now compare row n to row n-1 quite easily.
| where HostId == "a3d1a948-53ae-4175-8030-d7d174c427d1"
| sort by PreciseTimeStamp asc
| extend shouldSelect = iff(BuildCommitHash != prev(BuildCommitHash), true, false)
| where shouldSelect | project-away shouldSelect
The query above gives you the service changes for one host server. It can be modified to look at all host servers, by doing a sort by HostId and then PreciseTimeStamp. The query below also creates a new column that contains the old value and the new value being transitioned to.
| sort by HostId, PreciseTimeStamp asc
| extend prevBuildCommitHash = prev(BuildCommitHash)
| extend buildCommitHashChange = case(
HostId != prev(HostId), "",
BuildCommitHash != prevBuildCommitHash , strcat(prevBuildCommitHash, " -> ", BuildCommitHash),
| where buildCommitHashChange != "" | project PreciseTimeStamp, HostId, buildCommitHashChange
Besides these snapshot events, our services push a lot of other traditional log events to Azure Data Explorer. What if you wanted to get the last known property of an object as another log event occurs? For example, let’s say you wanted to get a property of the host server right before a windows event occurs. The simplest option might be to log this property on the windows event table itself – however, this approach doesn’t scale well when an object has hundreds of properties.
One of the biggest advantages to having your object snapshots and log events both be in Azure Data Explorer is that it enables you to easily combine this information – you can take any log event, find the object associated with that event and pull in properties for that object and add it to the original event, making it seem like that property was logged natively on the event.
This pattern is based on time window joins described here. The query below takes all host windows events and adds the last known BuildCommitHash of the host server to each event. Note that events without a matching snapshot prior to the event are dropped here, though this can be tweaked depending on your use case. The lookback window here is 2 hours, twice the snapshot interval for this object.
let lookupWindow = 2h;
let lookupBin = lookupWindow / 2.0;
| extend TimeKey = range(bin(PreciseTimeStamp - lookupWindow, lookupBin),
bin(PreciseTimeStamp + lookupBin, lookupBin),
| mvexpand TimeKey to typeof(datetime)
| join kind= leftouter (
project Object1Time = PreciseTimeStamp, HostId , BuildCommitHash,
TimeKey = bin(PreciseTimeStamp, lookupBin)
) on HostId , TimeKey
| extend timeDistance1 = case(
BuildCommitHash == "", totimespan(365d), // No Snapshot found. Ignore these records.
Object1Time > PreciseTimeStamp, totimespan(365d), // Snapshot is newer than event. Ignore this as well.
PreciseTimeStamp - Object1Time) // Compute the time distance between the event and snapshot
| where timeDistance1 != totimespan(365d) // 365d is just used as a value to filter out invalid matches
| summarize hint.shufflekey = HostId arg_min(timeDistance1, *) by HostId, PreciseTimeStamp // Find the closest snapshot. Shufflekey is used to optimize query performance.
| project PreciseTimeStamp, HostId, BuildCommitHash, Message
This pattern allows most log events to only carry object identifiers and pull in all properties about that object dynamically using snapshots. This is especially useful in cases where the service that emits log events doesn’t have all the relevant properties available about the object – this is becoming more prevalent with microservices. Internally, we use similar queries for A/B testing and automated anomaly detection.
In a follow-up post, I’ll be sharing more details about a Live Site tool built around these object snapshots that will be available on GitHub soon.
Edit: The follow up post on Orb is here.