Finding Data Gaps Using Window Functions

Microsoft

My team focuses on processing streaming data. When there are gaps in streams for even a minute, it can cause inaccuracies in our output. Over the years, we’ve written a variety of different queries to find gaps in datasets, but I’ll describe one of our favorites below. If you deal with time series data too, be sure to check out the official docs for a lot more information about time series analysis.

 

For a quick visual inspection, I can use a make-series operator and a timechart:

T

| make-series count() on EventTime in range(ago(7d), now(), 1m)

| render timechart

 

With the chart, getting the exact start and end time of the gap means carefully moving my mouse around on the screen, and that’s prone to errors. The query below shows how I can get an ordered list of all the minutes that have data and then find any gaps bigger than five minutes.

let newSessionLimit = 5m;

let dataFrequency = 1m;

T

| make-series count() on EventTime in range(ago(7d), now(), dataFrequency)

| mv-expand EventTime to typeof(datetime), count_

| where count_ != 0

| order by EventTime asc

| extend prevTime = prev(EventTime)

| where EventTime - prevTime > newSessionLimit

| project GapStart=prevTime+dataFrequency, GapEnd=EventTime

 

GapStart

GapEnd

2019-02-06 01:55:00.0000000

2019-02-06 10:16:00.0000000

2019-02-06 16:55:00.0000000

2019-02-06 18:01:00.0000000

2019-02-06 18:55:00.0000000

2019-02-07 00:16:00.0000000

2019-02-07 00:15:00.0000000

2019-02-07 03:01:00.0000000

2019-02-07 04:55:00.0000000

2019-02-08 14:01:00.0000000

 

The query makes use of one of the window functions: prev(). The window functions operate on serialized/ordered data to help you do operations involving nearby rows. The prev() function references a field in the previous row. This lets us compare the EventTime in our current row with the one right before it and determine if there was a gap big enough to interest us.

 

The window functions also give you the ability to add row numbers and do cumulative sums, but be warned that serializing your data can incur a high performance penalty. Employ them with caution, but when used wisely, they can open up a whole world of analysis that isn’t otherwise available.

3 Replies

@Ben Martens Hi!

 

This query was super useful for myself and the work I do.

 

I was wondering if I could be lucky enough to ask a little question and hope you can send me in the right direction.

 

I work with streaming data that is available in the Data explorer. I use the exact same query above to assess whether there was a data gap in a certain place. The different sources are defined in a column field (a column called place lets say) and all the data for the different places is stored in the same table.

 

Is there a way to run the query below for each source?

 

Tried something like the below and some other things but didn't work.

 

| make-series count() on Time in range(ago(7d), now(), dataFrequency) by Place

At the moment, when I run the query for a particular place, I just define something like:

| where Place == "New York"

 

Hope my question makes sense.

 

Thanks in advance!

 

Regards,


Santi. 

 

 

@Santi143 Apologies for the delay while I was on vacation over the holidays. Yes, it's possible to do this for multiple sources. Here's an example where I've added the "EventSource" column and the query looks for gaps in each source individually even though the same event is coming from both sources:

 

let T =
datatable(EventTime:datetime, Event:string, EventSource:string)[
    datetime(2020-01-03 19:00), 'a', '1',
    datetime(2020-01-03 19:01), 'a', '1',
    datetime(2020-01-03 19:07), 'a', '1',
    datetime(2020-01-03 19:00), 'a', '2',
    datetime(2020-01-03 19:01), 'a', '2',
    datetime(2020-01-03 19:02), 'a', '2',
    datetime(2020-01-03 19:03), 'a', '2',
    datetime(2020-01-03 19:04), 'a', '2',
    datetime(2020-01-03 19:05), 'a', '2',
    datetime(2020-01-03 19:06), 'a', '2',
    datetime(2020-01-03 19:07), 'a', '2',
]
;
let newSessionLimit = 5m;
let dataFrequency = 1m;
T
| make-series count() on EventTime in range(datetime(2020-01-03 19:00), datetime(2020-01-03 19:10), dataFrequency) by EventSource
| mv-expand EventTime to typeof(datetime), count_
| where count_ != 0
| order by EventSource, EventTime asc
| extend prevTime = prev(EventTime), prevSource = prev(EventSource)
| where EventTime - prevTime > newSessionLimit and EventSource == prevSource
| project GapStart=prevTime+dataFrequency, GapEnd=EventTime, EventSource

  

Hi @Ben Martens

 

No problem at all. Thanks for taking the time to answer.

 

Hope you enjoyed the holidays!

 

Just tried it and works brilliant. Thanks a lot!

 

Regards,


Santi.