Blog Post

Azure Data Explorer Blog
3 MIN READ

Now in preview: Materialized Views Over Materialized Views

DrewFurgiuele's avatar
DrewFurgiuele
Icon for Microsoft rankMicrosoft
Mar 15, 2022

Consider this: you’ve developed a data ingestion strategy that is taking in remote thermostat readings. Usually, the devices report in on a set frequency and you’re able to calculate aggregate readings an hourly interval. A materialized view could be created that does this calculation and stores the results out for querying. But what if something causes some of this data to become duplicated? You’d first have to eliminate these duplicates, re-ingest the data, and then do your calculations again.

 

This is where we can leverage creating a materialized view over a materialized view. Our first materialized view will handle the deduplication, and our second can handle the aggregation of the deduplicated data.

 

At a high level, this is how you’d accomplish this:

 

  1. First, create a materialized view over your source data using the take_any(*) function. This function will return one row out of a collection of identical rows, based on your aggregation criteria.
  2. Once that view is created, you’ll then create another down sampling materialized view that references the first materialized view.

Here’s a sample of our raw data that is being ingested into our database. We’ve got a unique device ID and some sample readings by the time they were generated on the device to when they were queued to our Event Hub. One table is holding both the readings for temperature and humidity, by device:

 

 

This data looks pretty good, but what if some of these readings become duplicated? Our first materialized view will take care of effectively de-duping any readings that come in during that five-minute window:

 

.create materialized-view with (lookback="12:00:00") SmartDeviceDedupe_MV on table SmartDeviceTelemetry {
    SmartDeviceTelemetry
| summarize take_any(*) by DeviceID, ReadingType,  EventCreatedTime
}

In the code above, we’re creating our new materialized view. We’ll read the data from the SmartDeviceTelemetry table, then use the take_any(*) function aggregated by the device ID and  reading type, and EventCreatedTime (which is our potential duplicated value).

 

We’re also setting the lookback property (which tells the view how far to look back at the ingestion time of the records to look for matches, which can lessen view overhead when looking for duplicates). We’re setting this to 12 hours, since we’re confident we shouldn’t see duplicates outside of a that time window based on how our devices are sending data.

 

Let’s see how it looks:

 

SmartDeviceDedupe_MV

 

As you can see, our materialized view is now storing one reading for a given 5 minute period, for each device, for each reading type. With our first materialized view in place, we'll create our next materialized view on this data, which will be a view that summarizes the minimum, maximum, and average reading types for each device by hour, using this deduped data provided by the first materialized view:

 

.create materialized-view SmartDeviceReadingsPerHour_MV on materialized-view SmartDeviceDedupe_MV {
SmartDeviceDedupe_MV
| summarize max(Measurement), min(Measurement), avg(Measurement) by DeviceID, ReadingType, bin(EventCreatedTime, 1h)
}

 

This view now aggregates and summarizes the average, minimum, and maximum temperature readings from each device using the de-duped  data that the first materialized view is providing, and since this is a materialized view itself, it is also persisted in the same way.

 

It is also worth noting that the function materialized_view() is a best practice when querying nested views like ours because it instructs the engine to ONLY look at the data that is already materialized in the view. The function can take an optional max_age parameter that allows you to specify a time range to check if you need to return just the materialized data or the entire view. You can still query the new materialized view like you would any other table or view to retrieve all the data but know that performance can suffer without it.

 

Then we can query this new materialized view to summarize our readings take the minimum and maximums (24 hours) for temperature readings:

 

SmartDeviceReadingsPerHour_MV
| where ReadingType == "Temperature"
| summarize DailyMax=max(max_Measurement), DailyMin=min(min_Measurement) by DeviceID, bin(EventCreatedTime, 1d)

Or, using the materialized_view() function for only the materialized data:

 

materialized_view("SmartDeviceReadingsPerHour_MV") 
| where ReadingType == "Temperature"
| summarize DailyMax=max(max_Measurement), DailyMin=min(min_Measurement) by DeviceID, bin(EventCreatedTime, 1d)

 

 

Conclusion

 

Materialized views can be an extremely powerful and high-performing option for your data inside of Azure Data Explorer. With the inclusion of the materialized views over materialized views feature, you can extend this out even further for very large datasets that need deduplication and additional aggregation. For more details around how materialized views are created and more examples, be sure to check out the documentation. Happy querying!

Updated Mar 15, 2022
Version 3.0
  • wernerzirkel's avatar
    wernerzirkel
    Brass Contributor

    It might be helpful if the materialized views could controlled by a materializationbatching policy some day. In this example, the second materialized view might not have to merged within seconds, because it provides hourly aggregations anyway. Imagine using bin(eventCreatedTime,1d). The db admin could optimize the cluster performance with this.

  • wernerzirkel's avatar
    wernerzirkel
    Brass Contributor

    How does the lookback parameter work in the follower matView? 

     

    Example:

    view1 with (lookback="30:00:00") on table1    //this will check the ingestion_time of table1

    view2 with (lookback="01:00:00") on view1    // does view2 use the ingestion_time of view1 ? Or does it look at table1?