Last month at //build 2019 we announced several new capabilities that tightly integrate Azure Data Explorer with the Azure data lake, increasing flexibility and reducing costs for running cloud scale interactive analytics workloads. One, Continuous Export, is the ability to continuously save data streaming into Azure Data Explorer into the lake, creating well-formed data lake hierarchical constructs, a task that typically requires separate, expensive, high latency, batch oriented compute workloads. The second new capability is the ability to use Azure Data Explorer to query data stored in the lake in it native format. Both capabilities are using external tables, one as an export target and one as input to queries, and both are in public preview. Here is how to unlock even more value from your data lake investment.
Azure Data Explorer is an extremely fast big data interactive analytics engine delivering blazing fast queries over billions of records. That performance is achieved, among other factors, by leveraging high speed SSD drives, data-compute proximity and highly optimized data organization and indexing as described in the Azure Data Explorer whitepaper. There are situations in which the data that you need to analyze was not ingested or was already archived. Examples can be years of historical data that are already there, large streams of data that are rarely queried, or data exported out of Azure Data Explorer for archiving purpose.
With the new external table query capability, the data lake is now closer than ever before to your fingertips. You can leverage the powerful and user friendly query language to interactively explore and analyze data where it resides – ingested into Azure Data Explorer and/or stored in the data lake.
In this //build demo, you can see how patterns from the current year, with data available in Azure Data Explorer hot storage are compared with previous years patterns that are only available in the data lake.
You can also try it yourself: Use the help and samples database to explore the New York city taxi rides public data by running a few queries:
What cab type is used more? Green or Yellow?, Cab type over a month, Comparing two Christmas weeks.
Alternatively use your own Azure Data Explorer cluster, create an external table pointing to the same raw data in Azure Storage and query that external table.
Now with queries directly over the data lake, users can be even more efficient by choosing which storage approach to apply to each data:
* Parquet support coming soon. Parquet files do have standard column-based indexes but are not as performant as the Azure Data Explorer native indexes.
Based on the usage data and frequency of the different query patterns, data engineers and developers can decide to ingest data from the data lake to accelerate queries over that data, or setup continuous export + retention policy to constantly vacate data from the cluster into a query-able archive that can also be easily used by other tools and engines.
Now how do these queries affect costs? Queries over the data lake are reading data from the data lake, and as such generate read transactions, that are naturally billed to the storage account subscription. The compute cycles required to evaluate the query are executed by the Azure Data Explorer cluster dedicated compute resources and therefor contribute to the overall load on the cluster but do not generate any direct billing. That is true as long as the load generated by these queries does not requires an increase of the cluster size.
As external tables are in preview, it means both that we would love to hear you feedback, questions and suggestions (twitter, Tech Communities, stack overflow, uservoice) and that more functionalities are coming like parquet support and more supported URL partitioning schemes.