In my previous article, we explored about the fundamentals of ADX export and leveraging ADX to overcome the challenges with big data export.
In this article, we will explore effective techniques to use with ADX export. ADX supports an on-demand export and continuous export. This article describes the techniques about on-demand export. Let's dive into the export command behavior and techniques to control the export behavior.
The below export query on my cluster outputs about few millions of records and the command tries to export all the data into the provided storage container using the blob secret.
.export async to csv (h@"https://bhaskar.blob.core.windows.net/kustoexport;BlobSecretKey")
with (sizeLimit=10000000, namePrefix="export", includeHeaders="all")
<|materialized_view('TransactMV')
| where Timestamp >= datetime("2023-02-01") and Timestamp <= datetime("2023-02-28")
By default, export commands are distributed such that there may be many concurrent writes to storage. That means, if you have more nodes & more extents in the cluster, most of the cluster nodes participate in the export operation and share the load to produce the files faster.
The outcome of the export operation and egress of ADX export works blazing fast.
Well, the export works without any issues. So, what is the problem?
The surplus of files generated by the distribution process through the export process creates various challenges for the consumer. Firstly, managing and organizing a large number of files becomes cumbersome and time-consuming. The consumer now needs to invest extra effort in sorting, categorizing, and maintaining an inventory of the numerous files.
Secondly, the increased volume of files can lead to confusion and difficulty in locating specific data or information. With a limited number of files, it is relatively easier for the consumer to navigate and access the desired content. However, when confronted with an excessive number of files, finding the relevant data becomes a daunting task, potentially resulting in delays and productivity losses.
So, how to limit the total number of export files while keeping up the performance of the operation?
Non-distribution:
One technique is to disable the distribution flag by using "distributed=false". This flag instructs ADX to use non-distributed approach and produce single file from each node unless its file size limit is reached.
.export async to csv (h@"https://bhaskar.blob.core.windows.net/kustoexport;BlobSecretKey")
with (sizeLimit=950000000, namePrefix="export", includeHeaders="all", distributed=false) <|
materialized_view('TransactMV')
| where Timestamp >= datetime("2023-02-01") and Timestamp <= datetime("2023-02-28")
The same export operation outputs very few export files when the distribution is disabled. Also, notice that another export file gets created only when the size is exceeding 950MB as specified in the export operation. ADX supports a max file size limit of 1GB for export operation.
Note: If the export query yields higher volume in say millions of records, this approach could lead to latency and memory issues as all the extents data has to collected and egressed by a single node on the cluster. So, use this approach with caution and for low volume data exports.
Data Partitioning: Another effective technique is by leveraging data partitioning. Data partitioning involves dividing your data into logical segments based on specific criteria such as time, location, or any other relevant attribute based on your table schema. By partitioning your data, you can export individual partitions as separate files, thus controlling the total file count. This approach helps in parallelizing exports, optimizing query performance, and simplifying downstream processing.
Note: Be sure to select the partition column which has high cardinality of your data for optimized performance.
Now, let's apply the partitioning technique to the original export query we used above and see the difference in results.
.export async to csv (h@"https://bhaskar.blob.core.windows.net/kustoexport;BlobSecretKey")
with (sizeLimit=950000000, namePrefix="export", includeHeaders="all") <|
materialized_view('TransactMV')
| where Timestamp >= datetime("2023-02-01") and Timestamp <= datetime("2023-02-28")
| extend P = hash(TransactSource, 4) // create an artificial partitionkey to split the data. You should choose a high cardinality column for this, to increase chances of even distribution between partitions
| partition hint.strategy=shuffle by P
(
project-away P // remove the extended partition column
)
The above export operation partitions the data by specified column into 4 partitions along with a partition shuffle hint strategy for optimized performance. At the end, the temporary partition is removed from the export operation using project-away. You can control the total number of partitions based on the extents data and this hints ADX to use total number of nodes in the cluster to participate in the export operation with partition.
Time Slicing: Another technique to export the data is by time slicing. If your data is time-series based or has a temporal aspect, you can employ time slicing techniques to limit the file count during export. Time slicing involves dividing the data into smaller time intervals, such as daily, weekly, monthly, or quarterly, and exporting each interval as a separate file. This approach simplifies data management and facilitates incremental processing and analysis.
// Export to limit the data by each week
.export async to csv (h@"https://bhaskar.blob.core.windows.net/kustoexport;BlobSecretKey")
with (sizeLimit=950000000, namePrefix="Feb01_Feb07_export", includeHeaders="all") <|
materialized_view('TransactMV')
| where Timestamp >= datetime("2023-02-01") and Timestamp <= datetime("2023-02-07")
| extend P = hash(TransactSource, 4) // create an artificial partitionkey to split the data. You should choose a high cardinality column for this, to increase chances of even distribution between partitions
| partition hint.strategy=shuffle by P
(
project-away P // remove the extended partition column
)
Export Batching: ADX provides the ability to batch export data, allowing you to specify the maximum size for each exported file. By carefully selecting appropriate batch sizes, you can ensure that the exported files meet your desired file count limit.
Here are some of the recommendations based on my experience with Export feature.
- Limit the export scope: Before initiating an export, carefully define the scope of data that needs to be exported. Avoid exporting unnecessary or redundant data. The smaller the dataset, the faster the export process will be.
-
Optimize query performance: Ensure that the underlying queries used for data retrieval are optimized for speed. Consider using appropriate query hints and leveraging ADX's query optimization features to minimize query execution time. Note: Try leveraging "consume" command to understand the query usage especially memory usage, runtime, caching look up etc., The "Consume" is a great handy command for any query. Use as-is query and append "| consume" at the end for usage.
-
Use proper partitioning: Partitioning your data can significantly enhance export performance. By dividing your data into smaller, manageable partitions, you can parallelize the export process, allowing multiple exports to occur simultaneously. This is especially beneficial when dealing with large datasets.
-
Utilize aggregations: If your export requires aggregated data, consider precomputing and storing the aggregations as part of your data model. This can eliminate the need for real-time aggregation during the export, resulting in faster performance.
-
Advantage of parallelism: ADX supports parallel query execution and export operations. Leverage this capability by using multiple parallel threads or processes to export data. Distributing the workload across multiple resources can significantly speed up the export process. Note: By default each cluster supports a maximum of 100 export operations. So, have proper checks in place when triggering multiple parallel threads, check the cluster capacity and implement proper queueing mechanisms.
These specific techniques and approaches you choose for ADX exports will depend on the nature of your application, data structure, and specific requirements. It's important to analyze your needs and tailor the export process accordingly.
Thanks for reading through this post and if you have any feedback or questions, feel free to comment on this post. Thank you for your time!