Overcoming the Challenges of Big Data Export with ADX Export
Published Feb 23 2023 02:52 AM 6,474 Views
Microsoft

Overview

One of the key services we manage today in one of our financial reporting portals (A portal that gives access to exclusive tools to our customers and drives their business growth) is "Export service" that provides ability to export customer's business critical information. With the click of a button, users can export key financial data such as account balances, transaction history, and other important financial information. This feature is incredibly powerful as it enables our customers export their transactional data to a variety of file formats, such as CSV (Comma separated value) or TSV (Tab separated value) and use it for data analysis to make informed decisions in their financial, marketing, and overall business strategies.  

 

Today, the export service taps into Elasticsearch that stores our big data, to extract valuable data that empowers customers to evaluate their business needs. Exporting data from a large data store like Elasticsearch can be a complex and time-consuming process, particularly when dealing with big data containing vast amounts of information. 

 

Currently, there are many challenges associated with exporting data from Elasticsearch. These include issues such as

  • Slow query performance
  • Need for data transformation.
  • Unable to perform multiple aggregations at query runtime.
  • Joining multiple entities at query runtime.

Additionally, the sheer volume of data can make it difficult to manage and analyze, particularly when working with legacy systems or systems that have been in place for many years. 

 

With recent migration from Elasticsearch to ADX, we've migrated billions of data (approximately 30TB+) to ADX. We're managing billions of data and trying to make sense of it all. Well, what if we told you that we've just discovered a secret tool that makes life a whole lot easier?

"Yes, we're talking about the Export feature with ADX"!

 

This article provides an overview of the challenges with the current export service using Elasticsearch and the improvements and simplified approach of the same export service using ADX (Azure Data Explorer) export feature. 

 

Export service using Elasticsearch

As described above, customers require data for their organizational needs that empower their business needs.  Some of the scenarios include:

  • A Customer wants to export 6 months worth of their sales transactions for 6 months. 
  • A Customer wants to export their revenue information for the last 3 years for their organization’s location.  

These export operations are pretty intensive, and the current volume of these export requests range from thousands to millions of records. 

 

Below is the high-level approach of the service that extracts the data from Elasticsearch, transforms it, and provides an export file for customers once the operation is complete.

 

High level workflow of export process using ElasticsearchHigh level workflow of export process using Elasticsearch

 

  1. Request initiation: The customer creates an export request through the API or customer portal. The request includes any filters, sorting options, or other parameters to define the scope of the export.
  2. Process request: The export service validates the export request, adds it to the request queue and starts the export process asynchronously in the background. This allows the user to continue to use the UX application while the export operation is running.
  3. Process request data in the memory layer: The export service reads the requested data from Elasticsearch and loads it into a memory layer. This is pretty intensive operation as bigger export requests takes longer time to read the data, adding necessary headers, and applying any requested filters or sorting options. 
  4. Transform data: The service compresses the file(s) into a ZIP archive to reduce the file size and make it easier to download and finally uploads to a blob.
  5. Download file: Once the service has completed all of the steps above, it makes the file available for the customer in a secured way. The customer is provided with a time-limited URL generated with shared access signature (SAS) token to secure the download. The customer can then download the file using the secure URL from the application.

However, due to the sheer volume of data involved and the complexity of working with Elasticsearch, the export service described above sometimes experiences issues with timeouts and transient errors. These issues cause several problems for customers, which is why we are constantly working worked on to optimize and improve the performance of our service to give better experience to our customers. 

 

Challenges

The current approach for exporting data involves reading data from Elasticsearch using a .NET service. This approach has led to several challenges, including high memory consumption, frequent failures, timeout issues. There is a lot of overhead especially the service needs to be scaled up to perform intensive big export operations.  

  1. Scalability: As the amount of data in the big data store grows on daily basis, it becomes increasingly difficult and resource-intensive to process and export data in a timely manner. This can lead to performance issues, time-consuming processes, and increased costs.

  2. Data processing complexity: In some cases, it may be necessary to preprocess or transform the data before exporting it. This can involve significant time and resources to build and maintain the processing infrastructure, as well as additional storage requirements for intermediate data.

    1. Aggregations:  One of the frequent issues is aggregating the data with Elasticsearch. Elasticsearch offers limited number of aggregation types, and there some types of grouping aggregations that aren't possible with Elasticsearch. The more complex the aggregation query, the more likely it is to impact search performance.
    2. Joins: Unlike traditional database, Elasticsearch is not a SQL type of database and doesn't have a direct way to join the indices. It offers different flavors to combine the data in the form of multi-index search or denormalization of data. This can be cumbersome and leads to performance issues especially when you want to join multiple indices on a larger volume of data. 
  3. Performance: Exporting large volumes of data can be resource-intensive, requiring significant amounts of processing power, memory, and storage. This can lead to performance issues, time-consuming processes, and increased costs.

Simplified export service using ADX

One of the key advantages of using ADX for exporting data is its ability to handle large datasets with ease. Unlike Elasticsearch, which can sometimes struggle to handle millions of records, ADX has been specifically designed to handle big data at scale. This means that customers can rely on the portal to provide them with a comprehensive dataset that captures the full scope of their financial transactions, sales details, revenue figures, and other key business metrics. This is achieved through a number of key features, including the ability to perform data transformations and filtering, as well as the ability to export data in a variety of different formats. 

 

Below is high level workflow of the same export service that is leverages the power of ADX’s export feature. The benefit of using ADX for exporting data is the simplicity and ease ease-of of-use of its export functionality. With ADX, customers can quickly and easily generate a custom export request through the customer portal or API, specifying their preferred filters and other parameters. Once the request is submitted, ADX processes the data in the background and generates a download link that the customer can use to access the data. The data file is generated and available to download within a few moments. High level workflow of export process using ADX export feature.High level workflow of export process using ADX export feature. 

 

 

By leveraging ADX & its powerful export feature, the challenges on the service side are simplified as the complete export operation is handled through ADX. The overhead of reading the data from the data store by chunks, formatting it and constructing the file at the service end is completely avoided by using ADX export as it does everything for you. This approach also avoids common timeout issues, memory issues and other file management issues. 

 

Some of the major challenges like aggregations, joins are simplified as ADX export has higher performance with aggregations. On the other hand, ADX provides a simple and an intuitive language, Kusto Query Language (KQL), which allows customers to perform complex aggregations with ease. KQL has built-in functions for common aggregation tasks like summing, counting, and averaging data, as well as more advanced operations like percentile calculations and trend analysis. KQL also supports group-by and time-series aggregation, which makes it easy to organize and summarize large datasets.

 

By utilizing the ADX export feature, we observed data exports 20-30% faster than our current process of reading data directly from Elasticsearch.

 

In summary, while Elasticsearch offers a wide range of aggregation features, ADX provides a more simplified and optimized approach to aggregation, thanks to its columnar database architecture, intuitive query language, and scalable cloud-based infrastructure. This makes ADX a great choice for businesses and organizations that need to perform complex data analysis and visualization on large datasets.

 

Using the ADX export command is incredibly simple, allowing for easy and efficient retrieval of data without the need for complex configurations. Below is a sample export operation that exports transactional data of a customer from an ADX table to a storage account. For more information refer to this original documentation on exporting data from ADX to storage account  and exporting to external tables - Export data to SQL - Azure Data Explorer | Microsoft Learn.

 

 

 

 

// A sample async export command that exports all the exchange rates information.
.export
  async compressed
  to csv (
    h@"https://mystorage.blob.core.windows.net/adxexport/request1;secretKey"
  ) with (
    sizeLimit=9500000, // In bytes, max size: 1GB
    namePrefix="export", 
    includeHeaders="all",
    encoding ="UTF8NoBOM"
  )
  <| SaleTransactions | where FiscalYear  == 2022 and CustomerId == 252423 

 

 

 

The ADX export feature offers several advantages for exporting data from a big data store, including:

  1. Query optimization: ADX is designed to optimize queries for fast and efficient data retrieval. This means that data can be read directly from the big data store, without the need for a separate processing layer. This reduces the time and resources required for data export and provides a more efficient and streamlined process.

  2. Customizable export formats: ADX supports a range of export formats, including CSV, TSV, JSON, and Parquet. This allows users to export data in the format that best suits their needs, without the need for additional data processing or conversion steps. This can save time and resources and reduce complexity.

  3. Automatic compression: ADX supports automatic compression of exported data, which can reduce the amount of storage space required for the exported data. This can save on storage costs and improve performance by reducing the amount of data that needs to be transferred.

  4. Scalability: ADX is highly scalable and can handle large volumes of data and concurrent user requests without performance degradation. This means that it can provide a reliable and efficient data export process, even in scenarios with high demand.
  5. Integration with other Azure services: ADX can be easily integrated with other Azure services, such as Azure Blob Storage, Data Factory, and Logic Apps. This enables users to build customized and scalable data export pipelines, without the need for complex coding or infrastructure management.

Overall, using ADX export offers several advantages for exporting data from a big data store. It provides a more efficient, scalable, and secure process for exporting data, with customizable export formats, automatic compression, and integration with other Azure services. This can save time and resources, reduce complexity, and provide a more reliable and secure data export process.

 

Another powerful way of exporting data is by using Azure Data Share serviceUsing Data Share, a data provider can share data and manage their shares all in one place. They can stay in control of how their data is handled by specifying terms of use for their data share. The data consumer must accept these terms before being able to receive the data. Data providers can specify the frequency at which their data consumers receive updates. Access to new updates can be revoked at any time by the data provider.

 

Related Articles:

 

Let us know if you have any comments on this article for any improvements. In the future article, will share more details about on ADX export usage with examples and tips.

 

Co-Authors
Version history
Last update:
‎Feb 22 2023 05:50 PM
Updated by: