Azure Data Explorer supports native ingestion from Amazon S3
Published Aug 24 2022 01:50 AM 15K Views
Microsoft

Update 03/01/2024 - S3 ingestion now supports presigned URLs.

 

Today we are excited to launch the ability to ingest data from Amazon Simple Storage Service (S3) into Azure Data Explorer (ADX) natively. 

 

Amazon S3 is one of the most popular object storage services. AWS Customers use Amazon S3 to store data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, applications, IoT devices, log analytics and big data analytics. 

Azure Data Explorer (ADX) is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time.  ADX supports ingesting data from a wide variety of sources such as Azure Blob, ADLS gen2, Azure Event Hub, Azure IoT Hub, and with popular open-source technologies such as Kafka, Logstash, Telegraph. With the new S3 support, customers can bring data from S3 natively without relying on complex ETL pipelines. 

 

How does it work? 

 

The .ingest into ADX command ingests data into a table by "pulling" the data from one or more cloud storage files.  The command now supports Amazon S3 URLs with below syntax, read more in the docs. 

 

Ingest a file using IAM credentials: 

 

.ingest into table Table ( 
h'https://<bucket_name>.s3.<region_name>.amazonaws.com/<object_name>;AwsCredentials=<AWS_ACCESS_ID>,<AWS_SECRET_KEY>') 

 

 

Ingest a file using S3 presigned URLs: 

 

.ingest into table T (
'h'https://<bucket_name>.s3.<region_name>.amazonaws.com/<object_name>?<pre signed string>')

 

 

 

Please note using the above command, you will be directly interacting with Kusto engine. This isn't recommended for production grade ingestion solutions as - 

  • The client code issuing the command might overwhelm the Kusto Engine service with ingestion requests, since it isn't aware of the Engine service capacity 
  • The client code must implement any retry or error handling logic 
  • Ingestions are impossible when the Kusto Engine service is unavailable 

The recommended approach is to ingest the data via Data Management service, which batch ingests the data at a high throughput. The ingestion batching policy can be set on databases or tables. This method is the preferred and most performant type of ingestion.  

All the ADX SDKs have been updated to support the S3 ingestion, providing batch ingestion support for S3.  

 

Continuous data ingestion from S3

 

Let’s imagine you have applications running on AWS, periodically storing logs in S3, or S3 being used as a staging layer, and you want to load that data in Azure Data Explorer for ad-hoc analysis & reporting. 

Prior to the S3 ingestion support in ADX, depending on the volume & frequency of the incoming data, you might use an ETL process to move data from S3 to Azure blob before ingesting to ADX, or read the file content in AWS lambda or Azure functions, and ingest directly into ADX.  The former approach requires you to duplicate the data, adding more cost and complexities, and the latter proves challenging especially if you are moving large files.  

 

With the new native S3 ingestion support, you can simplify this process as below - 

 

Anshul_Sharma_0-1661240500279.png

High level flow 

 

Once the file lands in Amazon S3, S3 invokes your Lambda function asynchronously with an event that includes the details about the object. The Lambda function, using ADX SDK, will send the object URL along with the authentication tokens to your Azure Data Explorer ingestion endpoint. 

 

Let’s see what happens behind the scenes when ADX receives the above details. 

 

Anshul_Sharma_1-1661240661756.png

 

Low level flow 

 

  1. S3 invokes AWS Lambda when a new object is received. 
  2. AWS Lamdba, using ADX SDK, posts a message to the Azure storage queue which includes file metadata, object URL & authentication token to fetch the file. 
  3. ADX gets notified of incoming files.
  4. Depending on your ADX batching policy, ADX will pull the data from S3 when a batch is sealed. 

 

The process in the dotted box is transparent to the end user and completely managed by ADX. 

 

Code sample 

 

We have a sample AWS lambda function written in .Net that you can refer to in the Github repository. 

 

AWS Lambda in this scenario is extremely lightweight as it does not process the data and just sends a message on to ADX using the SDK. This keeps the lambda cost minimal, and relies on ADX to do the heavy lifting.  

The other thing to call out in the code sample is the endpoint you’ll be sending the data to. We’ll use the ingestion endpoint of the Azure Data Explorer cluster. Note that in the code, environment variables are used to store the cluster details, and ADX authentication credentials to dynamically build the ingestion endpoint with KustoConnectionStringBuilder. You can choose to store the authentication credentials in AWS SecretsManager or Parameter store. Similarly, you can store the AWS IAM credentials securely. 

 

 

 

 

var kustoConnectionStringBuilderDM = new KustoConnectionStringBuilder(`https://ingest-${clusterName}.kusto.windows.net`).WithAadApplicationKeyAuthentication(appId, appKey, authority); 

IKustoIngestClient client = KustoIngestFactory.CreateQueuedIngestClient(kustoConnectionStringBuilderDM); 

var uri = $"https://{s3.Bucket.Name}.s3.{record.AwsRegion}.amazonaws.com/{s3.Object.Key}"; 

await client.IngestFromStorageAsync(uri:$"{uri};AwsCredentials={awsCredentials}", ingestionProperties: kustoIngestionProperties, sourceOptions); 

 

 

 

 

You can refer to the Github repository for detailed instructions on an end-to-end setup. 

 

Try this Today, For Free! 

  

If you’re interested in trying this out today, you can start with your own free Azure Data Explorer cluster to host your incoming data. For more details on getting started with free Data Explorer clusters, refer ADX - start for free.   

 

Co-Authors
Version history
Last update:
‎Jan 02 2024 06:48 PM
Updated by: