Query Delta Lake files using T-SQL language in Azure Synapse Analytics
Published May 27 2021 09:00 AM 29.3K Views
Microsoft

Azure Synapse now enables you to query data stored in Delta Lake format. This is one of the top feedback requests and we are happy to announce that this feature is now available in public preview.

In this article you will learn how to run the T-SQL queries on a Delta Lake storage from your Synapse workspace.

 

What is Delta Lake?

Delta Lake is an open-source data format that enables you to update your big data sets with guaranteed ACID transaction behavior. Delta Lake is a layer placed on top of your existing Azure Data Lake data that can be fully managed using Apache Spark APIs available in both Azure Synapse and Azure Databricks.

Delta Lake is one of the most popular updateable big data formats in big data solutions, and frequently used by many data engineers who need to prepare, clean, or update data stored in data lake, or apply machine learning experiments.

 

Why would you query Delta Lake with Azure Synapse?

Azure Synapse provides a serverless endpoint with every Synapse workspace. This provides a few key benefits when it comes to querying Delta Lake data:

  • Easy data sharing between Azure Synapse and Azure Databricks without the need to copy and transform data.
  • Connecting a large ecosystem of reporting and analysis tools with your data stored in Delta Lake format.
  • Favorable pay-per-use consumption model where you don’t need to pre-provision resources. You are paying only for the queries that you are executing.

You can use Azure Synapse and Azure Databricks to prepare and modify your Delta Lake data sets placed in the Azure Data Lake storage. Once your data engineers have prepared the data, your data analysts can create reports using the tools such as Power BI.

Using the serverless query endpoint in Azure Synapse, you can create a relational layer on top of your Delta Lake files that directly references the location where Azure Synapse and Azure Databricks are used to modify data. This way, you can get the real-time analytics on top of the Delta Lake data set without any need to wait for a pipeline to copy and prepare data.

 

JovanPop_0-1622049239155.png

 

Data sharing without copy, load, or transformation of Delta Lake files is the main benefit of serverless SQL pools. The serverless endpoint in Azure Synapse represents a bridge between a reporting/analytics layer where you use Power BI or Azure Analysis Services, and your data stored in Delta Lake format. This enables a variety of tools that work on T-SQL endpoints to access Delta Lake data.

In this solution, every role in your organization that works with big data can use the preferred tools to complete their tasks:

  • Data engineers can keep using the standard tools for data preparation and transformation (for example the notebooks in Azure Synapse or Jupyter environment).
  • Data analysts can keep using their favorite reporting tools such as Power BI to analyze data and present the reports to the end users.

Azure Synapse enables your teams to implement end-to-end solutions on top of Delta Lake files with no friction or a need to change their standard tools.

 

How to query Delta Lake in Azure Synapse?

The serverless endpoint in Azure Synapse (serverless SQL pool) enables you to easily query data stored in Delta Lake format. You just need to provide a URI of the Delta Lake folder to the OPENROWSET function and specify that the format is DELTA. If you have plain parquet files, you can easily convert them to Delta Lake format using Apache Spark.

JovanPop_1-1622049239171.png

 

This query enables you to explore data in your Delta Lake data sets. The OPENROWSET function will automatically determine the columns in Delta Lake format and their data types by inspecting Delta Lake folder.

When you complete exploration, you can create views or external tables on top of your Delta Lake folder. The partitioned views are preferred approach if you have partitioned Delta Lake structure because they can more optimally execute your queries and eliminate the partitions that do not contain the data that should be returned in the queries.

Tools and applications such as Power BI or Azure Analysis Service can read data from these views or external tables, without need to know that the underlying data is stored in Delta Lake format.

Learn more about Delta Lake query capabilities in Synapse documentation.

 

Feedback

This feature is currently in public preview, and the Azure Synapse team is happy to hear your feedback.

Some features such as temporal/time-travel queries, automatic synchronization of Delta Lake tables created in Spark pools, and updates of Delta Lake data are still not available in the public preview. We would be happy to get your feedback related to the new features in this scenario, so you can post your ideas in Azure Feeback site.

 

7 Comments
Iron Contributor

Are there known limitations?

I just tried to access a delta lake folder via serverless SQL pool but it ran into an error. Apparently it is struggling with a datetime column.

Error handling external file: 'Inserting value to batch for column type DATETIME2 failed. Invalid argument provided.'. File/External table name: 'https://.../.../part-00002-45b1c248-95c6-40d0-b5cc-98e7e749caae-c000.snappy.parquet'.

Loading the folder into a PySpark dataframe via notebook works fine.

Microsoft

Hi @_MartinB,

 

The known issues are documented here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand#delta... Please contact me in direct message to share more details about this error. By looking at the message, it seems that you cannot read the underlying parquet file. Are you able to directly reference this file and read it using TYPE=PARQUET

 

Jovan

Copper Contributor

Small correction, Delta Lake is not part of Apache.  First reference in the article says "Apache Delta Lake". 

Copper Contributor

Hi @JovanPop ,

Works like a charm querying from SSMS when creating and querying it through views.

These views can be used successfully in Power BI as well.


However, when trying to source Azure Analysis Services from these views I get the NO_BROWSETABLE error message as shown below, any recommendations / workarounds for this?

Thanks,

Paul


PS: using VS 2019 16.10.1 with Analysis Services Projects 2.9.17

paulmooij_1-1623830354086.png

paulmooij_2-1623830970707.png

 

 

Copper Contributor

What kind of data load this architecture can support? Does it support large volume of data? Do we have any case study or project already done to support huge volume of data residing in Azure Data Lake?

Copper Contributor

Hi @JovanPop 

 

Since yesterday, I am getting some errors on our serverless pool, we are mainly using delta format, errors we are getting are below.

 

Attempt to create an empty container map on a non null topology

 

 

This was working fine since past few weeks without any issues. are there any planned maintenance going on behind the scene ?

 

Could you help ? do you need more information?

 

Kind Regards,

Ahsan

Microsoft

@Ahsan245  @paulmooij please file support tickets with a repro and information about the workspace.

Co-Authors
Version history
Last update:
‎Sep 15 2021 12:09 PM
Updated by: