In this blog, we detail how GitHub leveraged new functionality enabling the automatic creation of a table to ingest complex data types present in Parquet files. These improvements help efficiency and performance, all within the COPY command. This feature is currently not available for loading hash distributed tables from Parquet files.
Azure Synapse Analyticsbrings the worlds of data integration, big data, and enterprise data warehousing together into a single service for end-to-end analytics at cloud scale. The ever-increasing volume and variety of data has led to introductions of new data and file types. Apache Parquet is a columnar file format that is efficient and provides optimizations to speed up queries. Parquetsupports nested datatypes processing huge variety of data. It does not need strict ‘CONTROL’ permissions on the data warehouse to load the data. Depending on the workload, the file splitting feature of the COPY command provides better performance for ingesting data.
On 23rd September 2020, Microsoft announced automatic Schema discovery within the COPY command, which gives you the option to perform automatic table creation. Using the automatic schema discovery command you can ingest complex data types – a capability which was not previously easy to achieve. Complex data types that are ingested using the 'auto_create_table' flag in the COPY command are mapped to varchar(max) columns upon ingestion.
The whole process of defining and mapping source data into target tables is a cumbersome process, especially when tables contain large number of columns. Automatic schema discovery simplifies the process of defining and mapping source data into target tables. Automatic schema discovery works alongside the auto table creation process. When used, the system automatically creates the target table based on the schema of the Parquet file by identifying the first and last alpha numerical files within the list.
With the increase in variety of data, there was a need for the COPY command to support complex data types. COPY command can now recognize and ingest complex data types like maps and arrays, which provides customers with increased performance and flexibility when it comes to data ingestion.
The structured rows of Parquet files are organized into Rowgroups. A Rowgroup is a group of rows that are compressed into the Columnstore format. Columnstore is data that is logically organized as a table with rows and columns, and physically stored in a column-wise data format. Previously when ingesting data, customers had to choose the ‘COMPRESSION’ option manually. With the feature of Auto Compression, the you don't need to choose the ‘COMPRESSION’ option, the auto-table creation process automatically discovers and handles the input files that qualify the Rowgroup compression.
A number of customers including GitHub observed a substantial performance improvement with the easy-to-use automatic schema discovery, automatic table creation and automatic compression feature. Below we detail how GitHub was able to leverage this feature.
GitHub is the world’s leading developer platform. As the home to more than 73 million developers across the globe,GitHub is where developers can create, share, and ship the best code possible through a single, seamless experience.
1. The parquet files containing complex data types created using Trino/Hive are stored in Azure Data Lake Storage (ADLS). Most of these parquet files include attributes with complex datatypes like maps and arrays. Examples of data include:
Labels and assignees for a support issue is stored as an array
Details of A/B experiments is stored as an array of maps
2. Copy command is used to Ingest data from Azure Data Lake Storage (ADLS) into Dedicated SQL pools within Azure Synapse Analytics. The complex data types are then parsed through OpenJSON(..) functions. The data is then made available for reporting.
Note: GitHub is currently migrating from Trino/Hive to Dedicated SQL Pools within Azure Synapse Analytics. The above is a combination of current and future state architecture.
GitHub analyses and derives insights from the various customer support tickets that it receives. The various categories of the support tickets are tagged uniquely creating unlimited number of tags. The tags are stored in array data format. The customers would want to analyze the data associated with a particular kind of tag.
As GitHub stores data in various complex data types – array being one of those – they have a need to analyze and process these complex data types so that the data is made available to customers for analysis and to derive insights.
During the data ingestion process into the Dedicated SQL pools within Azure Synapse Analytics, Polybase did not support the complex data types such as maps and arrays. This avoided ingesting complex data types due to lack of support.
Automatic schema discovery along with automatic creation of table feature was used, and complex Parquet datatypes were automatically mapped to varchar columns. This is embedded in the COPY Command. JSON functions are then used to parse the attributes of complex data types.
Using the ‘COPY’ command with the ‘Auto Create Table’ flag turned on, the data ingestion process will create a new table automatically by discovering the structure of the parquet file through identifying the first and last alpha numerical files within the list.
COPY INTO [table] FROM 'location' WITH (..., AUTO_CREATE_TABLE='On')
Above screen-print depicts the new complex array datatype from parquet file that we are now supporting.
Azure Portal- Azure Synapse Analytics
Automatic schema discovery along with auto-table creation process makes it easy for customers to automatically map and load complex data types present in Parquet files such as arrays and maps into the Dedicated SQL pools within Azure Synapse Analytics.Rowgroup compression is automatically enabled when customers enable the auto-create table option within the COPY command. Start taking advantage of all these features today to simplify data ingestion with Azure Synapse Analytics!