When Polybase was made available for use in Azure Microsoft provided 2 schema's which allowed you to connect to Blob Store via WASBS and Azure Data Lake via ADL. Authentication for Blob Store is Storage Account Key based and ADL makes use of Azure AD Application Service Principle both of which will only allow you to connect to storage which has no network restrictions in place
In Azure Storage Account V2 we've introduced a new Schema called ABFSS which allows you to connect to both secured and unsecured storage. The following Link provides additional information on Azure Storage Accounts.
The implementation is ideal for connecting to external storage with the least amount of effort and configuration at DW level and one can be up and running in a few minutes ready to import files from your storage account.
In Part 1 we reviewed the Secured Storage Account configuration with Polybase.
In Part 3 we will review Integration tools which make use of Polybase to load data (ADF/Databricks)
In Part 4 we will review how to modify your existing Polybase configuration without recreation
Requirements
No Firewall enabled at a Storage Account level and Allow All Networks is Enabled.
If you are making use of ADL schema the Storage account has to be Gen1 Azure Data Lake.
If you are on Azure Storage Account V2 you have to make use of the ABFSS Schema.
Currently SAS tokens are unsupported with External Data Source Type = HADOOP
Database Security Consideration
Before we proceed further, we need to address the permissions a user requires to create all the required objects. To make this easier and more manageable I would suggest creating a role within your Data Warehouse and making users role members which would avoid having to grant and revoke permissions individually as users are added or removed. The permissions which are required are elevated and therefore I would advise that only Database Administrators, Senior Project Members or knowledgeable users are granted these permissions.
Below are all the permissions you will require within the database, the script will create the role, assign the permissions to the role and add the role membership to the designated user account. Specify required Data Warehouse DB name and User account.
Create Role Polybase_Users
Grant CREATE TABLE TO Polybase_Users
Grant ALTER ANY SCHEMA TO Polybase_Users
Grant ALTER ANY EXTERNAL DATA SOURCE TO Polybase_Users
Grant ALTER ANY EXTERNAL FILE FORMAT TO Polybase_Users
Grant CONTROL ON DATABASE::[] TO Polybase_Users
sp_addrolemember Polybase_Users , <yourUsername>
Step 1 - Obtain Storage Account Key
The Storage account Key is obtained from the Azure Portal, browse to the Target Storage Account resource, select the Access Key Blade and copy an Access Key from the window. (When you refresh or recycle the Keys subsequent Authentication will fail. In the event of you refreshing the keys update the Database Scoped Credential with the new Key Secret)
Step 2 - Database Credential Configuration
The next step is to configure the database credentials which will be used by your External Data Source
If you have never created any external objects and this is the first time you are configuring Polybase you have to start out by creating a Master Key within your database. (Azure Data Warehouse does not require a password to be specified for the Master Key. If you make use of a password, take record of the password)
CREATE MASTER KEY [ENCRYPTION BY PASSWORD = 'somepassword'];
Once the key is created proceed with creating the credentials which will refer to the Storage Account Key which you obtained from the previous step. The Storage Account Key will be stored as the Secret for the Database Credential. The Identity can be any text Except an OAUTH 2.0 string when using WASBS Schema. OAUTH2.0 is only supported with ADL Schema on Gen1 and ABFSS Schema on Azure Storage Gen 2 for Data Lake.
2.a. Blob Store Credential
CREATE DATABASE SCOPED CREDENTIAL <Your_Credential_Name>
WITH IDENTITY = '<Your Account Name>',
SECRET = '<azure_storage_account_key>'
2.c. Azure Data Lake Gen1 and Gen 2 Credentials (OAuth 2.0 is now supported on Gen 2)
CREATE DATABASE SCOPED CREDENTIAL <Your_Credential_Name>
WITH
IDENTITY = '<client_id>@\<OAuth_2.0_Token_EndPoint>',
SECRET = '<key>'
Additional information on how to create the OAuth 2.0 Credentials can be found here.
For additional information review the documentation on Database Scoped Credentials.
Step 3 - Create External Data Source
When creating the External Data Source, obtain the Storage Account Name which you are connecting to as well as the Name of the Container OR the ADL Account Name when connecting to Data Lake Gen 1. The following Schema and Paths should be used when connecting to storage as explained in our documentation on External Data Sources.
Data Source |
Location Prefix |
Path |
Azure Blob Storage |
wasb[s] |
<container>@<storage_account>.blob.core.windows.net |
Data Lake Gen 1 |
adl |
<storage_account>.azuredatalake.net |
Storage Account V2 |
abfss |
<container>@<storage_account>.dfs.core.windows.net |
3.a. Blob Store Connection
When connecting to a blob store you will have to create it as follows
CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>
WITH ( LOCATION = 'wasbs://<My_Container>@<Storage_Account>.core.windows.net/' , CREDENTIAL = mycredential ,
TYPE = HADOOP ) ;
3.b. Data Lake Gen 1 Connection
When connecting to ADLS Storage on Gen1 you will have to create it as follows
CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>
WITH ( LOCATION = 'adl://storageaccount.azuredatalakestore.net' ,
CREDENTIAL = <Your_Credential_Name>,
TYPE = HADOOP )
3.c. Storage Account V2
When connecting to Storage on Gen2 Storage Account you will have to create it as follows
CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>
WITH ( LOCATION = 'abfss://<My_Container>@<Storage_Account>.dfs.core.windows.net'
CREDENTIAL = <Your_Credential_Name>,
TYPE = HADOOP )
Step 4 – Create the External file format
When connecting to files stored on an external location we need to specify what the file looks like in the terms of row / column delimiters, specific date time formats I would require to persist and what type of file the source file is. Additional Info on this can be found here
In the example I will create a file format that references a Delimited Text file which is comma delimited and has no table header in the file.
CREATE EXTERNAL FILE FORMAT NoHeader_CSV
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
USE_TYPE_DEFAULT = True)
)
Step 5 – Create External Table
The final step is to create the external table which will be consuming the credentials we’ve created and connecting to the External Data Source that has been provided.
When creating the above-mentioned objects in the Data Warehouse we are only validating the Syntax and if the syntax is valid the object will create. Therefore, you will only know that there is a problem once you create your External Table as only then will we authenticate using the information that has been provided.
CREATE EXTERNAL TABLE <YourTableName>
(
Column1 varchar(50) Null,
Column2 IDENTITY,
Column3 datetime2,
...
)
WITH
(
LOCATION = '/path/filename',--Path will be the root of the specified container
DATA_SOURCE = <My_Data_Source_Name> ,
FILE_FORMAT = NoHeader_CSV
) ;
If you followed the instructions and guidelines provided and specified a valid path and file name, you will be able to Create the External Table successfully and Select from the object without failure.
Solution Overview
The intention of the solution is to allow you to make use of Polybase as quickly as possible and with the least amount of steps to connect to a Storage Account with no Firewall rules or Network restrictions in place.
Part 3 Integration tools which make use of Polybase to load data (ADF/Databricks) to follow soon