Polybase User Guide - Part 2 : Unrestriced Storage the original Configuration
Published Aug 26 2019 06:46 AM 5,330 Views
Microsoft

When Polybase was made available for use in Azure Microsoft provided 2 schema's which allowed you to connect to Blob Store via WASBS and Azure Data Lake via ADL. Authentication for Blob Store is Storage Account Key based and ADL makes use of Azure AD Application Service Principle both of which will only allow you to connect to storage which has no network restrictions in place 

 

In Azure Storage Account V2 we've introduced a new Schema called ABFSS which allows you to connect to both secured and unsecured storage. The following Link provides additional information on Azure Storage Accounts. 

 

The implementation is ideal for connecting to external storage with the least amount of effort and configuration at DW level and one can be up and running in a few minutes ready to import files from your storage account. 

 

In Part 1 we reviewed the Secured Storage Account configuration with Polybase.

In Part 3 we will review Integration tools which make use of Polybase to load data (ADF/Databricks)

In Part 4 we will review how to modify your existing Polybase configuration without recreation

 

 

Requirements

No Firewall enabled at a Storage Account level and Allow All Networks is Enabled. 

If you are making use of ADL schema the Storage account has to be Gen1 Azure Data Lake.

If you are on Azure Storage Account V2 you have to make use of the ABFSS Schema.

Currently SAS tokens are unsupported with External Data Source Type = HADOOP 

 

Database Security Consideration

 

Before we proceed further, we need to address the permissions a user requires to create all the required objects. To make this easier and more manageable I would suggest creating a role within your Data Warehouse and making users role members which would avoid having to grant and revoke permissions individually as users are added or removed.  The permissions which are required are elevated and therefore I would advise that only Database Administrators, Senior Project Members or knowledgeable users are granted these permissions.

Below are all the permissions you will require within the database, the script will create the role, assign the permissions to the role and add the role membership to the designated user account. Specify required Data Warehouse DB name and User account.

 

Create Role Polybase_Users 

Grant CREATE TABLE TO Polybase_Users

Grant ALTER ANY SCHEMA TO Polybase_Users

Grant ALTER ANY EXTERNAL DATA SOURCE TO Polybase_Users

Grant ALTER ANY EXTERNAL FILE FORMAT TO Polybase_Users

Grant CONTROL ON DATABASE::[] TO Polybase_Users

sp_addrolemember Polybase_Users , <yourUsername>

 

Step 1 - Obtain Storage Account Key 

The Storage account Key is obtained from the Azure Portal, browse to the Target Storage Account resource, select the Access Key Blade and copy an Access Key from the window. (When you refresh or recycle the Keys subsequent Authentication will fail. In the event of you refreshing the keys update the Database Scoped Credential with the new Key Secret) 

storage.png

 

Step 2Database Credential Configuration

 

The next step is to configure the database credentials which will be used by your External Data Source

If you have never created any external objects and this is the first time you are configuring Polybase you have to start out by creating a Master Key within your database. (Azure Data Warehouse does not require a password to be specified for the Master Key. If you make use of a password, take record of the password)

 

CREATE MASTER KEY [ENCRYPTION BY PASSWORD = 'somepassword'];

 

Once the key is created proceed with creating the credentials which will refer to the Storage Account Key which you obtained from the previous step. The Storage Account Key will be stored as the Secret for the Database Credential.  The Identity can be any text Except an OAUTH 2.0 string when using WASBS Schema. OAUTH2.0 is only supported with ADL Schema on Gen1 and ABFSS Schema on Azure Storage Gen 2 for Data Lake. 

 

2.a. Blob Store Credential 

 

CREATE DATABASE SCOPED CREDENTIAL <Your_Credential_Name>

WITH IDENTITY = '<Your Account Name>',

SECRET = '<azure_storage_account_key>'

 

2.c. Azure Data Lake Gen1 and Gen 2 Credentials (OAuth 2.0 is now supported on Gen 2)

 CREATE DATABASE SCOPED CREDENTIAL <Your_Credential_Name>

WITH 

IDENTITY = '<client_id>@\<OAuth_2.0_Token_EndPoint>', 

SECRET = '<key>'

 Additional information on how to create the OAuth 2.0 Credentials can be found here.

For additional information review the documentation on Database Scoped Credentials.

 

Step 3 - Create External Data Source

When creating the External Data Source, obtain the Storage Account Name which you are connecting to as well as the Name of the Container OR the ADL Account Name when connecting to Data Lake Gen 1.  The following Schema and Paths should be used when connecting to storage as explained in our documentation on External Data Sources.

 

Data Source

Location Prefix

Path

Azure Blob Storage

wasb[s]

<container>@<storage_account>.blob.core.windows.net

Data Lake Gen 1

adl

<storage_account>.azuredatalake.net

Storage Account V2

abfss

<container>@<storage_account>.dfs.core.windows.net

 

3.a. Blob Store Connection

When connecting to a blob store you will have to create it as follows

 CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>

WITH ( LOCATION = 'wasbs://<My_Container>@<Storage_Account>.core.windows.net/' , CREDENTIAL = mycredential ,

TYPE = HADOOP ) ;

 

3.b. Data Lake Gen 1 Connection

When connecting to ADLS Storage on Gen1 you will have to create it as follows

CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>

WITH ( LOCATION = 'adl://storageaccount.azuredatalakestore.net' ,

CREDENTIAL = <Your_Credential_Name>,

TYPE = HADOOP )

 

3.c. Storage Account V2

When connecting to Storage on Gen2 Storage Account you will have to create it as follows

 

CREATE EXTERNAL DATA SOURCE <My_Data_Source_Name>

WITH ( LOCATION = 'abfss://<My_Container>@<Storage_Account>.dfs.core.windows.net'

CREDENTIAL = <Your_Credential_Name>,

TYPE = HADOOP )

  

Step 4 – Create the External file format

When connecting to files stored on an external location we need to specify what the file looks like in the terms of row / column delimiters, specific date time formats I would require to persist and what type of file the source file is.  Additional Info on this can be found here

 

In the example I will create a file format that references a Delimited Text file which is comma delimited and has no table header in the file.

 

CREATE EXTERNAL FILE FORMAT NoHeader_CSV

WITH (FORMAT_TYPE = DELIMITEDTEXT,

      FORMAT_OPTIONS (

          FIELD_TERMINATOR = ',',

          STRING_DELIMITER = '"',

          USE_TYPE_DEFAULT = True)

)

 

 

Step 5 – Create External Table

 

The final step is to create the external table which will be consuming the credentials we’ve created and connecting to the External Data Source that has been provided.

 

When creating the above-mentioned objects in the Data Warehouse we are only validating the Syntax and if the syntax is valid the object will create. Therefore, you will only know that there is a problem once you create your External Table as only then will we authenticate using the information that has been provided.

 

CREATE EXTERNAL TABLE <YourTableName>

(

Column1 varchar(50) Null,

Column2 IDENTITY,

Column3 datetime2,

...

)

WITH

(

LOCATION = '/path/filename',--Path will be the root of the specified container

DATA_SOURCE = <My_Data_Source_Name> ,

FILE_FORMAT = NoHeader_CSV

) ;

 

If you followed the instructions and guidelines provided and specified a valid path and file name, you will be able to Create the External Table successfully and Select from the object without failure.  

 

 Solution Overview

 

The intention of the solution is to allow you to make use of Polybase as quickly as possible and with the least amount of steps to connect to a Storage Account with no Firewall rules or Network restrictions in place.

 

Part 3 Integration tools which make use of Polybase to load data (ADF/Databricks) to follow soon

 

 

Version history
Last update:
‎Apr 20 2020 02:23 PM
Updated by: