SQL Server Support Blog

5 MIN READ

Polybase for beginners.

Microsoft

Dec 20, 2019

Polybase for beginners.

In regards to Polybase, I have seen some confusion from customers about what it is, the problems that it solves and when to use it; so I decided to come up with a brief blog post to explain Polybase in some short paragraphs.

Hopefully after reading this you will walk out with a good enough grasp of it, even if you are completely new to this topic.

What is data virtualization?

In simple terms data virtualization means querying external data objects from a database as if they were local objects. The most basic example is to make a third party database table (such as Oracle) available for queries as if it was a SQL Server table.

Polybase is SQL Server’s 2019 data virtualization solution. SQL Server can now be the central hub for all your data.

When to use it?

It is common to see a wide variety of data sources and data platforms in modern organizations. By using data virtualization over disparate data sources, you get the following benefits:

A common language to query the data. Your staff does not need to learn several different programming/query languages and you can combine data using a unifying paradigm.
A common model for security. You can leverage the SQL Server and Active Directory permissions to simplify authentication and security administration.
No need to move the data and increase the stewardship.

So, this replaces data movement? I no longer need Extract Transform Load jobs?

This depends on the scenario. If all you do with your data movement process is moving data for one database to another (for example Mongo to SQL SERVER), data virtualization can very well fulfill the requirement.

If you need to process unstructured or semi structured files coming from an external source, where you cannot trust the validity of the data and you need to take actions regarding discarded rows and other types of data processing; your ETL packages are still necessary. Hopefully, this problem of format untrustworthiness in data exchange will become less of an issue in the future as data format standards are adopted and stricter data stewardship is implemented across organizations.

Isn’t this the same functionality that linked servers have?

Linked servers also allow for querying external data from SQL server however there are some key differences:

A linked server is a data source definition at the server level. Polybase objects represent not just the data source but the metadata for the tables, giving a more richer integration model out of the box, for the reasons I just explained in the previous section.

With linked there will be 1 thread (either in process or out of process) will be used in sql server during the reading from the data source. Polybase has dedicated services for processing and moving the external data.

Linked servers support data modification while Polybase doesn’t. Polybase was conceived as a technology for mainly reading data. This may change in the future.

Linked servers use OLEDB while Polybase uses ODBC.

Polybase offers out of the box support for SQL Server, Oracle, Teradata, MongoDB, Azure Blob Storage and Hadoop.

What are OLEDB and ODBC? How are they different?

Both ODBC and OLEDB are application programing interfaces for connecting to databases and both specifications were developed by Microsoft. They define a set of methods as well as datatypes that serve as a common interface for clients to read data.

Interestingly enough, OLEDB is a newer more feature rich technology, but ODBC is now preferred mostly due to having much wider adoption.

To use an ODBC connection, you must install the ODBC driver for the database you want to connect to. After the installation program registers the driver you can review your installed client drivers using the ODBC Data Source Administrator (64-Bit).

What it means exactly to have "out of the box" support for SQL Server, Oracle, Teradata, MongoDB, Azure Blob Storage and Hadoop?

First, it means that you don’t need to install any drivers for these data sources, Microsoft provides its own out of the box. You will use a specific “native protocol” name in the data source location definition.

For example, to define an Oracle data source the command would be like this:

CREATE EXTERNAL DATA SOURCE external_data_source_name

WITH (LOCATION = 'oracle://<server address>[:<port>]',

CREDENTIAL = credential_name)

For SQL Server an Oracle, it also means native support for partitioned tables. Polybase can read different partitions using multiple workers concurrently in a scale out group (scale out groups are a discussion for another day).

Use Example: Oracle

Let’s suppose that you want to access a table named PERSON on an Oracle contained database named CDB1, from a SQL SERVER 2019 instance with the Polybase feature installed.

Ensure your database has a master key created. If not you will need to create one.

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'StrongP@ssword1'

Create credential (username/password pair) to connect to your Oracle database. The credentials are safely stored in your database using the DB’s master key.

CREATE DATABASE SCOPED CREDENTIAL [ORA_C_HR] WITH IDENTITY = N'C##HR', Secret = 'StrongP@ssword1';

Now create the datasource definition.

The data source requires:

The location: This is a URL that contains the location prefix (the data source type, oracle, sqlserver, odbc, etc.), the name of the server and the port.
The credential to connect.
Any additional options for the driver in the form of a connection string can be passed using the CONNECTION_OPTIONS parameter.

More info: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15

Let’s suppose that your server is named oraserver and listens in the default port (1521).

CREATE EXTERNAL DATA SOURCE [ORA_SERVER]

WITH (LOCATION = N'oracle://oraserver:1521', CREDENTIAL = [ORA_C_HR]);

Now you can create the table. In this example the table is named “Person” and is under the USER C##HR in the database CDB1, this is specified in the LOCATION option for the CREATE TABLE statement.

CREATE EXTERNAL TABLE dbo.[PERSON]

(

[ID] DECIMAL(38,0) NOT NULL,

[NAME] VARCHAR(100) COLLATE Latin1_General_100_BIN2_UTF8 NOT NULL

)

WITH (LOCATION = N'[CDB1].[C##HR].[PERSON]', DATA_SOURCE = [ORA_SERVER]);

You can now query the table Person like if it was a SQL Server native table. Remember you cannot issue update or insert statements, only queries.

SELECT TOP 100 * FROM dbo.[PERSON]

This is it for today’s post. Hopefully, you will have a better idea of what SQL SERVER 2019 Polybase is.

I will be posting more in-depth examples for generic ODBC types in the next days.

Regards.

Updated Dec 20, 2019