In regards to Polybase, I have seen some confusion from customers about what it is, the problems that it solves and when to use it; so I decided to come up with a brief blog post to explain Polybase in some short paragraphs.
Hopefully after reading this you will walk out with a good enough grasp of it, even if you are completely new to this topic.
In simple terms data virtualization means querying external data objects from a database as if they were local objects. The most basic example is to make a third party database table (such as Oracle) available for queries as if it was a SQL Server table.
Polybase is SQL Server’s 2019 data virtualization solution. SQL Server can now be the central hub for all your data.
It is common to see a wide variety of data sources and data platforms in modern organizations. By using data virtualization over disparate data sources, you get the following benefits:
This depends on the scenario. If all you do with your data movement process is moving data for one database to another (for example Mongo to SQL SERVER), data virtualization can very well fulfill the requirement.
If you need to process unstructured or semi structured files coming from an external source, where you cannot trust the validity of the data and you need to take actions regarding discarded rows and other types of data processing; your ETL packages are still necessary. Hopefully, this problem of format untrustworthiness in data exchange will become less of an issue in the future as data format standards are adopted and stricter data stewardship is implemented across organizations.
Linked servers also allow for querying external data from SQL server however there are some key differences:
Both ODBC and OLEDB are application programing interfaces for connecting to databases and both specifications were developed by Microsoft. They define a set of methods as well as datatypes that serve as a common interface for clients to read data.
Interestingly enough, OLEDB is a newer more feature rich technology, but ODBC is now preferred mostly due to having much wider adoption.
To use an ODBC connection, you must install the ODBC driver for the database you want to connect to. After the installation program registers the driver you can review your installed client drivers using the ODBC Data Source Administrator (64-Bit).
First, it means that you don’t need to install any drivers for these data sources, Microsoft provides its own out of the box. You will use a specific “native protocol” name in the data source location definition.
For example, to define an Oracle data source the command would be like this:
CREATE EXTERNAL DATA SOURCE external_data_source_name
WITH (LOCATION = 'oracle://<server address>[:<port>]',
CREDENTIAL = credential_name)
For SQL Server an Oracle, it also means native support for partitioned tables. Polybase can read different partitions using multiple workers concurrently in a scale out group (scale out groups are a discussion for another day).
Let’s suppose that you want to access a table named PERSON on an Oracle contained database named CDB1, from a SQL SERVER 2019 instance with the Polybase feature installed.
Ensure your database has a master key created. If not you will need to create one.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'StrongP@ssword1'
Create credential (username/password pair) to connect to your Oracle database. The credentials are safely stored in your database using the DB’s master key.
CREATE DATABASE SCOPED CREDENTIAL [ORA_C_HR] WITH IDENTITY = N'C##HR', Secret = 'StrongP@ssword1';
Now create the datasource definition.
The data source requires:
Let’s suppose that your server is named oraserver and listens in the default port (1521).
CREATE EXTERNAL DATA SOURCE [ORA_SERVER]
WITH (LOCATION = N'oracle://oraserver:1521', CREDENTIAL = [ORA_C_HR]);
Now you can create the table. In this example the table is named “Person” and is under the USER C##HR in the database CDB1, this is specified in the LOCATION option for the CREATE TABLE statement.
CREATE EXTERNAL TABLE dbo.[PERSON]
[ID] DECIMAL(38,0) NOT NULL,
[NAME] VARCHAR(100) COLLATE Latin1_General_100_BIN2_UTF8 NOT NULL
WITH (LOCATION = N'[CDB1].[C##HR].[PERSON]', DATA_SOURCE = [ORA_SERVER]);
You can now query the table Person like if it was a SQL Server native table. Remember you cannot issue update or insert statements, only queries.
SELECT TOP 100 * FROM dbo.[PERSON]
This is it for today’s post. Hopefully, you will have a better idea of what SQL SERVER 2019 Polybase is.
I will be posting more in-depth examples for generic ODBC types in the next days.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.