ADF now supports data integration with Snowflake
Published Jun 07 2020 08:21 PM 41.6K Views
Microsoft

Azure Data Factory provides 90+ built-in connectors allowing you to easily integrate with various data stores regardless of variety of volume, whether they are on premises or in the cloud.

 

We are glad to share that ADF newly added support for Snowflake connector with the following capabilities to fulfill your Snowflake data integration need:

  • Ingest data from Snowflake into any supported sinks (e.g. data lake) using Copy activity, which utilizes Snowflake’s COPY into <location> command to achieve the best performance;
  • Look up data from Snowflake for ADF control flow consumption using Lookup activity;
  • Load data into Snowflake from 90 plus supported sources using Copy activity, which takes advantage of Snowflake’s COPY into <table> command to achieve the best performance.

You can now find Snowflake connector from ADF connector gallery UI. Learn more from Snowflake connector documentation.

SnowflakeConnector.png

 

Upcoming:

Next, we plan to enable the following to further enrich the ADF-Snowflake integration capabilities. Stay tuned!

  • Load semi-structured data from JSON files into Snowflake VARIANT column using Copy activity, for data processing in Snowflake subsequently. [Update 2020/7/12: now data factory supports direct copy between JSON files and Snowflake semi-structured data types, learn more from direct copy from Snowflake and direct copy to Snowflake.]
  • Support Snowflake in Mapping Data Flow as source and sink transformations to provide seamless ETL. [Update 2020/8/31: ADF enables Snowflake connector in Mapping Data Flow, learn more from this blog.]
69 Comments
Copper Contributor

In one of our energy client(can't take the name for confidentiality) we found the performance of ADF when loading data into Snowflake was not great. Then raised a ticket to Microsoft and MS support team advised to use Matillion inside Azure. After that we found performance of data load been increased. It's around 3/4 months old story. Does this still true for large volume of data load?

Microsoft

@avixorld this Snowflake connector utilizes Snowflake's most recommended and performant way to load data and can scale for large volume of data. Not sure what exact mechanism your client used earlier. If any perf issue going forward, please keep raising support ticket and we'd love to look into the specific case.

Copper Contributor

Great update @Linda_Wang !!!

 

I tried executing Snowflake stored procedure from Lookup activity and it succeeds at the database level but errors out in the UI - 

 

{ "errorCode": "2100", "message": "Failure happened on 'Source' side. ErrorCode=UserErrorOdbcInvalidQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The following ODBC Query is not valid: 'call CDM.test1('c')'.,Source=Microsoft.DataTransfer.Runtime.GenericOdbcConnectors,'", "failureType": "UserError", "target": "Lookup1", "details": [] }

 

 

a known issue ?

Copper Contributor

Thank you for the update @Linda_Wang 

 

Please let us know if Snowflake stored procedure execution is going to be available using this connector in the future? Looks like this would only cover ingesting data into Snowflake but not execute any DML or DDL statements including stored procedures.

Microsoft

@adfguy currently ADF Snowflake connector doesn't support executing stored proc. We will track the ask. Do you mind sharing your scenario with lookup + sp?

Microsoft

@Srini2020 Snowflake stored procedure execution is on the radar though we don't have timeline at the moment. To understand more, what kind of Snowflake DML/DDL operations you want to execute in the workflow?

Copper Contributor

@Linda_Wang   Hi Linda. Hope you're doing good.

 

I created a schemaless dataset against the Snowflake linked service- then i used that dataset in a lookup activity and called a stored procedure from it.

Surprisingly- Stored proc ran just fine in the database but ADF doesn't seem to capture the return value and it fails.

 

Please see the attached screenshots for more understanding.Query log at Snowflake levelQuery log at Snowflake levelADF execution logADF execution logLookup activity configured to call stored procedure of snowflakeLookup activity configured to call stored procedure of snowflakeSchemaless datasetSchemaless dataset

 

 
 

 

 

 
 

 

 

Microsoft

@adfguy thanks for sharing the details. Executing stored proc on source/lookup is not supported now. The way ADF submits query to Snowflake doesn't handle SP as you observed. I was trying to understand more on your workflow logic w/ lookup + SP to see what scenario is blocked by lacking of such capability, or if you can use query instead. Hope it clarifies.

Copper Contributor

@Linda_Wang 

so we need to wait till Stored Procedure Activity rolls out the support for Snowflake it seems. 

 

Copper Contributor

@Linda_Wang  We would like to execute create table and insert into, merge commands using the Snowflake stored procedure activity.

 

Thanks

Copper Contributor

Does it work with Snowflake residing in GCP or AWS?

Microsoft

@Alaurier source supports any Snowflake flavor, and sink supports Snowflake on Azure.

Copper Contributor

Hi,

 

Does ADLS Gen1 support to copy data into snowflake using ADF.

In snowflake we are currently using Blob storage as staging area, can we overcome this with ADF Snowflake connector 

Copper Contributor

@Srinivas_b14 Using blob storage is still required as a staging for the ADF Snowflake connector as well. Also it only takes a SAS token and not a Managed Identity which makes it require a manual intervention of updating the SAS token when its rotated periodically for security and complaince.

@Linda_Wang Please correct me if I'm wrong.

Copper Contributor

We are trying to connect to a snowflake instance but are receiving errors while setting up the connection in ADF.

 

If using the Azure hosted Integration Runtime we receive the following exception when testing the connection: 

CURLerror (curl_easy_perform() failed) - code=51 msg='SSL peer certificate or SSH remote key was not OK' osCode=13 osMsg='Permission denied'

If using a self hosted Integration Runtime we get:

CURLerror (curl_easy_perform() failed) - code=60 msg='Peer certificate cannot be authenticated with given CA certificates' osCode=2

There are no SSL certificate errors when trying to access the database through the browser. The host is on xxxxxxx.eu-central-1.snowflakecomputing.com.

 

Have you come upon this error before? Any pointers?

Brass Contributor

@Linda_Wang a stored proc activity (or something like an Execute SQL Task in SSIS) would be useful to execute queries, but also for running DDL commands such as CREATE/ALTER WAREHOUSE. This would allow to scale a warehouse up at the start of the pipeline and scale it back down at the end.

Microsoft

@KoenVerbeeck thanks for the feedback. Script task for Snowflake is on our future plan though I don't have ETA to share at the moment. 

Microsoft

@BenMcIntyre for Self-hosted IR, please make sure you try with the latest version. And looks to me the issue is related to your network/firewall settings where some traffic is intercepted or blocked (https://community.snowflake.com/s/article/Error-Peer-certificate-cannot-be-authenticated-with-given-...). If the issue persist, I'd suggest you contact Snowflake to guide you troubleshoot it.

Copper Contributor

Thanks for the suggestions @Linda_Wang Do you have any idea why this would happen when we were initially using the Azure hosted IR? We only tried the self hosted option as a last resort.

Copper Contributor

@Linda_Wangthe Microsoft documentation doesn't seem to mention any controls with using the Snowflake stage. It creates an external stage dynamically by default.  Is there a plan to allow this setting to be controllable from within ADF and provide the ability to reference existing Snowflake external stages?

Copper Contributor

Hi @Linda_Wang,

 

I'm trying to use this connector, but when i try to get a sample of the data I receive this error:

JhonatanReco_0-1593496203745.png

 

However, the connection are sucessful:

JhonatanReco_1-1593496284201.png

 

I've researched this but I did not found information related :(

 

any idea? 

 

Thanks

Brass Contributor

Hi @JhonatanReco,

you can specify a database in the linked service:

KoenVerbeeck_0-1593504441191.png

If you have multiple databases and you don't want to create a linked service for each, you can specify a "USE myDB;" in the pre-copy script or you can just use three-part names, e.g. myDb.mySchema.myTable.

Brass Contributor

Sorry, with three-part name I should have specified this:

myDB.mySchema.myTable.

 

EDIT: updated in original comment.

Copper Contributor

Adding support for Stored Procedure calls would be huge to winning over ADF adoption in Snowflake projects.  Staging data is a great start, but it leaves a gap on how to handle the data once it's in Snowflake without a 3rd party tool, or using Databricks Notebooks, or lastly using Snowflake Tasks which would break the dependency between the ADF pipeline and downstream activity.

Copper Contributor

I have the connector working using Snowflake as the sink using the native COPY TO connector. Everything is working just fine, however I would like to add a second default column to the sink table. The COPY TO column has transformation parameters that allow for this (https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#transformation-parameters). Is it possible to configure the column name in the ADF connector to utilize this functionality?

Copper Contributor

@Linda_Wang Thanks for the update. This is really nice feature. We have BLOB files in JSON format and want to load into Snowflake stage. As you mentioned above , JSON format support will be there in upcoming releases.  Do you have any timeline/ETA for this JSON support feature?

Microsoft

@ChrisPuuri You are right that during each copy ADF will create external stage and clean it up at last. We don't have plan to support referencing existing external stage at the moment. Do you see any issue with current model?

Microsoft

@MartinJohnMadsen Good to know the connector works for you. I don't think Snowflake COPY INTO support adding a column w/ static value which does not exist in source, the transformation parameters can do column reordering/omission and we do leverage it for column mapping behavior. Kindly clarify if I miss anything.

Microsoft

@RavindraAsabe we are going to release loading between JSON and Snowflake VARIANT by end of this week.

Copper Contributor

Hello, @Linda_Wang.

Thanks for responding to questions on this topic. It really helps us understand all the possibilities and limitations of this connector.

I wanted to clarify if the "ROLE" connection parameter is supported. I can see that there is no such field in the linked service view, but it can be added through "Additional connection properties".

Anyway, even having a specified role in properties or directly in a connection string - I was not able to query Snowflake due to this error:

ERROR [57P03] No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command. 

Despite that warehouse is specified for the linked service.

'use warehouse' in a query text does not work also.

Is there any way to work around this issue?

Copper Contributor

Thank you @Linda_Wang for working with us on this. The connector is amazing and has the potential to simplify a lot of our ELT work with Snowflake.

 

My scenario is this: the source data has three columns (C1, C2, C3) stored in a Parquet file in Blob. The target table in snowflake has 4 columns (C1, C2, C3, C4) where the last column, C4, has a default value CURRENT_TIMESTAMP(). When I manually run the copy into from an external table, I can specify the column list explicitly from the source like this:

 

copy into snowflake_table (C1, C2, C3)
from 'azure://...'
credentials=(azure_sas_token='...');

If I don't specify the column names, I get the same error I see from the ADF connector: Snowflake can't copy 3 columns of data into a table with 4 columns. But specifying the column names makes this work since the fourth column has a default value. If there were a way to configure the column name list in the connector, we could utilize the default column to timestamp every row we load with the load time (which is the goal).

 

 

Copper Contributor

@Linda_Wang I have tested this using an explicit schema in the copy activity (excluding the column with default timestamp). That behaves as expected- the default column in Snowflake populates with the timestamp. I think this will work for us, but I wanted to follow up for anyone else who might find this.

Copper Contributor

@AzureFrank We have exactly the same need for calling stored procedures in Snowflake from ADF. This would be a huge win for us for the same reasons you detailed above.

Copper Contributor

Thanks @Linda_Wang no we don't have any issues creating the stages and then cleaning them up after the copy activity completes. I was just curious. It doesn't seem to create much extra overhead.

 

Our biggest need as this point will be 1) the ability to call a stored procedure or send a SQL script to Snowflake to create a table/view. This is huge for us! 2) would be the ability to load JSON's which sound like it's coming soon. Curious if the first item is on the roadmap.

Copper Contributor

@Linda_Wang who is the appropriate team to create a ticket with when we run into issues? MSFT or Snowflake? We are seeing some issues as we deploy from a Development to Test data factory and attempt to switch the connection string to use our 'EDW_DEV' to 'EDW_TEST' database.

Microsoft

@ChrisPuuri Good to know you don't have issue nor concern on the behavior regarding external stage.

 

Snowflake stored proc support is on the plan though I don't yet have definitive timeline to share. 

Loading data between JSON files and Snowflake VARIANT is under release process and shooting for live by end of this week.

 

For the issue you mentioned, you can create ticket to MSFT to take a look first.

 

Microsoft

@MartinJohnMadsen the default mapping behavior in copy activity is to map columns by name in case-sensitive manner. It should work even the source column count is less than the sink column count, as long as all source columns exist in sink with identical names. Let know if you see differently. And explicit column mapping does also work as you mentioned.

And Snowflake stored proc activity support is on the plan.

Microsoft

@oleksandrkravchuk ADF Snowflake linked service UI currently shows the most common and basic properties, and you can extend via "Additional connection properties" e.g. to add ROLE setting.

'use warehouse' is not supported in Snowflake connector source query. When copying data from Snowflake, the query only takes single select statement.

ADF uses the warehouse you specified in the linked service during runtime. The error seems indicating the warehouse name is wrong or it's inactive. Please double check.

Copper Contributor

@oleksandrkravchuk  @Linda_Wang We got a similar error when using the Snowflake connector. The Linked Service tested ok, but when trying to use the connector in a pipeline, we got the "use warehouse" error. We solved this by specifying a default role, namespace, and warehouse for the service account user in Snowflake. It appears that the Snowflake connector uses the default snowflake role which is typically the PUBLIC role. Hope that helps.

Copper Contributor

@Linda_Wang do you have any sense of timing on the ability for stored procedures? We also have this requirement and would like to keep our whole workflow running in ADF, but we need the ability to execute a stored proc or at least execute some SQL from ADF to do some transformation and insert into another table.

Microsoft

@dschuler36 stored proc activity support is on our H2CY20 plan now.

Note added by 2020/12/14: stored proc activity support for Snowflake got deferred. Currently we don't yet have timeline.

Copper Contributor

@Linda_Wang awesome, thanks for the update!

Copper Contributor

@Linda_Wang Thanks for the update. We are looking forward for this as well.

Copper Contributor

looking forward to the stored procedure support or native sql support for ETL.  Currently this limitation is forcing us to look elsewhere - question does the databricks let us run sql procedure?

Copper Contributor

@AzureFrank As a workaround. I tried the "Pre-copy script" option in the "copy data" activity for running stored procedures and seems to work. I set up a dummy table and a dummy file that would load 0 rows, but executing my procedure first.
This is not optimal I know, but if you really need to execute a SP, it could be an option for you.
Let's hope we have a real solution soon.

Copper Contributor

genius idea about copy pre-script!     @Linda_Wang  - any idea when they could add snowflake to the stored procedure?  i do not know the dates for HY20?

Copper Contributor

@sonh13H2Y20 I believe means second half of year 2020.

Microsoft

What are the current max file sizes when using Azure Blob as a sink for snowflake data?  I'm running into file size limits with exporting 5GB of data.

Copper Contributor

Can I  use Common data service (CDS)  as source and snowflakes on aws as sink for an adf integration to send data from cds to snowflakes on aws

Copper Contributor

Since Snowflake supports ADLS Gen2 for staging files, do you know if the copy data task will support it as well? Or do we have to stick to Blob Storage? 

Version history
Last update:
‎Aug 31 2020 10:48 AM
Updated by: