ADF Adds Support for Inline Datasets and Common Data Model to Data Flows
Published Jun 04 2020 05:51 PM 29.1K Views
Microsoft

Azure Data Factory makes ETL even easier when working with corporate data entities by adding support for inline datasets and the Common Data Model (CDM public preview connector). With CDM, you can express common schemas and semantics across applications. By including CDM as a source and destination format in ADF's ETL data flows engine, you can now read from CDM entity files, both using manifest files and model.json. Likewise, you can write to CDM's manifest format by using an ADLS Gen2 Sink in your ADF data flow.

 

cdm1.png

 

The way that we're exposing the CDM format to your ADF Data Flow source and sink transformations is via the new inline dataset option. Previously, ADF required you to create or use an existing dataset, which is a shared entity across an entire factory. But now data flows allow you to define your source and sink formats inline in the transformation without requiring a shared dataset.

 

There is a new selector on data flow source and sink transformations for "Type". The default is "Dataset", which is the most common use case and you will not have to change anything for existing data flows. You follow the same workflow of creating or choosing a shared dataset for your source and sink.

 

CDM is the first new ADF format to support inline datasets only. Choose "Common Data Model" from the list of inline types and then pick ADLS Gen2 as your Linked Service, where your CDM entity files are located. ADF fully supports all of the rest of the data flow transformation logic which you can use in your flow.

 

NOTE: When using model.json source type from Power BI or Power Platform dataflows and you encounter "corpus path is null or empty" errors, it is likely due to formatting issues of the partition location path in the model.json. Here is how to fix it:

 

1. Open the model.json file in a text editor
2. Find the partitions.Location property 
3. Change "blob.core.windows.net" to "dfs.core.windows.net"
4. Fix any "%2F" encoding in the URL to "/"

 

You can mix and match linked service and dataset types, too. This way, ADF allows you to have, for example, an Azure SQL source database, then dedupe some of the table data with an Aggregate transformation, and then finally sink those transformed rows into CDM partitions. Likewise, you could take an existing set of CDM entities and load those into a SQL Database after transforming the incoming entity data.

 

 

You'll notice that the Source and Sink settings are more complex when using an inline dataset. This is because you are not using a shared configuration (dataset) and so we need to capture details about your CDM entity file locations and corpus settings for CDM formats directly in the transformation. As more formats are added to the inline dataset option, the source and sink settings properties will change depending on the type of format you selected. Note that with the CDM format, ADF expects the entity reference to be located in ADLS Gen2 or Github. So you will select the appropriate Linked Service to point to the entity files.

 

 

 

When using inline datasets, you may want to import the target schema as you do with Import Schema in ADF Datasets. This is available to you inside the Source & Sink directly in inline mode.

 

In this use case above, I'm using a CDM sink format type so the CDM entity definition includes a rich set of schema semantics that can be read by ADF and used for your mappings and transformations. I click on Import Schema to load the columns and data types into my sink mapping and then View Schema so that I can set schema drift or validate schema as well as view the columns defined in my target entity schema.

60 Comments
Copper Contributor

Nice feature !!

Copper Contributor

Just in time. I’m Power BI dev without Premium, but access to Azure, so may use ADF as. Poor mans Power BI dataflow. This feature makes it easier to deal with CDM in ADLS to be readable by Power BI and AAS 

Copper Contributor

Exactly what we need! Is there any documentation / help available when using the model.json? Doesn't seem to connect right of the bat. Hope to see this in GA soon!

Copper Contributor

+1 on a model.json example. Hoping to use it with the Power Apps Export to Data Lake option that uses model.json 

Copper Contributor

Hi,

 

I have been trying to use this connector since it was in preview mode.

I have difficulty understanding why a linked service test connection fails in the 'source options' while testing OK on its settings blade. Pic below 

 

hshokry_1-1591954284711.png

 

Microsoft

@hshokry Is that storage account protected in a private network or is it a public endpoint?

Copper Contributor

Can you provide a sample screenshot of the source options for the inline data set? I am exporting my CDS data to an azure data lake gen2 account, which works, but I can't figure out how to consume that information as an inline source in data factory. The container structure is as follows: 

-[azuredatalakestorageaccount]

--Blob containers

---commondataservice-[environmentname]-[orgname]

----model.json

----[entityname]

-----2020.json

----[entityname]

-----2020.json

I don't understand what to set as root location and Entity reference. I assume the metadata format is Model.json since that is the file that CDS created. 

Brass Contributor

An example on using model.json with PowerBI dataflow data would be really helpful. I've tried every combination of paths and entities and nothing seems to work. I constantly get the error:

 

DF-CDM_005 at Source 'source1': Could not read '*entity name*' from the 'cdm' namespace. Reason 'com.microsoft.commondatamodel.objectmodel.storage.StorageAdapterException: Could not read content at path: /logical/*entity name*'

Copper Contributor

An example on using model.json would be nice.

I'm getting the same errors as @JonesTrac. I'm not sure what to input in the Root location.

Microsoft

Here is a model.json example ... Set the "Root location" to the folder where your model.json is located. Make sure all locations in your model.json are accessible to your factory.

 

modeljson.png

modeljson2.pngmodeljson1.png

Copper Contributor

Hi @Mark Kromer ,

 

Thanks for the examples.

I don't know how you get this folder structure since the 'Export to data lake' feature in the powerplatform ouputs a different folder structure as documented here.

entity-data-in-lake.png

 

As you can see the model.json file is directly under the collection and this gives troubles with the ADF source options.

The 'folderPath' field is required but the model.json is directly under the collection so it has no folder path.

Brass Contributor

Thanks Mark, really appreciate it. What level of permissions are required for the data factory as I'm still getting these errors? Is an account key for the Datalake Linked Service enough? Is it a problem if the entity name contains spaces?

Microsoft

@MathiasBL To use model.json as a source, ADF will also need the exported partition data. Is there a set of CSV files associated with your model in a partition folder?

Microsoft

@JonesTrac Your error looks like a missing entity reference in your model. Would you be able to submit an Azure ticket so that we can have an engineer help you with the model?

Copper Contributor

@Mark Kromer yes, there's data as documented.

It seems like this feature doesn't go well together, for the moment, with the OOB 'Export to data lake' feature in the power platform?

 

export-data-lake-account-snapshots.png

Brass Contributor

Thanks Mark, the model was created by PowerBI dataflows and hasn't been edited so is that a known incompatibility with this DataFactory feature?

Copper Contributor

@Mark Kromer There's data available.

It seems that this feature doesn't go well together, for the moment, with the OOB 'export to data lake' feature in the power platform.

export-data-lake-account-snapshots.png

Microsoft

@JonesTrac  I'm not able to tell from the message. It sounds like there is an entity that is not resolved, but would need an engineer to look deeper at the error.

Brass Contributor

Thanks @Mark Kromer, will put in a ticket

Copper Contributor

@MathiasBL I have the exact same problem, but I was able to get around the source folder issue by putting a "/" for the folder path. I still wasn't able to get it to work due to the error "the corpus path is null or empty" when I tried to import the schema. I am using model.json not manifest, so not sure why it is requiring a corpus path.  I opened a discussion here regarding the issue on this issue specifically if anyone wants to follow: https://docs.microsoft.com/en-us/answers/questions/37384/cant-use-inline-common-data-model-data-set-...

Brass Contributor

For anyone struggling to get PowerBI dataflows into ADF this way I have been told by a support engineer that they are incompatible. You would have thought the whole point of a Common Data Model was that it was common across services so this is certainly a bizarre omission. They plan to work with the PowerBI team to include it but there is no ETA for this. :facepalm:

Copper Contributor

Hi All

I am even getting the error by connecting using Manifest please help

Rahul_Jain450_1-1593086356223.png

Error:
DF-CDM_007 at Source 'sourceFromCDS': CDM path /commondataservice-poc-org/ not valid

 

Copper Contributor

@Mark KromerWe've heard conflicting takes on ADFs capability to generate/create the manifest and related schema documents given a non-cdm source in a mapping data flow. I've seen no documentation explicitly stating this functionality but MS reps have represented to us that this is possible.

 

Can you provide confirmation that the capability for automatically creating CDM schema documents including the manifest either exists or does not exist?

Microsoft

@rcgriner ADF data flow sink can produce CDM manifest files with partitioned output. You must have an existing CDM entity reference set in the sink properties for that to work.

Brass Contributor

Thanks @Mark Kromer . Does this mean you need to manually define the entity yourself or can ADF create the entity definition too? Right now I keep getting errors like "Entity '/CoreData/WellMetaData/Geo.cdm.json/Geography' not found in corpus". How much of the CDM structure must already exist for ADF to sink new entities into it?

Microsoft

@JonesTrac You must have an entity definition and point to it from the Corpus folder in your sink. You will then Import that CDM schema so that you can map to their entity properties in your Sink transformation.

Copper Contributor

@Mark Kromer i have submitted azure ticket also can you please help

case 120062623000705

I am even getting the error by connecting using Manifest please help

Error:
DF-CDM_007 at Source 'sourceFromCDS': CDM path /commondataservice-poc-org/ not valid
Copper Contributor

i have have an entity definition and point to it from the Corpus folder in sink. then also getting error please help DF-CDM_007 at Source/ not valid

Microsoft

If you are using model.json from Power BI or Power Platform dataflows, I've added a note in the blog post above to help when you encounter "the corpus path is null or empty".

 

NOTE: When using model.json source type from Power BI or Power Platform dataflows and you encounter "corpus path is null or empty" errors, it is likely due to formatting issues of the partition location path in the model.json. Here is how to fix it:

 

1. Open the model.json file in a text editor
2. Find the partitions.Location property 
3. Change "blob.core.windows.net" to "dfs.core.windows.net"
4. Fix any "%2F" encoding in the URL to "/"

 

Copper Contributor

@Mark Kromer Really appreciate the follow-up. My partition location paths in model.json already meet these criteria, but I still receive the corpus path is null or empty error. I also double checked to make sure the paths are valid.  Let me know if there is anything else to try. 

Copper Contributor

@Mark Kromer Really appreciate the follow-up but i am using manifest file to import data using ADF to ADLS gen 2 , at that time i am getting the error .

i have also set up correct corpus path .

Copper Contributor
 

corpus.png

 

i have tried many different path still getting error folder structure u can refer below 

path.png

Microsoft

If you are still having troubles getting your CDM sources to work, please either contact us via Azure Support ticket or use the feedback button on the top of the ADF UI (smiley face) so that we can look into your individual configuration.

Copper Contributor

already raised following from last week case 120062623000705 already 10 days over 

Brass Contributor

Hi @Rahul_Jain1785 I finally got ADF to sink into a CDM folder last night. You can see my configuration below if that helps

TopLevelDatalake.PNGEntity Folder.PNGADFSetup.PNG

The root location is coreapidatalake/TracCoreDataLake/T****Core/WellMetaData. This is the folder that contains the ShaleWellGeography entity.

Entity path is T****Core/WellMetaData/ShaleWellGeography.cdm.json/ShaleWellGeography

I had to create the entity schema documents myself using the guidance here (https://docs.microsoft.com/en-us/common-data-model/creating-schemas). ADF then imported that schema and allowed me to set up a mapping in the Sink. When copying the data ADF created the wellmetadata.manifest.json file, the config.json file and the ShaleWellGeography folder with the parquet data partitions. Hope that helps!

Copper Contributor

@Mark Kromer does Microsoft have a working example where the Power Apps 'Export to data lake' is used (model.json) and used as source/sink?

Still experiencing issues even with the note you added.

Copper Contributor

Hi Mark, I am facing an issue with the CDM parsing. All the timestamp fields are returning as NULL . Am I missing any thing here please?

Microsoft

@ghurruprasath Where is your model.json file coming from? Is it an export from Power Platform? Something else?

Copper Contributor

@Mark Kromer  its coming from CRM dynamics Mark.   

Microsoft

@ghurruprasath Try this: When you click "Import schema" on the Source projection, set the proper timestamp conversion in the dialog box for time fields:

timefields.png

Copper Contributor

My source data format is in MM/dd/yyyy HH:mm:ss "04/04/2020 17:49:31" . However, none of the time stamp conversion seems to be parsing the data correctly. I have also noticed that whichever time format I set while imporing the schema , next time when I click "Import schema" it is populating with a fixed format as shown below. 

 
 

date.pngtest2.png

Microsoft

@ghurruprasath Do you mind if we take this over to the Q&A forum? I'd like to get a few CDM SMEs engaged on it and it's much easier to have dialog on your efforts there instead of this comments section: https://docs.microsoft.com/en-us/answers/questions/topics/azure-data-factory.html. TY!

Copper Contributor

@Mark Kromer  Thanks for your suggestion on setting time format in "infer drifted column", it worked for the date field. However,  Datetimeoffset fields are not working. I have the field createdon (directly from Export to Datalake from the CDM) formatted like: 2020-01-03T07:46:37.0000000+00:00 . None of the date format options there seems to be working for this data type. can you please suggest?

Copper Contributor

@Mark Kromer @ghurruprasath same issue here with datetimeoffset. This problem is also stated in https://docs.microsoft.com/en-us/answers/questions/68493/azure-data-factory-cdm-inline-dataset-error....  No solution yet.

Copper Contributor

Hi,

I am experiancing similar issues.

In my case CDM sources from powerapps dataflow stored at datalake.

By default all of datetime files are NULLs.

When I add format SinkCreatedOn and SinkModifiedOn field turn to be fine.

Jkibiki_0-1599817533916.png

 

But CreatedOn field which interests me most still is null. Also using format ('yyyy-MM-dd\\'T\\'HH:mm:ss.SSS') is not help.

 

Microsoft

@Jkibiki Can you copy/paste an example here of the timestamp field that is being generated in your source data?

Copper Contributor

@Mark Kromer 

My flow is following
PowerApps dataflow -> CDM custom entity (historization enabled) -> export to datalake

 

I am most interested in retrieving datetime values from CDM auto genered field "CreatedOn".

 

I have checked .csv on storage account and format of CreatedOn is as follows.

Jkibiki_0-1599911284932.png


@TimEllens How can i modify model.json to take formating into account. Can you please provide screenshare or else?

 

J.




Copper Contributor

I resigned from those datatime mappings and added a custom string datetimestamp marker in power query.

Copper Contributor

For others: This is the format for me that worked: timestampFormats: ['dd-MM-yyyy','yyyy.MM.dd HH:mm:ss','yyyy-MM-dd\'T\'HH:mm:ss\'Z\'','yyyy-MM-dd\'T\'HH:mm:ss.SSS\'Z\'','MM/dd/yyyy HH:mm:ss','yyyy-MM-dd\'T\'HH:mm:ss.SSS']).

Other question: someone else experiences issues with some multi-line rows? Normally I would use a double quote as Quote Character and as Escape Character for csv files. Inline datasets does not seem to have this included. Any way to work around this issue? Is it on the backlog?

Example: (

"Test Test Test Test

Test Test ""Test Test Test"" Test Test"
)

Copper Contributor

@JonesTrac  I have followed the exact steps like you showed in the snapshots and used a standard entity schema (Account.cdm.json) but when trying to Import Schema or Preview Data in the sink transform it fails with this message "at Sink 'sink1': java.util.concurrent.ExecutionException: java.lang.NullPointerException" 

Have you come across this before?

 

Thanks

Version history
Last update:
‎Jul 07 2020 12:49 PM
Updated by: