Azure Data Factory adds support for XML format
Published Jul 17 2020 07:20 AM 24.8K Views
Microsoft

We are glad to announce that now in Azure Data Factory, you can extract data from XML files by using copy activity and mapping data flow. With such capability, you can either directly load XML data to another data store/file format, or transform your XML data and then store the results in the lake or database.

 

XML format is supported on all the file-based connectors as source. You can find the new "XML" option in the available format list:

Linda_Wang_0-1594989922662.png

 

When reading data from XML files, you have the flexibility to define whether to apply schema validation, to specify how to handle namespaces, and to map selective fields or reshape the hierarchy to match downstream data need.

 

Copy activity source example:

Linda_Wang_1-1594989936019.png

 

In mapping data flow, you can use XML as source either via an XML dataset or via inline dataset.

Data flow source example with XML dataset:

Linda_Wang_2-1594989956975.png

 

To learn more about XML support in Azure Data Factory, visit XML format in the documentation site.

34 Comments
Copper Contributor

this is a great addition

Brass Contributor

Hi @Linda_Wang,

 

Thanks for the good news.

 

Few questions everybody would like to know the answers:

  • What XML processor is in use behind the scenes?
    Is it .Net Framework or something else?
  • When sink support will be introduced?
  • Any future XSLT support?
  • Any future XQuery support?
  • Excerpt from the documentation: "...When using XSD or DTD to validate XML files, the XSD/DTD must be referred inside the XML files through relative path...".
    It is a very serious limitation. Are you working to remove such dependency?
  • Is the ADF XSD Validation conformant with the XSD 1.1 standard?
    XSD 1.1 became a W3C Recommendation in April 5, 2012.
  • Does the XSD Validation support master XSD with imported or included XSDs?
    It is a very common scenario.

For example,

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:include schemaLocation="includeMe.xsd"/>
...
</xs:schema>

 

Copper Contributor

This is great news.

 

Any future plan to have sink support?

Copper Contributor

This is a great addition. Thanks!!

Microsoft

@PuranJoshi Thanks for the feedback. For now we don't have plan to support XML as sink.

Copper Contributor

Yeah! I expected a lot for this!

Copper Contributor

Excellent,i was trying to convert an xml file to csv through a Copy Activity pipe line but getting an error ,' Could not find the schema information for the element'.

 

Any idea about this,can some one help me.

 

THank you :) 

Microsoft

@BijuNambiarC The error message seems indicating XSD not found. Did you configure XSD schema validation in ADF? If so, please turn off the validation and try again. And note the XML connector behavior on schema validation: https://docs.microsoft.com/en-us/azure/data-factory/format-xml#xml-connector-behavior.

Copper Contributor

@Linda_Wang I am retrieving an XML file from an HTTP connection in Data Factory (from a SOAP API). What is the supported method to sink/save this data to Data Lake in order to run Data Flow transformations on the data?

Copper Contributor

Does anyone have a method to save/sink this retrieved XML data from the HTTP connection? Binary isn't an option, needs to be something that can be parsed out into a database.

Microsoft
@Mysticode as it's SOAP API, you can use copy activity with XML format on top of HTTP connector as source, in copy activity "mapping" to map the fields directly to sink database. If you need complex transformation in-between, where Data Flow fits, you can use copy activity to copy to lake then apply Data Flow- config both copy source and sink with XML Binary format, one for HTTP connector one for data lake connector.
Copper Contributor

@Linda_Wang Do you have an example of this? I can't find a file type that works with saving the XML to the Sink source. I need it to be something that I can parse through.

Copper Contributor

As @Linda_Wang  described, you can copy the results of the http to data lake. I got this to work with binary and XML. For complex XML support I was able to copy it to a SQL server instance where you can use TSQL XML functions to parse the results, I used binary dataset and copy the results of data lake and copy it to a string in a table, but XML should work too with an XML data type. I haven't tried that yet, but that may be a good solution for complex XML parsing and processing in line instead of saving it to a table. In my case I had to create a loop because there is a limit of rows returned on the soap envelope. But that is another story. Hope this helps.

 

WillT985_2-1595944648855.png

 

WillT985_0-1595944372367.png

WillT985_1-1595944595267.png

 

 

Copper Contributor

Hi @Linda_Wang ,

 

   Is there any limitation to convert the xml files to csv using the copy activity? 1 GB file gets converted within 2 hour without any parallelism/DIU/blocks, but a 2GB file fails with a memory exception.

 

Any suggestion to process a large xml file.

 

Thank you

Copper Contributor

@WillT985 I must be missing something. This is what I have so far.

 

1. Linked Service: HTTP type, base URL is my external API. Basic auth, username, password

2. A dataset, type HTTP/XML type, connected to my Linked Service above. I have entered a base URL. It connects without issue

3. Pipeline activity of "Copy data"

  • Source dataset: the HTTP/XML dataset from above
  • Request method: POST
  • Additional headers: Content-Type: application/soap+xml;charset=UTF-8;action="http://outsidesoft.com/webservices/API/Authenticate"
  • Request body: <soap:abc123....><soap:/Header/>....<soap:Body><api:Authenticate>...etc etc

What are the next steps? When I try to set up a Sink dataset, I choose my data lake, but then when to choose a format is when I run into a problem. I want to save the file in a format I can parse. If I choose CSV, then run a debug, I get the following error:

ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: One or more errors occurred.,Source=Microsoft.DataTransfer.ClientLibrary,'",

 

What are the proper steps to take to get a sink working?

My future plans are to do another "Copy data", and set up the variable to get the token as you demo'd above, and then make subsequent API calls with the token and save that newly retrieved data to the data lake in their own tables.

 

Thank you all

Copper Contributor

Linda mentioned configuring the Sink and Source as XML for Data Lake, but this is not an option. Only options are: Avro, Binary, DelimitedText, Json, ORC, Parquet.

 

Also, Linda, my Sink is an HTTP/XML dataset connected to a Linked Source type of HTTP. Do you have a demo of how to setup a proper copy that uses an HTTP linked service source, to map fields from a SOAP API?

 

If I select DelimitedText/CSV for Data Link source's format and debug the pipeline, I get the following error:

ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: One or more errors occurred.,Source=Microsoft.DataTransfer.ClientLibrary,'",

Copper Contributor

@Mysticode 

In my case, First I created a copy activity that uses a binary dataset source type to call the Marketing Cloud SOAP Api and puts the results(XML envelope) in Data lake as Binary Dataset Sink in Data lake. My binary source dataset uses an http linked service, under the Base URL you would use your API url from your MC. but again the result will get dumped to data lake as binary sink. You are correct that there is no SINK XML so you will need to use a binary dataset for sink that points to a location/folder in ADL.

Then, I created a new XML source dataset that points to the binary sink from step above. I then use this source in a new copy activity as source dataset, for my sink in my case I use Azure DB table that has the columns that I want to map. Then under mappings you can import schemas and you will see XML structure. Here is where it can get complex, if its a simple collection reference then you should be able to map it. Here is how my mapping looks.

 
 

xml mapping.PNG

Copper Contributor

Great addition. Does this also include XML data type? I have json files and one of the columns is XML. I had a quick look and I'm still not sure this is supported for either copy activity or mapping data flows? 

 

Microsoft

@sandeepthachan please file a support ticket for the copy activity perf and memory exception issue, engineer can look into your particular case and we may need more info on your exact data shape. Meanwhile, you can give mapping data flow a try for super large XML file.

Microsoft

@DC_07 thanks for sharing the use case. This XML format support is orthogonal to XML data type in data. The XML field in your case will be treated as string.

Copper Contributor

@Linda_Wang thanks for the reply. I have tested with copy activity the XML column mapped as string and the error is same as previous to this release :sad: The XML column from the JSON file is attempting to insert into an XML datatype in Azure SQL DB

 

Operation on target Copy data1 failed: ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed. Please search error to get more details.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=XML parsing: line 1, character 55, unable to switch the encoding,Source=.Net SqlClient Data Provider,SqlErrorNumber=9402,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=9402,State=1,Message=XML parsing: line 1, character 55, unable to switch the encoding,},],'
Copper Contributor

@Mysticode  not sure if you got it working. If you use a delimited csv sink, it should work and you should get back the XML string in a row and column, so just map the one column. here is my sink 

WillT985_0-1597947395668.png

 

WillT985_1-1597947437992.png

 

WillT985_2-1597947456103.png

 

Copper Contributor

@Linda_Wang 

 

I'm trying with a Dataflow and which is success for 1 GB or less files but whenever its 2 GB or more failing with memory error ,Can you help me to solve this issue.

 

Thank you

Microsoft

@BijuC Thank you for trying it out and raising the issue. Please file a support ticket on the memory error, we would need to collect some context of your runs there for investigation.

Copper Contributor

@Linda_Wang Thank you for the reply,could you please provide me the URL to raise the support ticket.

Microsoft

@BijuC You can go to Azure portal -> navigate to your data factory -> on the left panel you can find "New support request" under "Support + troubleshooting".

Copper Contributor

Hey..Hi

 

Could you please tell me how to configure xsd validation. All i see here it needs to be mentioned inside the XML files using relative path. Could you please show one with an example. Would be really helpful.

 

Thanks and Regards

Abhishek

Copper Contributor

Hey 

 

May i know how i can load data into an XML file in ADF. If sink isn't an option, what are my other alternatives to handle this scenario.

Do respond if anyone has any ideas.

Copper Contributor

Hi @Linda_Wang 

 

I have a copy activity in a pipeline where the source is XML file and the sink is SQL table.  I have some issues with mapping the XML file due its hierarchical structure. I wonder if is possible to use XPath in the mapping to pick the xml nodes   

Did anyone use the XPath in this circumstance?

 

Thanks 
John Balta

Copper Contributor

Hi @Linda_Wang

 

The XML validation seems to be very restrictive ... When using XSD or DTD to validate XML files, the XSD/DTD must be referred inside the XML files through relative path..."

 

The Stackoverflow response (https://stackoverflow.com/questions/63923010/xml-validation-in-azure-data-factory) shows the use of the xsi:noNamespaceSchemaLocation="order.xsd" to specify the xsd

 

But how can this be done for an external source, where we don't have any control of the XML content, so there is no way we can add a "xsi:noNamespaceSchemaLocation" attribute ?

Copper Contributor

Hi @Linda_Wang

I have a XML which is in the form as below . I am trying to read it through Copy activity and load it to a CSV format which will be in flatten out structure. But I am only getting the first row of each node. When I try through dataFlow I am able to read it correctly. Could you please suggest how i can read it through Copy Activity. 

 

<?xml version="1.0" encoding="UTF-8"?>
<root>
<temproot id='124117' name='Los Angeles'>
<price logid='3...







Copper Contributor

hi @Linda_Wang,

I am not able to flatten the below xml file to flatten structure in CSV through Copy activity. While I am able to do so in Data Flow. can you please help 

<root>
<temproot id='124117' name='Los Angeles'>
<price logid='317497' date='24-MAY-21' max='177.62' min='176.62'/>
<price logid='317107' date='26-MAY-21' max='178.64' min='177.89'/>
</temproot>
<temproot id='124119' name='North West'>
<price logid='31702' date='25-MAY-21' max='68.2' min='68.16' />
<price logid='31782' date='19-MAY-21' max='66.58' min='66.54'/>
</temproot>
</root>

Copper Contributor

Hi Team,

 

How to convert string msg to xml format ,

Suppose if we have one xml file data present in string message format, In python we will convert/Parse into xml by using  et.ElementTree(et.fromstring(msg)) or xmltodict.parse(msg)

 

In same way how to parse the string message to xml format.

 

Thanks ,

venkata sk

 

 

Copper Contributor

I'd like to know how to execute a function such as formatDateTime during the mapping on the copy task. I posted a question at the Microsoft Q and A site here 

Version history
Last update:
‎Jul 17 2020 07:20 AM
Updated by: