Read an XML Spreadsheet

Copper Contributor

Hi - just started learning Power Query / Get & Transform this month - what an incredible tool! My background is VBA, so getting data cleaned/shaped without writing any code is incredible!

 

Has anyone tried reading a XML Spreadsheet? Here's the header of my data:

 

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="date"><NumberFormat ss:Format="Short Date"/></Style>
<Style ss:ID="bold"><Font ss:Bold="1"/></Style>
<Style ss:ID="default"></Style>
</Styles>
<Worksheet ss:Name="Data">
<Table>
<Row ss:StyleID="bold"><Cell><Data ss:Type="String">Company</Data></Cell><Cell><Data ss:Type="String">Job</Data></Cell><Cell><Data ss:Type="String">Emp Num</Data></Cell><Cell><Data ss:Type="String">Emp Name</Data></Cell><Cell><Data ss:Type="String">SSN</Data></Cell><Cell><Data ss:Type="String">Period Date</Data></Cell><Cell><Data ss:Type="String">Skill</Data></Cell><Cell><Data ss:Type="String">Hour Type</Data></Cell><Cell><Data ss:Type="String">ST</Data></Cell><Cell><Data ss:Type="String">OT</Data></Cell><Cell><Data ss:Type="String">DT</Data></Cell><Cell><Data ss:Type="String">HW</Data></Cell><Cell><Data ss:Type="String">HP</Data></Cell><Cell><Data ss:Type="String">Skill Category</Data></Cell><Cell><Data ss:Type="String">Skill Category Name</Data></Cell><Cell><Data ss:Type="String">FD</Data></Cell><Cell><Data ss:Type="String">Calc No.</Data></Cell><Cell><Data ss:Type="String">Description</Data></Cell><Cell><Data ss:Type="String">Contribution</Data></Cell><Cell><Data ss:Type="String">VH</Data></Cell><Cell><Data ss:Type="String">Contribution Amount</Data></Cell><Cell><Data ss:Type="String">Has Dues</Data></Cell><Cell><Data ss:Type="String">Dues Amnt</Data></Cell><Cell><Data ss:Type="String">Has PAC</Data></Cell><Cell><Data ss:Type="String">PAC Amount</Data></Cell><Cell><Data ss:Type="String">Has 401k</Data></Cell><Cell><Data ss:Type="String">401K Amount</Data></Cell><Cell><Data ss:Type="String">401K Rate</Data></Cell><Cell><Data ss:Type="String">Premium</Data></Cell><Cell><Data ss:Type="String">Has ROTH</Data></Cell><Cell><Data ss:Type="String">ROTH Amount</Data></Cell><Cell><Data ss:Type="String">ROTH Rate</Data></Cell><Cell><Data ss:Type="String">Roth Premium</Data></Cell><Cell><Data ss:Type="String">Address</Data></Cell><Cell><Data ss:Type="String">Local</Data></Cell><Cell><Data ss:Type="String">Union Skill</Data></Cell><Cell><Data ss:Type="String">Skill Name</Data></Cell><Cell><Data ss:Type="String">Skill Class Prior Rate</Data></Cell><Cell><Data ss:Type="String">Skill Class Current Rate</Data></Cell><Cell><Data ss:Type="String">Skill Class Due</Data></Cell><Cell><Data ss:Type="String">Skill Class HRA</Data></Cell><Cell><Data ss:Type="String">Skill Class 401A</Data></Cell><Cell><Data ss:Type="String">Skill Credit Union Code</Data></Cell><Cell><Data ss:Type="String">Skill Credit Union Rate</Data></Cell><Cell><Data ss:Type="String">Skill 401k Code</Data></Cell><Cell><Data ss:Type="String">Skill 401k Rate</Data></Cell><Cell><Data ss:Type="String">Unassigned</Data></Cell><Cell><Data ss:Type="String">Company Name</Data></Cell><Cell><Data ss:Type="String">CO Addr1</Data></Cell><Cell><Data ss:Type="String">CO Addr2</Data></Cell><Cell><Data ss:Type="String">CO City</Data></Cell><Cell><Data ss:Type="String">CO State</Data></Cell><Cell><Data ss:Type="String">CO Zip</Data></Cell><Cell><Data ss:Type="String">Master Agreement</Data></Cell><Cell><Data ss:Type="String">Year</Data></Cell><Cell><Data ss:Type="String">Period</Data></Cell><Cell><Data ss:Type="String">Check</Data></Cell><Cell><Data ss:Type="String">Tran</Data></Cell><Cell><Data ss:Type="String">Key</Data></Cell><Cell><Data ss:Type="String">Union</Data></Cell><Cell><Data ss:Type="String">Gross</Data></Cell><Cell><Data ss:Type="String">Manual Check</Data></Cell></Row>
</Table>
</Worksheet>
</Workbook>

Data is received monthly with 150K+ rows give or take. I used a small subset of about 2K rows to create my M Code, but when applied to the full dataset, I've waited over 30min without any results and just canceled the process.

 

Here's the M Code I have (It uses Ken Puls' Parameter Table to get the filepath - local, not on a shared drive)

let
    SolutionFile = fnGetParameter("File"),
    Source = Xml.Tables(File.Contents(SolutionFile)),
    #"Filtered Rows" = Table.SelectRows(Source, each not Text.Contains([Name], "Styles")),
    Table = #"Filtered Rows"{0}[Table],
    #"Filtered Rows1" = Table.SelectRows(Table, each ([Name] = "Table")),
    Table1 = #"Filtered Rows1"{0}[Table],
    Table2 = Table1{0}[Table],
    #"Removed Columns" = Table.RemoveColumns(Table2,{"urn:schemas-microsoft-com:office:spreadsheet"}),
    AddedIndex = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1),
    #"Expanded Cell" = Table.ExpandTableColumn(AddedIndex, "Cell", {"Data"}, {"Cell.Data"}),
    #"Expanded Cell.Data" = Table.ExpandTableColumn(#"Expanded Cell", "Cell.Data", {"Element:Text"}, {"Cell.Data.Element:Text"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Cell.Data",{{"Cell.Data.Element:Text", "Data"}, {"Index", "Row"}}),
    #"Added Index" = Table.AddIndexColumn(#"Renamed Columns", "Index", 0, 1),
    #"Inserted Modulo" = Table.AddColumn(#"Added Index", "Inserted Modulo", each Number.Mod([Index], 63), type number),
    #"Removed Columns1" = Table.RemoveColumns(#"Inserted Modulo",{"Index"}),
    #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Removed Columns1", {{"Inserted Modulo", type text}}, "en-US"), List.Distinct(Table.TransformColumnTypes(#"Removed Columns1", {{"Inserted Modulo", type text}}, "en-US")[#"Inserted Modulo"]), "Inserted Modulo", "Data"),
    #"Removed Columns2" = Table.RemoveColumns(#"Pivoted Column",{"Row"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns2"),
    Custom1 = Table.RowCount(#"Promoted Headers")
in
    Custom1

Currently I get great performance if I open the file first and save as an Excel file. A bit better performance if I save it as a CSV. Obviously, with those two formats, it's already in table format, I don't have to parse the XML.

 

I don't know if I'm parsing the XML correctly or efficiently. I'd really like to be able to use the XML file directly rather than taking the extra "open/save as" route.

 

Thanks for any tips/tricks.

1 Reply

Using Table notation, I can get directly to the data I need:
Data = Source{1}[Table]{0}[Table]{0}[Table]

 

But as I afeared, applying that to the original data set, I get this error:
Expression.Error: Evaluation ran out of memory and can't continue.