Help request. Power query manipulation of messy data

Copper Contributor

New to Power Query (mind is kinda blown I never new this existed, been using the heck out of Excel and array formulas/nested if's to do similar things, in a harder way and repeatedly, for years).

 

My "starting point" is right after pdf import of a table. The pdf is diabolical. 300+ pg's, each of which has a table. When I import, PowerQuery does not recognize the table spanning multiple pg's, and does not recognize the table on every pg. So I need to set up query steps for the pg's, BUT the each pg does not import as the same number of columns!

 

End goal is to tease out those chunks of data that are 4 rows tall but with the qty is stuck to the side, in a nearby column. Eventually having one set per row.

 

Two questions:

1. Anyone see an elegant way to do a single pg? I am working on replacing, [images] and null's with blank; Collapsing all columns down to one, comma delimited; Replacing multiple commas with one comma; then splitting things back out. But no success yet.

2. The pdf has ~300pgs, one table on each page. I import and and select all the pg's via check box, which loads each pg as a query. How do I automatically apply my "applied steps" to all queries? It seems I can copy the "advanced editor" lines in to each pg, but I don't want to do that 300 times, every week.

I could maybe append all the pg's first, if I can get my solution to be nimble enough.

Since all tables are from the same pdf, the query editor doesn't pick a sample table, like it does when you load multiple pdfs at once.

 

I have a sample excel file. Page002 and Page003 are examples of how the pages import. Goal is a manually created version. 

 

https://docs.google.com/spreadsheets/d/1NGGPFo1I9DjO2CtfRHCwGom5F0R00dN8/edit?usp=sharing&ouid=10402...

 

Thank you for any ideas!

4 Replies

I am trying to figure out how to post my sample file.

Edit:  Added google docs link. If there is a better way, Please let me know. 

Hi @pau11y 

 

With the sample you shared this seems doable. To confirm, could you share a PDF file?

Thank you for your offer, by the way. I got some traction on this via responses in the PowerBI Subreddit. Basically creating new column that is a list of the data in the rows. then using using features of the data to peel everything back out in to more organized new columns.
to detect where to split the two vertical groups, the table was transposed, then a new column detected the number of entries in each row, the assumption being the two rows with the most entries correspond roughly to where the split in the orginally table needs to happen.

Now I trying to figure out how to automate these steps with a custom function. I will make a new post regarding this. Thank you again.
skipping the "let" and "source" entries:

Page1 = Source{[Id="Page002"]}[Data],
#"Replaced Value" = Table.ReplaceValue(Page1,"[image]",null,Replacer.ReplaceValue,Table.ColumnNames(Page1)),
#"Added Conditional Column" = Table.AddColumn(#"Replaced Value", "Custom", each List.IsEmpty(List.RemoveNulls(Record.FieldValues(_)))),
#"Remove null rows" = Table.SelectRows(#"Added Conditional Column", each ([Custom] = false)),
#"Removed Columns" = Table.RemoveColumns(#"Remove null rows",{"Custom"}),
#"Idenitfy rows with unnecessary text" = Table.AddColumn(#"Removed Columns", "Custom", each Text.Contains(List.RemoveNulls(Record.FieldValues(_)){0},"page") or Text.Contains(List.RemoveNulls(Record.FieldValues(_)){0},"MESSENGER") or Text.Contains(List.RemoveNulls(Record.FieldValues(_)){0},"CLUTCH")),
#"Remove rows" = Table.SelectRows(#"Idenitfy rows with unnecessary text", each ([Custom] = false)),
#"Removed Columns1" = Table.RemoveColumns(#"Remove rows",{"Custom"}),
#"Transposed Table" = Table.Transpose(#"Removed Columns1"),
//#"Add col to identify all null rows2" = Table.AddColumn(#"Transposed Table", "Custom", each List.IsEmpty(List.RemoveNulls(Record.FieldValues(_)))),
//#"Removed null rows 2" = Table.SelectRows(#"Add col to identify all null rows2", each ([Custom] = false)),
//#"Removed Columns2" = Table.RemoveColumns(#"Removed null rows 2",{"Custom"}),
#"Added Index" = Table.AddIndexColumn(#"Transposed Table", "Index", 0, 1, Int64.Type),
#"Add col to identify rows with most items" = Table.AddColumn(#"Added Index", "Custom", each List.Count(List.RemoveNulls(Record.FieldValues(_)))),
#"Sorted Rows" = Table.Sort(#"Add col to identify rows with most items",{{"Custom", Order.Descending}}),
#"Kept First Rows" = Table.FirstN(#"Sorted Rows",2),
#"Sorted Rows1" = Table.Sort(#"Kept First Rows",{{"Index", Order.Descending}}),
#"Transposed Table1" = Table.Transpose(#"Transposed Table"),
#"Left Side" = Table.SelectColumns(#"Transposed Table1",List.FirstN(Table.ColumnNames(#"Transposed Table1"), #"Sorted Rows1"[Index]{0}-1)),
#"Right Side" = Table.SelectColumns(#"Transposed Table1",List.RemoveFirstN(Table.ColumnNames(#"Transposed Table1"), #"Sorted Rows1"[Index]{0}-1)),
#"Demoted Headers" = Table.DemoteHeaders(#"Right Side"),
#"Removed Top Rows2" = Table.Skip(#"Demoted Headers",1),
#"Appending left side with right side" = Table.Combine({#"Left Side", #"Removed Top Rows2"}),
#"Add col to identify all null rows3" = Table.AddColumn(#"Appending left side with right side", "Custom", each List.IsEmpty(List.RemoveNulls(Record.FieldValues(_)))),
#"Removed null rows 3" = Table.SelectRows(#"Add col to identify all null rows3", each ([Custom] = false)),
#"Removed Columns3" = Table.RemoveColumns(#"Removed null rows 3",{"Custom"}),
#"Added Custom3" = Table.AddColumn(#"Removed Columns3", "List", each List.RemoveNulls(Record.FieldValues(_))),
#"Added Custom4" = Table.AddColumn(#"Added Custom3", "Size", each if Text.Contains([List]{0},"Size") then [List]{0} else null),
#"Added Custom5" = Table.AddColumn(#"Added Custom4", "Price", each if Text.Contains([List]{0},"Price") then [List]{0} else null),
#"Added Custom6" = Table.AddColumn(#"Added Custom5", "Quantity", each if Text.Contains(List.Last([List]),"Qty") then List.Last([List]) else null),
#"Added Custom7" = Table.AddColumn(#"Added Custom6", "Status", each if List.Contains({"DISCONTINUED","Other options"},[List]{0}) then [List]{0} else null),
#"Added Custom8" = Table.AddColumn(#"Added Custom7", "ID", each if Text.Contains([List]{0},"ADB") then [List]{0} else null),

#"Removed Columns4" = Table.RemoveColumns(#"Added Custom8",{"Status"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Removed Columns4",{"Size", "Price", "Quantity", "ID"}),
#"Filled Up" = Table.FillUp(#"Removed Other Columns1",{"Size", "Price", "Quantity"}),
#"Filtered Rows3" = Table.SelectRows(#"Filled Up", each ([ID] <> null)),
#"Identify which ID's have status" = Table.SelectColumns(#"Added Custom8",{"Status", "ID"}),
#"Filled Down" = Table.FillDown(#"Identify which ID's have status",{"ID"}),
#"Filtered Rows4" = Table.SelectRows(#"Filled Down", each ([Status] = "DISCONTINUED")),
#"Lookup which ID's are disc" = #"Filtered Rows3",
#"Merged Queries" = Table.NestedJoin(#"Lookup which ID's are disc", {"ID"}, #"Filtered Rows4", {"ID"}, "Custom2", JoinKind.LeftOuter),
#"Expanded Custom2" = Table.ExpandTableColumn(#"Merged Queries", "Custom2", {"Status"}, {"Status"}),
#"Replaced Value2" = Table.ReplaceValue(#"Expanded Custom2","Size: ","",Replacer.ReplaceText,{"Size"}),
#"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","Price: ","",Replacer.ReplaceText,{"Price"}),
#"Replaced Value4" = Table.ReplaceValue(#"Replaced Value3","Qty: ","",Replacer.ReplaceText,{"Quantity"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Replaced Value4",{{"Price", type number}, {"Quantity", type number}}),
#"Reordered Columns" = Table.ReorderColumns(#"Changed Type1",{"ID", "Size", "Price", "Quantity", "Status"})

in
#"Reordered Columns"