Feb 28 2018
03:25 PM
- last edited on
Jul 12 2019
11:04 AM
by
TechCommunityAP
Feb 28 2018
03:25 PM
- last edited on
Jul 12 2019
11:04 AM
by
TechCommunityAP
Hello,
I have a list of about 75 webpages (within the same website) from which I need to extract an specific table on each of these pages. I have been doing this one by one, (please see file attached with a sample set of extracted data) by copying the URLs which I previously extracted and placed on an excel sheet as individual links and pasting in the first query dialog box, then on the Navigator window, I select the table I want ( in this case it is consistently "Table0") and click on "load". and I get the table on a new sheet. From here I edit and sort it according to my need.
BUT.... I figure that there must be a better way to automate this extraction. How do I do that?
Here is the example link:
http://www.usendoscopy.com/Products/roth-net-retriever---polyp.aspx
There is only one table contained on each of these URLs (pictured) which lists product details but I don’t know what is the sintax I need to target it for extraction when I choose "Data"--> "Queries and Connections" --> "From Web", or if that is even the bes way to extract it.
Mar 01 2018 01:49 AM
Hi Marcio,
Didn't catch you'd like to combine all tables from web in one or keep them separate. If the first
- create the table with all your links
- based on your query create the function which extracts the table from web page
let Source = (link as text, name as text) => let Source = Web.Page(Web.Contents(link)), Data0 = Source{0}[Data], #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Product number", type text}, {"", type text}, {"sheath diameter", type text}, {"length", type text}, {"approximate net size", type text}, {"Unit/box", Int64.Type}, {"Quantity", type text}, {"2", type text}}), #"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"", "Description"}}), #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","",name,Replacer.ReplaceValue,{"Description"}), #"Duplicated Column" = Table.DuplicateColumn(#"Replaced Value", "Description", "Description - Copy"), #"Renamed Columns1" = Table.RenameColumns(#"Duplicated Column",{{"Description - Copy", "Product Name"}}), #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Quantity", "2"}), #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"}) in #"Reordered Columns" in Source
- make query on your table with links
- in it add one more column with name of the product which is part of URL
- add another column with function above, parameters will be the link and name fields
- keep only latest column
- expand it - you'll have all your web tables combines
let Source = Excel.CurrentWorkbook(){[Name="Links"]}[Content], #"Changed Type" = Table.TransformColumnTypes(Source,{{"Links", type text}}), #"Added Custom1" = Table.AddColumn(#"Changed Type", "Description", each [Links]), #"Split Column by Delimiter" = Table.SplitColumn(#"Added Custom1", "Description", Splitter.SplitTextByDelimiter("/Products/", QuoteStyle.Csv), {"Description.1", "Description.2"}), #"Removed Columns1" = Table.RemoveColumns(#"Split Column by Delimiter",{"Description.1"}), #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Description.2", "Description"}}), #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","-"," ",Replacer.ReplaceText,{"Description"}), #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",".aspx","",Replacer.ReplaceText,{"Description"}), #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Tables", each fnTransfomLinkTable([Links], [Description])), #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Links", "Description"}), #"Expanded Tables" = Table.ExpandTableColumn(#"Removed Columns", "Tables", {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"}, {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"}) in #"Expanded Tables"
Please see file attached
Code could be optimized, it's too straight forward, just to illustrate an idea. Performance is not good, my guess it's mainly due to web connection
Mar 02 2018 06:33 PM
Hello Sergei,
I had no idea how "in over my head" I would be getting. You just did what the IT guys said couldn't be done.
But here is the issue I face: Almost every table embedded in the (now much longer) list of links on the updated sample fileI I have attached contains columns with different names, or even a different number of columns.
I should have provided with that varied sampling before. I tried to figure out on my own the logical steps you took but the variable of different column names and different number of coulmns have me at a loss.
Would you be willing to take another look at the file and teach me what I need to do to address the above? I am looking to have all of the tables contained in the links combined into one table. If not possible, I can use multiple tables and will combine them manually if I have to.
I am really thankful for the time you took with me. I realize the level of Q&A's that go on in this forum and I know I need to take classes in Querys urgently! :)
Thanks again for your help.
Cheers,
Marcio
Mar 03 2018 10:24 AM - edited Mar 03 2018 01:56 PM
Hi Marcio,
Yes, that's bit complicated. In general tables on the web have three parts
- 1st which is Product Number and empty column
- 3rd which is Unix/box, Quantity and empty column
(above exists in each table in same format);
- 2nd which could have different column names, different number of columns and in some cases could be missed at all.
Moreover, some pages do not have tables, nothing to pick-up.
Perhaps the logic is more complex, but that what I seen. Based on this the function to transform web tables looks like
let Source = (link as text, name as text) => let Source = Web.Page(Web.Contents(link)), Data0 = Source{0}[Data], DataEnd = Table.SelectRows(Source, each ([ClassName] = "uk-table uk-text-nowrap")){0}[Data], DataMiddle = try Table.SelectRows(Source, each ([ClassName] = "uk-table")){0}[Data] otherwise Table.FromRows({}), ListOfColumns = List.Combine({ List.FirstN(Table.ColumnNames(Data0),2), List.FirstN(Table.ColumnNames(DataEnd),2), Table.ColumnNames(DataMiddle) }), RemoveLastEmpty = Table.RemoveColumns(Data0,{"2"}), ReorderAsInList = Table.ReorderColumns(RemoveLastEmpty,ListOfColumns), RenameFirstEmpty = Table.RenameColumns(ReorderAsInList,{{"", "Product Name"}}), ProductName = Table.ReplaceValue(RenameFirstEmpty,"",name,Replacer.ReplaceValue,{"Product Name"}), DemoteHeaders = Table.DemoteHeaders(ProductName), EmptyTable = Table.FromRows({}), Result = if Source{0}[Source] = "Service" then EmptyTable else DemoteHeaders in Result in Source
It has two branches of code - if web table exists we return transformed table, otherwise empty one.
Main query Links is practically the same, I only take as the source the table in your last excel sheet with links and Product names, named the table as "Products"
let Source = Excel.CurrentWorkbook(){[Name="Products"]}[Content], ToText = Table.TransformColumnTypes(Source,{{"Link", type text}, {"Product Name", type text}}), AddFromWeb = Table.AddColumn(ToText, "Tables", each fnTransfomLinkTable([Link], [Product Name])), RemoveUnused = Table.RemoveColumns(AddFromWeb,{"Link", "Product Name"}), ExpandWebTables = Table.ExpandTableColumn(RemoveUnused, "Tables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7"}, {"Product number", "Product Name", "Unit/box", "Quantity", "Text1", "Text2", "Text3"}), RemoveEmptyTables = Table.SelectRows(ExpandWebTables, each ([Product number] <> null)) in RemoveEmptyTables
As result all web tables are combined in one returned to Excel. Together with headers for each web table since they are different.
Alternatively that could be unpivoted table generated by slightly modified above queries. Result is ProductsUnpivoted sheet. But it takes much more time to be generated.
Mar 06 2018 05:09 AM
Apropos to this question, check out
Fast forward to about 41:30.
Dale
Mar 06 2018 05:59 AM
Hi Dale,
Is that something specific for this concrete task here?
Mar 06 2018 07:03 AM