Help with Queries and Connections, Data from Web from multiple URLs

Copper Contributor

Hello,

 

I have a list of about 75 webpages (within the same website)   from which I need to extract an specific table on each of these pages.  I have been doing this one by one, (please see file attached  with a sample set of extracted data) by copying the  URLs which I previously extracted and placed  on an excel sheet as individual links  and pasting in the first query dialog box, then on the Navigator window, I select the table I want ( in this case it is consistently "Table0") and click on "load". and I get the table on a new sheet. From here I edit and sort it  according  to my need.

BUT....  I figure that there must be a better way to automate this extraction.   How do I do that?

 

 Here is the example link:

 http://www.usendoscopy.com/Products/roth-net-retriever---polyp.aspx

 There is only one  table contained on each of these URLs (pictured)  which lists product details  but I don’t know what is the sintax I need to target it for extraction when I choose "Data"-->  "Queries and Connections" --> "From Web", or if that is even the bes way to extract it.

Table-sample.PNG

 

6 Replies

Hi Marcio,

 

Didn't catch you'd like to combine all tables from web in one or keep them separate. If the first

- create the table with all your links

- based on your query create the function which extracts the table from web page

let

    Source = (link as text, name as text) => let

    Source = Web.Page(Web.Contents(link)),
    Data0 = Source{0}[Data],
    #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Product number", type text}, {"", type text}, {"sheath diameter", type text}, {"length", type text}, {"approximate net size", type text}, {"Unit/box", Int64.Type}, {"Quantity", type text}, {"2", type text}}),
    #"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"", "Description"}}),
    #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","",name,Replacer.ReplaceValue,{"Description"}),
    #"Duplicated Column" = Table.DuplicateColumn(#"Replaced Value", "Description", "Description - Copy"),
    #"Renamed Columns1" = Table.RenameColumns(#"Duplicated Column",{{"Description - Copy", "Product Name"}}),
    #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Quantity", "2"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"})
in
    #"Reordered Columns"

in Source

- make query on your table with links

- in it add one more column with name of the product which is part of URL

- add another column with function above, parameters will be the link and name fields

- keep only latest column

- expand it - you'll have all your web tables combines

 

let
    Source = Excel.CurrentWorkbook(){[Name="Links"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Links", type text}}),
    #"Added Custom1" = Table.AddColumn(#"Changed Type", "Description", each [Links]),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Added Custom1", "Description", Splitter.SplitTextByDelimiter("/Products/", QuoteStyle.Csv), {"Description.1", "Description.2"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Split Column by Delimiter",{"Description.1"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Description.2", "Description"}}),
    #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","-"," ",Replacer.ReplaceText,{"Description"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",".aspx","",Replacer.ReplaceText,{"Description"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Tables", each fnTransfomLinkTable([Links], [Description])),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Links", "Description"}),
    #"Expanded Tables" = Table.ExpandTableColumn(#"Removed Columns", "Tables", {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"}, {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"})
in
    #"Expanded Tables"

Please see file attached

Code could be optimized, it's too straight forward, just to illustrate an idea. Performance is not good, my guess it's mainly due to web connection

 

 

 

 

Hello Sergei,

 

I had no idea how "in over my head" I would be getting.  You just did what the IT guys said couldn't be done.

 

But  here is the issue I face:  Almost every table  embedded in the (now much longer) list  of links on the updated sample fileI I have attached contains columns with different names, or even a different number of columns.  

 

I should have provided with that varied sampling before.  I tried to figure out on my own the logical steps you took but the variable of different column names and different number of coulmns have me at a loss.

 

Would you be willing to take another look at the file and teach me what I need to do to address the above?  I am looking to have all of the tables contained in the links combined into one table.  If not possible, I can use multiple tables and will combine them manually if I have to.  

 

I am really thankful for the time you took with me.  I realize the level of Q&A's that go on in this forum and I know I need to take classes in Querys urgently!  :)

 

Thanks again for your help.

 

Cheers,

 

Marcio

Hi Marcio,

 

Yes, that's bit complicated. In general tables on the web have three parts

- 1st which is Product Number and empty column

- 3rd which is Unix/box, Quantity and empty column

(above exists in each table in same format);

- 2nd which could have different column names, different number of columns and in some cases could be missed at all.

 

Moreover, some pages do not have tables, nothing to pick-up.

 

Perhaps the logic is more complex, but that what I seen. Based on this the function to transform web tables looks like

 

let

    Source = (link as text, name as text) => let

    Source = Web.Page(Web.Contents(link)),

    Data0 = Source{0}[Data],

    DataEnd = Table.SelectRows(Source, each ([ClassName] = "uk-table uk-text-nowrap")){0}[Data],
    DataMiddle = try Table.SelectRows(Source, each ([ClassName] = "uk-table")){0}[Data] otherwise Table.FromRows({}),

    ListOfColumns = List.Combine({
        List.FirstN(Table.ColumnNames(Data0),2),
        List.FirstN(Table.ColumnNames(DataEnd),2),
        Table.ColumnNames(DataMiddle)
    }),

    RemoveLastEmpty = Table.RemoveColumns(Data0,{"2"}),
    ReorderAsInList = Table.ReorderColumns(RemoveLastEmpty,ListOfColumns),
    RenameFirstEmpty = Table.RenameColumns(ReorderAsInList,{{"", "Product Name"}}),
    ProductName = Table.ReplaceValue(RenameFirstEmpty,"",name,Replacer.ReplaceValue,{"Product Name"}),
    DemoteHeaders = Table.DemoteHeaders(ProductName),

    EmptyTable = Table.FromRows({}),

    Result =
        if Source{0}[Source] = "Service"
        then EmptyTable
        else DemoteHeaders

in
    Result

    in Source

It has two branches of code - if web table exists we return transformed table, otherwise empty one.

 

Main query Links is practically the same, I only take as the source the table in your last excel sheet with links and Product names, named the table as "Products"

let
    Source = Excel.CurrentWorkbook(){[Name="Products"]}[Content],
    ToText = Table.TransformColumnTypes(Source,{{"Link", type text}, {"Product Name", type text}}),
    AddFromWeb = Table.AddColumn(ToText, "Tables", each fnTransfomLinkTable([Link], [Product Name])),
    RemoveUnused = Table.RemoveColumns(AddFromWeb,{"Link", "Product Name"}),
    ExpandWebTables = Table.ExpandTableColumn(RemoveUnused, "Tables",
        {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7"},
        {"Product number", "Product Name", "Unit/box", "Quantity", "Text1", "Text2", "Text3"}),
    RemoveEmptyTables = Table.SelectRows(ExpandWebTables, each ([Product number] <> null))
in
    RemoveEmptyTables

As result all web tables are combined in one returned to Excel. Together with headers for each web table since they are different.

 

Alternatively that could be unpivoted table generated by slightly modified above queries. Result is ProductsUnpivoted sheet. But it takes much more time to be generated.

Apropos to this question, check out

 

Deep Dive into Power Query Formula Language

 

Fast forward to about 41:30.

 

Dale

 

Hi Dale,

 

Is that something specific for this concrete task here?

Similar to what you have created here, the specific reference demonstrates how to create a function and then iterate across a set of URLs to pull table information.