Forum Discussion

Marcio Flores's avatar
Marcio Flores
Copper Contributor
Feb 28, 2018

Help with Queries and Connections, Data from Web from multiple URLs

Hello,

 

I have a list of about 75 webpages (within the same website)   from which I need to extract an specific table on each of these pages.  I have been doing this one by one, (please see file attached  with a sample set of extracted data) by copying the  URLs which I previously extracted and placed  on an excel sheet as individual links  and pasting in the first query dialog box, then on the Navigator window, I select the table I want ( in this case it is consistently "Table0") and click on "load". and I get the table on a new sheet. From here I edit and sort it  according  to my need.

BUT....  I figure that there must be a better way to automate this extraction.   How do I do that?

 

 Here is the example link:

 http://www.usendoscopy.com/Products/roth-net-retriever---polyp.aspx

 There is only one  table contained on each of these URLs (pictured)  which lists product details  but I don’t know what is the sintax I need to target it for extraction when I choose "Data"-->  "Queries and Connections" --> "From Web", or if that is even the bes way to extract it.

 

6 Replies

  • SergeiBaklan's avatar
    SergeiBaklan
    Diamond Contributor

    Hi Marcio,

     

    Didn't catch you'd like to combine all tables from web in one or keep them separate. If the first

    - create the table with all your links

    - based on your query create the function which extracts the table from web page

    let
    
        Source = (link as text, name as text) => let
    
        Source = Web.Page(Web.Contents(link)),
        Data0 = Source{0}[Data],
        #"Changed Type" = Table.TransformColumnTypes(Data0,{{"Product number", type text}, {"", type text}, {"sheath diameter", type text}, {"length", type text}, {"approximate net size", type text}, {"Unit/box", Int64.Type}, {"Quantity", type text}, {"2", type text}}),
        #"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"", "Description"}}),
        #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","",name,Replacer.ReplaceValue,{"Description"}),
        #"Duplicated Column" = Table.DuplicateColumn(#"Replaced Value", "Description", "Description - Copy"),
        #"Renamed Columns1" = Table.RenameColumns(#"Duplicated Column",{{"Description - Copy", "Product Name"}}),
        #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns1",{"Quantity", "2"}),
        #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"})
    in
        #"Reordered Columns"
    
    in Source

    - make query on your table with links

    - in it add one more column with name of the product which is part of URL

    - add another column with function above, parameters will be the link and name fields

    - keep only latest column

    - expand it - you'll have all your web tables combines

     

    let
        Source = Excel.CurrentWorkbook(){[Name="Links"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Links", type text}}),
        #"Added Custom1" = Table.AddColumn(#"Changed Type", "Description", each [Links]),
        #"Split Column by Delimiter" = Table.SplitColumn(#"Added Custom1", "Description", Splitter.SplitTextByDelimiter("/Products/", QuoteStyle.Csv), {"Description.1", "Description.2"}),
        #"Removed Columns1" = Table.RemoveColumns(#"Split Column by Delimiter",{"Description.1"}),
        #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Description.2", "Description"}}),
        #"Replaced Value" = Table.ReplaceValue(#"Renamed Columns","-"," ",Replacer.ReplaceText,{"Description"}),
        #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",".aspx","",Replacer.ReplaceText,{"Description"}),
        #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Tables", each fnTransfomLinkTable([Links], [Description])),
        #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Links", "Description"}),
        #"Expanded Tables" = Table.ExpandTableColumn(#"Removed Columns", "Tables", {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"}, {"Product number", "Description", "Unit/box", "sheath diameter", "length", "approximate net size", "Product Name"})
    in
        #"Expanded Tables"

    Please see file attached

    Code could be optimized, it's too straight forward, just to illustrate an idea. Performance is not good, my guess it's mainly due to web connection

     

     

     

     

    • Marcio Flores's avatar
      Marcio Flores
      Copper Contributor

      Hello Sergei,

       

      I had no idea how "in over my head" I would be getting.  You just did what the IT guys said couldn't be done.

       

      But  here is the issue I face:  Almost every table  embedded in the (now much longer) list  of links on the updated sample fileI I have attached contains columns with different names, or even a different number of columns.  

       

      I should have provided with that varied sampling before.  I tried to figure out on my own the logical steps you took but the variable of different column names and different number of coulmns have me at a loss.

       

      Would you be willing to take another look at the file and teach me what I need to do to address the above?  I am looking to have all of the tables contained in the links combined into one table.  If not possible, I can use multiple tables and will combine them manually if I have to.  

       

      I am really thankful for the time you took with me.  I realize the level of Q&A's that go on in this forum and I know I need to take classes in Querys urgently!  :)

       

      Thanks again for your help.

       

      Cheers,

       

      Marcio

      • SergeiBaklan's avatar
        SergeiBaklan
        Diamond Contributor

        Hi Marcio,

         

        Yes, that's bit complicated. In general tables on the web have three parts

        - 1st which is Product Number and empty column

        - 3rd which is Unix/box, Quantity and empty column

        (above exists in each table in same format);

        - 2nd which could have different column names, different number of columns and in some cases could be missed at all.

         

        Moreover, some pages do not have tables, nothing to pick-up.

         

        Perhaps the logic is more complex, but that what I seen. Based on this the function to transform web tables looks like

         

        let
        
            Source = (link as text, name as text) => let
        
            Source = Web.Page(Web.Contents(link)),
        
            Data0 = Source{0}[Data],
        
            DataEnd = Table.SelectRows(Source, each ([ClassName] = "uk-table uk-text-nowrap")){0}[Data],
            DataMiddle = try Table.SelectRows(Source, each ([ClassName] = "uk-table")){0}[Data] otherwise Table.FromRows({}),
        
            ListOfColumns = List.Combine({
                List.FirstN(Table.ColumnNames(Data0),2),
                List.FirstN(Table.ColumnNames(DataEnd),2),
                Table.ColumnNames(DataMiddle)
            }),
        
            RemoveLastEmpty = Table.RemoveColumns(Data0,{"2"}),
            ReorderAsInList = Table.ReorderColumns(RemoveLastEmpty,ListOfColumns),
            RenameFirstEmpty = Table.RenameColumns(ReorderAsInList,{{"", "Product Name"}}),
            ProductName = Table.ReplaceValue(RenameFirstEmpty,"",name,Replacer.ReplaceValue,{"Product Name"}),
            DemoteHeaders = Table.DemoteHeaders(ProductName),
        
            EmptyTable = Table.FromRows({}),
        
            Result =
                if Source{0}[Source] = "Service"
                then EmptyTable
                else DemoteHeaders
        
        in
            Result
        
            in Source

        It has two branches of code - if web table exists we return transformed table, otherwise empty one.

         

        Main query Links is practically the same, I only take as the source the table in your last excel sheet with links and Product names, named the table as "Products"

        let
            Source = Excel.CurrentWorkbook(){[Name="Products"]}[Content],
            ToText = Table.TransformColumnTypes(Source,{{"Link", type text}, {"Product Name", type text}}),
            AddFromWeb = Table.AddColumn(ToText, "Tables", each fnTransfomLinkTable([Link], [Product Name])),
            RemoveUnused = Table.RemoveColumns(AddFromWeb,{"Link", "Product Name"}),
            ExpandWebTables = Table.ExpandTableColumn(RemoveUnused, "Tables",
                {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7"},
                {"Product number", "Product Name", "Unit/box", "Quantity", "Text1", "Text2", "Text3"}),
            RemoveEmptyTables = Table.SelectRows(ExpandWebTables, each ([Product number] <> null))
        in
            RemoveEmptyTables

        As result all web tables are combined in one returned to Excel. Together with headers for each web table since they are different.

         

        Alternatively that could be unpivoted table generated by slightly modified above queries. Result is ProductsUnpivoted sheet. But it takes much more time to be generated.

Resources