Forum Discussion

1 Reply

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Hi

    It's almost all about parsing the HTML code and transforming it to Tables (https://learn.microsoft.com/en-us/powerquery-m/html-table). Note that I had a couple of times error "Unable to connect..."
    No idea what you want to do with the content of each PDF so the below query stops after getting the content of each file

     

    Power Query:

    let
        Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ),
        HtmlTextToTable = #table(type table [HtmlText = Text.Type],
            {{Source}}
        ),
        SelectedTextBetweenPreTags = Table.AddColumn( HtmlTextToTable, "BetweenPreTags", each
            Text.BetweenDelimiters( [HtmlText], "<pre>", "</pre>" )
        ),
        RemovedHtmlTextColumn = Table.SelectColumns( SelectedTextBetweenPreTags, {"BetweenPreTags"} ),
        RemovedDoubleQuotes = Table.ReplaceValue( RemovedHtmlTextColumn, """", "",
            Replacer.ReplaceText, {"BetweenPreTags"}
        ),
        PdfParentLink = Table.AddColumn( RemovedDoubleQuotes, "PdfParentLink", each
            let
                tableFromHtml = Html.Table( [BetweenPreTags], {{"ParentLink", "a", each "www.etsi.org" & [Attributes][href]}} )
            in
                // Root Directory doesn't content any file ==> Skip 1st record
                Table.Skip( tableFromHtml, 1 ),
            Table.Type
        ),
        RemovedOtherColumn = Table.SelectColumns( PdfParentLink, {"PdfParentLink"}),
        ExpandedPdfParentLink = Table.ExpandTableColumn( RemovedOtherColumn, "PdfParentLink", {"ParentLink"} ),
    
        // There seems to be a single file per Directory...
        PdfFileName = Table.AddColumn( ExpandedPdfParentLink, "PdfName", each
            let
                webContent = Web.BrowserContents( [ParentLink] ),
                betweenHrefTag1 = Text.BetweenDelimiters( webContent, "<a href=", "</a>", 1 )
            in
                Text.AfterDelimiter( betweenHrefTag1, ">", {0, RelativePosition.FromEnd} ),
            Text.Type
        ),
        PdfContents = Table.AddColumn( PdfFileName, "PdfContents", each
            Pdf.Tables( Web.Contents( [ParentLink] & [PdfName] ) , [Implementation = "1.3"] ),
            Table.Type
        )
    in
        PdfContents

     

Resources