Forum Discussion

anupambit1797's avatar
anupambit1797
Iron Contributor
Dec 04, 2025
Solved

How to write a script or any PQ or in Excel to download the zip files from a Webpage

Dear Experts,

                     Greetings!

https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/

 

Could you please help me on how to download the pdf.zip files from above for all the versions?

Using a single command in Excel or PQ-option.

Thanks in Advance,

Br,

Anupam

  • A better option:

    let
        Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ),
        UrlRoot = "https://www.etsi.org",
        // All Parent 'directory' link end with ".00_60/"
        ParentLinks = Html.Table( Source,
            { {"Parent_Link", "a[href$='.00_60/']", each UrlRoot & [Attributes][href]} }
        ),
        PdfContents = Table.AddColumn( ParentLinks, "PdfContents", each
            let
                childContents = Web.BrowserContents( [Parent_Link] ),
                // Only one pdf file in sub-directory
                pdfLink = Table.FirstValue(
                    Html.Table( childContents,
                        { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} }
                    )
                )
            in
              Pdf.Tables( Web.Contents( pdfLink ) , [Implementation = "1.3"] ),
              Table.Type
        )
    in
        PdfContents

     

    EDIT: CSS Selectors: https://www.w3schools.com/cssref/css_selectors.php

    let
        OneParentLink = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/15.02.00_60/" ),
        UrlRoot = "https://www.etsi.org",    
    
        // Get <a href>'s ending with '.pdf' 
        OptionOne = Html.Table( OneParentLink,
            { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} }
        ),
    
        OptionTwo = Html.Table( OneParentLink,
            { {"Pdf_Link", "a", each UrlRoot & [Attributes][href]} },
            [RowSelector = "a[href$='.pdf']"]
        )
    in
        OptionTwo

     

5 Replies

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Hi anupambit1797​ 

    To help people who search for existing solutions could you please:

    1. Update the title of this thread to better reflect the challenge - ex. How to write a script or any PQ or Excel formula to download the pdf files from a Webpage
    2. Provide feedback/mark as solved if appropriate - Hope you saw the Better option that's more straightforward than the initial proposal

    Thanks & let me know if you need more help with this issue
    Cheers

    • anupambit1797's avatar
      anupambit1797
      Iron Contributor

      Thanks Lorenzo​ , I am not sure how to update the title here as no option to Edit seems after creating post.. or may be Iam not able to find it.. I marked the last solution as solved.

       

      Sorry , for delay as I am still on leave till Jan-5, and watching mails intermittently only.

       

      Happy Holidays!!

       

      Thanks & Regards

      Anupam Shrivastava

      • Lorenzo's avatar
        Lorenzo
        Silver Contributor

        anupambit1797​ 
        No worries at all re. the delay & glad you sorted out the next steps.

        Just checked How To update the initial post :

        • Open the thread
        • Click the gear at the top right of the window...

        Nice holidays too

        Best
        Lz.

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    A better option:

    let
        Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ),
        UrlRoot = "https://www.etsi.org",
        // All Parent 'directory' link end with ".00_60/"
        ParentLinks = Html.Table( Source,
            { {"Parent_Link", "a[href$='.00_60/']", each UrlRoot & [Attributes][href]} }
        ),
        PdfContents = Table.AddColumn( ParentLinks, "PdfContents", each
            let
                childContents = Web.BrowserContents( [Parent_Link] ),
                // Only one pdf file in sub-directory
                pdfLink = Table.FirstValue(
                    Html.Table( childContents,
                        { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} }
                    )
                )
            in
              Pdf.Tables( Web.Contents( pdfLink ) , [Implementation = "1.3"] ),
              Table.Type
        )
    in
        PdfContents

     

    EDIT: CSS Selectors: https://www.w3schools.com/cssref/css_selectors.php

    let
        OneParentLink = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/15.02.00_60/" ),
        UrlRoot = "https://www.etsi.org",    
    
        // Get <a href>'s ending with '.pdf' 
        OptionOne = Html.Table( OneParentLink,
            { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} }
        ),
    
        OptionTwo = Html.Table( OneParentLink,
            { {"Pdf_Link", "a", each UrlRoot & [Attributes][href]} },
            [RowSelector = "a[href$='.pdf']"]
        )
    in
        OptionTwo

     

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Hi

    It's almost all about parsing the HTML code and transforming it to Tables (https://learn.microsoft.com/en-us/powerquery-m/html-table). Note that I had a couple of times error "Unable to connect..."
    No idea what you want to do with the content of each PDF so the below query stops after getting the content of each file

     

    Power Query:

    let
        Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ),
        HtmlTextToTable = #table(type table [HtmlText = Text.Type],
            {{Source}}
        ),
        SelectedTextBetweenPreTags = Table.AddColumn( HtmlTextToTable, "BetweenPreTags", each
            Text.BetweenDelimiters( [HtmlText], "<pre>", "</pre>" )
        ),
        RemovedHtmlTextColumn = Table.SelectColumns( SelectedTextBetweenPreTags, {"BetweenPreTags"} ),
        RemovedDoubleQuotes = Table.ReplaceValue( RemovedHtmlTextColumn, """", "",
            Replacer.ReplaceText, {"BetweenPreTags"}
        ),
        PdfParentLink = Table.AddColumn( RemovedDoubleQuotes, "PdfParentLink", each
            let
                tableFromHtml = Html.Table( [BetweenPreTags], {{"ParentLink", "a", each "www.etsi.org" & [Attributes][href]}} )
            in
                // Root Directory doesn't content any file ==> Skip 1st record
                Table.Skip( tableFromHtml, 1 ),
            Table.Type
        ),
        RemovedOtherColumn = Table.SelectColumns( PdfParentLink, {"PdfParentLink"}),
        ExpandedPdfParentLink = Table.ExpandTableColumn( RemovedOtherColumn, "PdfParentLink", {"ParentLink"} ),
    
        // There seems to be a single file per Directory...
        PdfFileName = Table.AddColumn( ExpandedPdfParentLink, "PdfName", each
            let
                webContent = Web.BrowserContents( [ParentLink] ),
                betweenHrefTag1 = Text.BetweenDelimiters( webContent, "<a href=", "</a>", 1 )
            in
                Text.AfterDelimiter( betweenHrefTag1, ">", {0, RelativePosition.FromEnd} ),
            Text.Type
        ),
        PdfContents = Table.AddColumn( PdfFileName, "PdfContents", each
            Pdf.Tables( Web.Contents( [ParentLink] & [PdfName] ) , [Implementation = "1.3"] ),
            Table.Type
        )
    in
        PdfContents

     

Resources