Forum Discussion
How to write a script or any PQ or in Excel to download the zip files from a Webpage
Dear Experts,
Greetings!
https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/
Could you please help me on how to download the pdf.zip files from above for all the versions?
Using a single command in Excel or PQ-option.
Thanks in Advance,
Br,
Anupam
A better option:
let Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ), UrlRoot = "https://www.etsi.org", // All Parent 'directory' link end with ".00_60/" ParentLinks = Html.Table( Source, { {"Parent_Link", "a[href$='.00_60/']", each UrlRoot & [Attributes][href]} } ), PdfContents = Table.AddColumn( ParentLinks, "PdfContents", each let childContents = Web.BrowserContents( [Parent_Link] ), // Only one pdf file in sub-directory pdfLink = Table.FirstValue( Html.Table( childContents, { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} } ) ) in Pdf.Tables( Web.Contents( pdfLink ) , [Implementation = "1.3"] ), Table.Type ) in PdfContentsEDIT: CSS Selectors: https://www.w3schools.com/cssref/css_selectors.php
let OneParentLink = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/15.02.00_60/" ), UrlRoot = "https://www.etsi.org", // Get <a href>'s ending with '.pdf' OptionOne = Html.Table( OneParentLink, { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} } ), OptionTwo = Html.Table( OneParentLink, { {"Pdf_Link", "a", each UrlRoot & [Attributes][href]} }, [RowSelector = "a[href$='.pdf']"] ) in OptionTwo
5 Replies
- LorenzoSilver Contributor
To help people who search for existing solutions could you please:
- Update the title of this thread to better reflect the challenge - ex. How to write a script or any PQ or Excel formula to download the pdf files from a Webpage
- Provide feedback/mark as solved if appropriate - Hope you saw the Better option that's more straightforward than the initial proposal
Thanks & let me know if you need more help with this issue
Cheers- anupambit1797Iron Contributor
Thanks Lorenzo , I am not sure how to update the title here as no option to Edit seems after creating post.. or may be Iam not able to find it.. I marked the last solution as solved.
Sorry , for delay as I am still on leave till Jan-5, and watching mails intermittently only.
Happy Holidays!!
Thanks & Regards
Anupam Shrivastava
- LorenzoSilver Contributor
anupambit1797
No worries at all re. the delay & glad you sorted out the next steps.Just checked How To update the initial post :
- Open the thread
- Click the gear at the top right of the window...
Nice holidays too
Best
Lz.
- LorenzoSilver Contributor
A better option:
let Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ), UrlRoot = "https://www.etsi.org", // All Parent 'directory' link end with ".00_60/" ParentLinks = Html.Table( Source, { {"Parent_Link", "a[href$='.00_60/']", each UrlRoot & [Attributes][href]} } ), PdfContents = Table.AddColumn( ParentLinks, "PdfContents", each let childContents = Web.BrowserContents( [Parent_Link] ), // Only one pdf file in sub-directory pdfLink = Table.FirstValue( Html.Table( childContents, { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} } ) ) in Pdf.Tables( Web.Contents( pdfLink ) , [Implementation = "1.3"] ), Table.Type ) in PdfContentsEDIT: CSS Selectors: https://www.w3schools.com/cssref/css_selectors.php
let OneParentLink = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/15.02.00_60/" ), UrlRoot = "https://www.etsi.org", // Get <a href>'s ending with '.pdf' OptionOne = Html.Table( OneParentLink, { {"Pdf_Link", "a[href$='.pdf']", each UrlRoot & [Attributes][href]} } ), OptionTwo = Html.Table( OneParentLink, { {"Pdf_Link", "a", each UrlRoot & [Attributes][href]} }, [RowSelector = "a[href$='.pdf']"] ) in OptionTwo - LorenzoSilver Contributor
Hi
It's almost all about parsing the HTML code and transforming it to Tables (https://learn.microsoft.com/en-us/powerquery-m/html-table). Note that I had a couple of times error "Unable to connect..."
No idea what you want to do with the content of each PDF so the below query stops after getting the content of each filePower Query:
let Source = Web.BrowserContents( "https://www.etsi.org/deliver/etsi_ts/138300_138399/138306/" ), HtmlTextToTable = #table(type table [HtmlText = Text.Type], {{Source}} ), SelectedTextBetweenPreTags = Table.AddColumn( HtmlTextToTable, "BetweenPreTags", each Text.BetweenDelimiters( [HtmlText], "<pre>", "</pre>" ) ), RemovedHtmlTextColumn = Table.SelectColumns( SelectedTextBetweenPreTags, {"BetweenPreTags"} ), RemovedDoubleQuotes = Table.ReplaceValue( RemovedHtmlTextColumn, """", "", Replacer.ReplaceText, {"BetweenPreTags"} ), PdfParentLink = Table.AddColumn( RemovedDoubleQuotes, "PdfParentLink", each let tableFromHtml = Html.Table( [BetweenPreTags], {{"ParentLink", "a", each "www.etsi.org" & [Attributes][href]}} ) in // Root Directory doesn't content any file ==> Skip 1st record Table.Skip( tableFromHtml, 1 ), Table.Type ), RemovedOtherColumn = Table.SelectColumns( PdfParentLink, {"PdfParentLink"}), ExpandedPdfParentLink = Table.ExpandTableColumn( RemovedOtherColumn, "PdfParentLink", {"ParentLink"} ), // There seems to be a single file per Directory... PdfFileName = Table.AddColumn( ExpandedPdfParentLink, "PdfName", each let webContent = Web.BrowserContents( [ParentLink] ), betweenHrefTag1 = Text.BetweenDelimiters( webContent, "<a href=", "</a>", 1 ) in Text.AfterDelimiter( betweenHrefTag1, ">", {0, RelativePosition.FromEnd} ), Text.Type ), PdfContents = Table.AddColumn( PdfFileName, "PdfContents", each Pdf.Tables( Web.Contents( [ParentLink] & [PdfName] ) , [Implementation = "1.3"] ), Table.Type ) in PdfContents