how do i extract data from a pdf file using power query in excel?

Copper Contributor

hi team,

 

Please i would like to find out if there is a way to extract data from a pdf file using power query in excel?

 

regards!!

16 Replies

@samuel_kodjoe 

If you are on Windows, on Excel 365 and not on semi-annual channel you shall have this connector

image.png

Perhaps these days availability bit wider, not sure. But in any case that's for subscription model.

wow.....so there is no ad ins i can install to access this with 2019 version?

@samuel_kodjoe 

Copy/paste the following code in a new blank query using the Advanced Editor

let
    Source = Record.ToTable(
        Record.RemoveFields(#shared,
            Record.FieldNames(#sections[Section1])
        )
    ),
    SelectedTypeFunction = Table.SelectRows(Source, each [Value] is function),
    RemovedValue = Table.SelectColumns(SelectedTypeFunction, {"Name"}),
    SortedByName = Table.Sort(RemovedValue, {{"Name", Order.Ascending}})
in
    SortedByName

 

then load the result of that query on a sheet. If you find Pdf.Tables in column [Name] the function is available in your version of Excel/PQ. Per its name the function is looking for table(s) in the PDF...

hi Lz!
I saw Pdf Tables in column name after loading the code, how then do i locate it.
Thanks for the feedback.

@samuel_kodjoe 

Given you don't have the corresponding wizard in Excel user interface, you have to code it yourself. You start with a new blank query and in the formula bar, something like:

= Pdf.Tables(File.Contents("FolderPath\Example.pdf"), [Implementation="1.2"])

Then, assuming the function finds a Table in your PDF, click on it in the [Data] column:

 

Demo.png

NB: re. [Implementation=x.y] the Pdf.Tables doc. says:

  • The newest version should always give the best results
    I've seen a couple of cases where this wasn't true so you have to test what gives you the best result based on your PDF
  • Valid values are "1.3", "1.2", "1.1", or null 
    1.3 isn't avail. yet. Please refer to DanielPerelman's reply on this thread

please this is what i get trying to implement for 1.2. the other values are giving me invalid answers

Carefully read the error message you get, it clearly indicates what the problem is. I checked the formula you entered and it's OK (no syntaxt error nor anything like that)

@L z. 

Small comment, 1.3 is available now. Not sure about exact PQ version.

On Current Channel v2106 b14131.20278 (latest Click-to-Run) "1.3" is still not avail.
(Same goes with PBI Desktop v2.94.921.0 - June 2021)
If you have it on the Insider Channel it should land here in a few weeks...

@L z. 

Yes, that's PQ 2.95.223.0 on insiders channel.

I have a similar question to read a PDF.  The PDF is link to a PDF file share to the public (no login required); however, I not figure out how to get Power Query to read that publicly shared PDF file.

 

Is there a trick related to this using the same method described or is there another set of instructions associated with reading a PDF file referenced as a URL?

 

I tried the Power Query get from web, but it seems to think it needs to authenticate to access the file even though authentication is not required.

@Russ_MS_Commmunity 

The following works with a PDF stored on a Personal OneDrive:

- From a Web Browser, download the PDF file
- Open the Downloads page of the Web Browser (pic. from MSFT Edge below)
- Copy the Download link:

Sample.png

In Excel:
- Data (tab) > From Web > Paste the Download link > OK > ...


If you want to try with the PDF I shared, a Download link is:

 

 

https://public.am.files.1drv.com/y4mp7lEIREyPqh2W_UW3tzsSdWp6Kpim9Hpj2b2Sp-OcoTenAZ-33ASspv_OnvEkzD3RpLWfT4ftsglbRAfyHuRz5AIZq3FNL8HRX8-n6-0eV7bV1nTQkSrT77dOVHFIwewQYfCE0hNshQGrbt1JW4dVZuYOMEJy0yN4R7_3CweweFjm6wPeZWFYJxC8QSrloB5Fk9lsiU4a2RjqmGAMdxThaztoypmCrUsT8j7kO3H35E?AVOverride=1
It did not seem to resolve the problem.

Going to Data >> From Web >> Pasting in the download link is the process I originally used. However, it interprets the OneDrive address as something that requires an authentication.

It comes back with an error message "We could not authenticate with the credentials authenticated. Please try again." There is no authentication. required to view and download the file, but when you try to put the URL in power query it doesn't seem to understand that.

@Russ_MS_Commmunity 

When OneDrive isn't unresponsive this works no problem here. The attached sample query contains a download link and the expected dataset is returned to Excel

 

After putting in place the query, if I clear to Permissions to OneDrive (the one highlighted below that matches the download link - https://public.am.files.1drv.com...):

Sample.png

 

of course, I'm asked to Authenticate (even as Anonymous) the next time I refresh the query

 

IMHO you should have a look at your Data Source Settings and in case this doesn't help provide more information regarding your context, i.e. where the PDF is actually stored (ex. OneDrive Business, SharePoint...)

@Russ_MS_Commmunity 

Providing no feedback, good or bad, doesn't help people who Search (you did - thanks)

@L z. I have not been able to get the above solution proposed to work successfully.  In every attempt on multiple computers and multiple networks it is still seeking a login.