Forum Discussion

Jeff_Lipschultz's avatar
Jeff_Lipschultz
Copper Contributor
Jun 19, 2023
Solved

Converting PDF to Excel

I'm a struggling with the should-be-simple task of converting a pdf to Excel table.  I can use the Get Data ->From PDF (Select Multiple Items) function to Query but it puts all the tables (there are 27) into separate sheets (by page in the PDF).  I tried to consolidate the sheets and struggled with this too, unfortunately.  There has got to be an easy way.  The original file is on this web page:  https://cdn1.sportngin.com/attachments/document/0b07-2896686/2023_Participant_List_70.3_Waco_4.27.23.pdf

  • Jeff_Lipschultz 

     

    Thanks for sharing a file. Combining all Tables in the Pdf you shared seems to do it (attached), just edit the query to change the path to the file - line File.Contents("ToBeReplaceWithYourPath")

    let
        Source = Pdf.Tables(
            File.Contents("D:\Lorenzo\Downloads\2023_Participant_List_70.3_Waco_4.27.23.pdf"),
            [Implementation="1.3"]
        ),
        KindTable = Table.SelectRows(Source, each ([Kind] = "Table")),
        RemovedOtherColumns = Table.SelectColumns(KindTable,{"Data"}),
        CombinedTables = Table.Combine(RemovedOtherColumns[Data]),
        PromotedHeaders = Table.PromoteHeaders(CombinedTables, [PromoteAllScalars=true]),
        ChangedTypes = Table.TransformColumnTypes(PromotedHeaders,
            {{"LAST NAME", type text}, {"FIRST NAME", type text}, {"GENDER", type text}, {"AGE AS OF#(lf)12/31/2023", Int64.Type}, {"COUNTRY", type text}}
        )
    in
        ChangedTypes

      

6 Replies

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Jeff_Lipschultz 

     

    Thanks for sharing a file. Combining all Tables in the Pdf you shared seems to do it (attached), just edit the query to change the path to the file - line File.Contents("ToBeReplaceWithYourPath")

    let
        Source = Pdf.Tables(
            File.Contents("D:\Lorenzo\Downloads\2023_Participant_List_70.3_Waco_4.27.23.pdf"),
            [Implementation="1.3"]
        ),
        KindTable = Table.SelectRows(Source, each ([Kind] = "Table")),
        RemovedOtherColumns = Table.SelectColumns(KindTable,{"Data"}),
        CombinedTables = Table.Combine(RemovedOtherColumns[Data]),
        PromotedHeaders = Table.PromoteHeaders(CombinedTables, [PromoteAllScalars=true]),
        ChangedTypes = Table.TransformColumnTypes(PromotedHeaders,
            {{"LAST NAME", type text}, {"FIRST NAME", type text}, {"GENDER", type text}, {"AGE AS OF#(lf)12/31/2023", Int64.Type}, {"COUNTRY", type text}}
        )
    in
        ChangedTypes

      

      • Lorenzo's avatar
        Lorenzo
        Silver Contributor

        Jeff_Lipschultz 

         

        Not avanced at all you can do it with the PQ User Interface only
        The PDF you shared is really clean and doesn't require any specific transformation if you combine the Tables (not the Pages). The starting point is to select (in the PQ Navigator) the file name only and not Pages or Tables - as pictured in this thread

  • Patrick2788's avatar
    Patrick2788
    Silver Contributor
    The PDF looks clean enough that you may want to try your luck with opening the PDF in Word which can convert it to a docx. Then you can copy/paste to Excel.
    • Jeff_Lipschultz's avatar
      Jeff_Lipschultz
      Copper Contributor

      Patrick2788 Great idea.  But unfortunately, it did not copy paste into Word very cleanly (or do you have a hack for that, too?).

      • Jeff_Lipschultz's avatar
        Jeff_Lipschultz
        Copper Contributor
        But, you got me thinking! I cut and paste as a text file and I was able to import that TXT file cleanly into excel!

Resources