Improve PDF data Connector for reading pages

Improve PDF data Connector for reading pages



 Aug 26 2022

I have already submitted this as a private message, however, feel that it deserved a wider audience's attention.


I'm interested in using excel to extract information from text passages within PDF documents. Whereas most just want to extract a table, im interested in the document's entire text to search. After quite a bit of work, I have come up with a way to achieve this in power query using the PDF connector and a whole series of steps to split pretty much any text into its constituent sentences reliably; however, all this has one major flaw. The PDF connector itself cannot reliably interpret the text on pages. The words are pulled through okay but are often merged ie. hellohowareyou, Or randomly parsed into separate columns as shown here.  Capture1.PNG


If I open a PDF in adobe acrobat and convert this to a text file. It recognises the words separately and doesn't break sentences by a line feed (but extends horizontally as required). Following this, As a workaround, I have found that by converting PDF documents into text files and connecting to these instead, I can more reliably transform the data into its sentences and search pretty much any PDF for the data I need. below shows the much-improved result of doing this.




Although the data requires further transformations to separate into sentences (I use regex for this), The second image shows the data in a much more logical way and this is much easier to transform as the sentences don't really need to be 'fixed' just seperated.


In contrast, the first, even with transformations, cannot undo the already merged text.


Additionally, the PDF connector can lead to text 'slipping', where after merging, sentences may append to the wrong sentences. This is rare but, again, shouldn't happen. In the second case using the text file, this does not occur. 


I will provide the M Code for the PDF connection below as this PDF is online so you should be able to open it:


Source = Pdf.Tables(Web.Contents(""), [Implementation="1.3"]),
#"Filtered Rows" = Table.SelectRows(Source, each ([Kind] = "Page")),
#"Expanded Data" = Table.ExpandTableColumn(#"Filtered Rows", "Data", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18"})
#"Expanded Data"


M Code for the Text file: 


Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Desktop\TEST.txt"))})



Unfortunately, you will need to generate the Text file as it doesn't allow me to upload txt here.  But you can just copy all of the below and paste in a Text File names TEST on your desktop. 


SIAM 13, 6-9 November 2001 UK/ICCA

CAS No. 107-41-5
Chemical Name Hexylene glycol (2-methyl pentane-2,4-diol)
Structural Formula CH3 CHOH CH2 C (OH)(CH3)2 (NB the commercial substance is a racemic mixture)
RECOMMENDATIONS The chemical is currently of low priority for further work.
SUMMARY CONCLUSIONS OF THE SIAR Human Health Hexylene glycol is of relatively low acute toxicity to mammals, the acute oral LD50 is >2000 and <5000 mg/kg (range >2000-4700 mg/kg) while the dermal LD50 is >2000 mg/kg (range >1.84-12.3 g/kg). The acute inhalational LC50 is = the saturated vapour concentration. Recent skin and eye irritation guideline studies indicate that hexylene glycol has low potential to irritate the skin and is slightly irritating to the eye. Skin and eye effects are reversible. Hexylene glycol is not a skin sensitiser. Repeated exposure by oral gavage to rats at 50, 150 or 450 mg/kg/day hexylene glycol for 90 days, with additional animals at the top dose also allowed a 4 week exposure-free recovery period, resulted in hepatocellular hypertrophy and increased liver weight, male rat specific nephropathy and inflammatory changes in the forestomach and to a lesser extent the glandular stomach. The liver changes were reversible and considered an adaptive physiological response to increased metabolic demand. The male rat nephropathy was partially reversible and associated with an increased severity of acidophilic globules, subsequently identified by specific staining (Masson’s trichrome) as alpha-2-microglobulins, and considered of questionable biological significance to humans. Changes in the stomach (reversible) and forestomach (partially reversible) were considered attributable to local irritation induced by the gavage procedure. The NOAEL for this local effect being 50 mg/kg/day. The systemic NOAEL for this guideline study is considered to be 450 mg/kg/day with a no effect level for local irritation to the stomach and forestomach of 50 mg/kg/day. Hexylene glycol is not genotoxic in either mammalian or non-mammalian cells in vitro. No standard fertility studies are available. No effects on the gonads were observed in a good quality 90-day oral gavage study in rats, which were, administered hexylene glycol at doses up to 450 mg/kg/day by oral gavage. Therefore no studies are required under the SIDS regarding fertility. In a good quality developmental toxicity study, in which rats received 30, 300 or 1000 mg/kg/day hexylene glycol by oral gavage, the LOAEL for maternal toxicity was 1000 mg/kg/day, based on slightly reduced weight gain at this top dose level. Greater pre-implanation loss observed at this dose level may be regarded of questionable biological significance. This dose level was also the LOAEL for foetotoxicity based on a, slight delay in ossification, a greater number of fetuses with extra thoraco-lumbar ribs, and a slight decrease (not statistically significant) in foetal body weight. There was no evidence of teratogenicity up to the limit dose of 1000 mg/kg.


SIAM 13, 6-9 November 2001 UK/ICCA
Environment The environmental effects database meets the requirements of the SIDS data package. Hexylene glycol is of low acute toxicity to aquatic organisms. The lowest valid 96h LC50 for fish was 8510 mg/l (Mosquito fish, Gambusia affinis) and the lowest valid 48h EC50 for invertebrates was 2800 mg/l (Ceriodaphnia reticulata). Tadpoles of the frog Rana catesbiana were tested, with a 96 hour EC50 = 11800 mg/l. The 72 hour EC50 for the freshwater alga Selenastrum capricornutum is >429 mg/l (highest level tested) based on both growth rate and biomass. The PNECaqua derived from the lowest toxicity value is 4.3 mg/l, based on an assessment factor of 100 applied to the algal EC50, in accordance with OECD guidance. No data are available on terrestrial or sediment organisms but PNEC values have been derived for the sediment and terrestrial compartments using equilibrium partitioning, 0.295 mg/kg wt for sediment and 0.0786 mg/kg for soil. Exposure The combined market for hexylene glycol in Europe and the USA for 2000 is 15000 tonnes. The principal end uses are in industrial coatings (45%) and as a chemical intermediate (20%). Hexylene glycol occurs as a component in a large number of products for industrial and consumer use. Hexylene glycol is a liquid, melting point – 50°C, boiling point 197.5°C, vapour pressure 0.07hPa at 20°C, it is fully miscible in water and has a calculated n-octanol water partition coefficient (log Kow) of 0.58. There are no aqueous streams from the production process but small amounts of hexylene glycol will be present in the output to the wastewater treatment plant from spills and cleaning operations. Hexylene glycol can also enter the aqueous and terrestrial environment from end uses such as in agricultural products and down hole lubricants for oil and gas fields. Under normal manufacturing practices there should be no emissions to the atmosphere. Low levels of emissions may occur as a result of spills and cleaning operations. The main application is in industrial surface coatings and there is potential here for release to the atmosphere. There is a potential for occupational and consumer exposure through inhalation and skin contact although exposures through inhalation are expected to be low due to the low vapour pressure. Consumer exposure to hexylene glycol will occur principally through its use in cosmetics, antifreezes and hydraulic fluids. Exposure to aerosols is possible as a result of industrial spraying with paints containing hexylene glycol. Indirect exposures via the environment (e.g. ingestion of surface water contaminated with hexylene glycol) are also possible. The calculated half-life for the photo-oxidation (reaction with hydroxyl radicals) of hexylene glycol in air is 9 hours. Hexylene glycol is not expected to undergo direct photolysis and is not susceptible to hydrolysis. Hexylene glycol is predicted to distribute in the environment primarily to water or water and soil. Based on a calculated log Kow of 0.58 which suggests a log Koc of <1, hexylene glycol has low potential to bioaccumulate (BCF=3) and low potential for sorption to soil. In water, hydrolysis and photodegradation are not expected to occur. Hexylene glycol is at least inherently biodegradable.
NATURE OF FURTHER WORK RECOMMENDED No further work is indicated.



Sorry if this is all overkill; I just figured that if you can replicate my issue, you will have a perfect understanding of what I am pointing out. 


I believe that if the PDF connector could be updated to parse the raw text better, then this would make automating PDF searches and pulling data directly into excel far more powerful.