Mar 03 2024 09:55 AM
I am struggling with extracting data from pdf docs. I have come upon what I hope is a reasonable strategy. In looking at address the data in a pdf I have found that location data like bbox parameters varies from one odf do to another - even though our eyeballs say it the same document. I have found in converting to a Word doc that I can locate text based upon paragraph number - that is, I have tested a handful of pdf-derived Word docs and found that the info I need is reliability in paragraph X. I am wondering, however, if I simply "got lucky". The question is this: is it reasonable that I could find consistency in paragraph number in Word doc where we are dealing with pdfs with the same "look" but which may have been scanned in with different systems or software? How would I test this on a large scale?