I’m not a software developer, but I like to use Python to help speed up some of my office work. One of my regular tasks is to print a stack of ~40 sheets of paper, highlight key information for each entry (about 3 entries per page), and fill out a spreadsheet with that information that then gets loaded into our software.
This is time-consuming, and I’d like to write a program that can scan the OCR-ed PDFs and pull the relevant information into a CSV.
I’m confident I could handle it from there, but I know that PDFs are tricky files to work with. Are there any Python modules that might be a good fit for the approach I’m hoping to take here? Thanks!
PyMuPDF is excellent for extracting ‘structured’ text from a pdf page — though I believe ‘pulling out relevant information’ will still be a manual task, UNLESS the text you’re working with allows parsing into meaningful units.
That’s because ‘textual’ content in a pdf is nothing other than a bunch of instructions to draw glyphs inside a rect that represents a page; utilities that come with mupdf or poppler arrange those glyphs (not always perfectly) into ‘blocks’, ‘lines’, and ‘words’ based solely on whitespace separation; the programmer who uses those utilities in an end-user facing application then has to figure out how to create the illusion (so to speak) that the user is selecting/copying/searching for paragraphs, sentences, and so on, in proper reading order.
PyMuPDF comes with a rich collection of convenience functions to make all that less painful; like dehyphenation, eliminating superfluous whitespace, etc. but still, need some further processing to pick out humanly relevant info.
Built-in regex capabilities of Python can suffice for that parsing; but if not, you might want to look into NLTK tools, which apply sophisticated methods to tokenize words & sentences.
EDIT: I really should’ve mentioned some proper full text search tools. Once you have a good plaintext representation of a pdf page, you might want to feed that representation into tools like the following to index them properly for relevant info:
https://lunr.readthedocs.io/en/latest/ – this is easy to use, & set up, esp. in a python project.
… it’s based on principles that are put to use in this full-scale, ‘industrial strength’ full text search engine: https://solr.apache.org/ – it’s a bit of a pain to set up; but python can interface with it through any http client. Once you set up some kind of mapping between search tokens/keywords/tags, the plaintext page, & the actual pdf, you can get from a phrase search, for example, to a bunch of vector graphics (i.e. the pdf) relatively painlessly.
pypdf, recently been updated to version 3… it sometimes takes a bit of wrangling for more specific use cases: I’ve used it in conjunction with reportlab when needing to add text and other bits with a bit more flexibility.
From what I understand PyPDF3 and 4 are separate from pypdf which is the modern version of PyPDF2 as of last year
That’s correct afaik. The maintainers of PyPDF2 merged it back into the original pypdf for version 3 I believe.
BTW you might want to not tell anyone: there are a few stories on Reddit of people that were fired after they (or someone else) automated some time-consuming tasks like that.