Many years ago, I regularly had to parse specifications of protocols from variou...

dredmorbius · on March 3, 2020

I've discovered page-oriented processing in awk, which is a godsend for parsing PDFs.

See:

https://news.ycombinator.com/item?id=22156456

In the GNU Awk User's Guide:

https://www.gnu.org/software/gawk/manual/html_node/Multiple-...

Tracking column and field widths across page breaks is ... interesting, but more tractable.

mtlogstdo · on March 3, 2020

I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.

I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!

haberman · on March 3, 2020

Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.