Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like:
After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).
A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).
As someone else said, this is like black magic, but kind of fun :)
I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.
I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!
Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.
A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).
As someone else said, this is like black magic, but kind of fun :)
Edit: grammar