Is there a tool that works for the limited subset of PDFs generated by Latex? Do...

aglionby · on March 3, 2020

I spent some time extracting abstracts from NLP papers (ACL conferences) and it was mostly straightforward. Using pdfquery to extract PDF -> XML gave each character as an element, and they were mostly ordered sensibly and grouped into paragraphs.

However... this didn't work in some cases, mainly with formatted text but sometimes with PDFs that looked like they were compiled in some nonstandard way. As a result I ended up chucking the XML structure entirely and recompiling the text from character-level coordinates. Formatted text was also an issue, with slightly offset y coordinates from regular characters on the same line.

I'm not sure I could take this experience and say that extracting _all text_ would be straightforward. Hopefully for most documents the XML is nicely structured, but I imagine there are many more opportunities for inconsistencies in how the PDF is generated when thinking about diagrams, tables etc. rather than just abstracts.

Considered writing up a blog post about my experiences with the above but imagined that it was far too niche. Code's here [1] if it's of interest.

[1] https://gist.github.com/GuyAglionby/4b55d00803710f2e2e9877fd...

dredmorbius · on March 3, 2020

I've had remarkably good results in general (for reading) using the Poppler library's "pdftotext" utility. Since it defaults to writing output to file, I wrap that in a bash function to arrive at a less-like pager, with page breaks noted:

    pdfless ()
    {
       pdftotext -layout "$1" - |
       sed 's/\f/\n\n ----------------- ----------------- <page> ----------------- ----------------- \n\n\n/g' |
       ${PAGER:-less -S}
    }

The key is the "-layout" argument, which preserves original layout of the document. This ... may not be what you want visually, but makes backing out the original text somewhat easier.

Of course, requesting the LaTeX sources would be preferred.

bhl · on March 3, 2020

Not a general tool but arxiv-vanity - which produces webpages of articles submitted to arxiv - works by parsing the source code that's submitted along with the PDF. You can probably use this data to train a model that converts between pdf, tex, and html.