robust pdf title extraction
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting. For example, these are the documents I’ve recently downloaded:
wodet3-paper12.pdf jong_afst.pdf tut_gpu_2012_03.pdf lecture1-1.pdf natella_binary_sfi_edcc_2012.pdf TR-Farrukh-58.pdf 730959.pdf NLSEmagic_Paper.pdf M23584378H1770Q2.pdf G89T37P10W263075.pdf journal_online.pdf manus_Jour-INFORMATION-Camera.pdf 12011.VitekJan.Paper.pdf R3X8722476T2X278.pdf 1203.0321.pdf
I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.
I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.
Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution. It also resembles a rube-goldberg machine. After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format. From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:
from BeautifulSoup import BeautifulStoneSoup | |
import subprocess | |
import sys | |
import tempfile | |
def extract_pdf_title(pdfdata): | |
src_file = tempfile.NamedTemporaryFile(delete=True) | |
src_file.write(pdfdata) | |
src_file.flush() | |
try: | |
command = ' '.join(['pdftohtml', '-c -s -i', '-stdout', '-f 1', '-l 1', | |
'-xml', src_file.name, '/tmp/pdftoxml']) | |
xmlout, xmlerr = subprocess.Popen(command, shell=True, | |
stdout=subprocess.PIPE, | |
stderr=subprocess.STDOUT).communicate('') | |
xml_data = open('/tmp/pdftoxml.xml').read() | |
except: | |
print 'Error in pdftohtml ' | |
return '' | |
dom = BeautifulStoneSoup(xml_data) | |
text = dom.findAll('text') | |
# let the title be the first set of text elements until we see a change in font | |
title_text = '' | |
last_font = None | |
for t in text: | |
if last_font is not None and t.get('font') != last_font: | |
if len(title_text) > 5: break | |
else: title_text = '' | |
title_text += t.getText().encode('utf-8') + ' ' | |
last_font = t.get('font') | |
return title_text | |
if __name__ == '__main__': | |
for f in sys.argv[1:]: | |
print f, ' -- ', extract_pdf_title(open(f).read()) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.