Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting. For example, these are the documents I’ve recently downloaded:
wodet3-paper12.pdf jong_afst.pdf tut_gpu_2012_03.pdf lecture1-1.pdf natella_binary_sfi_edcc_2012.pdf TR-Farrukh-58.pdf 730959.pdf NLSEmagic_Paper.pdf M23584378H1770Q2.pdf G89T37P10W263075.pdf journal_online.pdf manus_Jour-INFORMATION-Camera.pdf 12011.VitekJan.Paper.pdf R3X8722476T2X278.pdf 1203.0321.pdf
I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.
I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.
Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution. It also resembles a rube-goldberg machine. After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format. From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:
View the code on Gist.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.