Taming exam results in pdf with pdftools

mareviv

4 years ago

[This article was first published on R – Tales of R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")
library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1]) 
  [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
  [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

Split the blob.
Transform the resulting list into a character vector with unlist.
Trim leading white spaces with stringr::str_trim.

library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
   [1] "EDAIC Part I 2017                                                  Overall Results"
   [2] "Candidate N°       Result"                                                         
   [3] "107131            FAIL"                                                            
   [4] "119233            PASS"                                                            
   [5] "123744            FAIL"                                                            
   [6] "127988            FAIL"                                                            
   [7] "133842            PASS"                                                            
   [8] "135692            PASS"                                                            
   [9] "140341            FAIL"                                                            
  [10] "142595            FAIL"

Remove the very first row.
Transform into a tibble.

tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
  # A tibble: 2,579 x 1
                             .
                         <chr>
   1 Candidate N°       Result
   2    107131            FAIL
   3    119233            PASS
   4    123744            FAIL
   5    127988            FAIL
   6    133842            PASS
   7    135692            PASS
   8    140341            FAIL
   9    142595            FAIL
  10    151479            PASS
  # ... with 2,569 more rows

Use tidyr::separate to split each row into two columns.
Remove all spaces.

tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
tx4
  # A tibble: 2,579 x 2
           key    value
         <chr>    <chr>
   1 Candidate N°Result
   2    107131     FAIL
   3    119233     PASS
   4    123744     FAIL
   5    127988     FAIL
   6    133842     PASS
   7    135692     PASS
   8    140341     FAIL
   9    142595     FAIL
  10    151479     PASS
  # ... with 2,569 more rows

Remove rows that do not represent table elements.

tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
  # A tibble: 2,424 x 2
        key value
      <chr> <chr>
   1 107131  FAIL
   2 119233  PASS
   3 123744  FAIL
   4 127988  FAIL
   5 133842  PASS
   6 135692  PASS
   7 140341  FAIL
   8 142595  FAIL
   9 151479  PASS
  10 151632  PASS
  # ... with 2,414 more rows

Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()

value	count	percent
FAIL	1017	42 %
PASS	1407	58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.