Introduction
There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.
This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.
As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.
Mining the table
In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:
install.packages("pdftools") library(pdftools) txt <- pdf_text("EDAIC.pdf") txt[1] class(txt[1]) [1] "EDAIC Part I 2017 Overall Results\n Candidate N° Result\n 107131 FAIL\n 119233 PASS\n 123744 FAIL\n 127988 FAIL\n 133842 PASS\n 135692 PASS\n 140341 FAIL\n 142595 FAIL\n 151479 PASS\n 151632 PASS\n 152787 PASS\n 157691 PASS\n 158867 PASS\n 160211 PASS\n 161970 FAIL\n 162536 PASS\n 163331 PASS\n 164442 FAIL\n 164835 PASS\n 165734 PASS\n 165900 PASS\n 166469 PASS\n 167241 FAIL\n 167740 PASS\n 168151 FAIL\n 168331 PASS\n 168371 FAIL\n 168711 FAIL\n 169786 PASS\n 170721 FAIL\n 170734 FAIL\n 170754 PASS\n 170980 PASS\n 171894 PASS\n 171911 PASS\n 172047 FAIL\n 172128 PASS\n 172255 FAIL\n 172310 PASS\n 172706 PASS\n 173136 FAIL\n 173229 FAIL\n 174336 PASS\n 174360 PASS\n 175177 FAIL\n 175180 FAIL\n 175184 FAIL\nYour candidate number is indicated on your admission document Page 1 of 52\n" [1] "character"
These commands return a lenghty blob of text. Fortunately, there are some \n
symbols that signal the new lines in the original document.
We will use these to split the blob into something more approachable, using tidyversal
methods…
- Split the blob.
- Transform the resulting
list
into acharacter vector
withunlist
. - Trim leading white spaces with
stringr::str_trim
.
library(tidyverse) library(stringr) tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns unlist() %>% str_trim(side = "both") # trim white spaces tx2[1:10] [1] "EDAIC Part I 2017 Overall Results" [2] "Candidate N° Result" [3] "107131 FAIL" [4] "119233 PASS" [5] "123744 FAIL" [6] "127988 FAIL" [7] "133842 PASS" [8] "135692 PASS" [9] "140341 FAIL" [10] "142595 FAIL"
- Remove the very first row.
- Transform into a
tibble
.
tx3 <- tx2[-1] %>% data_frame() tx3 # A tibble: 2,579 x 1 . <chr> 1 Candidate N° Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
- Use
tidyr::separate
to split each row into two columns. - Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>% mutate(key = gsub('\\s+', '', key)) %>% mutate(value = gsub('\\s+', '', value)) tx4 # A tibble: 2,579 x 2 key value <chr> <chr> 1 Candidate N°Result 2 107131 FAIL 3 119233 PASS 4 123744 FAIL 5 127988 FAIL 6 133842 PASS 7 135692 PASS 8 140341 FAIL 9 142595 FAIL 10 151479 PASS # ... with 2,569 more rows
- Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] tx5 # A tibble: 2,424 x 2 key value <chr> <chr> 1 107131 FAIL 2 119233 PASS 3 123744 FAIL 4 127988 FAIL 5 133842 PASS 6 135692 PASS 7 140341 FAIL 8 142595 FAIL 9 151479 PASS 10 151632 PASS # ... with 2,414 more rows
Extracting the results
We already have the table! now it’s time to get to the summary:
library(knitr) tx5 %>% group_by(value) %>% summarise (count = n()) %>% mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% kable()
value | count | percent |
---|---|---|
FAIL | 1017 | 42 % |
PASS | 1407 | 58 % |
From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.
Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.