Site icon R-bloggers

About Risks and Side-Effects… Consult your Purrr-Macist

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Capture errors, warnings and messages, but keep your list operations going

In a recent post about text mining, I discussed some solutions to webscraping the contents of our STATWORX blog using the purrr-package. However, while preparing the next the episode of my series on text mining, I remembered a little gimmick that I found quite helpful along the way. Thus, a little detour: How do I capture side-effects and errors when I perform operations on lists with purrr rather than using a loop?

First of all, a quick motivating example: Imagine we will build a parser for the blog-section of our STATWORX website. However, we have no idea how many entries were posted in the meantime (the hive of data scientists in our office is pretty quick with writing these things, usually). Thus, we need such a function to be more robust in the sense that it can endure the cruelties of "404 – Not found" error messages and still continues parsing after running into an error.

How could this possibly() work?

So let's use some beautiful purrr adverbs to fearlessly record all our outputs, errors and warnings, rather than stopping and asking the user to handle side-effects the exact moment errors turn up. These adverbs are reminiscent of try(), however, these are a little more convenient for operations on lists.

Let's consider a more complex motivating example first, but no worries – there are more obvious examples to help explain the nitty-gritties further down this page. The R-code below illustrates our use of possibly() for later use with puurr::map(). First, let us have a look at what we tried to achieve with our function. More specifically, what happens between the curly braces below: Our robust_parse() function will simply parse HTML-webpages for other links using URLs that we provide it with. In this case, we simply use paste0() to create a vector of links to our blog overview pages, extract the weblinks from these each of these pages using XML::xpathSApply(), pipe these weblinks into a data_frame and clean our results from duplicates using dplyr::filter() – there are various overview pages that group our blogs by category – and dplyr::distinct().

robust_parse <- possibly(function(value){
  htmlParse(paste0("http://www.statworx.com/de/blog/page/",
                   value, "/")) %>%
    xpathSApply(., "//a/@href") %>%
    data_frame(.) %>%
    filter(., grepl("/blog", .)) %>%
    filter(., !grepl("/blog/$|/blog/page/|/data-science/|/statistik/", .)) %>%
    distinct()
  }, otherwise = NULL)

Second, let us inspect how we employ possibly() in this context. possibly() expects a function to be modified from us, as well as the argument otherwise, stating what it is supposed to do when things go south. In this case, we want NULL as an output value. Another popular choice would be NA, signaling that somewhere, we have not produced a string as intended. However, in our example we are happy with NULL, since we only want to parse the pages that exist and do not require a specific listing of pages that do not exist (or what happened when we did not find a page).

webpages < - map_df(0:100, ~robust_parse(.)) %>%
			unlist

webpages

.1 
"https://www.statworx.com/de/blog/strsplit-but-keeping-the-delimiter/"
.2 
"https://www.statworx.com/de/blog/data-science-in-python-vorstellung-von-nuetzlichen-datenstrukturen-teil-1/"
.3 
"https://www.statworx.com/de/blog/burglr-stealing-code-from-the-web/"
.4 
"https://www.statworx.com/de/blog/regularized-greedy-forest-the-scottish-play-act-i/"
...

Third, we use our new function robust_parse() to operate on a vector or list of integers from 0 to 100 (possible numbers of subpages we want to parse) and have a quick look at the beautiful links we extracted. Just as a reminder, below you find the code to extract and clean the contents of the individual pages, using another map_df()-based loop – which is the focus of another post.

tidy_statworx_blogs <- map_df(webpages, ~read_html(.) %>% 
                              	htmlParse(., asText = TRUE) %>%
                                xpathSApply(., "//p", xmlValue) %>%
                                paste(., collapse = "\n") %>%
                                gsub("\n", "", .) %>%
                                data_frame(text = .) %>%
                                unnest_tokens(word, text) %>%
                                anti_join(data_frame(word = stopwords("de"))) %>% 
                                anti_join(data_frame(word = stopwords("en"))) %>% 
                                mutate(author = .$word[2]))

However, we actually want to go back to our purrr-helpers and see what they can do for us. To be more specific, rather than helpers, these are actually called adverbs since we use them to modify the behavior of a function (i.e. a verb). Our current robust_parse() function does not produce entries when the loop does not successfully find a webpage to parse for links. Consider the situation where you intend to keep track of unsuccessfull operations and of errors that arise along the way. Instead of further exploring purrr adverbs using the above code, let us look at a much easier example to realise the possible contexts in which using purrr adverbs might help you out.

A much easier example: Try dividing a character string by 2

Suppose there is an element in our list where our amazing division powers are useless: We are going to try to divide all the elements in our list by 2 – but this time, we want purrr to note where the function i_divide_things resists dividing particular elements for us. Again, the otherwise argument helps us defining our output in situations that are beyond the scope of our function.

i_divide_things <- possibly(function(value){
				value /2},
                  		otherwise = "I won't divide this for you.")

# Let's try our new function

> purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.))

[[1]]
[1] 0.5

[[2]]
[1] 1

[[3]]
[1] "I won't divide this for you."

[[4]]
[1] 3

However, consider the case where "something did not work out" might not suffice and you want to keep track of possible errors as well as warnings while still retaining the entire output. A job for safely(): As illustrated below, wrapping our function by safely(), helps us output a nested list. For each element of the input, the output provides two components – $result and $error. For all iterations where a list element is numeric, $result includes a numeric output and an empty (= NULL) error-element. Only for the third list element – where our function stumbled over a character input – we captured an error message, as well as the result we defined using otherwise.

i_divide_things <- safely(function(value){
                      value /2},
                      otherwise = "This did not quite work out.")

purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.))

[[1]]
[[1]]$result
[1] 0.5

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 1

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] "This did not quite work out."

[[3]]$error
<simpleError in value/2: non-numeric argument to binary operator>


[[4]]
[[4]]$result
[1] 3

[[4]]$error
NULL

In the example above, we have only been revealing our errors once we have looped over all elements of our list, by inspecting the output list. However, safely() also has the quiet argument – by default set to TRUE. If we set this to FALSE, we receive our errors the very moment they occur.

Now, we want to have a quick look at quietly(). We will define a warning, a message and print an output. This is to illustrate where purrr saves the individual components that our function returns. For each element of our input the returned list provides four components:

i_divide_things <- purrr::quietly(function(value){
  if(is.numeric(value) == TRUE) {
          print(value / 2)
  } else{ 
          warning("Can't be done. Printing this instead.")
          message("Why would you even try dividing this?")
          print(value)
  }
  })

purrr::map(list(1, "a", 6), ~i_divide_things(.))

[[1]]
[[1]]$result
[1] 0.5

[[1]]$output
[1] "[1] 0.5"

[[1]]$warnings
character(0)

[[1]]$messages
character(0)


[[2]]
[[2]]$result
[1] "a"

[[2]]$output
[1] "[1] \"a\""

[[2]]$warnings
[1] "Can't be done. Printing this instead."

[[2]]$messages
[1] "Why would you even try dividing this?\n"


[[3]]
[[3]]$result
[1] 3

[[3]]$output
[1] "[1] 3"

[[3]]$warnings
character(0)

[[3]]$messages
character(0)

Last, there is auto_browse(), which allows us to trigger the RStudio browser for debugging and brings the user to the approximate location of the error. This case is illustrated in the screenshot below.

i_divide_things <- purrr::auto_browse(function(value){

    print(value / 2)
})

purrr::map(list(1, "a", 6), ~i_divide_things(.)) 

Splendid - this was a quick wrap-up of how to wrap your functions for handling side-effects in your operations on lists using adverbs of purrr. Happy wrapping everyone!

Über den Autor

David Schlepps

David ist Mitglied im Data Science Team und interessiert sich für R und Markdown. In seiner Freizeit spielt er gerne Gitarre.

Der Beitrag About Risks and Side-Effects… Consult your Purrr-Macist erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.