Site icon R-bloggers

goodreads: what books #estoniareads?

[This article was first published on reigo.eu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Estonian National Broadcasting Agency run a wonderful campaign called “Eesti loeb” (Estonia reads), where people were asked to post a picture of themselves with a book they are reading (or have read) and find inspiring. Photos of readers with their favourite books were published afterwards.

But which books are the most popular ones among Estonians? We are going to answer this question (inspired by the campaign) by analyzing books read by people in Eesti Estonia group in goodreads. R will be used to get and analyze data.

If you are not interested in R stuff, then feel free to jump to results.

API

First we need to get data about books read by members of Eesti Estonia. We are going to use goodreads API to access the data in the website. To get an access to goodreads API one should register for a developer key. After getting goodreads API key I took advice (I chose the one that sounded most reasonable for my purposes) from Andrie de Vries blog post and saved my API key to .Renviron file. I also found “R startup” chapter in “Efficient R programming” to be helpful when learning about .Renviron.

note: as this is my first time working with API’s, any comments on how to improve the code or methods I share are more than welcomed.

After saving goodreads API key to .Renviron one can access it by:

api_key <- Sys.getenv("GOODREADS_API_KEY")

Getting data

Our plan is to get user ID-s of the members of Eesti Estonia goodreads group. Using these ID-s we can get the books users have in their read shelves. Having this data we can easily find out which books are read by how many group members. We use functions get_group() and get_shelf() to get the raw data. For getting beautiful tibbles out of raw data we use functions1 group_to_df() and shelf_to_df():

library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(stringr)

get_group <- function(group_id = 45927, gr_api_key, page = 1) {

  # functions for getting goodreads group info
  url <- paste0("https://www.goodreads.com/group/members/", group_id, ".xml")
  members <- httr::GET(
    url = url, 
    query = list(sort = "last_online", key = gr_api_key, page = page)
  )
  members <- httr::content(members, as = "parsed")
  members
}

get_shelf <- function(gr_id, gr_api_key, n = 200, page = 1, 
                      shelf = c("read", "currently-reading", "to-read")) {
  # functions for getting shelf content
  url <- "https://www.goodreads.com/review/list?"
  shelf <- shelf[1]
  shelf <- GET(
    url = url, 
    query = list(v = 2, key = gr_api_key, id = gr_id, 
                 shelf = shelf, per_page = n, page = page)
  )
  shelf_contents <- content(shelf, as = "parsed")
  return(shelf_contents)
}

group_to_df <- function(group) {
  
  member_id <- group %>% 
    xml_find_all("//user") %>% 
    as_list() %>% 
    map_chr(~ .$id[[1]])
  
  df <- tibble(member_id)
  df
}

shelf_to_df <- function(shelf) {
  # function for getting data frame from xml input 
  shelf_as_list <- xml2::as_list(shelf)[["reviews"]]
  title   <- shelf %>% xml_find_all("//title") %>% xml_text()
  book_id <- map_chr(shelf_as_list, ~ .$book$id[[1]])
  df <- tibble(title, book_id)
  df
}

We can look up “Eesti Estonia” group ID from its URL, which is 45927. In the time of writing this post there was 23 pages of members in the group, totalling little less than 690 members.

Now we can get all group members ID’s and use these ID’s to programmatically find and download users shelves. We use purrr::map_df for getting and storing members ID’s and shelves. Since not all members allow accessing their data and some members have not listed any books in their read shelves, we use tryCatch() to avoid our loop breaking down. We save shelves one by one to get the number of members who have read at least 1 book. We use this number as an effective group size, which comes handy when evaluating percentage of members who have read a certain book.2 Also, notice that the code below downloads first 1000 books (we download first 5 pages, each max 200 entries) users have in their read shelves.3

members <- map_df(1:23, function(p) {
  get_group(group_id = 45927, gr_api_key = api_key, page = p) %>% 
    group_to_df()
})

# this will take a while :)
shelves <- map_df(members$member_id, function(id) {
  
  shelf <- map_df(1:5, function(page) {
    x <- get_shelf(gr_id = id, gr_api_key = api_key,
                   shelf = "read", n = 200, page = page)
    tryCatch({
      shelf_to_df(shelf = x)
    }, error = function(cond) data_frame())
  }) 
  
  saveRDS(shelf, paste0("shelves/read-", id, ".rds"))
  shelf
  
})

n_not_zero <- map(dir("/shelves", full.names = T), readRDS) %>% 
  map_lgl(function(x) nrow(x) > 0) %>% sum()

saveRDS(members, "members.rds")
saveRDS(shelves, "shelves.rds")

Results

Having this data we can easily find out which are the most popular books among the group members:

df <- shelves %>%
  count(title) %>%
  arrange(-n) %>%
  slice(1:15) %>%
  mutate(title = ifelse(str_detect(title, "arry"), str_sub(title, -17, -2), title)) %>%
  arrange(-n, title) %>%
  mutate(title = forcats::fct_reorder(title, n))

# est variable is added by manually exploring data
df %>%
  bind_cols(est = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)) %>%
  mutate(`cntry` = ifelse(est == 1, "Estonian", "other")) %>%
  ggplot(aes(title, n / n_not_zero)) +
  geom_bar(stat = "identity", aes(fill = cntry)) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("#0072CE", "#000000"),
                    name = "Book author \nnationality") +
  labs(x = "", y = "% of users in goodreads 'Eesti Estonia' \ngroup who have read the book") +
  geom_text(data = df %>% slice(1) %>% mutate(text = paste(n, "out of", n_not_zero)),
            aes(title, n / n_not_zero - .055, label = text), color = "white", size = 3) +
  hrbrthemes::theme_ipsum() +
  theme(axis.text =  element_text(size = 11),
        axis.title.x = element_text(size = 11),
        legend.text = element_text(size = 11))

Andurs Kivirähk’s Rehepapp ehk november takes the first place. Almost 40% of group members have it in their read shelf. Movie version of “Rehepapp ehk november” aired in 2017, which gives anyone interested a second medium to discover this popular story.

On the fourth place we see one of the Estonian classics – A.H Tammsaare’s Tõde ja õigus I. My initial guess was that books by Estonian authors are dominating the list.4 Nevertheless, I am happy to see 4 books of Estonian author in TOP15. The popularity of Harry Potter series came to me as a little surprise. I think it was surprising for me because I am biased – I have not read any of them myself. When scrolling through “Eesti loeb” campaign photos of readers with their favourite books we see both authors, A.H Tammsaare and J.K Rowling, mentioned more than once.


  1. Half the credit for functions seen below goes to Max Humber’s blog post about how to use goodreads API in R.

  2. Real percentages may obviously differ, but for our purposes I find it good enough approximation. If your read shelf is empty, then you are probably not active in goodreads.

  3. There are 8 members having more than 1000 books in their read shelf, which means our book stats can be off by 8.

  4. When considering Harry Potter series as one book, the Estonian/other ratio in TOP15 will remain same.

To leave a comment for the author, please follow the link and comment on their blog: reigo.eu.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.