Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
I could finally made to the movies for watching the new Star Wars release this weekend. Although this episode wasn’t that spectacular, in my view, it did inspire some data seeking afterwards. I wanted to know how this film compares to others top movies in terms of worldwide grossing as well as within the Star Wars series.
Fortunately, there is a wealth–though incomplete–list of the top grossing films of all time at < boxofficemojo.com>. Although the information is right in the front-page, I’d rather like something more visual teasing. So, I decided to see how it goes with *R* and the new *ggplot2* package release. Also, because I must scrap the data from the *Box Office* website, I will need a function to handle the HTML structure of those tables. The function `readHTMLTable()` from the *XML* package can certainly be an asset here.
The setup
First, let’s load the packages we’ll need.
library(XML) library(ggplot2)
In what follows is my setup for using the readHTMLTable
function to retrieve, cleaning, and arrange HTML tables in a data.frame format. I’d rather wrap everything in a single function, but keeping the three snippets apart is rather easy to make out.
The first function will pull out all tables on the webpage as a list of data.frames, and I’ll give them similar names.
GetTable <- function(t) { table <- readHTMLTable(t)[[2]] names(table) <- c("Rank", "Title", "Studio", "Worldwide", "Domestic", "DomesticPct", "Overseas", "OverseasPct", "Year") boxdf <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE) boxdf <- as.data.frame(boxdf, stringsAsFactors=FALSE) boxdf <- transform(boxdf, Year = ifelse(Year==0, NA, Year)) return(boxdf) }
The data will come dirty
with lots of tags and marks, so a little janitor work will be necessary. The following code does just that.
CleanDataFrame <- function(boxdf) { clean <- function(col) { col <- gsub("$", "", col, fixed = TRUE) col <- gsub("%", "", col, fixed = TRUE) col <- gsub(",", "", col, fixed = TRUE) col <- gsub("^", "", col, fixed = TRUE) return(col) } boxdf <- sapply(boxdf, clean) boxdf <- as.data.frame(boxdf, stringsAsFactors=FALSE) return(boxdf) }
The next snippet is the main piece. It will construct the URLs based on the number of pages we feed in and will call the two preceding functions.
BoxOfficeMojoScraper <- function(npages) { # This line constructs the URLs urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:npages, "&p=.htm", sep = "") # The next line scrapes every table in the URLs formed boxdf <- do.call("rbind", lapply(urls, GetTable)) # This does the janitor work boxdf <- CleanDataFrame(boxdf) # The next lines arrange the data to my needs cols <- c(1, 4:9) boxdf[, cols] <- sapply(boxdf[, cols], as.numeric) boxdf$Studio <- as.factor(boxdf$Studio) return(boxdf) }
I’m scrapping the first 7 pages of the target address http://www.boxofficemojo.com/alltime/world/. It will bring missing values too, don’t worry for the time being.
npages <- 7 box <- BoxOfficeMojoScraper(npages) ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
Results
Our new acquired data is a data.frame with more than 620 rows or films, with the oldest dating back to 1939.
str(box) ## 'data.frame': 628 obs. of 9 variables: ## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Title : chr "Avatar" "Titanic" "Jurassic World" "Star Wars: The Force Awakens" ... ## $ Studio : Factor w/ 38 levels "Art.","BV","Col.",..: 7 22 32 2 2 32 2 36 2 2 ... ## $ Worldwide : num 2788 2187 1669 1602 1520 ... ## $ Domestic : num 760 659 652 781 623 ... ## $ DomesticPct: num 27.3 30.1 39.1 48.7 41 23.3 32.7 28.4 31.4 33.7 ... ## $ Overseas : num 2028 1528 1017 821 896 ... ## $ OverseasPct: num 72.7 69.9 60.9 51.3 59 76.7 67.3 71.6 68.6 66.3 ... ## $ Year : num 2009 1997 2015 2015 2012 ... # the oldest: min(box$Year, na.rm=TRUE) ## [1] 1939
The following chart displays the grossing worldwide values for the top 25 ranked movies of all time. As it shines out, the Star Wars: The Force Awakens is doing pretty well worldwide. It’s ranked fourth now, but it just began to play in China this week, so it may unseat Titanic over the next weeks, and Avatar in the long run.
If you want to reproduce the very same plot decoration of this post, you’ll have to install the development version of SciencesPo package, and add + theme_538(legend="top")
to the following code.
box2 <- subset(box, Rank<=25) ggplot(box2) + geom_bar(aes(x=reorder(Title, Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") + geom_bar(aes(x=Title, y=Domestic, fill="Domestic"),alpha=.5, stat = "identity") + scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) + coord_flip() + labs(x=NULL, y=NULL, title="Top 25 Films by Worldwide Grosses (US$ Millions)")
Next, how The Force Awakens compares with other episodes of Star Wars?
# will search for Star Wars names within the data.frame: box3 <- subset(box, grepl("^Star Wars", box$Title)|grepl("^Return of the Jedi", box$Title)|grepl("^The Empire Strikes Back", box$Title)) # will wrap the axis labels wrap_20 <- function(x)gsub('(.{1,20})(\s|$)', '\1n', x) ggplot(box3) + geom_bar(aes(x=reorder(wrap_20(Title), Worldwide), y=Worldwide, fill="Worldwide"), stat = "identity") + geom_bar(aes(x=wrap_20(Title), y=Domestic, fill="Domestic"), alpha=.5, stat = "identity") + scale_fill_manual(name="Grossing", values=c(Worldwide="#A6CEE3", Domestic="#386CB0")) + coord_flip() + labs(x=NULL, y=NULL, title="Star Wars Grosses (US$ Millions)")
The new release of Star Wars by Disney is making its way to the top films of all-time. Although the values are not in current currency, old films in the list may have grosses based on multiple releases, and that the The Force Awakens just began its journey.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.