Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
I finally made to the movies for watching the new Star Wars release this weekend. Although this episode wasn’t that spectacular, in my view, it did inspire some data seeking afterwards. I wanted to know how this film compares to others top movies in terms of worldwide grossing as well as within the Star Wars series.
Fortunately, there is a wealth–though incomplete–list of the top grossing films of all time at < boxofficemojo.com>. Although the information is right in the front-page, I’d rather like something more visual teasing. So, I decided to see how it goes with *R* and the new *ggplot2* package release. Also, because I must scrap the data from the *Box Office* website, I will need a function to handle the HTML structure of those tables. The function `readHTMLTable()` from the *XML* package can certainly be an asset here.
The setup
First, let’s load the packages we’ll need.
In what follows is my setup for using the readHTMLTable
function to retrieve, cleaning, and arrange HTML tables in a data.frame format. I’d rather wrap everything in a single function, but keeping the three snippets apart is rather easy to make out.
The first function will pull out all tables on the webpage as a list of data.frames, and I’ll give them similar names.
The data will come dirty
with lots of tags and marks, so a little janitor work will be necessary. The following code does just that.
The next snippet is the main piece. It will construct the URLs based on the number of pages we feed in and will call the two preceding functions.
I’m scrapping the first 7 pages of the target address http://www.boxofficemojo.com/alltime/world/. It will bring missing values too, don’t worry for the time being.
Results
Our new acquired data is a data.frame with more than 620 rows or films, with the oldest dating back to 1939.
The following chart displays the grossing worldwide values for the top 25 ranked movies of all time. As it shines out, the Star Wars: The Force Awakens_ is doing pretty well worldwide. It’s ranked fourth now, but it just began to play in China this week, so it may unseat *Titanic over the next weeks, and Avatar in the long run.
If you want to reproduce the very same plot decoration of this post, you’ll have to install the development version of SciencesPo package, and add + theme_538(legend="top")
to the following code.
Next, how The Force Awakens compares with other episodes of Star Wars?
The new release by Disney of Star Wars is making its way to the top films of all-time. Although the values presented are not in current currency, and old films in the list may have their grosses based on multiple releases, The Force Awakens has just began its journey.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.