Yet Another Movie: IMDB Top 250 movies
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m not a big movie person. Nonetheless I have a media library with quite a few films in and I wondered how many “films to see before you die”-type movies I had in the collection, and how many were missing. I used R to find the answers.
I’ve described previously how to get a plain text dump of a Plex database using WebTools-NG. I did that for the Movies library of my Plex Media Server. Now, for the list of “films to see before you die”. I searched a bit and found a few text files which claimed to be meta-rated as the best. I was a bit suspicious about these. In the end, I figured I should just use to the IMDB’s Top 250 Movies, which could be scraped with rvest
.
The code
Let’s get the Top 250 movies:
library(rvest) library(XML) library(xml2) library(fuzzyjoin) library(dplyr) # IMDB Top 250 Movies are here url <- "http://www.imdb.com/chart/top?ref_=nv_wl_img_3" page <- read_html(url) movie.nodes <- html_nodes(page,'.titleColumn a') movie.name <- html_text(movie.nodes) sec <- html_nodes(page,'.secondaryInfo') # to get the year we need to remove ) and ( and then get text) year <- as.numeric(gsub(")","",gsub("\\(","",html_text( sec )))) rating.nodes <- html_nodes(page,'.imdbRating strong') rating <- as.numeric(html_text(rating.nodes)) imdb <- data.frame(Title = movie.name, Year = year, Rating = rating)
Now we have a data frame of the movies, with Title, Year and the IMDB rating.
We can load in the Plex library so that we can match them up, but we don’t need all the data.
libfile <- file.choose() libdf <- read.delim(libfile,sep = "|") # we only need title and year pms <- libdf %>% select(Title,Year)
Now we have two data frames to perform the matching.
The first issue is that we can’t simply use the titles for matching because remakes and different versions of movies will cause a mismatch. To get around this we can use Title and Year as a combination for fuzzy matching.
# to match movies, we need Title-Year combination imdb$titleyear <- paste(imdb$Title, imdb$Year) pms$titleyear <- paste(pms$Title,pms$Year) # fuzzy matching match <- stringdist_join(imdb, pms, by = 'titleyear', mode ='left', method = "jw", max_dist = 99, # could set this a lot lower distance_col = 'dist') %>% group_by(titleyear.x) %>% slice_min(order_by = dist, n = 1) # gives the best match found
Fuzzy matching is needed because a simple string comparison will get derailed pretty easily by capitalisation and other minor issues. So we need something a little more forgiving to do the matching.
Now we can have a look at the matches by typing match
> # have a look at matches > match # A tibble: 254 × 8 # Groups: titleyear.x [250] Title.x Year.x Rating titleyear.x Title.y Year.y titleyear.y dist <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 12 Angry Men 1957 9 12 Angry Men 1957 12 Monkeys 1995 12 Monkeys 1995 0.282 2 12 Years a Slave 2013 8.1 12 Years a Slave 2013 Oz the Great and Powerful 2013 Oz the Great and P… 0.348 3 1917 2019 8.2 1917 2019 Cats 2019 Cats 2019 0.296 4 2001: A Space Odyssey 1968 8.3 2001: A Space Odyssey 1968 2001: A Space Odyssey 1968 2001: A Space Odys… 0 5 3 Idiots 2009 8.3 3 Idiots 2009 The Incredibles 2004 The Incredibles 20… 0.286 6 A Beautiful Mind 2001 8.2 A Beautiful Mind 2001 Beautiful Noise 2014 Beautiful Noise 20… 0.312 7 A Clockwork Orange 1971 8.2 A Clockwork Orange 1971 A Clockwork Orange 1972 A Clockwork Orange… 0.0290 8 A Separation 2011 8.2 A Separation 2011 Separado! 2010 Separado! 2010 0.189 9 Aladdin 1992 8 Aladdin 1992 Aladdin 1992 Aladdin 1992 0 10 Alien 1979 8.4 Alien 1979 Alien 1979 Alien 1979 0 # 244 more rows # Use `print(n = ...)` to see more rows
We have several perfect matches in the first 10 rows. These have a distance of 0. There are some less-good-but-still-matches, such as A Clockwork Orange where the year differs between IMDB and Plex. Then there are a bunch of clear “not matched” movies, e.g. 12 Angry Men, 12 Years a Slave. We can see that a distance of 0.1 or more means the match is not true.
Note that it says there are 244 more rows and shows us 10 (a total of 254 when we should have only 250). The 4 extra matches are duplicates caused by a same-distance match to two different movies in the Plex library. Let’s get rid of them and then figure out our totals.
# remove duplicate fuzzy match fails match <- match[!duplicated(match[ , "titleyear.x"]), ] # leaves 250 rows # matches with distance of 0.1 or more are not a match # the are the movies we want to look at match <- match %>% filter(dist >= 0.1) # leaves 190 rows
So I have 60 of the IMDB’s Top 250 Movies. This is not very high. In my defence, I am not a movie buff and my movie collection is not particularly huge.
So what are those movies that I am missing? Let’s sort them to be the highest rated and figure out what I should add with some urgency!
match <- match[order(-match$Rating),] # write file lapply(match$titleyear.x, write, "Output/Data/imdb.txt", append=TRUE)
The Shawshank Redemption | 1994 | 9.2 |
12 Angry Men | 1957 | 9 |
The Dark Knight | 2008 | 9 |
Schindler’s List | 1993 | 8.9 |
The Good, the Bad and the Ugly | 1966 | 8.8 |
Fight Club | 1999 | 8.7 |
Inception | 2010 | 8.7 |
Interstellar | 2014 | 8.6 |
It’s a Wonderful Life | 1946 | 8.6 |
Life Is Beautiful | 1997 | 8.6 |
I have at least seen some of those films at some point in the past.
—
The post title comes from “Yet Another Movie” by Pink Floyd from “A Momentary Lapse of Reason”.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.