IMDb datasets: 3 centuries of movie rankings visualized
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Question
I am a sucker for IMDb ratings so don’t judge me. They are my priors before watching almost anything on a screen (home screen that is). But between movies (feature films), TV movies and TV (mini) series IMDb ratings are highly inconsistent. For example, series The Boys has rating 8.7 and so does movie Goodfellas by Martin Scorsese. Does it make sense The Boys ranked as high as #16 rated movie title in the whole IMDb database (among those with at least 25,000 user votes)? Or, in other words, if and how much apples vs. oranges those ratings are?
Rating Distributions
To start I downloaded IMDb datasets (here). Let’s show distributions of title ratings depending on the types: movie (i.e. feature film), TV movie, TV mini series, and TV series between fiction and documentaries:
Title ratings drift towards higher values depending on their types (shown on the right): movie, TV movie, TV mini series, and TV series. So indeed ratings of movies and TV series come from different distributions representing different things like apples and oranges. But how much different they are? (we will focus on fiction titles only from this point on.)
Percentiles
If a title has all time best rating then no doubt it’s worth giving a try (let’s say among titles with at least 1000 votes – number of votes is rather important consideration but we let it slide here and may come back to votes later). Why? Because 100% of other titles are rated below or at best the same and that indicates exceptional qualities. In statistics such rating has a name: 100th percentile. Following the same logic 99th percentile represents rating above 99% of all titles in the database (again, don’t forget about minimum threshold for number of votes to be considered).
Based on above we can assign IMDb titles to groups based on the highest percentile they belong to: 99% percentile suggests that the title is very best, 95% – excellent, 90% – very good, 75% – good, 50% – average, and 25% – bad. Feel free to assign and name percentiles differently in your analysis but we stick with this convention for this post. Last piece of the puzzle is taking percentiles not across whole IMDb set but rather for each title type separately and compare them:
Going back to our example, 8.7 in TV Series places The Boys firmly in “Excellent” (95th percentile), while Goodfellas at 8.7 sits at the top of “Very Best” (99th percentile) in movies – noticeable difference between the two.
The difference becomes even more meaningful when looking at the lower tiers “Very Good” (90th percentile) and below: while rating of 7.6 suffices for a movie (e.g. Love Actually) to place in “Very Good”, a TV series must achieve rating of 8.4 to qualify for the same 90th percentile. In fact, a TV Series with 7.6 rating (like Grey’s Anatomy) places just above “Average” 50th percentile. Furthermore, the rating of 8 would place a movie firmly in top 5% while the same 8 for a TV series barely cracks top 25%.
Percentiles Extra
Comparing and analyzing ratings between title types can be helped by organizing and visualizing the same percentile data in a few different ways:
- Overlapping bar charts by title types:
- Line chart by title types:
- Line chart by percentiles:
What About Documentaries?
The title percentiles above excluded documentaries. To be able to compare ratings between fiction and documentary titles the following visual computes and dissects rating percentiles between fiction and documentaries by title types:
For whatever reason IMDb users rate documentaries more generously than their fiction counterparts across all title types.
Historical Perspective Mixed with Film Trivia
The oldest film on IMDb is Passage de Venus made in 1874, is ranked 6.9 with 1282 votes (as of January 2020), and is filed under title type short and genre Documentary. In chronological order it is followed by 2 titles in 1878 (short animation Le singe musicien and short documentary Sallie Gardner at a Gallop), 1 in 1881 (short documentary Athlete Swinging a Pick), 1 in 1883 (short documentary Buffalo Running), and 1 in 1885 (short animation L’homme machine). Starting with 1887 that cranked up 45 titles total there are no more gap years, but such production feast will be surpassed only 1894 with 97 titles. First movie title (and only that year) Reproduction of the Corbett and Fitzsimmons Fight was filmed in 1897 under Documentary, News, and Sport genres. Lastly, first year when total number of titles exceeded its year numerical value is 1952 with 2059 shorts, movies, etc. under the belt. Did I just say last? One more factoid if you excuse me: movie production in 2020 (35,109 titles total) dropped us exactly 10 years back when 35,062 titles were produced in 2010, while the absolute record belongs to 2017 with 51231 films total.
What about visualizing film production over time?
Final Thoughts
IMDb dataset turned out to be richer and deeper than I expected and I just scratched the surface. There is plenty to play with – genres, runtimes, adult movies (yes, probably for compliance IMDb flags each title as adult or not), and, of course, ratings. IMDb uses adjusted (weighted) rating formula (based on averages and number of user votes) in their rankings (see Weighted Average Ratings) so the title averageRating we looked at can’t be taken at the face value after all.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.