24 Days of R: Day 10
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How often is someone nominated for an academy award? Who has been nominated most often? Is there a difference between leading and supporting roles? Important questions. To answer them, I'm making use of a list of academy award nominees and winners. I've obtained the data from aggdata.com which has a few sets of free data. We'll open the file, do some basic clean up and then have a look at the results for Michael Caine. Note that these results are only through 2010.
dfAwards = read.csv("./Data/academy_awards.csv", stringsAsFactors = FALSE) dfAwards = dfAwards[, 1:5] dfAwards$Year = as.numeric(substr(dfAwards$Year, 1, 4)) colnames(dfAwards) = gsub(".", "", colnames(dfAwards), fixed = TRUE) dfAwards$Won = dfAwards$Won == "YES" dfCaine = subset(dfAwards, Nominee == "Michael Caine") row.names(dfCaine) = NULL FirstNominated = min(dfCaine$Year) FirstWon = min(dfCaine$Year[dfCaine$Won == TRUE])
Michael Caine has been nominated 6 times and has won 2 times. It took 20 years for him to win his first award. That's a long time. My guess is that actors receive more multiple nominations and receive nominations over a longer period of time. I'll split the data into actor and actress categories to test this.
dfAwards$Gender = "Other" dfAwards$Gender[grepl("Actor", dfAwards$Category)] = "Actor" dfAwards$Gender[grepl("Actress", dfAwards$Category)] = "Actress" dfActors = subset(dfAwards, Gender != "Other") row.names(dfActors) = NULL library(plyr) plyActor = ddply(dfActors, .(Nominee, Gender), summarize, FirstNominated = min(Year), NumberNominated = length(Year), LastNominated = max(Year)) plyActor$Span = plyActor$LastNominated - plyActor$FirstNominated row.names(plyActor) = NULL meanActor = mean(plyActor$Span[plyActor$Gender == "Actor"]) meanActress = mean(plyActor$Span[plyActor$Gender == "Actress"])
We see that the mean length of time between first and last nomination is fairly comparable. Mean have a slightly longer span, but only just. A box plot of the span looks like this:
library(ggplot2) ggplot(plyActor, aes(factor(Gender), Span)) + geom_boxplot()
We'll do the same for number of nominations. It's a similar window into the potential longevity of someone's career, or the degree to which someone commands attention.
actorNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actor"]) actressNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actress"]) ggplot(plyActor, aes(factor(Gender), NumberNominated)) + geom_boxplot()
Curiously, just who are those individuals who have career spans greater than 40 years? And which people have been nominated more than 10 times”“
plyActor[plyActor$Span > 40, ] ## Nominee Gender FirstNominated NumberNominated LastNominated ## 321 Henry Fonda Actor 1940 2 1981 ## 455 Julie Christie Actress 1965 4 2007 ## 466 Katharine Hepburn Actress 1932 12 1981 ## 655 Paul Newman Actor 1958 9 2002 ## 671 Peter O'Toole Actor 1962 8 2006 ## Span ## 321 41 ## 455 42 ## 466 49 ## 655 44 ## 671 44 plyActor[plyActor$NumberNominated >= 10, ] ## Nominee Gender FirstNominated NumberNominated LastNominated ## 77 Bette Davis Actress 1934 11 1962 ## 345 Jack Nicholson Actor 1969 12 2002 ## 466 Katharine Hepburn Actress 1932 12 1981 ## 594 Meryl Streep Actress 1978 16 2009 ## Span ## 77 28 ## 345 33 ## 466 49 ## 594 31
OK, I could see that. Katharine Hepburn, Paul Newman, Julie Christie, Bette Davis. A superficial look suggests that gender may not suffer from an age bias. Mind, I'd love to have more data to explore this further. In the meantime, I think I'm going to go watch “On Golden Pond”. I saw it when it first came out and it was clearly one hell of a movie for older performers.
Tomorrow: Unsure what will be covered. I'm going to a PostgreSQL meetup, so possibly that.
sessionInfo() ## R version 3.0.2 (2013-09-25) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] knitr_1.4.1 RWordPress_0.2-3 ggplot2_0.9.3.1 plyr_1.8 ## ## loaded via a namespace (and not attached): ## [1] colorspace_1.2-3 dichromat_2.0-0 digest_0.6.3 ## [4] evaluate_0.4.7 formatR_0.9 grid_3.0.2 ## [7] gtable_0.1.2 labeling_0.2 MASS_7.3-29 ## [10] munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 ## [13] RCurl_1.95-4.1 reshape2_1.2.2 scales_0.2.3 ## [16] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 ## [19] XMLRPC_0.3-0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.