Site icon R-bloggers

Warrior Zombies from Outer Space II: Mayhem Unleashed

[This article was first published on Hot Damn, Data!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Given the speed at which I consume them, it’s only justified that the first post on this blog is about movies. (Although, by that logic, it could have equally well been about sandwiches, Nutella, or tissue paper. Note to self: Look for a Nutella consumption dataset)
Anyway, this post is about movie taglines – specifically, the words that constitute them.
The data is pretty much there for the picking – IMDb hosts a number of freely available1 datasets, and one of them is about taglines.

The data is in an odd format, but at least it’s all available in one place. After the coding equivalent of jamming the fork into the toaster and jerking it around until something pops, I have the data in a usable structure

Once here, R’s tm package makes quick work of the word frequency analysis, and I have derived a dataset with common words and their frequencies in movie titles. After removing some highly frequent words in English (articles, pronouns, some prepositions, etc.). Here’s a list of the most used words in movie taglines, ordered by frequency:

love, life, story, world, time, film, comedy, death, woman, don’t

Not many surprises there – until we look at the fraction of taglines these terms occur in:

love< o:p>
7.5%< o:p>
life< o:p>
6.0%< o:p>
story< o:p>
5.0%< o:p>
world< o:p>
3.8%< o:p>
time< o:p>
3.1%< o:p>
film< o:p>
2.3%< o:p>
comedy< o:p>
2.2%< o:p>
death< o:p>
2.1%< o:p>
woman< o:p>
2.1%< o:p>
dont< o:p>
1.9%< o:p>


These numbers are way higher than I expected. ‘Love’ alone occurs in a whopping 7.5% of all movie taglines!
Here’s a visual representation of the words you’d have seem most often in movie taglines (the size of each word is proportional to the frequency of its occurrence)
[Full size image on imgur]

Yes Hollywood, we see right through you.
The R code to parse the data and make the word cloud is available at github if you’re interested 

I’m kinda keen to know if this trend has been constant through the years. Let’s do the same thing, except looking at the taglines decade by decade. Here’s the list of top 10 words in movie taglines from each decade2 – from the fifties to the teens (Teens? Onesies? I like ‘onesies’)



1940s< o:p>
1950s< o:p>
1960s< o:p>
1970s< o:p>
1980s< o:p>
1990s< o:p>
2000s< o:p>
2010s< o:p>
action< o:p>
story< o:p>
love< o:p>
love< o:p>
love< o:p>
love< o:p>
love< o:p>
love< o:p>
love< o:p>
love< o:p>
story< o:p>
story< o:p>
story< o:p>
life< o:p>
life< o:p>
life< o:p>
story< o:p>
world< o:p>
world< o:p>
film< o:p>
life< o:p>
story< o:p>
story< o:p>
story< o:p>
thrills< o:p>
terror< o:p>
picture< o:p>
time< o:p>
time< o:p>
time< o:p>
world< o:p>
world< o:p>
adventure< o:p>
adventure< o:p>
film< o:p>
world< o:p>
hes< o:p>
world< o:p>
time< o:p>
sometimes< o:p>
romance< o:p>
woman< o:p>
woman< o:p>
life< o:p>
world< o:p>
comedy< o:p>
sometimes< o:p>
dont< o:p>
gun< o:p>
screen< o:p>
adventure< o:p>
movie< o:p>
comedy< o:p>
hes< o:p>
film< o:p>
time< o:p>
west< o:p>
picture< o:p>
life< o:p>
death< o:p>
adventure< o:p>
murder< o:p>
dont< o:p>
film< o:p>
thrill< o:p>
gun< o:p>
time< o:p>
terror< o:p>
terror< o:p>
film< o:p>
comedy< o:p>
family< o:p>
screen< o:p>
girl< o:p>
motion< o:p>
hes< o:p>
movie< o:p>
dont< o:p>
family< o:p>
cant< o:p>

Again, I think a visual representation might come in handy

[Full size image on imgur]



‘story’ and ‘love’ are part of the Top 10 list in each decade, but the other words are distinctly symbolic of the movies of each era:

  • The 40s are the years of ‘action’, ‘adventure’, ‘thrills’ and ‘west’
  • The 50s go slightly more romantic, and scale stuff up, adding ‘woman’, ‘girl’, ‘world’ and ‘terror’
  • In the 60s, ‘girl’ is out, but ‘woman’ is still in; No more ‘gun’ and ‘terror’. instead, it’s about ‘life’ and ‘time’, both of which are here to stay
  • ‘terror’ makes a comeback in the 70s; ‘adventure’ goes out. And ‘death’ is explored.
  • In the 80s, ‘comedy’ makes the list for the first time
  • The 90s  are the only time ‘murder’ was cool.
  • The 00s (I like to call these the noughties) and (the early part of) the 10s show a distinct change in values that sell.’family’, ‘sometimes’  and ‘cant’ are popular

I’ve put up word clouds for each decade, and the code to generate them on imgur and github.

So that’s that for frequent words. But I’m also after words that are frequent exclusively in high (or low) rated movies. Or to look at it another way, words that, in retrospect, are indicative of the movie’s success.

One way of doing this is to segment the data into different parts by performance, and do the same analysis as above. But the prior frequencies will likely dominate these lists. What I really want is words whose presence (or absence) is highly indicative of the movie’s rating.

NOTE: Some math to follow. If you’re uncomfortable with arithmetic and/or statistics, skip a couple of paragraphs.

For a given term, if D1 is the distribution of movie ratings with the term present in the tagline, and D2 is the distribution of movie ratings with the term absent in the tagline, I’m going to define my divergence/separation metric as3,4:

<Obligatory CORRELATION DOES NOT IMPLY CAUSATION warning> 

Adding such words will not automatically make your movie successful – this is offered a post-event descriptive analysis, not a predictive one. I’m not implying any causality here.
< !--[if gte msEquation 12]>< m:oMathPara>< m:oMath>< m:r>Divergence< m:r>= < m:f>< m:fPr>< m:ctrlPr>< m:num>< m:sSup>< m:sSupPr>< m:ctrlPr>< m:e>< m:r>(< m:r>mean< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>1< m:r>–< m:r>mean< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>2< m:r>)< m:sup>< m:r>2< m:den>< m:r>sd< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>1< m:r>+< m:r>sd< m:r>(< m:r>D< m:r>2)< ![endif]-->< !--[if !msEquation]-->< v:shapetype coordsize="21600,21600" filled="f" id="_x0000_t75" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"> < v:stroke joinstyle="miter"> < v:formulas> < v:f eqn="if lineDrawn pixelLineWidth 0"> < v:f eqn="sum @0 1 0"> < v:f eqn="sum 0 0 @1"> < v:f eqn="prod @2 1 2"> < v:f eqn="prod @3 21600 pixelWidth"> < v:f eqn="prod @3 21600 pixelHeight"> < v:f eqn="sum @0 0 1"> < v:f eqn="prod @6 1 2"> < v:f eqn="prod @7 21600 pixelWidth"> < v:f eqn="sum @8 21600 0"> < v:f eqn="prod @7 21600 pixelHeight"> < v:f eqn="sum @10 21600 0"> < v:path gradientshapeok="t" o:connecttype="rect" o:extrusionok="f"> < o:lock aspectratio="t" v:ext="edit">< v:shape id="_x0000_i1025" style="height: 30.75pt; width: 201.75pt;" type="#_x0000_t75"> < v:imagedata chromakey="white" o:title="" src="file:///C:\Users\EESHAN~1\AppData\Local\Temp\msohtmlclip1\01\clip_image001.png">< !--[endif]-->< !--[if gte msEquation 12]>< m:oMathPara>< m:oMath>< m:r>Divergence< m:r>= < m:f>< m:fPr>< m:ctrlPr>< m:num>< m:sSup>< m:sSupPr>< m:ctrlPr>< m:e>< m:r>(< m:r>mean< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>1< m:r>–< m:r>mean< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>2< m:r>)< m:sup>< m:r>2< m:den>< m:r>sd< m:d>< m:dPr>< m:ctrlPr>< m:e>< m:r>D< m:r>1< m:r>+< m:r>sd< m:r>(< m:r>D< m:r>2)< ![endif]-->< !--[if !msEquation]-->< v:shapetype coordsize="21600,21600" filled="f" id="_x0000_t75" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"> < v:stroke joinstyle="miter"> < v:formulas> < v:f eqn="if lineDrawn pixelLineWidth 0"> < v:f eqn="sum @0 1 0"> < v:f eqn="sum 0 0 @1"> < v:f eqn="prod @2 1 2"> < v:f eqn="prod @3 21600 pixelWidth"> < v:f eqn="prod @3 21600 pixelHeight"> < v:f eqn="sum @0 0 1"> < v:f eqn="prod @6 1 2"> < v:f eqn="prod @7 21600 pixelWidth"> < v:f eqn="sum @8 21600 0"> < v:f eqn="prod @7 21600 pixelHeight"> < v:f eqn="sum @10 21600 0"> < v:path gradientshapeok="t" o:connecttype="rect" o:extrusionok="f"> < o:lock aspectratio="t" v:ext="edit">< v:shape id="_x0000_i1025" style="height: 30.75pt; width: 201.75pt;" type="#_x0000_t75"> < v:imagedata chromakey="white" o:title="" src="file:///C:\Users\EESHAN~1\AppData\Local\Temp\msohtmlclip1\01\clip_image001.png">< !--[endif]-->

</warning>

This divergence is just a magnitude – so I had to separate the most related ‘good movie’ keywords list from the ‘bad movie’ keywords list.
So, without further math or ado, the 10 terms that correspond to highest ratings:

animation, masterpiece, vision, magnificent, production, startling, french, glorious, smashing, grand         

And the 10 terms that correspond to the lowest ratings:

outer, zombies, ancient, woods, experiment, pray, tonight, mayhem, warrior, unleashed

Again, I’ve put up code to generate these lists on github

If these lists make you have second thoughts about making Warrior Zombies from Outer Space II: Mayhem Unleashed, don’t be disheartened  – because like I said earlier, there is certainly a correlation, but it’s not necessarily a causal relationship5. And hey, I know a bunch of people who would watch the hell out of that movie.



Footnotes< o:p>
1 Going through and adhering to the legal clauses for use for the datasets is left as an exercise for the reader< o:p>
2 The punctuation has been removed from the data to make the analysis easier. So if you see “cant”, that’s probably “can’t”, and so on.< o:p>
3 It is possible that a better metric might have been used, or even a simpler one, but for some reason, I went with this. Other suggestions are welcome.< o:p>
4 IMDb ratings are arguably, not the best indicators of movie success, but that’s certainly one way of estimating, and there is probably going to a future post analyzing how reliable a measure this.
5 EDIT: Revisiting this, the final two lists of words don’t seem particularly robust. < o:p>




To leave a comment for the author, please follow the link and comment on their blog: Hot Damn, Data!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.