Opting for shorter movies, be aware u might be cutting the entertainment too!
[This article was first published on Exploring and experiencing analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello Friends,
This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
Correlation:
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say
a) IQ Score of a person and Salary drawn
b) No. of obese people in an area vis-à-vis no. of fast-food centers in the locality
c) No. of Facebook friends , with relationship shelf life
d) No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
Normal Distribution:
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.
So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.
Name | Year of Release | Rating | Duration | Small Desc |
Skyfall | 2012 | 8.1 | 143 | Bond’s loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost. |
At this point of time I have taken 183 movies. I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.
Below are the commands for a quick reference. What I just adore about R is it’s simplicity, with just so few commands we are done
film<-read.csv("film.csv",header=T)# Reading the file in a list object
x<-as.matrix(film) # Converting the list to a matrix, for histogram plotting
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col=”green”,border=”black”,xlab=”Duarion”,ylab=”mvfreq”,main=”Mv Duration Distribution”,breaks=7)
hist(y,col=”blue”,border=”black”,xlab=”mvRtng”,ylab=”mvfreq”,main=”Mv Rtng Distribution”,breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments
To leave a comment for the author, please follow the link and comment on their blog: Exploring and experiencing analytics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.