I like you and you like me…but what does it all mean. (Part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tinder is a popular matchmaking application that allows users to connect with others whom they share a physical attraction. New members build their profile by importing their age, gender, geographic information, and photos from their Facebook account. Users are then presented with profiles which meet their search criteria and are able to like or dislike them. Unlike traditional online dating sites, members can only communicate with those individuals who they share a common affinity (you liked them and they liked you).
Tinder is an interesting product that offers an interesting case study for statisticians and data scientists who want to understand how human beings interact on on mobile dating applications. Given that large scale data collection is near impossible without a large team of interns, I decided to collect data on the profiles that were presented to me over a one week period. My goal was to extract information on users and their profiles in order to determine if certain people were more likely to like my Tinder profile. After a couple days, I realized that receiving likes on Tinder was a difficult proposition, and was forced to adjust the data in order to have a robust occurrence rate. Using Naive Bayes, I attempted to glean any insights from the data I collected.
> head(dat) Hair_Color Race Text Pictures Age Miles_Away Shared_Interest Overweight Liked_You 1 Black White Y 5 23 Close 0 N N 2 Blonde White N 4 23 Close 1 N N 3 Black Other Y 4 28 Close 4 N N 4 Blonde White Y 5 23 Close 0 N N 5 Blonde White N 4 21 Close 1 N N 6 Brunette White Y 6 23 Close 0 N N ...
- The Naive Bayes classifier did a surprisingly good job (60 to 70% accuracy) in predicting whether a user liked me in both the training and test data.
- Based on the logistic regression model, the most important predictors of whether someone liked me were the number of pictures on their profile, hair color, and the their physical distance from me. The predicted probabilities for someone liking me were higher for users who had less pictures, were further away, and were brunettes.
Part 1 of this series is simply focused on providing a high level overview of the problem and what I found. In part 2, I’ll offer a review of Naive Bayes classification and provide a worked out example.
train.ind <- sample(1:nrow(dat), ceiling(nrow(dat)*2/3), replace=FALSE) nb.res <- NaiveBayes(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away, data=dat[train.ind,]) nb.pred <- predict(nb.res, dat[-train.ind,]) accuracy <- table(nb.pred$class, dat[-train.ind,"Liked_You"]) sum(diag(accuracy))/sum(accuracy) mod = glm(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away + Shared_Interest, data=dat[train.ind,], family=binomial(link = "logit")) library(effects) plot(effect("Pictures", mod), rescale.axis=FALSE) plot(effect("Miles_Away", mod), rescale.axis=FALSE) plot(effect("Hair_Color", mod), rescale.axis=FALSE) fit = fitted(mod) accuracy = table(fit > .5, dat[train.ind, "Liked_You"]) sum(diag(accuracy)) / sum(accuracy)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.