Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft
In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.
The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.
The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.
We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.
@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?
— Robert E Kelly (@Robert_E_Kelly) March 10, 2017
What evidence did viewers of the video have?
- the person is American Caucasian
- the person is professional
- there are two kids
- the caretaker is Asian
We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)
- Probability (Asian Wife | Caucasian Husband) = 1% [Married couples in the United States in 2010]
- Probability of (Household has Nanny | husband is professional) = 3.5% [The Three Faces of Work-Family Conflict, page 9, Figure 3]
- Probability of (Asian | Nanny) = 6% [Caregiver Statistics: Demographics]
- Probability of (Stay at home mom) = 14% and Probability of (Stay at home mom | Asian Wife) = 30% [Stay-at-Home Mothers by Demographic Group ]
We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.
library(bnlearn) set.seed(3) net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]") plot(net)
The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit
function runs the EM algorithm to learn CPT for all different nodes in the above graph.
yn <- c("yes", "no") ca <- c("caucacian","other") ao <- c("asian","other") nw <- c("nanny","wife") cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1] cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2] cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3] cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1] cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4] dim(cptStayAtHomeMom) <- c(2, 2, 2) dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn) cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5] dim(cptHouseholdHasNanny) <- c(2, 2, 2) dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn) cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999) dim(cptCaretaker) <- c(2, 2, 2) dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn) cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99) dim(cptCaretakerEthnicity) <- c(2, 2, 2,2) dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao) net.disc <- custom.fit(net, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))
Once we have the model, we can query the network using cpquery
to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.
probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) [1] "The probability that the caretaker is his wife = 0.898718647764449" [1] "The probability that the caretaker is the nanny = 0.110892031547457"
In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!
The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.