Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try. To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia. This was an interesting opportunity for me to learn about a far-off place while practicing some fun data skills.
According to recent estimates, Indonesia is home to some 250 million individuals and over the years, thanks to government intervention, has had its fertility rate slowed down from well over 5 births per woman, to a current value of under 2.4 births per woman. Despite this slow down, Indonesia is not generating enough jobs to satisfy the population. Almost a fifth of their youth labour force (aged 15-24) are unemployed (19.6%), a whole 6.3% more than a recent estimate of the youth unemployment rate in Canada (13.3%). When you’re talking about a country with 250 million individuals (with approximately 43.4 million 15-24 year olds), that’s the difference between about 5.8 million unemployed and 8.5 million unemployed teenagers/young adults. The very idea is frightening! Hence the government’s focus (decades ago and now) on promoting contraceptive method use to its population.
That is the spirit behind the survey data we’ll look at today! First, download the data from my plot.ly account and load it into R.
library(plotly) library(caret) library(dplyr) library(reshape2) library(scales) cmc = read.csv("cmc.csv", colClasses = c(Wife.Edu = "factor", Husband.Edu = "factor", Wife.Religion = "factor", Wife.Working = "factor", Husband.Occu = "factor", Std.of.Living = "factor", Media.Exposure = "factor", Contraceptive.Method.Used = "factor")) levels(cmc$Contraceptive.Method.Used) = c("None","Short-Term","Long-Term") table(cmc$Contraceptive.Method.Used) None Short-Term Long-Term 629 333 511 prop.table(table(cmc$Contraceptive.Method.Used)) None Short-Term Long-Term 0.4270197 0.2260692 0.3469111
Everything in this data set is stored as a number, so the first thing I do is to define as factors what the documentation suggests are factors. Then, we see the numeric breakdown of how many women fell into each contraceptive method category. It’s 1473 responses overall, and ‘None’ is the largest category, although it is not hugely different from the others (although a chi squared test does tell me that the proportions are significantly different from one another).
Next up, let’s set up the cross validation and training vs testing indices, then train a bunch of models, use a testing sample to compare performance, and then we’ll see some graphs
# It's training time! control = trainControl(method = "cv") in_train = createDataPartition(cmc$Contraceptive.Method.Used, p=.75, list=FALSE) contraceptive.model = train(Contraceptive.Method.Used ~ ., data=cmc, method="rf", metric="Kappa", trControl=control, subset=in_train) contraceptive.model.gbm = train(Contraceptive.Method.Used ~ ., data=cmc, method="gbm", metric="Kappa", trControl=control, subset=in_train, verbose=FALSE) contraceptive.model.svm = train(Contraceptive.Method.Used ~ ., data=cmc, method="svmRadial", metric="Kappa", preProc = c('center','scale'), trControl=control, subset=in_train) contraceptive.model.c50 = train(Contraceptive.Method.Used ~ ., data=cmc, method="C5.0", metric="Kappa", trControl=control, subset=in_train, verbose=FALSE) dec = expand.grid(.decay=c(0,.0001,.01,.05,.10)) control.mreg = trainControl(method = "cv") contraceptive.model.mreg = train(Contraceptive.Method.Used ~ ., data=cmc, method="multinom", metric="Kappa", trControl=control.mreg, tuneGrid = dec, subset=in_train, verbose=FALSE) # And now it's testing time... cmc.test = cmc[-in_train,] cmc.test = cmc.test %>% mutate( rf.class = predict(contraceptive.model, cmc.test, type="raw"), gbm.class = predict(contraceptive.model.gbm, cmc.test, type="raw"), svm.class = predict(contraceptive.model.svm, cmc.test, type="raw"), mreg.class = predict(contraceptive.model.mreg, cmc.test, type="raw"), c50.class = predict(contraceptive.model.c50, cmc.test, type="raw")) # Here I'm setting up a matrix to host some performance statistics from each of the models cmatrix.metrics = data.frame(model = c("rf", "gbm", "svm", "mreg","c50"), kappa = rep(NA,5), sensitivity1 = rep(NA,5), sensitivity2 = rep(NA,5), sensitivity3 = rep(NA,5), specificity1 = rep(NA,5), specificity2 = rep(NA,5), specificity3 = rep(NA,5)) # For each of the models, I use confusionMatrix to give me the performance statistics that I want for (i in 11:15) { cmatrix.metrics[i-10,"kappa"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$overall[2][[1]] cmatrix.metrics[i-10, "sensitivity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,1] cmatrix.metrics[i-10, "sensitivity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,1] cmatrix.metrics[i-10, "sensitivity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,1] cmatrix.metrics[i-10, "specificity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,2] cmatrix.metrics[i-10, "specificity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,2] cmatrix.metrics[i-10, "specificity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,2] } # Now I transform my cmatrix.metrics matrix into a long format suitable for ggplot, graph it, and then post it to plot.ly cmatrix.long = melt(cmatrix.metrics, id.vars=1, measure.vars=c(2:8)) ggplot(cmatrix.long, aes(x=model, y=value)) + geom_point(stat="identity", colour="blue", size=3) + facet_grid(~variable) + theme(axis.text.x=element_text(angle=90,vjust=0.5, colour="black",size=12), axis.text.y=element_text(colour="black", size=12), strip.text=element_text(size=12), axis.title.x=element_text(size=14,face="bold"), axis.title.y=element_text(size=14, face="bold")) + ggtitle("Performance Stats for Contraceptive Choice Models") + scale_x_discrete("Model") + scale_y_continuous("Value") set_credentials_file("my.username","my.password") py = plotly() py$ggplotly()
And behold the beautiful graph (after some tinkering in the plot.ly interface, of course)
Firstly, the kappa stats are telling us that after you factor into accuracy the results you would expect by chance, the models don’t add a humongous level of predictive power. So, we won’t collect our nobel prize on this one! The classes are evidently more difficult to positively identify than they are to negatively identify, as evidenced by the lower sensitivity scores than specificity scores. Interestingly, class 1 (users of NO contraceptive methods) was the most positively identifiable.
Secondly, it looks like we have a list of the top 3 performers in terms of kappa, sensitivity and specificity (in no particular order because I they don’t appear that different):
- gbm (gradient boosting machine)
- svm (support vector machine)
- c50
Now that we have a subset of models, let’s see what they have to say about variable importance. Once we see what they have to say, we’ll do some more fun graphing to see the variables in action.
gbm.imp = varImp(contraceptive.model.gbm) svm.imp = varImp(contraceptive.model.svm) c50.imp = varImp(contraceptive.model.c50)
I haven’t used ggplot here because I decided to try copy pasting the tables directly into plot.ly for the heck of it. Let’s have a look at variable importance according to the three models now:
plot(Contraceptive.Method.Used ~ Wife.Age, data=cmc, main="Contraceptive Method Used According to Age of Wife") cmethod.by.kids = melt(dcast(cmc, Num.Kids ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records") ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Number of Kids") ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Number of Kids") cmethod.by.wife.edu = melt(dcast(cmc, Wife.Edu ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records") ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Education Level of Wife") ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Education Level of Wife")
And now the graphs:
Okay so this isn’t a plotly graph, but I do rather like the way this mosaic plot conveys the information. The insight here is after a certain age (around age 36) the women seem to be less inclined to report the use of long-term contraceptive methods, and more inclined to report no contraceptive use at all. Could it be that the older women feel that they are done with child bearing, and so do not feel the need for contraception anymore? Perhaps someone more knowledgeable on the matter could enlighten me!
Here’s a neat one, showing us the perhaps not-so-revelatory result that as the education level of the wife increases, the likelihood that she will report no contraception diminishes. Here it is in fact the short term contraceptive methods where we see the biggest increase in likelihood as the education level increases. I don’t think you can cite education in and of itself that causes women to choose contraception, because perhaps it’s higher socio-economic status which leads them to pursue higher education which leads them to choose contraception. I’m not sure this survey will answer that question, however!
Finally, we have number of kids, which exhibits an odd relationship with contraceptive choice. It appears as though at least half of the women with 1 child reported no contraception, but that proportion goes down when you look at women with 3 children. After that, women are more and more likely to cite no contraception, likely reflecting their age.
Conclusion and observations:
As I mentioned earlier, I don’t expect to pick up any nobel prizes from this analysis. Substantively, the most interesting thing that came out of this analysis for me is that it was stage of life factors (# kids and age) in addition to the wife’s education (and possibly income, but that wasn’t measured) which formed the most predictive variables in the classification of who uses which contraceptive methods. Naively, I expected wife’s religion to be amongst the most predictive.
According to UNICEF, 6/10 drop-outs in primary school are girls. That increases to 7/10 in secondary school. Stop that trend from happening, and then who knows what improvements might result? Perhaps continuing to invest in girls’ education will help lay the foundation for the later pursuit of higher education and the slowing down of their population expansion.
Lastly, I have some comments about plotly:
- It was a big buzz kill to discover that I couldn’t embed my plotly plots into my wordpress blog. Bummer.
- The plots that did result were very nice looking!
- I found myself getting impatient figuring out where to click to customize my plots to my liking. There’s something about going from a command line to a gui which is odd for me!
- I initially thought it was cool that I could save a theme for my plot once I customized it to my liking. I was disappointed to learn that the theme was not saved as an absolute set of visual characteristics. For some reason those characteristics changed depending on the initial graph I imported from R and I could not just apply it and be done with it.
- I found myself wondering if I there was a better way of getting my R graphs into plotly than the proscribed py$ggplotly() method. It’s not my biggest complaint, but somehow I’d rather just have one command to batch upload all of my graphs.
I’ll be watching plotly as it evolves and hope to see it improve even more than it has already! Good luck!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.