Introduction to R for Data Science :: Session 8 [Appendix]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression
Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]
The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.
Check out the Course Overview to acess the learning material presented thus far.
Data Science Serbia Course Pages [in Serbian]
Startit Course Pages [in Serbian]
Lecturers
- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Mentor at Springboard, Data Science Serbia
Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.
Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().
R script :: Session 8
Split data into training and test
#### w. training vs. test data set # split into test and training dim(dataSet) choice <- sample(1:475,250,replace = F) test <- which(!(c(1:475) %in% choice)) trainData <- dataSet[choice,] newData <- dataSet[test,] # check! sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test # Binomial Logistic Regression: use glm w. logit link bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=trainData) sumLR <- summary(bLRmodel) sumLR # Coefficients sumLR$coefficients class(sumLR$coefficients) coefLR <- as.data.frame(sumLR$coefficients) # Wald statistics significant? (this Wald z is normally distributed) coefLR <- coefLR[order(-coefLR$Estimate), ] w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)"))) # which predictors worked? rownames(coefLR)[w] # plot coefficients {ggplot2} plotFrame <- coefLR[w,] plotFrame$Estimate <- round(plotFrame$Estimate,2) plotFrame$Features <- rownames(plotFrame) plotFrame <- plotFrame[order(-plotFrame$Estimate), ] plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T) ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) + geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) + xlab("Features") + ylab("Regression Coefficients") + ggtitle("Logistic Regression: Coeficients (sig. Wald test)") + theme(axis.text.x = element_text(angle=90)) # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") # Prediction from the model predictions <- predict(bLRmodel, newdata=newData, type='response') predictions <- ifelse(predictions >= 0.5,1,0) trueCategory <- newData$Category meanClasError <- mean(predictions != trueCategory) accuracy <- 1-meanClasError accuracy # probably rather poor..? - Why? - Think! # Try to train a binomial regression model many times by randomly assigning # documents to the training and test data set # What happens? Why? # *Look* at your data set and *think* about it before actually modeling it.
Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]
- Yves Croissant, Estimation of multinomial logit models in R: The mlogit Packages
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.