Introduction to R for Data Science :: Session 8 [Appendix]

Posted on June 20, 2016 by The Exactness of Mind in R bloggers | 0 Comments

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression

Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

Lecturers

dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Mentor at Springboard, Data Science Serbia

Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().

Session 8 R Script

R script :: Session 8

Split data into training and test

#### w. training vs. test data set
# split into test and training
dim(dataSet)
choice <- sample(1:475,250,replace = F)
test <- which(!(c(1:475) %in% choice))
trainData <- dataSet[choice,]
newData <- dataSet[test,]
# check!
sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training
sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test
 
# Binomial Logistic Regression: use glm w. logit link
bLRmodel <- glm(Category ~.,
                family=binomial(link='logit'),
                control = list(maxit = 500),
                data=trainData)
 
sumLR <- summary(bLRmodel)
sumLR
 
# Coefficients
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
 
# plot coefficients {ggplot2}
plotFrame <- coefLR[w,]
plotFrame$Estimate <- round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-plotFrame$Estimate), ]
plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) +
  geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) +
  xlab("Features") + ylab("Regression Coefficients") +
  ggtitle("Logistic Regression: Coeficients (sig. Wald test)") +
  theme(axis.text.x = element_text(angle=90))
 
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
     main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)), 
        col="red", 
        border="black")
 
# Prediction from the model
predictions <- predict(bLRmodel,
                       newdata=newData,
                       type='response')
 
predictions <- ifelse(predictions >= 0.5,1,0)
trueCategory <- newData$Category
 
meanClasError <- mean(predictions != trueCategory)
accuracy <- 1-meanClasError
accuracy # probably rather poor..? - Why? - Think!
 
# Try to train a binomial regression model many times by randomly assigning 
# documents to the training and test data set
# What happens? Why?
 
# *Look* at your data set and *think* about it before actually modeling it.

Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

Yves Croissant, Estimation of multinomial logit models in R: The mlogit Packages

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Introduction to R for Data Science :: Session 8 [Appendix]

Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression

Lecturers

Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

R script :: Session 8

Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

Related

Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression

Lecturers

Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

R script :: Session 8

Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)