Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]
The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.
Check out the Course Overview to acess the learning material presented thus far.
Data Science Serbia Course Pages [in Serbian]
Startit Course Pages [in Serbian]

Lecturers
- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Mentor at Springboard, Data Science Serbia
Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.
Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().
Intro to R for Data Science SlideShare :: Session 8
R script :: Session 8
########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear
rm(list=ls())
 
# libraries
library(tm)
library(tm.plugin.webmining)
 
#### NOTE
#### Suggestion: Skip WebCorpus() calls from the tm.plugin.webmining
#### (skip everything before: # START HERE load)
#### Data set is available from GitHub :: https://github.com/GoranMilovanovic/IntroRDataScience
#### File: Session8.RData
#### Download link: https://goo.gl/cgJ3J3
 
#### Part I Information Retrieval: ICT market
# 2 categories: dotcom vs hardware companies
# dotCom category:
# NASDAQ:GOOGL is Alphabet Inc, NASDAQ:AMZN is Amazon, NASDAQ:JD is JD.com, NASDAQ:FB is Facebook,
# NYSE:BABA is Alibaba
# hardware category:
# NYSE:HPQ is HP, NASDAQ:AAPL is Apple, KRX:005930 is Samsung Electronics, TPE:2354 is Foxconn,
# NYSE:IBM is IBM
 
# source: Google Finance
# search queries: .com vs. hardware companies
searchQueries <- list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'),
                      c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM'));
 
# retrive w. tm.plugin.webmining 
# retrieve news for dotCom
dotCom <- lapply(searchQueries[[1]], function(x) {
                 WebCorpus(GoogleFinanceSource(x))
});
# now retrieve news for hardware companies
hardware <- lapply(searchQueries[[2]], function(x) {
  WebCorpus(GoogleFinanceSource(x))
});
 
# source: Google News
searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"), 
                       c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
 
# retrieve news for dotCom companies 
dotComNews <- lapply(searchQueriesNews[[1]], function(x) {
  googleNewsSRC <- GoogleNewsSource(x,
                                    params = list(hl = "en",
                                                  q = x,
                                                  ie = "utf-8",
                                                  num = 30,
                                                  output = "rss"))
 
  WebCorpus(googleNewsSRC)
});
 
# retrieve news for hardware companies
hardwareNews <- lapply(searchQueriesNews[[2]], function(x) {
  googleNewsSRC <- GoogleNewsSource(x,
                                    params = list(hl = "en",
                                                  q = x,
                                                  ie = "utf-8",
                                                  num = 30,
                                                  output = "rss"))
 
  WebCorpus(googleNewsSRC)
});
 
# save workspace; NOTE: tm has functions to save corpora
sessionDir <- 
  "YOUR WORKING DIRECTORY HERE"
setwd(sessionDir)
save.image(file="Session8.RData")
 
# START HERE load
sessionDir <- 
  "YOUR WORKING DIRECTORY HERE"
setwd(sessionDir)
load(file="Session8.RData")
 
# tm corpora are lists
 
# a tm corpus (WebCorpus, more specifically)
dotCom[[1]]
# a PlaintTextDocument in the first dotCom corpus
class(dotCom[[1]][[1]])
dotCom[[1]][[1]]
# document metadata
dotCom[[1]][[1]]$meta
# document content
dotCom[[1]][[1]]$content
 
# let's add another tag the document metadata structure: dotCom
dotCom <- lapply(dotCom, function(x) {
  x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply
    meta(doc, "category") <- "dotCom"
    return(doc)
    })
})
dotCom[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: dotComNews
dotComNews <- lapply(dotComNews, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "dotCom"
    return(doc)
  })
})
dotComNews[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: hardware
hardware <- lapply(hardware, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "hardware"
    return(doc)
  })
})
hardware[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: hardwareNews
hardwareNews <- lapply(hardwareNews, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "hardware"
    return(doc)
  })
})
hardwareNews[[1]][[1]]$meta
 
# combine corpora w. do.call() and c()
dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply
# Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html
hardware <- do.call(c, hardware)
dotComNews <- do.call(c, dotComNews)
hardwareNews <- do.call(c, hardwareNews)
workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
 
# are there any duplicates?
urls <- as.character(meta(workCorpus,tag="origin"))
uurls <- unique(urls)
w <- unlist(lapply(uurls, function(x) {which(urls %in% x)[1]}))
workCorpus <- workCorpus[w]
 
# empty docs?
docLengths <- as.numeric(lapply(workCorpus, function(x) {nchar(x$content)}))
w <- which(docLengths<10)
workCorpus <- workCorpus[-w]
 
#### Part II Text Preprocessing
 
### text pre-processing
workCorpus[[1]]$content # we need to clean-up the docs
# reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
# Regex in R
removeSpecial <- function(x) {
  # replacing w. space might turn out to be handy
  x$content <- gsub("\\t|\\r|\\n"," ",x$content)
  return(x)
}
# example: 
cleanDoc <- removeSpecial(workCorpus[[1]])
cleanDoc$content
# removeSpecial with tm_map {tm}
workCorpus <- tm_map(workCorpus, removeSpecial)
workCorpus[[1]]$content
 
#### {tm} by the book text pre-processing
 
# remove punctuation
workCorpus <- tm_map(workCorpus, removePunctuation) # built in
workCorpus[[1]]$content
 
# remove numbers
workCorpus <- tm_map(workCorpus, removeNumbers) # built in
workCorpus[[1]]$content
 
# all characters tolower
toLower <- function(x) {
  x$content <- tolower(x$content)
  return(x)
}
workCorpus <- tm_map(workCorpus, toLower)
workCorpus[[1]]$content
 
# remove stop words
workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in
workCorpus[[1]]$content
 
# stemming
# see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
library(SnowballC) # nice stemming algorithm (Porter, 1980)
# see: https://cran.r-project.org/web/packages/SnowballC/index.html
workCorpus <- tm_map(workCorpus, stemDocument)
workCorpus[[1]]$content
 
# remove potentially dangerous predictors: company names
# Why? Why? (In a nutshell: we are trying to discover something here,
# not to 'confirm' our potentially redundant knowledge)
predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba", 
                       "HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
predWordsRegex <- paste(stemDocument(predWords),collapse="|");
replacePreds <- function(x) {
  x$content <- gsub(predWordsRegex,
                    " ", 
                    x$content)
  return(x)
}
workCorpus <- tm_map(workCorpus, replacePreds)
workCorpus[[1]]$content
 
# strip whitespace
workCorpus <- tm_map(workCorpus, stripWhitespace)
workCorpus[[1]]$content
 
#### Part III Feature Selection: Term-Document Frequency Matrix (TDM)
 
# TDM
# rows = terms; columns = docs
# we will use the td-idf weighting
# see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
dT <- TermDocumentMatrix(workCorpus,
                              control = list(tolower = FALSE,
                                             wordLengths <- c(3, Inf),
                                             weighting = weightTfIdf)
)
 
#  dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package
# https://cran.r-project.org/web/packages/slam/index.html
class(dT)
dT$i # rows w. non-zero entries
dT$j # columns w. non-zero entries
dT$v # non-zero entry in [i,j]
dT$nrow # "true" row dimension
dT$ncol # "true" column dimension
dT$dimnames$Terms # self-explanatory
dT$dimnames$Docs # self-explanatory
 
# remove sparse terms
docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1]
dim(docTerm)
docTerm$dimnames$Terms
# see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms
# A term-document matrix where those terms from x are removed which have 
# at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) 
# elements. [1]
# NOTE: **very important**; in a real-world application
# you would probably need to run many models with features obtained from various levels of sparcity
# and perform model selection.
 
# as.matrix
docTerm <- as.matrix(docTerm)
colnames(docTerm) <- paste0("d",seq(1:dim(docTerm)[2]))
# dimensions
dim(docTerm)
# inspect candidate features
tfIdf <- rowSums(docTerm)
tfIdf <- sort(tfIdf, decreasing = T)
head(tfIdf, 10)
tail(tfIdf, 10)
# list candidate features
length(tfIdf)
names(tfIdf)
# pick some words w
w <- sample(1:length(tfIdf),4,replace = F)
par(mfcol = c(2,2))
for (i in 1:length(w)) {
  hist(docTerm[w[i],],
       main = paste0("Distribution of '",names(tfIdf)[w[i]],"'"),
       cex.main = .85,
       xlab = "Tf-Idf",
       ylab = "Count"
       )
} # quite interesting, isn't it? - How do you perform regression with these?

#### Part IV Binomial Logistic Regression
 
# How well can the selected words predict the document category?
# How to relate continuous, non-normally distributed predictors, to a categorical outcome?
# Idea: Logistic function
par(mfcol=c(1,1))
logistic <- function(t) {exp(t)/(1+exp(t))}
curve(logistic, from = -10, to = 10, n = 1000,
      main="Logistic Function",
      cex.main = .85,
      xlab = "t",
      ylab = "Logistic(t)",
      cex.lab = .85)
# and then let t be b0 + b1*x1 + b2*x2 + ... + bn*xn

# prepare data set
dataSet <- data.frame(t(docTerm))
dataSet$Category <- as.character(meta(workCorpus, tag="category")) 
table(dataSet$Category)
 
# does every word play at least some role in each category?
w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w1]
w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w2]
 
# recode category
library(plyr)
dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0)))
 
#### Logistic Regression
# Binomial Logistic Regression: use glm w. logit link (link='logit' is default)
bLRmodel <- glm(Category ~.,
             family=binomial(link='logit'),
             control = list(maxit = 500),
             data=dataSet)
 
sumLR <- summary(bLRmodel)
sumLR
 
# Coefficients
# NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)] 
# logit(p) = log(p/(1-p)) - also known as log-odds
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
# NOTE: Wald statistic (z) is dangerous:
# as the coefficient gets higher, its standard error inflates thus underestimating z
# Beware of z...
 
# plot coefficients {ggplot2}
library(ggplot2)
plotFrame <- coefLR[w,]
plotFrame$Estimate <- round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-plotFrame$Estimate), ]
plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) +
  geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) +
  xlab("Features") + ylab("Regression Coefficients") +
  ggtitle("Logistic Regression: Coeficients (sig. Wald test)") +
  theme(axis.text.x = element_text(angle=90))

# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
     main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)), 
                col="red", 
                border="black")

# coefficients related to odds (and not log-odds): simply
exp(bLRmodel$coefficients)[w]
# check max
max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
 
#### Reminder: Maximum Likelihood Estimation
normData <- rnorm(10000, mean = 5.75, sd = 1.25)
normLogLike <- function(params,x) {
  mean <- params[1]
  sd <- abs(params[2]) # dnorm generates NaNs if sd < 0
  dens <- dnorm(x,mean,sd)
  w <- which(dens==0)
  dens[w] <- .Machine$double.xmin	
  return(-(sum(log(dens)))) # negative logLike, for minimization w. optim()
}
# test
normLogLike(normData,c(2,.77))
# ML estimation
# random initial values
startMean <- runif(1,-100,100)
startSd <- runif(1,-100,100)
mlFit <- optim(c(startMean, startSd),
               fn = normLogLike,
               x=normData,
               control = list(maxit = 50000))
# ML estimates
mlFit$par
# cmp. true paramaters: mean = 5.75, sd = 1.25
 
# model Chi-Square
CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution
CSq
dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom
dfCSq
# Chi-Square significance test in R:
pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance)
pCSq
 
# AIC = Akaike information criterion  (-2*LogLikelihood+2*k, k = num.parameters)
bLRmodel$aic
Session 8 Appendix
Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]
- Yves Croissant, Estimation of multinomial logit models in R: The mlogit Packages
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
