Text Mining with R – Comparing Word Counts in two Text Documents

Kay Cichini

9 years ago

[This article was first published on theBioBucket*, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s what I came up with to compare word counts in two pieces of text. If you got any idea, I’d love to learn about alternatives!

## a function that compares word counts in two texts
wordcount <- function(x, y, stem = F, minlen = 1, marg = F) {

                        require(tm)

                        x_clean <- unlist(strsplit(removePunctuation(x), "\\s+"))
                        y_clean <- unlist(strsplit(removePunctuation(y), "\\s+"))

                        x_clean < - tolower(x_clean[nchar(x_clean) >= minlen])
                        y_clean < - tolower(y_clean[nchar(y_clean) >= minlen])

                        if ( stem == T ) {

                          x_stem <- stemDocument(x_clean)
                          y_stem <- stemDocument(y_clean)
                          x_tab <- table(x_stem)
                          y_tab <- table(y_stem)    

                          cnam <- sort(unique(c(names(x_tab), names(y_tab))))

                          z <- matrix(rep(0, 3*(length(cnam)+1)), 3, length(cnam)+1, dimnames=list(c("x", "y", "rowsum"), c(cnam, "colsum")))
                          z["x", names(x_tab)] <- x_tab
                          z["y", names(y_tab)] <- y_tab
                          z["rowsum",] <- colSums(z)
                          z[,"colsum"] <- rowSums(z)
                          ifelse(marg == T, return(t(z)), return(t(z[1:dim(z)[1]-1, 1:dim(z)[2]-1])))

                          } else { 

                          x_tab <- table(x_clean)
                          y_tab <- table(y_clean)    

                          cnam <- sort(unique(c(names(x_tab), names(y_tab))))

                          z <- matrix(rep(0, 3*(length(cnam)+1)), 3, length(cnam)+1, dimnames=list(c("x", "y", "rowsum"), c(cnam, "colsum")))
                          z["x", names(x_tab)] <- x_tab
                          z["y", names(y_tab)] <- y_tab
                          z["rowsum",] <- colSums(z)
                          z[,"colsum"] <- rowSums(z)
                          ifelse(marg == T, return(t(z)), return(t(z[1:dim(z)[1]-1, 1:dim(z)[2]-1])))
                          }
                        }

## example
x = "Hello new, new world, this is one of my nice text documents - I wrote it today"
y = "Good bye old, old world, this is a nicely and well written text document"

wordcount(x, y, stem = T, minlen = 3, marg = T)

To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.