Text Mining with R – Comparing Word Counts in two Text Documents
[This article was first published on theBioBucket*, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here’s what I came up with to compare word counts in two pieces of text. If you got any idea, I’d love to learn about alternatives!Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
## a function that compares word counts in two texts wordcount <- function(x, y, stem = F, minlen = 1, marg = F) { require(tm) x_clean <- unlist(strsplit(removePunctuation(x), "\\s+")) y_clean <- unlist(strsplit(removePunctuation(y), "\\s+")) x_clean <- tolower(x_clean[nchar(x_clean) >= minlen]) y_clean <- tolower(y_clean[nchar(y_clean) >= minlen]) if ( stem == T ) { x_stem <- stemDocument(x_clean) y_stem <- stemDocument(y_clean) x_tab <- table(x_stem) y_tab <- table(y_stem) cnam <- sort(unique(c(names(x_tab), names(y_tab)))) z <- matrix(rep(0, 3*(length(cnam)+1)), 3, length(cnam)+1, dimnames=list(c("x", "y", "rowsum"), c(cnam, "colsum"))) z["x", names(x_tab)] <- x_tab z["y", names(y_tab)] <- y_tab z["rowsum",] <- colSums(z) z[,"colsum"] <- rowSums(z) ifelse(marg == T, return(t(z)), return(t(z[1:dim(z)[1]-1, 1:dim(z)[2]-1]))) } else { x_tab <- table(x_clean) y_tab <- table(y_clean) cnam <- sort(unique(c(names(x_tab), names(y_tab)))) z <- matrix(rep(0, 3*(length(cnam)+1)), 3, length(cnam)+1, dimnames=list(c("x", "y", "rowsum"), c(cnam, "colsum"))) z["x", names(x_tab)] <- x_tab z["y", names(y_tab)] <- y_tab z["rowsum",] <- colSums(z) z[,"colsum"] <- rowSums(z) ifelse(marg == T, return(t(z)), return(t(z[1:dim(z)[1]-1, 1:dim(z)[2]-1]))) } } ## example x = "Hello new, new world, this is one of my nice text documents - I wrote it today" y = "Good bye old, old world, this is a nicely and well written text document" wordcount(x, y, stem = T, minlen = 3, marg = T)
To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.