Cleaning sentences by recursively merging words using R
[This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A question on StackOverflow really sparked my attention. The aim was to clean up a dataset of inappropriately spaced words. For example:
> word5 <- "hotter the doghou se would be bec ause the co lor was diffe rent"
My approach was to create what I call a wordpair
object. The word pair object for the example sentence looks like:
> abc1_pairs word1 word2 1 hotter the 2 the doghou 3 doghou se 4 se would 5 would be 6 be bec 7 bec ause 8 ause the 9 the col 10 col or 11 or was 12 was diffe 13 diffe rent
Then we iterate over the word pairs, and check if they are correct words using the aspell
function in R, and recursively keep merging words until no new correct words can be found. The code I created to create the wordpair
object, transform a wordpair back to a list of words, and some additional functions can be found at the end of this post.
Applied to the example dataset this would result in:
> word5 <- "hotter the doghou se would be bec ause the col or was diffe rent" > abc1 = strsplit(word5, split = " ")[[1]] > abc1_pairs = wordlist2wordpairs(abc1) > abc1_pairs word1 word2 1 hotter the 2 the doghou 3 doghou se 4 se would 5 would be 6 be bec 7 bec ause 8 ause the 9 the col 10 col or 11 or was 12 was diffe 13 diffe rent > abc1_merged_pairs = merge_wordpairs(abc1_pairs) Number of words about to be merged in this pass: 4 > merged_sentence = paste(wordpairs2wordlist(abc1_merged_pairs), collapse = " ") > c(word5, merged_sentence) [1] "hotter the doghou se would be bec ause the col or was diffe rent" [2] "hotter the doghouse would be because the color was different"
# A bunch of functions # Data transformation wordlist2wordpairs = function(word_list) { require(plyr) wordpairs = ldply(seq_len(length(word_list) - 1), function(idx) return(c(word_list[idx], word_list[idx+1]))) names(wordpairs) = c("word1", "word2") return(wordpairs) } wordpairs2wordlist = function(wordpairs) { return(c(wordpairs[[1]], wordpairs[[2]][nrow(wordpairs)])) } # Some checking functions # Is the word correct? word_correct = function(word) return(nrow(aspell(factor(word))) == 0) # Merge two words merge_word = function(word1, word2) return(paste(word1, word2, sep = "")) # Merge a specific pair, option to postpone deletion of pair merge_specific_pair = function(wordpairs, idx, delete_pair = TRUE) { # merge pair into word merged_word = do.call("merge_word", wordpairs[idx,]) # assign the pair to the idx above if(!(idx == 1)) wordpairs[idx - 1, "word2"] = merged_word if(!(idx == nrow(wordpairs))) wordpairs[idx + 1, "word1"] = merged_word # assign the pair to the index below (if not last one) if(delete_pair) wordpairs = wordpairs[-idx,] return(wordpairs) } # Recursively delete wordpairs which lead to a correct word merge_wordpairs = function(wordpairs) { require(plyr) merged_pairs = as.character(mlply(wordpairs, merge_word)) correct_words_idxs = which(sapply(merged_pairs, word_correct)) if(length(correct_words_idxs) == 0) { return(wordpairs) } else { message(sprintf("Number of words about to be merged in this pass: %s", length(correct_words_idxs))) for(idx in correct_words_idxs) { wordpairs = merge_specific_pair(wordpairs, idx, delete_pair = FALSE) } return(merge_wordpairs(wordpairs[-correct_words_idxs,])) # recursive call } }
To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.