[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m sure we all have our own words we use way too often.
Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today’s demonstration, I read in my (still in-progress) novel – a murder mystery called Killing Mr. Johnson – and did the same type of text analysis I’ve been demonstrating in recent posts.
To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.
setwd("~/Dropbox/Writing/Killing Mr. Johnson") library(tidyverse) KMJ_text <- read_lines('KMJ_full.txt') KMJ <- tibble(KMJ_text) %>% mutate(linenumber = row_number())
I kept my line numbers, which I could use in some future analysis. For now, I’m going to tokenize my data, drop stop words, and examine my most frequently used words.
library(tidytext) KMJ_words <- KMJ %>% unnest_tokens(word, KMJ_text) %>% anti_join(stop_words) ## Joining, by = "word" KMJ_words %>% count(word, sort = TRUE) %>% filter(n > 75) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()