Statistics Sunday: Using Text Analysis to Become a Better Writer
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m sure we all have our own words we use way too often.
Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today’s demonstration, I read in my (still in-progress) novel – a murder mystery called Killing Mr. Johnson – and did the same type of text analysis I’ve been demonstrating in recent posts.
To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.
setwd("~/Dropbox/Writing/Killing Mr. Johnson") library(tidyverse) KMJ_text <- read_lines('KMJ_full.txt') KMJ <- tibble(KMJ_text) %>% mutate(linenumber = row_number())
I kept my line numbers, which I could use in some future analysis. For now, I’m going to tokenize my data, drop stop words, and examine my most frequently used words.
library(tidytext) KMJ_words <- KMJ %>% unnest_tokens(word, KMJ_text) %>% anti_join(stop_words) ## Joining, by = "word" KMJ_words %>% count(word, sort = TRUE) %>% filter(n > 75) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It’s a murder mystery, so I’m not too surprised that words like “body” and “death” are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the “-ly adverbs.”
ly_words <- KMJ_words %>% filter(str_detect(word, ".ly")) %>% count(word, sort = TRUE) head(ly_words) ## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 emily 599 ## 2 finally 80 ## 3 quickly 60 ## 4 emily’s 53 ## 5 suddenly 39 ## 6 quietly 38
Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren’t actually -ly adverbs. I’ll filter those out then take a look at what I have left.
filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy") ly_words <- ly_words %>% filter(!word %in% filter_out) ly_words %>% filter(n > 10) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()