Statistics Sunday: Using Text Analysis to Become a Better Writer

Posted on August 19, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I’ve used some of my favorites, e.g., additionally, though, and so on.

I’m sure we all have our own words we use way too often.

Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today’s demonstration, I read in my (still in-progress) novel – a murder mystery called Killing Mr. Johnson – and did the same type of text analysis I’ve been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)

KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
  mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I’m going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
  unnest_tokens(word, KMJ_text) %>%
  anti_join(stop_words)

## Joining, by = "word"

KMJ_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 75) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()

Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It’s a murder mystery, so I’m not too surprised that words like “body” and “death” are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the “-ly adverbs.”

ly_words <- KMJ_words %>%
  filter(str_detect(word, ".ly")) %>%
  count(word, sort = TRUE)

head(ly_words)

## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 emily      599
## 2 finally     80
## 3 quickly     60
## 4 emily’s     53
## 5 suddenly    39
## 6 quietly     38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren’t actually -ly adverbs. I’ll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
  filter(!word %in% filter_out)

ly_words %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()