Where do letters occur in words
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while back I encountered an interesting graphic showing where letters were located in english words (http://www.prooffreader.com/2014/05/graphing-distribution-of-english.html). The other day I decided to do a similar one for letters in danish words and for this I used R.
I downloaded all abstracts from the danish Wikipedia and made my own version as you can see here:
Here is how you can do it:
# First you need to load in some text
library(rvest)
# I’ll grab an article from FiveThirtyEight.com as a show case.
# I did my analysis on all the danish abstracts from Wikipedia (took a while!)
# When you do your final analysis you’ll want as much text as possible too.
# We grab the html data
html_data <- html(“http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/”)
# We extract some text
textfile <- html_data %>% html_nodes(“p”) %>% html_text(trim=TRUE)
# We collapse it in to a single string
textfile <- paste(textfile, collapse= ” “)
# Then we need to do a little string manipulation
library(stringr)
# We set all text to lower case
textfile <- str_to_lower(textfile)
# We remove all punctuation and all digits
textfile <- str_replace_all(textfile, “[[:punct:]]|[[:digit:]]”, “”)
# Then we split the string into individual words
words <- unique(unlist(str_split(textfile, ” “)))
# And we count the letters in each word
word_length <- unlist(lapply(words, function(x) nchar(x)))
# And we split each word in to its individual letters
split_words <- str_split(words, “”)
# Then we create a loop to find the position of each letter in each word
# If you have national letters like we do in Denmark you icnlude them like this: for(i in c(letters, “æ”, “ø”, “å”))
for(i in letters){ # We loop through all the letters
# Create empty list to hold data later
letter_place.list <- c()
# We find the position of each letter in the words (that we split apart)
letter_data <- lapply(split_words, function(x) which(x == i))
# A nested loop calculates the relative position of the letter in each word
for(y in 1:length(word_length)){
# We find the relative position
letter_place <- unlist(lapply(letter_data[y], function(x) x/word_length[y]))
# We add that position to a lit of positions
letter_place.list <- c(letter_place.list, letter_place)
}
# We create a new list to hold all the data and we then add the results from the loop
if(!exists(“letter_place.data”)) letter_place.data <- list(letter_place.list) else letter_place.data <- append(letter_place.data , list(letter_place.list))
# We make sure to name each list properly
names(letter_place.data)[length(letter_place.data)] <- i
}
# Now we have a nested list with the data we need, but first we’ll convert it to a long form data frame
# We create an empty data frame to hold the data
letter_place.data.df <- data.frame()
# Then we create a loop to put the data from each letter list into the data frame
for(z in 1:length(letter_place.data)){ # We loop through each nested list
tryCatch({ # I add the tryCatch so the loop doesn’t break if there is an error (can occur with if a letter is missing)
# Here we extract the data from the letter list and create a data frame
loop_data <- data.frame(letter = names(letter_place.data)[z], value = letter_place.data[[z]], stringsAsFactors = F)
# We then bind all the data frames together
letter_place.data.df <- rbind(letter_place.data.df, loop_data)
}, error=function(e){}) # Ends the tryCatch
}
# We check to see if we have all the letters
unique(letter_place.data.df$letter)
# We change the letters back to upper case for aesthetics in the graphic
letter_place.data.df$letter <- str_to_upper(letter_place.data.df$letter)
library(ggplot2)
# We create a density plot with free y scales to show the distribution, we choose a red fill colour and then we facet wrap it to show each individual letter
p <- ggplot(letter_place.data.df, aes(x=value)) + geom_density(aes(fill=”red”)) + facet_wrap( ~ letter, scales=”free_y”)
# We add appropriate text to titles and axis
p <- p + labs(title = “Where do letters typically appear in english words”, y = “Appearance”, x = “Word length”, fill=””)
# We set a deeper red, choose the minimal theme, remove axis markers and grid, and remove the legend
p <- p + scale_fill_brewer(palette = “Set1″) + theme_minimal() +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(),
legend.position=”none”, panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# Voila! Here it is
p
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.