Statistics Sunday: Tokenizing Text
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In text analysis, a token is any kind of meaningful text unit that can be analyzed. Frequently, a token is a word and the process of tokenizing splits up the text into individual words and counts up how many times each word appears in the text. But a token could also be a phrase (such as each two-word combination present in a text, which is called a bi-gram), a sentence, a paragraph, even a whole chapter. Obviously, the size of the token you choose impacts what kind of analysis you can do. Generally, people choose smaller tokens, like words.
Let’s use R to download the text of a classic book (which I did previously in this post, but today, I’ll do in an easier way) and tokenize it by word.
Any text available in the Project Gutenberg repository can be downloaded, with header and footer information stripped out, with the guternbergr package.
install.packages("gutenbergr") library(gutenbergr)
The package comes with a dataset, called gutenberg_metadata, that contains a list, by ID, of all text available. Let’s use The War of the Worlds by H.G. Wells as our target book. We can find our target book like this:
library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.4 ## ✔ tidyr 0.8.0 ✔ stringr 1.3.0 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() gutenberg_metadata %>% filter(title == "The War of the Worlds") ## # A tibble: 3 x 8 ## gutenberg_id title author gutenberg_autho… language gutenberg_books… ## <int> <chr> <chr> <int> <chr> <chr> ## 1 36 The Wa… Wells, … 30 en Movie Books/Sci… ## 2 8976 The Wa… Wells, … 30 en Movie Books/Sci… ## 3 26291 The Wa… Wells, … 30 en <NA> ## # ... with 2 more variables: rights <chr>, has_text <lgl>
The ID for The War of the Worlds is 36. Now I can use that information to download the text of the book into a data frame, using the gutenbergr function, gutenberg_download.
warofworlds<-gutenberg_download(36) ## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest ## Using mirror http://aleph.gutenberg.org
Now I have a dataset with two variables: one containing the Project Gutenberg ID for the text (which is helpful if you create a dataset with multiple texts, perhaps all by the same author or within the same genre) and one containing a line of text. To tokenize our dataset, we need the R package, tidytext.
install.packages("tidytext") library(tidytext)
We can tokenize with the function, unnest_tokens: first we tell it to do so by word then we tell it which column to look in to find the tokens.
tidywow<-warofworlds %>% unnest_tokens(word, text)
Now we have a dataset with each word from the book, one after the other. There are duplicates in here, because I haven't told R to count up the words. Before I do that, I probably want to tell R to ignore extremely common words, like "the," "and," "to," and so on. In text analysis, these are called stop words, and tidytext comes with a dataset called stop_words that can be used to drop stop words from your text data.
tidywow<-tidywow %>% anti_join(stop_words) ## Joining, by = "word"
Last, we have R count up the words.
wowtokens<-tidywow %>% count(word, sort=TRUE) head(wowtokens) ## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 martians 163 ## 2 people 159 ## 3 black 122 ## 4 time 121 ## 5 road 104 ## 6 night 102
After removing stop words, of which there may be hundreds or thousands in any text, the most common words are: Martians, people, black, time, and road.
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.