Data Science Capstone – Milestone Report

John

3 years ago

[This article was first published on R – NetworkX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Executive Summary

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This Milestone Report describes the major features of the data with my exploratory data analysis and summarizes. To get started with the Milestone Report I’ve download the Coursera Swiftkey Dataset. Also I’ve defind my plans for creating the predictive model(s) and a Shiny App as data product.

All the code is attached as Appendix.

Files used:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

File details and stats

Let’s have a look at the files. I determined the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). Also I calculate some basic stats on the number of words per line (WPL).

File	Lines	LinesNEmpty	Chars	CharsNWhite	TotalWords	WPL_Min	WPL_Mean	WPL_Max
blogs	899288	899288	206824382	170389539	37570839	0	41.75107	6726
news	1010242	1010242	203223154	169860866	34494539	1	34.40997	1796
twitter	2360148	2360148	162096241	134082806	30451170	1	12.75065	47

Sample the data

The data files are very hugh, I will get a sample of 1% of every file and save it to RDS file sample.rds for saving space. We can load it in for starting the analysis.

Preprocessing the data

After loading the sample RDS file, I created a Corpus and start to analyse the data with the tm library.

There is a lot of information in the data I do not need and is not usefull. I need to clean it up and removed all numbers, convert text to lowercase, remove punctuation and stopwords, in this case english. After that. had I performed stemming, a stem is a form to which affixes can be attached. An example of this is wait, waits, waited, waiting, all of them are common to wait. When the stemming is done, I had removed a lot of characters which resulted in a lot of whitespaces, I removed this also.

N-gram Tokenization

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a “unigram”, size 2 is a “bigram” and size 3 is a “trigram”.

The RWeka package has been used to develop the N-gram Tokenizersin order to create the unigram, bigram and trigram.

Exploratory Analysis

Know I’m ready to perform exploratory analysis on the data. It will be helpful to find the most frequenzies of occurring words based on on unigram, bigram and trigrams.

Unigrams

	term	freq
will	will	3124
said	said	3048
just	just	3019
one	one	2974
like	like	2953
get	get	2949
time	time	2598
can	can	2465
day	day	2277
year	year	2127

Bigrams

	term	freq
right now	right now	270
last year	last year	220
look like	look like	217
cant wait	cant wait	193
new york	new york	186
last night	last night	167
year ago	year ago	162
look forward	look forward	154
feel like	feel like	150
high school	high school	150

Trigrams

	term	freq
happi mother day	happi mother day	46
cant wait see	cant wait see	43
new york citi	new york citi	30
happi new year	happi new year	28
let us know	let us know	21
look forward see	look forward see	20
cinco de mayo	cinco de mayo	17
two year ago	two year ago	17
new york time	new york time	16
im pretti sure	im pretti sure	14

Development Plan

The next steps of this capstone project would be to create predictive models(s) based on the N-gram Tokenization, and deploy it as a data product. Here are my steps:

Establish the predictive model(s) by using N-gram Tokenizations.
Optimize the code for faster processing.
Develop data product, a Shiny App, to make a next word prediction based on user inputs.
Create a Slide Deck for pitching my algorithm and Shiny App.

Appendix

Appendix – Load libraries, doParallel and files

# Loading Libraries
library(doParallel)
library(tm)
library(stringi)
library(RWeka)
library(dplyr)
library(kableExtra)
library(SnowballC)
library(ggplot2)

# Setting up doParallel 
library(doParallel) 
set.seed(613)
n_cores <- detectCores() - 2  
registerDoParallel(n_cores,cores=n_cores)

# Show files used
directory_us <- file.path(".", "data", "final", "en_US/")
dir(directory_us)

Appendix A – File details and stats

#Loading Files and show summaries
blogs_con <- file(paste0(directory_us, "/en_US.blogs.txt"), "r")
blogs <- readLines(blogs_con, encoding="UTF-8", skipNul = TRUE)
close(blogs_con)

news_con <- file(paste0(directory_us, "/en_US.news.txt"), "r")
news <- readLines(news_con, encoding="UTF-8", skipNul = TRUE)
close(news_con)

twitter_con <- file(paste0(directory_us, "/en_US.twitter.txt"), "r")
twitter <- readLines(twitter_con, encoding="UTF-8", skipNul = TRUE)
close(twitter_con)

# Create stats of files
WPL <- sapply(list(blogs,news,twitter),function(x)
  summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
rawstats <- data.frame(
  File = c("blogs","news","twitter"), 
  t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
          TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
          WPL))
)
# Show stats in table
kable(rawstats) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Appendix B – Sample the data

# Sample of data
set.seed(613)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))
saveRDS(data.sample, 'sample.rds')

# Ceaning up a other object we do not use anymore.
rm(blogs, blogs_con, data.sample, directory_us, news, news_con, rawstats, twitter, 
   twitter_con, WPL)

Appendix C – Preprocessing the data

# Load the RDS file
data <- readRDS("sample.rds")
# Create a Corpus
docs <- VCorpus(VectorSource(data))
# Remove data we do not need 
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
# Do stamming
docs <- tm_map(docs, stemDocument)
# Strip whitespaces
docs <- tm_map(docs, stripWhitespace)

Appendix D – N-gram Tokenization

# Create Tokenization funtions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Create plain text format
docs <- tm_map(docs, PlainTextDocument)

Appendix E – Exploratory Analysis

# Create TermDocumentMatrix with Tokenizations and Remove Sparse Terms
tdm_freq1 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = unigram)), 0.9999)
tdm_freq2 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = bigram)), 0.9999)
tdm_freq3 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = trigram)), 0.9999)

# Create frequencies 
uni_freq <- sort(rowSums(as.matrix(tdm_freq1)), decreasing=TRUE)
bi_freq <- sort(rowSums(as.matrix(tdm_freq2)), decreasing=TRUE)
tri_freq <- sort(rowSums(as.matrix(tdm_freq3)), decreasing=TRUE)

# Create DataFrames
uni_df <- data.frame(term=names(uni_freq), freq=uni_freq)   
bi_df <- data.frame(term=names(bi_freq), freq=bi_freq)   
tri_df <- data.frame(term=names(tri_freq), freq=tri_freq)

# Show head 10 of unigrams
kable(head(uni_df,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
# Plot head 20 of unigrams
head(uni_df,20) %>% 
  ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity") +
  ggtitle("20 Most Unigrams") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

# Show head 10 of bigrams
kable(head(bi_df,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
# Plot head 20 of bigrams
head(bi_df,20) %>% 
  ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity") +
  ggtitle("20 Most Bigrams") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

# Show head 10 of trigrams
kable(head(tri_df,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
# Plot head 20 of trigrams
head(tri_df,20) %>% 
  ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity") +
  ggtitle("20 Most Trigrams") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

John

I’m creative, imaginative, free-thinking, daydreamer and strategic who needs freedom, peace and space to brainstorm and to fantasize about new and surprising solutions. Generates ideas and solves difficult problems, sees all options, judges accurately and wants to get to the bottom of things.

Interested in Data Science, Data Analytics, Running, Crossfit, Obstacle Running and Coffee.

< svg aria-hidden="true" class="sab-linkedin" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512">< path fill="currentColor" d="M100.3 480H7.4V180.9h92.9V480zM53.8 140.1C24.1 140.1 0 115.5 0 85.8 0 56.1 24.1 32 53.8 32c29.7 0 53.8 24.1 53.8 53.8 0 29.7-24.1 54.3-53.8 54.3zM448 480h-92.7V334.4c0-34.7-.7-79.2-48.3-79.2-48.3 0-55.7 37.7-55.7 76.7V480h-92.8V180.9h89.1v40.8h1.3c12.4-23.5 42.7-48.3 87.9-48.3 94 0 111.3 61.9 111.3 142.3V480z">< svg aria-hidden="true" class="sab-twitter" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">< path fill="currentColor" d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z">< svg aria-hidden="true" class="sab-github" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">< path fill="currentColor" d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z">< svg aria-hidden="true" class="sab-user_email" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512">< path fill="currentColor" d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z">

The post Data Science Capstone – Milestone Report appeared first on NetworkX.

To leave a comment for the author, please follow the link and comment on their blog: R – NetworkX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.