Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN. This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.
qdapRegex 0.2.0
The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi. The qdapRegex package serves a dual purpose of being both functional and educational.
New Features/Changes
Here are a select few new features. For a complete list of changes CLICK HERE:
is.regex
added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).- Case wrapper functions,
TC
(title case),U
(upper case), andL
(lower case) added for convenient case manipulation. rm_citation_tex
added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.regex_cheat
data set andcheat
function added to act as a quick reference for common regex task operations such a lookaheads.explain
added to view a visual representation of a regular expression using http://www.regexper.com andhttp://rick.measham.id.au/paste/explain. Also takes named regular expressions from theregex_usa
or other supplied dictionary.
The last two functions regex_cheat
& explain
provide educational regex tools. regex_cheat
provides a cheatsheet of common regex elements. explain
interfaces with http://www.regexper.com & http://rick.measham.id.au/paste/explain.
qdapTools 1.1.0
qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.
New Features/Changes
loc_split
added to split data forms (list
,vector
,data.frame
,matrix
) on a vector of integer locations.matrix2long
makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.read_docx
added to read in .docx documents.
split_vector
picks up aregex
argument to allow for regular expression search of break location.
Integrated Demonstration
In this demonstration we will use dl_url
to grab a .docx file from the Internet. We’ll then read this document in with read_docx
. We’ll use split_vector
to split the text from the .docx into main body and a references section. rm_citations
will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.
First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:
Map(function(x, y) { if (!x %in% list.files(.libPaths())){ install.packages(x) } else { if (packageVersion(x) < y) { install.packages(x) } else { message(sprintf("Version of %s is suitable for demonstration", x)) } } }, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0")) lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)
Now let’s grab the .docx document, read it in, and split into body/references sections:
## Download .docx url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx") ## Read in .docx txt <- read_docx("whole_language_timeline-updated.docx") ## Remove non ascii characters txt <- rm_non_ascii(txt) ## Split into body/references sections parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE) ## View body parts[[1]] ## View references parts[[2]]
Now we can extract the in-text APA 6 citations and view frequencies:
## Extract citations in order of appearance rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]] ## Extract citations by section rm_citation(parts[[1]], extract=TRUE) ## Frequency left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]), extract=TRUE)), TRUE), "freq", "citation")[2:1])
## citation freq
## 1 Walker, 2008 14
## 2 Flesch (1955) 2
## 3 Adams (1990) 1
## 4 Anderson, Hiebert, Scott, and Wilkinson (1985) 1
## 5 Baumann & Hoffman, 1998 1
## 6 Baumann, 1998 1
## 7 Bond and Dykstra (1967) 1
## 8 Chall (1967) 1
## 9 Clay (1979) 1
## 10 Goodman and Goodman (1979) 1
## 11 McCormick & Braithwaite, 2008 1
## 12 Read Adams (1990) 1
## 13 Stahl and Miller (1989) 1
## 14 Stahl and Millers (1989) 1
## 15 Word Perception Intrinsic Phonics Instruction Gates (1951) 1
Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:
## Distribution of citations (find locations) cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){ m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE) data.frame( citation=x, start = m[[1]] -5, end = m[[1]] + 5 + attributes(m[[1]])[["match.length"]] ) })) ## Plot the distribution ggplot(cite_locs) + geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3, color="yellow") + xlab("Duration") + scale_x_continuous(expand = c(0,0), limits = c(0, nchar(unbag(parts[[1]])) + 25)) + theme_grey() + theme( panel.grid.major=element_line(color="grey20"), panel.grid.minor=element_line(color="grey20"), plot.background = element_rect(fill="black"), panel.background = element_rect(fill="black"), panel.border = element_rect(colour = "grey50", fill=NA, size=1), axis.text=element_text(color="grey50"), axis.title=element_text(color="grey50") )
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.