Fixing APA citations from Pandoc with stringr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Pandoc is awesome. It’s the universal translator for plain-text
documents. I especially like that it can do inline citations. I write
@Jones2005 proved aliens exist
and pandoc produces “Jones (2005) proved
aliens exist”.
But it doesn’t quite do APA style citations correctly. A citation
like @SimpsonFlanders2006 found...
renders as “Simpson & Flanders (2006)
found…”. Inline citations are not supposed to have an ampersand. It should be
“Simpson and Flanders (2006) found…”.
In the grand scheme of writing and revising, these errors are tedious low-level stuff. But I have colleagues who will read a draft of a manuscript and write unnecessary comments about how to cite stuff in APA. And the problem is just subtle and pervasive enough that it doesn’t make sense to manually fix the citations each time I generate my manuscript. My current project has 15 of these ill-formatted citations. That number is just big enough to make manual corrections an error-prone process— easy to miss 1 in 15.
Find and replace
I wrote a quick R function that replaces all those inlined ampersands with “and”s.
library("stringr")
fix_inline_citations <- function(text) {
Let’s assume that an inline citation ends with an author’s last name followed
by a parenthesized year: SomeKindOfName (2001)
. We encode these assumptions
into regular expression patterns, prefixed with re_
.
The year is pretty easy. If it looks weird, it’s because I prefer to escape
special punctuation like (
using brackets like [(]
. Otherwise, a year is
just four digits: \\d{4}
.
re_inline_year <- "[(]\\d{4}[)]"
What’s in a name? Here we have to stick our necks out a little bit more about
our assumptions. I’m going to assume a last name is any combination of letters,
hyphens and spaces (spaces needed for von Name
).
re_author <- "[[:alpha:]- ]+"
re_author_year <- paste(re_author, re_inline_year)
We define the ampersand.
re_ampersand <- " & "
Lookaround, lookaround. Our last regular expression trick is positive lookahead.
Suppose we want just the word “hot” from the larger word “hotdog”.
Using just hot
would match too many things, like the “hot” in “hoth”. Using
hotdog
would match the whole word “hotdog”, which is more than we asked for.
Lookaround patterns allow us to impose more constraints on a pattern.
In the “hotdog”” example, positive lookahead hot(?=dog)
says find “hot” if it
precedes “dog”.
We use positive lookahead to find only the ampersands followed by an author name and a year. We replace the strings that match this pattern with and’s.
re_ampersand_author_year <- sprintf("%s(?=%s)", re_ampersand, re_author_year)
str_replace_all(text, re_ampersand_author_year, " and ")
}
We can now test our function on a variety of names that it should and should not fix.
do_fix <- c(
"Jones & Name (2005) found...",
"Jones & Hyphen-Name (2005) found...",
"Jones & Space Name (2005) found...",
"Marge, Maggie, & Lisa (2005) found...")
fix_inline_citations(do_fix)
#> [1] "Jones and Name (2005) found..."
#> [2] "Jones and Hyphen-Name (2005) found..."
#> [3] "Jones and Space Name (2005) found..."
#> [4] "Marge, Maggie, and Lisa (2005) found..."
do_not_fix <- c(
"...have been found (Jones & Name, 2005)",
"...have been found (Jones & Hyphen-Name, 2005)",
"...have been found (Jones & Space Name, 2005)",
"...have been found (Marge, Maggie, & Lisa, 2005)")
fix_inline_citations(do_not_fix)
#> [1] "...have been found (Jones & Name, 2005)"
#> [2] "...have been found (Jones & Hyphen-Name, 2005)"
#> [3] "...have been found (Jones & Space Name, 2005)"
#> [4] "...have been found (Marge, Maggie, & Lisa, 2005)"
By the way, our final regular expression re_ampersand_author_year
is
& (?=[[:alpha:]- ]+ [(]\d{4}[)])
. It’s not very readable or comprehensible in
that form, so that’s why we built it up step by step from easier sub-patterns
like re_author
and re_inline_year
. (Which is a micro-example of the strategy
of managing complexity by combining/composing simpler primitives.)
Steps towards production
These are complications that arose as I tried to use the function on my actual manuscript:
Placing it in a build pipeline. My text starts with an RMarkdown file
that is knitted into a markdown file and rendered into other formats by
pandoc. Because this function post-processes output from pandoc, I can’t
just hit the “Knit”” button in RStudio. I had to make a separate script to
do rmarkdown::render
to convert my .Rmd file into a .md file which can then be
processed by this function.
Don’t fix too much. When pandoc does your references for you, it also does
a bibliography section. But it would be wrong to fix the ampersands there. So
I have to do a bit of fussing around by finding the line "## References"
and
processing just the text up until that line.
Accounting for encoding. I use readr::read_lines
and
stringi::stri_write_lines
to read and write the text file to preserve the
encoding of characters. (readr just released its own write_lines
today
actually, so I can’t vouch for it yet.)
False matches are still possible. Suppose I’m citing a publication by an organization, like Johnson & Johnson, where that ampersand is part of the name. That citation would wrongly be corrected. I have yet to face that issue in practice though.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.