Useful Functions in R for Manipulating Text Data

Eric Cai - The Chemical Statistician

8 years ago

[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.

Obtaining Basic Information about Character Variables

In R, I often work with text data in the form of character variables. To check if a variable is a character variable, use the is.character() function.

> year = 2014
> is.character(year)
[1] FALSE

If a variable is not a character variable, you can convert it to a character variable using the as.character() function.

> year.char = as.character(year)
> is.character(year.char)
[1] TRUE

A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.

> nchar(year.char)
[1] 4

Pattern Matching and Manipulation

I often need to combine several character variables into one string, and the paste() function is useful for that. Notice my use of the “sep =” option to specify that I want to separate the variables with 1 space.

> first = 'The'
> second = 'Chemical'
> third = 'Statistician'
> my.name = paste(first, second, third, sep = ' ')
> my.name
[1] "The Chemical Statistician"

A common task in my job is determining whether or not a sequence of nucleotides/amino acids is present in a much longer sequence of length (i.e. ). Essentially, I want to determine if a pattern of text exists in a character variable. The grepl() function is useful for that; in fact, the pattern of interest can be searched in multiple character variables simultaneously – just combine the 2 variables using the c() function!

> x = 'ATCG'
> y = 'GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT'
> z = 'CTATCGGGTAGCT'
> grepl(x, c(y, z))
[1] TRUE TRUE

If you want to determine precisely where “x” is located along “y” and along “z”, use the gregexpr() function.

> gregexpr(x, c(y, z))
[[1]]
[1] 19 25
attr(,"match.length")
[1] 4 4
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 3
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

The output of gregexpr(x, c(y, z)) is a list of 2 objects.

The first object contains the positional information about the pattern “x” in the variable “y”.
- “x” appears twice in the variable “y” – at positions 19 and 25. (Specifically, the “A” in x = ‘ATCG’ appears at positions 19 and 25.)
The second object contains the positional information about the pattern “x” in the variable “z”.

To extract these positions, you must first slice the list into its 2 objects – use double braces to do this. Then, you can extract the positions from each object – use single braces to do this. For simplicity, let’s assign the output of gregexpr(x, c(y, z)) to a variable named “pos”.

> pos = gregexpr(x, c(y, z))

> pos[[1]]
[1] 19 25
attr(,"match.length")
[1] 4 4
attr(,"useBytes")
[1] TRUE

> pos[[1]][1]
[1] 19

> pos[[1]][2]
[1] 25

If you want to extract a portion of a string, use the substr() function. For example, if I know that the first 3 nucleotides of a particular DNA sequence are junk, I would want to discard them and extract the rest of that sequence only. Let’s use the variable “y” to illustrate this.

> y
[1] "GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT"
> substr(y, 4, nchar(y))
[1] "CTCTAAATCCGTACTATCGTCATCGTTTTTCCT"

Further Information

John Myles White, who co-wrote the excellent “Machine Learning for Hackers” with Drew Conway, has a nice blog entry on some other useful functions for text processing in R. If you have any more suggestions, please share them in the comments!

Filed under: R programming Tagged: amino acids, as.character(), data manipulation, DNA, gregexpr(), grepl(), HIV, is.character(), manipulating strings, nchar(), nucleotides, paste(), R, R programming, string, strings, substr(), text, text data, text manipulation, text processing

To leave a comment for the author, please follow the link and comment on their blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.