Useful Functions in R for Manipulating Text Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.
Obtaining Basic Information about Character Variables
In R, I often work with text data in the form of character variables. To check if a variable is a character variable, use the is.character() function.
> year = 2014 > is.character(year) [1] FALSE
If a variable is not a character variable, you can convert it to a character variable using the as.character() function.
> year.char = as.character(year) > is.character(year.char) [1] TRUE
A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.
> nchar(year.char) [1] 4
Pattern Matching and Manipulation
I often need to combine several character variables into one string, and the paste() function is useful for that. Notice my use of the “sep =” option to specify that I want to separate the variables with 1 space.
> first = 'The' > second = 'Chemical' > third = 'Statistician' > my.name = paste(first, second, third, sep = ' ') > my.name [1] "The Chemical Statistician"
A common task in my job is determining whether or not a sequence of nucleotides/amino acids is present in a much longer sequence of length (i.e. ). Essentially, I want to determine if a pattern of text exists in a character variable. The grepl() function is useful for that; in fact, the pattern of interest can be searched in multiple character variables simultaneously – just combine the 2 variables using the c() function!
> x = 'ATCG' > y = 'GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT' > z = 'CTATCGGGTAGCT' > grepl(x, c(y, z)) [1] TRUE TRUE
If you want to determine precisely where “x” is located along “y” and along “z”, use the gregexpr() function.
> gregexpr(x, c(y, z)) [[1]] [1] 19 25 attr(,"match.length") [1] 4 4 attr(,"useBytes") [1] TRUE [[2]] [1] 3 attr(,"match.length") [1] 4 attr(,"useBytes") [1] TRUE
The output of gregexpr(x, c(y, z)) is a list of 2 objects.
- The first object contains the positional information about the pattern “x” in the variable “y”.
- “x” appears twice in the variable “y” – at positions 19 and 25. (Specifically, the “A” in x = ‘ATCG’ appears at positions 19 and 25.)
- The second object contains the positional information about the pattern “x” in the variable “z”.
To extract these positions, you must first slice the list into its 2 objects – use double braces to do this. Then, you can extract the positions from each object – use single braces to do this. For simplicity, let’s assign the output of gregexpr(x, c(y, z)) to a variable named “pos”.
> pos = gregexpr(x, c(y, z)) > pos[[1]] [1] 19 25 attr(,"match.length") [1] 4 4 attr(,"useBytes") [1] TRUE > pos[[1]][1] [1] 19 > pos[[1]][2] [1] 25
If you want to extract a portion of a string, use the substr() function. For example, if I know that the first 3 nucleotides of a particular DNA sequence are junk, I would want to discard them and extract the rest of that sequence only. Let’s use the variable “y” to illustrate this.
> y [1] "GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT" > substr(y, 4, nchar(y)) [1] "CTCTAAATCCGTACTATCGTCATCGTTTTTCCT"
Further Information
John Myles White, who co-wrote the excellent “Machine Learning for Hackers” with Drew Conway, has a nice blog entry on some other useful functions for text processing in R. If you have any more suggestions, please share them in the comments!
Filed under: R programming Tagged: amino acids, as.character(), data manipulation, DNA, gregexpr(), grepl(), HIV, is.character(), manipulating strings, nchar(), nucleotides, paste(), R, R programming, string, strings, substr(), text, text data, text manipulation, text processing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.