Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I recently needed to work with date values that look like this:
| mydate | 
| Jan 23/2 | 
| Aug 5/20 | 
| Dec 17/2 | 
I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash. I needed to think about how to program this carefully in both R and SAS, because
- the length of the day could be 1 or 2 characters long
- I needed a code that adapted to this varying length from observation to observation
- there is no function in either language that is suited exactly for this purpose.
In this tutorial, I will show you how to do this in both R and SAS. I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!
Extracting a String Between 2 Characters in R
I will write a function called getstr() in R to extract a string between 2 characters. The strategy is simple:
- Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
- Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
- Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.
##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician
# clear all variables in workspace
rm(list=ls(all=TRUE))
# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')
# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{
 
     # check that all 3 inputs are character variables
     if (!is.character(mystring))
     {
          stop('The parent string must be a character variable.')
     }
 
     if (!is.character(initial.character))
     {
          stop('The initial character must be a character variable.')
     }
 
 
     if (!is.character(final.character))
     {
          stop('The final character must be a character variable.')
     }
  
 
     # pre-allocate a vector to store the extracted strings
     snippet = rep(0, length(mystring))
 
 
 
     for (i in 1:length(mystring))
     {
          # extract the initial position
          initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1
  
          # extract the final position
          final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1
 
          # extract the substring between the initial and final positions, inclusively
          snippet[i] = substr(mystring[i], initial.position, final.position)
     }
 
     return(snippet)
}
# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')
Here is the output from getstr() on the vector “mydate”
> getstr(mydate, ' ', '/') [1] "23" "5" "17"
Extracting a String Between 2 Characters in SAS
I will write a macro program called %getstr(). It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.
The only tricky part in this macro program was creating a new data set name. The input data set is called “dates”, and I wanted to create a new data set called “dates2″. I accomplished that by appending %dataset with “.2″ within the macro.
First, let’s create the input data set. Notice my use of the “#” as a delimiter when inputting the dates.
data dates; infile datalines dlm = '#'; input mydate $; datalines; Jan 23/2015# Aug 5/2001# Dec 17/2007 ; run;
Let’s now write the macro program %getstr(). It will create a new data set with the appendix “2”.
%macro getstr(dataset, string_variable);
data &dataset.2;
          set &dataset;
          * search the string for the position of the space after the month;
           space_position = INDEX(&string_variable, ' ');
          * search the string for the position of the slash after the month;
           slash_position = INDEX(&string_variable, '/');
          * calculate the length between the space and the slash;
           space_to_slash = slash_position - space_position;
          * extract the day from the original string (the character(s) between the space and the slash;
           day = substr(&string_variable, space_position, space_to_slash);
run;
%mend getstr;
Let’s use the %getstr() macro program to create a new data set called “dates2″ that contains the day of each date. I’ll print the results afteward.
%getstr(dates, mydate); proc print data = dates2 noobs; run;
Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.
| mydate | space_position | substring_afterspace | day | 
|---|---|---|---|
| Jan 23/2 | 4 | 23/2 | 23 | 
| Aug 5/20 | 4 | 5/20 | 5 | 
| Dec 17/2 | 4 | 17/2 | 17 | 
Filed under: Data Analysis, R programming, SAS Programming Tagged: dates, macro, macro program, R, R programming, SAS, text, text processing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
