Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I recently needed to work with date values that look like this:
mydate |
Jan 23/2 |
Aug 5/20 |
Dec 17/2 |
I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash. I needed to think about how to program this carefully in both R and SAS, because
- the length of the day could be 1 or 2 characters long
- I needed a code that adapted to this varying length from observation to observation
- there is no function in either language that is suited exactly for this purpose.
In this tutorial, I will show you how to do this in both R and SAS. I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!
Extracting a String Between 2 Characters in R
I will write a function called getstr() in R to extract a string between 2 characters. The strategy is simple:
- Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
- Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
- Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.
##### Extracting a String Between 2 Characters in R ##### By Eric Cai - The Chemical Statistician # clear all variables in workspace rm(list=ls(all=TRUE)) # create a vector of 3 example dates mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011') # getstr() is my customized function # it extracts a string between 2 characters in a string variable getstr = function(mystring, initial.character, final.character) { # check that all 3 inputs are character variables if (!is.character(mystring)) { stop('The parent string must be a character variable.') } if (!is.character(initial.character)) { stop('The initial character must be a character variable.') } if (!is.character(final.character)) { stop('The final character must be a character variable.') } # pre-allocate a vector to store the extracted strings snippet = rep(0, length(mystring)) for (i in 1:length(mystring)) { # extract the initial position initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1 # extract the final position final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1 # extract the substring between the initial and final positions, inclusively snippet[i] = substr(mystring[i], initial.position, final.position) } return(snippet) } # use the getstr() function to extract the day between the comma and the slash in "mydate" getstr(mydate, ' ', '/')
Here is the output from getstr() on the vector “mydate”
> getstr(mydate, ' ', '/') [1] "23" "5" "17"
Extracting a String Between 2 Characters in SAS
I will write a macro program called %getstr(). It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.
The only tricky part in this macro program was creating a new data set name. The input data set is called “dates”, and I wanted to create a new data set called “dates2″. I accomplished that by appending %dataset with “.2″ within the macro.
First, let’s create the input data set. Notice my use of the “#” as a delimiter when inputting the dates.
data dates; infile datalines dlm = '#'; input mydate $; datalines; Jan 23/2015# Aug 5/2001# Dec 17/2007 ; run;
Let’s now write the macro program %getstr(). It will create a new data set with the appendix “2”.
%macro getstr(dataset, string_variable); data &dataset.2; set &dataset; * search the string for the position of the space after the month; space_position = INDEX(&string_variable, ' '); * search the string for the position of the slash after the month; slash_position = INDEX(&string_variable, '/'); * calculate the length between the space and the slash; space_to_slash = slash_position - space_position; * extract the day from the original string (the character(s) between the space and the slash; day = substr(&string_variable, space_position, space_to_slash); run; %mend getstr;
Let’s use the %getstr() macro program to create a new data set called “dates2″ that contains the day of each date. I’ll print the results afteward.
%getstr(dates, mydate); proc print data = dates2 noobs; run;
Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.
mydate | space_position | substring_afterspace | day |
---|---|---|---|
Jan 23/2 | 4 | 23/2 | 23 |
Aug 5/20 | 4 | 5/20 | 5 |
Dec 17/2 | 4 | 17/2 | 17 |
Filed under: Data Analysis, R programming, SAS Programming Tagged: dates, macro, macro program, R, R programming, SAS, text, text processing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.