Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A common task performed during data preparation or data analysis is the manipulation of strings.
Regular expressions are meant to assist in such and similar tasks.
A regular expression is a pattern that describes a set of strings.
Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes).
R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perl-like regular expressions.
Also, sometimes specific patterns may or may not be found, according to the system locales.
The implementation of those patterns can be performed thru several base-r functions, such as:
- grep
- grepl
- regexpr
- gregexpr
- sub
- gsub
- strsplit
Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3-sets series.
Answers to the exercises are available here.
Although with regex, you can get correct results in more than one way, if you have different solutions, feel free to post them.
Character class
A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list.
For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”.
There are some “shortcuts” that allow us finding specific ranges of digits or characters:
- [0-9] means any digit
- [A-Z] means any upper case character
- [a-z] means any lower case character
Let’s create a variable called  text1  and populate it with the value “The current year is 2016”
Exercise 1
Create a variable called my_pattern and implement the required pattern for finding any digit in the variable text1.
Use function  grepl  to verify if there is a digit in the string variable
Exercise 2
Use function gregexpr to find all the positions in text1 where there is a digit.
Place the results in a variable called  string_position 
Predefined classes of characters
In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc).
For this purpose, there are several predefined classes of characters that save us a lot of typing.
Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale.
Below are some “popular” predefined classes and their meaning:
1. [:alnum:]
Alphanumeric characters: [:alpha:] and [:digit:].
2. [:alpha:]
Alphabetic characters: [:lower:] and [:upper:] can also be used.
3. [:digit:]
Digits: 0 1 2 3 4 5 6 7 8 9.
4. [:blank:]
Blank characters: space and tab, and possibly other locale-dependent characters
such as non-breaking space.
Exercise 3
Create a variable called my_pattern and implement the required pattern for finding one digit and one uppercase alphanumeric character, in variable text1.
This time, combine predefined classes in the regex pattern.
Use function  grepl  to verify if the searched pattern exists on the string.
Exercise 4
Use function regexpr to find the position of the first space in text1.
Place the results in a variable called  first_space  and
Special single character
The period (“.”) matches any single character.
Exercise 5
Create a pattern that checks in text1 if there is a lowercase character, followed by any character and then by a digit.
Exercise 6
Find the starting position of the above string. Place the results in a variable called string_pos2
Special symbols
There are several “special symbols” that assist in the definition of specific patterns.
Pay attention that in R, you should append an extra backslash when using those special symbols:
The symbol \w matches a ‘word’ character and \W is its negation.
Symbols \d, \s, \D and \S denote the digit and space classes and their negations.
As you may have noticed, some special symbols have their parallel “predefined classes”.
(For example, \d equals [0-9] and equals [:digit:])
Exercise 7
Find the following pattern: one space followed by two lowercase letters and one more space.
Use a function that returns the starting point of the found string and place its result in string_pos3.
Metacharacters
There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones:
The caret ("^") – means: find a pattern starting from the beginning of the string
The dollar sign ("$") – means: find a pattern starting from the end of the string.
Exercise 8
Using the sub function, replace the pattern found on the previous exercice by the string ” is not ”
Place the resulting string in text2 variable.
Repetition Characters
There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax:
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more than m times.
By default repetition is greedy, so the maximal possible number of repeats is used.
Exercise 9 
Find in text2 the following pattern: Four digits starting at the end of the string.
Use a function that returns the starting point of the found string and place its result in string_pos4.
Exercise 10
Using the substr function, and according to the position of the string found in the previous excercise, extract the first two digits found at the end of text2.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
