Site icon R-bloggers

Text Data Analysis in R: Understanding grep, grepl, sub and gsub

[This article was first published on A Statistician's R Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

https://carlalexander.ca/beginners-guide-regular-expressions/
< section id="introduction" class="level2">

Introduction

In text data analysis, being able to search for patterns, validate their existence, and perform substitutions is crucial. R provides powerful base functions like grep, grepl, sub, and gsub to handle these tasks efficiently. This blog post will delve into how these functions work, using examples ranging from simple to complex, to show how they can be leveraged for text manipulation, classification, and grouping tasks.

< section id="understanding-grep-and-grepl" class="level2">

1. Understanding grep and grepl

< section id="what-is-grep" class="level3">

What is grep?

< section id="what-is-grepl" class="level3">

What is grepl?

< section id="differences-advantages-and-disadvantages" class="level3">

Differences, Advantages, and Disadvantages

< section id="using-sub-and-gsub-for-text-substitution" class="level2">

2. Using sub and gsub for Text Substitution

< section id="what-is-sub" class="level3">

What is sub?

< section id="what-is-gsub" class="level3">

What is gsub?

< section id="differences-advantages-and-disadvantages-1" class="level3">

Differences, Advantages, and Disadvantages

< section id="practical-examples-with-a-synthetic-dataset" class="level2">

3. Practical Examples with a Synthetic Dataset

< section id="example-dataset" class="level3">

Example Dataset

For the purposes of this blog post, we’ll create a synthetic dataset. This dataset is a data frame that contains two columns: id and text. Each row represents a unique text entry with a corresponding identifier.

# Creating a synthetic data frame
text_data <- data.frame(
  id = 1:15,
  text = c("Cats are great pets.",
           "Dogs are loyal animals.",
           "Birds can fly high.",
           "Fish swim in water.",
           "Horses run fast.",
           "Rabbits hop quickly.",
           "Cows give milk.",
           "Sheep have wool.",
           "Goats are curious creatures.",
           "Lions are the kings of the jungle.",
           "Tigers have stripes.",
           "Elephants are large animals.",
           "Monkeys are very playful.",
           "Giraffes have long necks.",
           "Zebras have black and white stripes.")
)
< section id="explanation-of-the-dataset" class="level3">

Explanation of the Dataset

< section id="applying-grep-grepl-sub-and-gsub" class="level3">

Applying grep, grepl, sub, and gsub

< section id="example-1-using-grep-to-find-specific-words" class="level4">

Example 1: Using grep to find specific words

# Find rows containing the word 'are'
indices <- grep("are", text_data$text, ignore.case = TRUE)
result_grep <- text_data[indices, ]
result_grep
   id                               text
1   1               Cats are great pets.
2   2            Dogs are loyal animals.
9   9       Goats are curious creatures.
10 10 Lions are the kings of the jungle.
12 12       Elephants are large animals.
13 13          Monkeys are very playful.

Explanation: grep("are", text_data$text, ignore.case = TRUE) searches for the word “are” in the text column of text_data, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.

< section id="example-2-applying-grepl-for-conditional-checks" class="level4">

Example 2: Applying grepl for conditional checks

# Add a new column indicating if the word 'fly' is present

text_data$contains_fly <- grepl("fly", text_data$text)
text_data
   id                                 text contains_fly
1   1                 Cats are great pets.        FALSE
2   2              Dogs are loyal animals.        FALSE
3   3                  Birds can fly high.         TRUE
4   4                  Fish swim in water.        FALSE
5   5                     Horses run fast.        FALSE
6   6                 Rabbits hop quickly.        FALSE
7   7                      Cows give milk.        FALSE
8   8                     Sheep have wool.        FALSE
9   9         Goats are curious creatures.        FALSE
10 10   Lions are the kings of the jungle.        FALSE
11 11                 Tigers have stripes.        FALSE
12 12         Elephants are large animals.        FALSE
13 13            Monkeys are very playful.        FALSE
14 14            Giraffes have long necks.        FALSE
15 15 Zebras have black and white stripes.        FALSE

Explanation: grepl("fly", text_data$text) checks each element of the text column for the presence of the word “fly” and returns a logical vector. This vector is then added as a new column contains_fly.

< section id="example-3-using-sub-to-replace-a-pattern-in-text" class="level4">

Example 3: Using sub to replace a pattern in text

# Replace the first occurrence of 'a' with 'A' in the text column

text_data$text_sub <- sub(" a ", " A ", text_data$text)
text_data[,c("text","text_sub")]
                                   text                             text_sub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: sub(" a ", " A ", text_data$text) replaces the first occurrence of ’ a ’ with ’ A ’ in each element of the text column. The resulting text is stored in a new column text_sub.

< section id="example-4-applying-gsub-for-global-pattern-replacement" class="level4">

Example 4: Applying gsub for global pattern replacement

# Replace all occurrences of 'a' with 'A' in the text column

text_data$text_gsub <- gsub(" a ", " A ", text_data$text)
text_data[,c("text","text_gsub")]
                                   text                            text_gsub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: gsub(" a ", " A ", text_data$text) replaces all occurrences of ’ a ’ with ’ A ’ in each element of the text column. The resulting text is stored in a new column text_gsub.

< section id="example-5-text-based-grouping-and-assignment" class="level3">

Example 5: Text-based Grouping and Assignment

Let’s group the texts based on the presence of the word “bird” and assign a category.

# Add a new column 'category' based on the presence of the word 'fly'

text_data$category <- ifelse(grepl("fly", text_data$text, ignore.case = TRUE), "Can Fly", "Cannot Fly")
text_data[,c("text","category")]
                                   text   category
1                  Cats are great pets. Cannot Fly
2               Dogs are loyal animals. Cannot Fly
3                   Birds can fly high.    Can Fly
4                   Fish swim in water. Cannot Fly
5                      Horses run fast. Cannot Fly
6                  Rabbits hop quickly. Cannot Fly
7                       Cows give milk. Cannot Fly
8                      Sheep have wool. Cannot Fly
9          Goats are curious creatures. Cannot Fly
10   Lions are the kings of the jungle. Cannot Fly
11                 Tigers have stripes. Cannot Fly
12         Elephants are large animals. Cannot Fly
13            Monkeys are very playful. Cannot Fly
14            Giraffes have long necks. Cannot Fly
15 Zebras have black and white stripes. Cannot Fly

Explanation: grepl("fly", text_data$text, ignore.case = TRUE) checks for the presence of the word “fly” in each element of the text column, ignoring case. The ifelse function is then used to create a new column category, assigning “Can Fly” if the word is present and “Cannot Fly” otherwise.

< section id="additional-examples" class="level3">

Additional Examples

< section id="example-6-using-grep-to-find-multiple-patterns" class="level4">

Example 6: Using grep to find multiple patterns

# Find rows containing the words 'great' or 'loyal'
indices <- grep("great|loyal", text_data$text, ignore.case = TRUE)
text_data[indices,c("text") ]
[1] "Cats are great pets."    "Dogs are loyal animals."

Explanation: grep("great|loyal", text_data$text, ignore.case = TRUE) searches for the words “great” or “loyal” in the text column, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.

< section id="example-7-using-gsub-for-complex-substitutions" class="level4">

Example 7: Using gsub for complex substitutions

# Replace all occurrences of 'animals' with 'creatures' and 'pets' with 'companions'

text_data$text_gsub_complex <- gsub("animals", "creatures", gsub("pets", "companions", text_data$text))
text_data[,c("text","text_gsub_complex")]
                                   text                    text_gsub_complex
1                  Cats are great pets.           Cats are great companions.
2               Dogs are loyal animals.            Dogs are loyal creatures.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.       Elephants are large creatures.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: The inner gsub replaces all occurrences of ‘pets’ with ‘companions’, and the outer gsub replaces all occurrences of ‘animals’ with ‘creatures’ in each element of the text column. The resulting text is stored in a new column text_gsub_complex.

< section id="example-8-using-grepl-with-multiple-conditions" class="level4">

Example 8: Using grepl with multiple conditions

# Add a new column indicating if the text contains either 'large' or 'playful'

text_data$contains_large_or_playful <- grepl("large|playful", text_data$text)
text_data[,c("text","contains_large_or_playful")]
                                   text contains_large_or_playful
1                  Cats are great pets.                     FALSE
2               Dogs are loyal animals.                     FALSE
3                   Birds can fly high.                     FALSE
4                   Fish swim in water.                     FALSE
5                      Horses run fast.                     FALSE
6                  Rabbits hop quickly.                     FALSE
7                       Cows give milk.                     FALSE
8                      Sheep have wool.                     FALSE
9          Goats are curious creatures.                     FALSE
10   Lions are the kings of the jungle.                     FALSE
11                 Tigers have stripes.                     FALSE
12         Elephants are large animals.                      TRUE
13            Monkeys are very playful.                      TRUE
14            Giraffes have long necks.                     FALSE
15 Zebras have black and white stripes.                     FALSE

Explanation: grepl("large|playful", text_data$text) checks each element of the text column for the presence of the words “large” or “playful” and returns a logical vector. This vector is then added as a new column contains_large_or_playful.

< section id="understanding-regular-expressions" class="level2">

4. Understanding Regular Expressions

Regular expressions (regex) are powerful tools used for pattern matching and text manipulation. They allow you to define complex search patterns using a combination of literal characters and special symbols. R’s grep, grepl, sub, and gsub functions all support the use of regular expressions.

< section id="key-components-of-regular-expressions" class="level3">

Key Components of Regular Expressions

< section id="examples-with-regular-expressions" class="level3">

Examples with Regular Expressions

Using the same synthetic dataset, let’s explore how to apply regular expressions with grep, grepl, sub, and gsub.

< section id="example-1-matching-text-that-starts-with-a-specific-word" class="level4">

Example 1: Matching Text that Starts with a Specific Word

# Find rows where text starts with the word 'Cats'
indices <- grep("^Cats", text_data$text)
text_data[indices,c("text")]
[1] "Cats are great pets."

Explanation: grep("^Cats", text_data$text) uses the ^ metacharacter to find rows where the text starts with “Cats”.

< section id="example-2-matching-text-that-ends-with-a-specific-word" class="level4">

Example 2: Matching Text that Ends with a Specific Word

# Find rows where text ends with the word 'water.'
indices <- grep("water\\.$", text_data$text)
text_data[indices,c("text")]
[1] "Fish swim in water."

Explanation: grep("water\\.$", text_data$text) uses the $ metacharacter to find rows where the text ends with “water.” The \\. is used to escape the dot character, which is a metacharacter in regex.

< section id="example-3-matching-text-that-contains-a-specific-pattern" class="level4">

Example 3: Matching Text that Contains a Specific Pattern

# Find rows where text contains 'great' followed by any character and 'pets'
indices <- grep("great.pets", text_data$text)
text_data[indices,c("text")]
[1] "Cats are great pets."

Explanation: grep("great.pets", text_data$text) uses the . metacharacter to match any character between “great” and “pets”.

< section id="example-4-using-gsub-with-regular-expressions" class="level3">

Example 4: Using gsub with Regular Expressions

# Replace all occurrences of words starting with 'C' with 'Animal'
text_data$text_gsub_regex <- gsub("\\bC\\w+", "Animal", text_data$text)
text_data[,c("text","text_gsub_regex")]
                                   text                      text_gsub_regex
1                  Cats are great pets.               Animal are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                    Animal give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.

Explanation: gsub("\\bC\\w+", "Animal", text_data$text) replaces all words starting with ‘C’ (\\b indicates a word boundary, C matches the character ‘C’, and \\w+ matches one or more word characters) with “Animal”.

< section id="example-5-using-grepl-to-check-for-complex-patterns" class="level4">

Example 5: Using grepl to Check for Complex Patterns

# Add a new column indicating if the text contains a word ending with 's'
text_data$contains_s_end <- grepl("\\b\\w+s\\b", text_data$text)
text_data[,c("text","contains_s_end")]
                                   text contains_s_end
1                  Cats are great pets.           TRUE
2               Dogs are loyal animals.           TRUE
3                   Birds can fly high.           TRUE
4                   Fish swim in water.          FALSE
5                      Horses run fast.           TRUE
6                  Rabbits hop quickly.           TRUE
7                       Cows give milk.           TRUE
8                      Sheep have wool.          FALSE
9          Goats are curious creatures.           TRUE
10   Lions are the kings of the jungle.           TRUE
11                 Tigers have stripes.           TRUE
12         Elephants are large animals.           TRUE
13            Monkeys are very playful.           TRUE
14            Giraffes have long necks.           TRUE
15 Zebras have black and white stripes.           TRUE

Explanation: grepl("\\b\\w+s\\b", text_data$text) checks each element of the text column for the presence of a word ending with ‘s’. Here, \\b indicates a word boundary, \\w+ matches one or more word characters, and s matches the character ‘s’.

< section id="conclusion" class="level2">

Conclusion

The grep, grepl, sub, and gsub functions in R are powerful tools for text data analysis. They allow for efficient searching, pattern matching, and text manipulation, making them essential for any data analyst or data scientist working with textual data. By understanding how to use these functions and leveraging regular expressions, you can perform a wide range of text processing tasks, from simple searches to complex pattern replacements and text-based classifications.

To leave a comment for the author, please follow the link and comment on their blog: A Statistician's R Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version