Extracting Numbers from Strings in R

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello! Today, we’ll jump into something I think is a pretty neat task in data processing: extracting numbers from strings. We’ll explore three different methods using base R, the stringr package, and the stringi package. Each method has its own strengths, so let’s get started!

Examples

Extracting Numbers with Base R

Base R provides powerful tools to manipulate strings, and you can use regular expressions to extract numbers. Here’s a simple example:

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using regular expressions
numbers <- gregexpr("[0-9]+", text)
result <- regmatches(text, numbers)

# Convert to numeric
numeric_result <- as.numeric(unlist(result))

print(numeric_result)
[1] 45 50

Explanation:

  1. gregexpr("[0-9]+", text) finds all sequences of digits in the text.
  2. regmatches(text, numbers) extracts these sequences from the text.
  3. unlist(result) flattens the list of matches.
  4. as.numeric() converts the character strings to numeric values.

Extracting Numbers with stringr

The stringr package offers a more user-friendly approach to string manipulation. Here’s how you can extract numbers:

library(stringr)

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using stringr
numbers <- str_extract_all(text, "\\d+")

# Convert to numeric
numeric_result <- as.numeric(unlist(numbers))

print(numeric_result)
[1] 45 50

Explanation:

  1. str_extract_all(text, "\\d+") extracts all sequences of digits from the text. \\d+ is a regular expression that matches one or more digits.
  2. unlist(numbers) and as.numeric() convert the result to numeric, as explained in the base R method.

Extracting Numbers with stringi

The stringi package is another excellent tool for string manipulation, providing robust and efficient functions. Here’s an example:

library(stringi)

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using stringi
numbers <- stri_extract_all_regex(text, "\\d+")

# Convert to numeric
numeric_result <- as.numeric(unlist(numbers))

print(numeric_result)
[1] 45 50

Explanation:

  1. stri_extract_all_regex(text, "\\d+") extracts all sequences of digits from the text using regular expressions.
  2. As before, unlist(numbers) and as.numeric() are used to convert the result to numeric values.

Comparison and Conclusion

  • Base R is flexible and does not require additional packages, but the syntax can be a bit cumbersome.
  • stringr simplifies the process with intuitive functions, making the code easier to read and write.
  • stringi offers powerful and efficient string operations, suitable for performance-critical tasks.

Try It Yourself!

I encourage you to try these methods on your own data. Extracting numbers from strings is a useful skill, especially when working with messy data. Experiment with different strings and see which method you prefer. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)