String Manipulation with Stringr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What is a string?
In coding, strings, also known as character strings, are sequences of characters that are surrounded by quotation marks. This often includes letters and can include numbers. Character strings are often used for names and categorizing data and it can be extremely useful to learn how to work with this type easily.
Note: if you prefer a video tutorial, you can see it here
Basic string manipulation
"hello world"
is an example of a basic character string. It contains two words that are surrounded by quotation marks. Typically when we are working with strings we have more than one, they will be in a vector or in a column(s) in a dataframe. The main package we will be using to manipulate strings will be stringr
. There are of course many ways to do these functions but stringr is the tidyverse method for string manipulation so the grammar and structure are consistent with other tidyverse packages.
First we will create a vector of strings called fruits that contains the names of 5 fruits.
library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(stringr) fruits <- c("Apple", "Banana", "Kiwi", "Pineapple", "Grape")
If we want to get some summary stats of our vector we can use the functions str_count()
or str_length()
. The functions tell us how many characters are in each string.
str_count(fruits) ## [1] 5 6 4 9 5 str_length(fruits) ## [1] 5 6 4 9 5
Now, say we want to answer a few questions:
## Which strings end with the letter e? str_ends(fruits, "e") ## [1] TRUE FALSE FALSE TRUE TRUE ## Which strings start with the letter a? str_starts(fruits, "A") ## [1] TRUE FALSE FALSE FALSE FALSE ## Do any strings have "pple" in them? str_detect(fruits, "pple") ## [1] TRUE FALSE FALSE TRUE FALSE
You can also quickly convert all letters to the same case by using str_to_lower()
or str_to_upper()
which can be handy for making everything uniform so it is easier to match up or group by later.
str_to_lower(fruits) ## [1] "apple" "banana" "kiwi" "pineapple" "grape" str_to_upper(fruits) ## [1] "APPLE" "BANANA" "KIWI" "PINEAPPLE" "GRAPE"
Now lets create a new vector that has day labels and we will:
- Replace the first part of each string with the word “sample”
- Split each string into two separate strings
- Pull out the number from each string
library(purrr) myString <- c("Day_01", "Day_02", "Day_03", "Day_04") myString %>% str_replace(pattern = "Day", replacement = "sample") %>% str_split(pattern = "_") %>% map(2) ## [[1]] ## [1] "01" ## ## [[2]] ## [1] "02" ## ## [[3]] ## [1] "03" ## ## [[4]] ## [1] "04"
Notice that because stringr
is part of the tidyverse, I was able to follow the syntax and pipe each function one after the other.
Advanced string manipulation
Now we are going to work with strings that are in a column in a dataframe and we will learn how to subset a dataframe by the strings we want, how to change strings, and how to find where certain strings are.
We will use the murders
data set in R, it is the murder rates of each state in the US.
library(dslabs) data("murders") head(murders) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65
We see that this dataframe has 2 columns that are character strings and the region column is a factor but can easily be changed into a character type. Often you will want to pull out only the rows of a dataframe that meet a certain criteria. To do that based on strings, we will need to use filter()
and str_detect()
. We pull out all of the rows with states that:
## States that start with A murders %>% filter(str_detect(string = state, pattern = "A")) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## States that start with A or C murders %>% filter(str_detect(string = state, pattern = "A|C")) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65 ## 7 Connecticut CT Northeast 3574097 97 ## 8 District of Columbia DC South 601723 99 ## 9 North Carolina NC South 9535483 286 ## 10 South Carolina SC South 4625364 207 ## States that are in states.of.interest states.of.interest <- c("Texas", "Louisiana", "Mississippi", "Alabama", "Florida") states.of.interest <- paste(states.of.interest, collapse="|") ## need to collapse the multiple strings into with the | symbol between them states.of.interest ## [1] "Texas|Louisiana|Mississippi|Alabama|Florida" murders %>% filter(str_detect(string = state, pattern = states.of.interest)) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Florida FL South 19687653 669 ## 3 Louisiana LA South 4533372 351 ## 4 Mississippi MS South 2967297 120 ## 5 Texas TX South 25145561 805 ## States that don't start with A or C murders %>% filter(str_detect(string = state, pattern = "A|C", negate = TRUE)) ## state abb region population total ## 1 Delaware DE South 897934 38 ## 2 Florida FL South 19687653 669 ## 3 Georgia GA South 9920000 376 ## 4 Hawaii HI West 1360301 7 ## 5 Idaho ID West 1567582 12 ## 6 Illinois IL North Central 12830632 364 ## 7 Indiana IN North Central 6483802 142 ## 8 Iowa IA North Central 3046355 21 ## 9 Kansas KS North Central 2853118 63 ## 10 Kentucky KY South 4339367 116 ## 11 Louisiana LA South 4533372 351 ## 12 Maine ME Northeast 1328361 11 ## 13 Maryland MD South 5773552 293 ## 14 Massachusetts MA Northeast 6547629 118 ## 15 Michigan MI North Central 9883640 413 ## 16 Minnesota MN North Central 5303925 53 ## 17 Mississippi MS South 2967297 120 ## 18 Missouri MO North Central 5988927 321 ## 19 Montana MT West 989415 12 ## 20 Nebraska NE North Central 1826341 32 ## 21 Nevada NV West 2700551 84 ## 22 New Hampshire NH Northeast 1316470 5 ## 23 New Jersey NJ Northeast 8791894 246 ## 24 New Mexico NM West 2059179 67 ## 25 New York NY Northeast 19378102 517 ## 26 North Dakota ND North Central 672591 4 ## 27 Ohio OH North Central 11536504 310 ## 28 Oklahoma OK South 3751351 111 ## 29 Oregon OR West 3831074 36 ## 30 Pennsylvania PA Northeast 12702379 457 ## 31 Rhode Island RI Northeast 1052567 16 ## 32 South Dakota SD North Central 814180 8 ## 33 Tennessee TN South 6346105 219 ## 34 Texas TX South 25145561 805 ## 35 Utah UT West 2763885 22 ## 36 Vermont VT Northeast 625741 2 ## 37 Virginia VA South 8001024 250 ## 38 Washington WA West 6724540 93 ## 39 West Virginia WV South 1852994 27 ## 40 Wisconsin WI North Central 5686986 97 ## 41 Wyoming WY West 563626 5
In the above examples, the negate = TRUE
argument is key for pulling out the states that don’t start with A or C. Sometimes it is easier to tell R which rows you don’t want rather than tell it which ones you do want, as in this case, and this is where the negate argument is useful.
Another way you may need to manipulate a column of character strings is if you want to change the words. So for example, in the murders dataframe, we will change the names of the regions so they are all one word and all lowercase. To do this we will combine the mutate()
and str_replace()
functions.
murders %>% distinct(region) ## region ## 1 South ## 2 West ## 3 Northeast ## 4 North Central # to just change one region murders %>% mutate(region = str_replace(string = region, pattern = "South", replacement = "south")) %>% head() ## state abb region population total ## 1 Alabama AL south 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR south 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65 # to change them all at the same time murders %>% mutate(region = str_replace_all(string = region, c("South" = "south", "West" = "west", "North Central" = "north_central", "Northeast" = "northeast"))) %>% head(n = 20) ## state abb region population total ## 1 Alabama AL south 4779736 135 ## 2 Alaska AK west 710231 19 ## 3 Arizona AZ west 6392017 232 ## 4 Arkansas AR south 2915918 93 ## 5 California CA west 37253956 1257 ## 6 Colorado CO west 5029196 65 ## 7 Connecticut CT northeast 3574097 97 ## 8 Delaware DE south 897934 38 ## 9 District of Columbia DC south 601723 99 ## 10 Florida FL south 19687653 669 ## 11 Georgia GA south 9920000 376 ## 12 Hawaii HI west 1360301 7 ## 13 Idaho ID west 1567582 12 ## 14 Illinois IL north_central 12830632 364 ## 15 Indiana IN north_central 6483802 142 ## 16 Iowa IA north_central 3046355 21 ## 17 Kansas KS north_central 2853118 63 ## 18 Kentucky KY south 4339367 116 ## 19 Louisiana LA south 4533372 351 ## 20 Maine ME northeast 1328361 11
Lastly, if you want to find the rows which contain certain strings you can use the str_which function. For example, if I want to find the index for the rows that contain south in the name.
murders %>% ## creating a new column with the state name and region together just for the example mutate(state_region = paste(state, region, sep = "_")) %>% pull(state_region) %>% str_which(pattern = "_South") ## [1] 1 4 8 9 10 11 18 19 21 25 34 37 41 43 44 47 49
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.