[This article was first published on K & L Fintech Modeling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post shows simple R trick for capitalization of names, which may have some delimiter. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !--콘텐츠 내 자동 삽입 광고 배치하기-->
Problem
Problem is to apply capitalization to names separated by punctuation mark (“.”). For example, “BABACAR.THIOMBANE” is to converted to “Babacar.Thiombane” as follows.
1 2 3 4 5 6 7 8 9 10 11 | ––––––––––––––––––––––––––––––––––––––––––––––––––– name(input) output ––––––––––––––––––––––––––––––––––––––––––––––––––– BABACAR.THIOMBANE Babacar.Thiombane DAMEN.THACKER Damen.Thacker GABE.QUINNETT ==> Gabe.Quinnett JAVARY.CHRISTMAS Javary.Christmas SCOTT.BLAKNEY Scott.Blakney BABACAR.THIOMBANE.AAA Babacar.Thiombane.Aaa BABACAR Babacar ––––––––––––––––––––––––––––––––––––––––––––––––––– | cs |
In this case, str_to_title() function from stringr library is used but results in wrong output as follows.
1 2 3 4 5 6 7 8 9 10 | > df$wrong <– str_to_title(df$name) > print(df) name wrong 1 BABACAR.THIOMBANE Babacar.thiombane 2 DAMEN.THACKER Damen.thacker 3 GABE.QUINNETT Gabe.quinnett 4 JAVARY.CHRISTMAS Javary.christmas 5 SCOTT.BLAKNEY Scott.blakney 6 BABACAR.THIOMBANE.AAA Babacar.thiombane.aaa 7 BABACAR Babacar | cs |
It’s because each name may contain “.”. It, therefore, should be taken into account.
Useful R functions
The following functions are used for solving the above problem.
- scan(text = “BABACAR.THIOMBANE”, sep = “.”, what = “”)
- –> [1] “BABACAR” “THIOMBANE”
- –> [1] “BABACAR THIOMBANE”
- –> [1] “BABACAR.THIOMBANE”
- –> [1] “BABACAR.THIOMBANE”
scan() function is used to read data into vector or list using delimiter. gsub(a,b,x) function replace a with b in x but some system characters are used with “[]” when these character is placed at a. paste() function concatenates strings with delimiter (default is a space).
Using these functions, we can implement the following R code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | library(stringr) # str_to_title # input data as one column df <– as.data.frame( c(“BABACAR.THIOMBANE”, “DAMEN.THACKER”, “GABE.QUINNETT”, “JAVARY.CHRISTMAS”, “SCOTT.BLAKNEY”, “BABACAR.THIOMBANE.AAA”, “BABACAR”)) colnames(df) <– “name” print(df) # Wrong method df$wrong <– str_to_title(df$name) print(df) # Method 1 df$method1 <– sapply(df$name, function(x) { paste(str_to_title( scan(text = x, sep = “.”, what = “”)) , collapse = “.”)}) # Method 2 df$method2 <– gsub(” “,“.”,str_to_title(gsub(“[.]”, ” “, df$name))) print(df) | cs |
< !--콘텐츠 내 자동 삽입 광고 배치하기-->
We implement two methods. Method 1 use sapply() function for the sequential row-wise operation on all rows which consist of multiple elements in each one entry. Method 2 use gsub() and paste() functions which is simpler than method 1.
From the output below, we can find the right answer.
1 2 3 4 5 6 7 8 9 10 | > print(df[,c(1,3,4)]) name method1 method2 1 BABACAR.THIOMBANE Babacar.Thiombane Babacar.Thiombane 2 DAMEN.THACKER Damen.Thacker Damen.Thacker 3 GABE.QUINNETT Gabe.Quinnett Gabe.Quinnett 4 JAVARY.CHRISTMAS Javary.Christmas Javary.Christmas 5 SCOTT.BLAKNEY Scott.Blakney Scott.Blakney 6 BABACAR.THIOMBANE.AAA Babacar.Thiombane.Aaa Babacar.Thiombane.Aaa 7 BABACAR Babacar Babacar | cs |
We may encounter similar or more difficult problems which require complicated and time-consuming data manipulation.
The example above is the tip of the iceberg. R will help us when we use appropriately. \(\blacksquare\)
To leave a comment for the author, please follow the link and comment on their blog: K & L Fintech Modeling.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.