Site icon R-bloggers

Capitalization of Names using R code

[This article was first published on K & L Fintech Modeling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post shows simple R trick for capitalization of names, which may have some delimiter.

< !--콘텐츠 내 자동 삽입 광고 배치하기-->

Problem


Problem is to apply capitalization to names separated by punctuation mark (“.”). For example, “BABACAR.THIOMBANE” is to converted to “Babacar.Thiombane” as follows.

1
2
3
4
5
6
7
8
9
10
11
           name(input)                      output
     BABACAR.THIOMBANE           Babacar.Thiombane
         DAMEN.THACKER               Damen.Thacker
         GABE.QUINNETT   ==>         Gabe.Quinnett
      JAVARY.CHRISTMAS            Javary.Christmas
         SCOTT.BLAKNEY               Scott.Blakney
 BABACAR.THIOMBANE.AAA       Babacar.Thiombane.Aaa
               BABACAR                     Babacar
cs


In this case, str_to_title() function from stringr library is used but results in wrong output as follows.

1
2
3
4
5
6
7
8
9
10
> df$wrong < str_to_title(df$name)
> print(df)
                   name                 wrong
1     BABACAR.THIOMBANE     Babacar.thiombane
2         DAMEN.THACKER         Damen.thacker
3         GABE.QUINNETT         Gabe.quinnett
4      JAVARY.CHRISTMAS      Javary.christmas
5         SCOTT.BLAKNEY         Scott.blakney
6 BABACAR.THIOMBANE.AAA Babacar.thiombane.aaa
7               BABACAR               Babacar
cs

It’s because each name may contain “.”. It, therefore, should be taken into account.

Useful R functions


The following functions are used for solving the above problem.

    –> [1] “BABACAR” “THIOMBANE”
  • gsub(“[.]”, ” “, “BABACAR.THIOMBANE”)
    1. –> [1] “BABACAR THIOMBANE”
  • gsub(” “, “.”, “BABACAR THIOMBANE”)
    1. –> [1] “BABACAR.THIOMBANE”
  • paste(c(“BABACAR”, “THIOMBANE”), collapse = “.”)
    1. –> [1] “BABACAR.THIOMBANE”

    scan() function is used to read data into vector or list using delimiter. gsub(a,b,x) function replace a with b in x but some system characters are used with “[]” when these character is placed at a. paste() function concatenates strings with delimiter (default is a space).

    Using these functions, we can implement the following R code.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    library(stringr) # str_to_title
     
    # input data as one column
    df < as.data.frame(
            c(“BABACAR.THIOMBANE”,
              “DAMEN.THACKER”,
              “GABE.QUINNETT”,
              “JAVARY.CHRISTMAS”,
              “SCOTT.BLAKNEY”,
              “BABACAR.THIOMBANE.AAA”,
              “BABACAR”))
     
    colnames(df) < “name”
    print(df) 
     
    # Wrong method
    df$wrong < str_to_title(df$name)
    print(df)
     
    # Method 1
    df$method1 < sapply(df$name, function(x) { 
                        paste(str_to_title(
                            scan(text = x, sep = “.”, what = “”))
                            , collapse = “.”)})
    # Method 2
    df$method2 < gsub(” “,“.”,str_to_title(gsub(“[.]”” “, df$name)))
    print(df)
     
    cs

    < !--콘텐츠 내 자동 삽입 광고 배치하기-->

    We implement two methods. Method 1 use sapply() function for the sequential row-wise operation on all rows which consist of multiple elements in each one entry. Method 2 use gsub() and paste() functions which is simpler than method 1.

    From the output below, we can find the right answer.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    > print(df[,c(1,3,4)])
                       name               method1               method2
    1     BABACAR.THIOMBANE     Babacar.Thiombane     Babacar.Thiombane
    2         DAMEN.THACKER         Damen.Thacker         Damen.Thacker
    3         GABE.QUINNETT         Gabe.Quinnett         Gabe.Quinnett
    4      JAVARY.CHRISTMAS      Javary.Christmas      Javary.Christmas
    5         SCOTT.BLAKNEY         Scott.Blakney         Scott.Blakney
    6 BABACAR.THIOMBANE.AAA Babacar.Thiombane.Aaa Babacar.Thiombane.Aaa
    7               BABACAR               Babacar               Babacar
     
    cs


    We may encounter similar or more difficult problems which require complicated and time-consuming data manipulation.

    The example above is the tip of the iceberg. R will help us when we use appropriately. \(\blacksquare\)


    To leave a comment for the author, please follow the link and comment on their blog: K & L Fintech Modeling.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.