Site icon R-bloggers

Partially renaming columns using a lookup table

[This article was first published on R | Tim Tiefenbach, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

Usually data sets come with short column names, which makes it easy to clean and manipulate the data. However, when presenting the data to stakeholders, in form of tables or plots, we often need longer, meaningful names. In many cases we have a lookup table which contains long and short versions of the column names so that we can “easily” replace the names when needed.

Below we’ll look at how to rename columns using different approaches in R. To make this a little bit more challenging, we’ll add three conditions:

  1. The lookup table is not complete, that means the lookup table only covers a subset of the columns in our data set.
  2. We are working with a subset of the original data, that means, the lookup table, although being not complete, holds actually more column name pairs than there are actually columns in the subset of our data.
  3. The sorting of the lookup table is different from the sorting of our actual column names.

Without those three conditions partially renaming columns is actually not a big deal. In real world settings however, there are many cases where we have to rename columns under one or more of the above conditions. Especially, since we often use short column names in the analysis and just rename them in the final step when creating a report. The latter almost never contains all the columns names of our originial data set.

It is interesting to see how the three large paradigms in R, base R, ‘data.table’ and ‘dplyr’ compare in handling this problem.

This post concludes by looking at how we would tackle the same problem in Python’s ‘pandas’ library.

Let’s start with the setup.

Setup

We take the mtcars data set and create lookup data.frame called recode_df based on the information from the documentation ?mtcars. Next, we apply the three conditions mentioned above (see code comments) and assign this new data to mycars.

recode_df <- data.frame(
  old = names(mtcars),
  new = c("Miles per galon", "Number of cylinders", "Displacement (cu.in.)",
          "Gross horsepower","Rear axle ratio", "Weight (1000 lbs)", 
          "1/4 mile time", "Engine (0=automatic, 1=manual)",
          "Number of forward gears", "Transmission (0=automatic, 1=manual)",
          "Number of carbuertors")
  )

# condition 3: The lookup table has a different sorting than the actual column names
# Here we choose a alphabetical ordering for the lookup table:
recode_df <- recode_df[order(recode_df$old),]
rownames(recode_df) <- NULL

# condition 2: we are only working with a subset of the data
# Here we only use every second column:
every_2nd_col <- seq(from = 1, to = length(mtcars), by = 2)

mycars <- mtcars[,every_2nd_col]

# condition 1: the data has a column that is not part of the lookup table
# Here we take the rownames and put them in a dedicated column `model` ...
# ... which is no included in `recode_df`
mycars <- cbind(model = rownames(mycars),
                data.frame(mycars, row.names = NULL))

str(mycars)
#> 'data.frame':  32 obs. of  7 variables:
#>  $ model: chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ mpg  : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ disp : num  160 160 108 258 360 ...
#>  $ drat : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ qsec : num  16.5 17 18.6 19.4 17 ...
#>  $ am   : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ carb : num  4 4 1 1 2 1 4 2 2 4 ...

As as final step lets write both data.frame’s mycars and recode_df from R to two separate csv files, so that we can load them easily into Python later on. (In RMarkdown we could of course access the objects created in R from Python via the r object, but lets stick to csv files to make this reproducible for all users.)

write.csv(mycars, "mycars.csv")
# available at:
# read.csv("https://raw.githubusercontent.com/TimTeaFan/tt_website/main/content/post/2022-rename-columns/mycars.csv")

write.csv(recode_df, "recode_df.csv")
# available at:
# read.csv("https://raw.githubusercontent.com/TimTeaFan/tt_website/main/content/post/2022-rename-columns/recode_df.csv")

base R

Lets start with base R. If it weren’t for the three conditions outlined above, renaming columns in base R would be really easy. It would basically boil down to a classic lookup using match() as index to extract [ the new names:

recode_df$new[match(names(mycars), recode_df$old)]
#> [1] NA                        "Miles per galon"        
#> [3] "Displacement (cu.in.)"   "Rear axle ratio"        
#> [5] "1/4 mile time"           "Number of forward gears"
#> [7] "Number of carbuertors"

But as we can see, this creates an NA for the column name that is not included in the lookup table model.

However, slightly modifying this answer on SO by Gregor Thomas, shows how to get rid of the NAs and only overwrite those column names that are present in our lookup table:

# create a new object so that we don't overwrite our original `mycars` data
mycars_base <- mycars

# create an index vector with match to find those names in`recode_df` ... 
# ...that are present in our data
idx_vec <- match(recode_df$old, names(mycars_base))

# assign the names
names(mycars_base)[na.omit(idx_vec)] <- recode_df$new[!is.na(idx_vec)]

# use `str()` for better printing
str(mycars_base)
#> 'data.frame':  32 obs. of  7 variables:
#>  $ model                  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ Miles per galon        : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ Displacement (cu.in.)  : num  160 160 108 258 360 ...
#>  $ Rear axle ratio        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ 1/4 mile time          : num  16.5 17 18.6 19.4 17 ...
#>  $ Number of forward gears: num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ Number of carbuertors  : num  4 4 1 1 2 1 4 2 2 4 ...

Although this approach is not very verbose, it does require some serious thinking about matching, extracting and indexing. It feels like there should be a cleaner solution for a common problem like this, so lets have a look how we can tackle this problem using ‘data.table’ and ‘dplyr’.

data․table

The ‘data.table’ package sometimes has the reputation for offering a cryptic, arcane syntax, but many users don’t know that the package also contains many helpful functions which are pretty straight-forward to use. In our case we can apply data.table::setnames() out of the box. It takes a data.table, a vector of old and new column names and finally all we have to do is to set the skip_absent argument to TRUE, to prevent ‘data.table’ from raising an error, since not all of the names in our lookup table are present in the data.

Unlike base R, the names are changed “by reference”, meaning that we don’t need to assign the result to a new variable, since no copy is made. Instead the data is “modified in place”.

library(data.table)

mycars_dt <- as.data.table(mycars)

setnames(mycars_dt,
         old = recode_df$old,
         new = recode_df$new,
         skip_absent = TRUE)

str(mycars_dt)
#> Classes 'data.table' and 'data.frame':  32 obs. of  7 variables:
#>  $ model                  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ Miles per galon        : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ Displacement (cu.in.)  : num  160 160 108 258 360 ...
#>  $ Rear axle ratio        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ 1/4 mile time          : num  16.5 17 18.6 19.4 17 ...
#>  $ Number of forward gears: num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ Number of carbuertors  : num  4 4 1 1 2 1 4 2 2 4 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

dplyr (tidyverse)

Renaming columns in ‘dplyr’ is as easy as df %>% rename("new_name" = "old_name"). At least when we use dplyr::rename() interactively and type in each old and new name manually. Usually when working programmatically with ‘dplyr’ we supply a named vector to functions that take the ellipsis ... as argument, like rename(), and splice it in using the bang-bang-bang operator !!!.

library(dplyr)

# create a named vector
recode_vec <- setNames(recode_df$old, recode_df$new)

# splice it in with `!!!`
mycars %>% 
  rename(!!! recode_vec)
#> Error in `rename()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `cyl` doesn't exist.

However, this won’t work in our case, since our lookup table, and the named vector that we constructed with it, contains column name pairs that don’t exist in our data, which leads to the above error.

Intuitively, one want’s fall back to the base R approach where we extract only those names from our named vector that our data actually contains:

# extract column name pairs that are actually in our data and splice into `rename()`
mycars %>% 
  rename(!!! recode_vec[recode_vec %in% names(mycars)]) %>% 
  glimpse() # for better printing
#> Rows: 32
#> Columns: 7
#> $ model                     <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", …
#> $ `Miles per galon`         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24…
#> $ `Displacement (cu.in.)`   <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 36…
#> $ `Rear axle ratio`         <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.…
#> $ `1/4 mile time`           <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15…
#> $ `Number of forward gears` <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `Number of carbuertors`   <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4,…

And although this works, it is not as easy and clean as we expect. However, there is an even easier way in ‘dplyr’ which doesn’t come with the need of neither splicing nor extracting: dplyr::any_of().

Usually any_of() takes a character vector with column names and is used inside dplyr::select() to select “any of” the columns in the vector. It won’t throw an error when any of the column names is not actually in our data.

The cool, and undocumented feature is that we can use any_of() inside rename() and that we can supply it a named vector to do the renaming for us:

# first lets create new object
# ... so that we don't overwrite our original `mycars` data
mycars_tidy <- mycars

# here we construct the same named vector as above
recode_vec <- setNames(recode_df$old, recode_df$new)

mycars_tidy <- mycars_tidy %>% 
  rename(any_of(recode_vec))

# use `glimpse()` for better printing
mycars_tidy %>%
  glimpse()
#> Rows: 32
#> Columns: 7
#> $ model                     <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", …
#> $ `Miles per galon`         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24…
#> $ `Displacement (cu.in.)`   <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 36…
#> $ `Rear axle ratio`         <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.…
#> $ `1/4 mile time`           <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15…
#> $ `Number of forward gears` <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ `Number of carbuertors`   <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4,…

The good thing about both approaches in ‘data.table’ and ‘dplyr’ is that we can deliberately ignore errors when not all of the column name pairs are present in our data. Either by setting data.table::setnames()‘s skip_absent argument to TRUE or by using tidyselect::any_of(). Which means we can also raise errors if we need them: setting skip_absent to FALSE or using tidyselect::all_of() (which is the opposite of any_of() and requires all column names to be present in the data).

Finally, lets have a look at how we would solve this problem in Python’s ‘pandas’ library.

Python pandas

In Python we can use ‘pandas’s pd.DataFrame.rename function to rename columns of a DataFrame. The only thing we need to take care of is supplying the columns argument with a dictionary of column name pairs {"old" : "new"}. There are several ways to create a dictionary from our recode_df. Below we use a pandas approach with set_index(...).to_dict(), but we could also have used Pythons dict(zip()) functions.

import pandas as pd

# setup
mycars = pd.read_csv("mycars.csv", index_col = 0)
recode_df = pd.read_csv("recode_df.csv",  index_col = 0)

# Create dictionary from `recode_df`
recode_dic = recode_df.set_index('old')['new'].to_dict()
# Alternatively: dict(zip(recode_df['old'], recode_df['new']))

# Rename columns
(mycars
  .rename(
     columns=recode_dic,
     inplace=True
     )
 )

mycars.info()

#> 
#> Int64Index: 32 entries, 1 to 32
#> Data columns (total 7 columns):
#>  #   Column                   Non-Null Count  Dtype  
#> ---  ------                   --------------  -----  
#>  0   model                    32 non-null     object 
#>  1   Miles per galon          32 non-null     float64
#>  2   Displacement (cu.in.)    32 non-null     float64
#>  3   Rear axle ratio          32 non-null     float64
#>  4   1/4 mile time            32 non-null     float64
#>  5   Number of forward gears  32 non-null     int64  
#>  6   Number of carbuertors    32 non-null     int64  
#> dtypes: float64(4), int64(2), object(1)
#> memory usage: 2.0+ KB

When it comes to renaming columns, we can see that ‘pandas’ is pretty similar to ‘dplyr’, even more so when we write our Python code according to the "Effective Pandas" style. However, it also resembles ‘data.table’ in two aspects. First, when setting the inplace argument to True the DataFrame is modified in place, no copy is made, and we don’t need to assign the result back to a variable. Second, rename has an argument errors which is set to 'ignore' by default. If we want pandas to throw an error if not all columns are present in our data, we can set it to 'raise'.

That’s it. I hope you enjoyed reading about renaming columns in R and Python. If you have a better way of renaming columns (especially in base R) let me know in the comments below or via Twitter, Mastodon or Github.

< details class="sess"> < summary class="session-header"> Session Info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-02-05
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package    * version date (UTC) lib source
#>  data.table * 1.14.2  2021-09-27 [1] CRAN (R 4.2.0)
#>  dplyr      * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  reticulate * 1.26    2022-08-31 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ─ Python configuration ───────────────────────────────────────────────────────
#>  python:         /usr/local/Caskroom/miniconda/base/bin/python3.9
#>  libpython:      /usr/local/Caskroom/miniconda/base/lib/libpython3.9.dylib
#>  pythonhome:     /usr/local/Caskroom/miniconda/base:/usr/local/Caskroom/miniconda/base
#>  version:        3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ]
#>  numpy:          /usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/numpy
#>  numpy_version:  1.22.3
#>  
#>  NOTE: Python version was forced by RETICULATE_PYTHON
#> 
#> ──────────────────────────────────────────────────────────────────────────────
To leave a comment for the author, please follow the link and comment on their blog: R | Tim Tiefenbach.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version