Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exploring-Data
is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).
Sign up Here to join the many other subscribers who also nerd out on new tips and tricks ????
And if you enjoy the post be sure to share it
TweetWhat to Expect
I’m excited to share pro-tips that will expedite your process for cleaning and standardizing column names in your data; this is a critical yet sometimes overlooked step in the cleaning + tidying of data.
There are a couple of handy functions()
available in R
to help effectively execute these tasks.
By the end of this short article you’ll have a couple of new tricks up your sleeve for getting those column names just the way you want them ????
Data Wrangling Toolkit ????
Load our Libraries
library(tidyverse) # Work-Horse Package library(tidytuesdayR) # Access Data from Tidy Tuesday library(janitor) # Data Cleaning Package library(purrr) # Functional Programming Toolkit
Let’s Get Some Data
I’m grabbing a couple of data-sets from the Tidy Tuesday Project that will help us walk through a couple of examples together.
# Get Marine Mammal Data cetacean_week <- tidytuesdayR::tt_load("2018-12-18") cetacean_raw_tbl <- cetacean_week$allCetaceanData # Get NFL Salary Data nfl_salary_week <- tidytuesdayR::tt_load("2018-04-09") nfl_salary_raw_tbl <- nfl_salary_week$nfl_salary
Each of these data-sets contain column naming useful for emphasizing the value in the aforementioned functions.
Let’s start with the janitor
library and it’s nifty function called clean_names()
.
Janitor Makes Life Easy
My head exploded ???? when learning about the Janitor
library – it’s one of my favorite’s and I use the clean_names()
function ALL the time.
Standardizing our naming convention upfront in our data cleaning pipeline can save enormous amounts of time downstream. I’m a big fan of the ???? snake_case
???? naming convention and so I typically like the columns of my data to follow that pattern.
Fortunately, the janitor::clean_names()
function has built in functionality
to programmatically clean up our column names – my favorite part is that by default it favors the snake_case
naming convention.
Let’s Look at an Example
Pulling a few columns from our marine-mammal data we see that our columns are not in our preferred snake_case
convention.
# Get subset of columns for example cetacean_subset_tbl <- cetacean_raw_tbl %>% # Select columns using helper_functions() select(contains("origin"), contains("date"), COD) # Transpose Data to view Column Names glimpse(cetacean_subset_tbl) ## Rows: 2,194 ## Columns: 6 ## $ originDate <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979-… ## $ originLocation <chr> "Marineland Florida", "Dolphin Research Center", "SeaW… ## $ statusDate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ transferDate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ entryDate <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979-… ## $ COD <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Mixed Naming, Let’s Standardize
As you can see we’ve got columns in lowerCamel and also in UPPERCASE. To standardize, let’s now use the clean_names()
function to tidy these up.
# Clean up Column Names + Glimpse Output cetacean_subset_tbl %>% clean_names() %>% glimpse() ## Rows: 2,194 ## Columns: 6 ## $ origin_date <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979… ## $ origin_location <chr> "Marineland Florida", "Dolphin Research Center", "Sea… ## $ status_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ transfer_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ entry_date <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979… ## $ cod <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Those Column Names Look Great!
Now imagine you have a picky partner/colleague who insists on a format like ALLCAPS – you’ve tried to convince them otherwise but they insist ????
# Standardize Column Naming - ALLCAPS cetacean_cols_allcaps_tbl <- cetacean_subset_tbl %>% clean_names(case = "all_caps") # Glimpse Output cetacean_cols_allcaps_tbl %>% glimpse() ## Rows: 2,194 ## Columns: 6 ## $ ORIGIN_DATE <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979… ## $ ORIGIN_LOCATION <chr> "Marineland Florida", "Dolphin Research Center", "Sea… ## $ STATUS_DATE <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ TRANSFER_DATE <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ ENTRY_DATE <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03, 1979… ## $ COD <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Awesome!
Nice way to be a team-player ????
Next Example: set_names()
With clean_names()
in our tool bag, we can now combine it with set_names()
to programmatically standardize ALL of our column names using advanced techniques.
Let’s take a quick peak at the columns from the NFL Salary data.
nfl_salary_raw_tbl %>% names() ## [1] "year" "Cornerback" "Defensive Lineman" ## [4] "Linebacker" "Offensive Lineman" "Quarterback" ## [7] "Running Back" "Safety" "Special Teamer" ## [10] "Tight End" "Wide Receiver"
Now imagine instead of requiring snake_case, the columns need to be lower-case with a dash
instead of an underscore
in between words.
The set_names()
function allows us to Set the Names of a Vector
programmatically.
Using the names()
function above, we can pass a vector of our column names and manipulate each name in similar fashion.
Let’s look at an example.
nfl_salary_raw_tbl %>% clean_names() %>% names() ## [1] "year" "cornerback" "defensive_lineman" ## [4] "linebacker" "offensive_lineman" "quarterback" ## [7] "running_back" "safety" "special_teamer" ## [10] "tight_end" "wide_receiver"
We’ve effectively used clean_names()
to quickly clean up our column names.
However, we still need to replace those underscores with dashes.
Check this out ????
nfl_salary_raw_tbl %>% clean_names() %>% set_names(names(.) %>% str_replace_all("_", "-")) %>% glimpse() ## Rows: 800 ## Columns: 11 ## $ year <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2… ## $ cornerback <dbl> 11265916, 11000000, 10000000, 10000000, 10000000,… ## $ `defensive-lineman` <dbl> 17818000, 16200000, 12476000, 11904706, 11762782,… ## $ linebacker <dbl> 16420000, 15623000, 11825000, 10083333, 10020000,… ## $ `offensive-lineman` <dbl> 15960000, 12800000, 11767500, 10358200, 10000000,… ## $ quarterback <dbl> 17228125, 16000000, 14400000, 14100000, 13510000,… ## $ `running-back` <dbl> 12955000, 10873833, 9479000, 7700000, 7500000, 70… ## $ safety <dbl> 8871428, 8787500, 8282500, 8000000, 7804333, 7652… ## $ `special-teamer` <dbl> 4300000, 3725000, 3556176, 3500000, 3250000, 3225… ## $ `tight-end` <dbl> 8734375, 8591000, 8290000, 7723333, 6974666, 6133… ## $ `wide-receiver` <dbl> 16250000, 14175000, 11424000, 11415000, 10800000,…
I learned this trick in the Data Science for Business 101 course taught by Matt Dancho.
At first, I was puzzled by the names(.)
component and didn’t understand what the period
was doing. In the course I learned that using the dot
(.) enables passing the incoming tibble to multiple-spots
in the function.
set_names()
is a vectorized function and so the first argument is a vector. The dot
functionality in R
allows us to take the incoming tibble and pass it to the names(.)
function. Once we have the names in a vector we use the str_replace_all()
function to replace the underscore
with a dash
.
The str_replace_all()
function uses regular expression pattern matching and so the options are endless for how creative you can get here.
Wrap-Up
That’s it for today!
We used clean_names()
and set_names()
to effectively standardize our column naming conventions.
Get the code here: Github Repo.
Subscribe + Share
Enter your Email Here to get the latest from Exploring-Data in your inbox.
PS: Be Kind and Tidy your Data ????
Learn R Fast ????
Interested in expediting your R
learning path?
Head on over to Business Science and join me on the journey.
Link to my favorite R
course: Data Science for Business 101
FREE Jumpstart Data-Science Course (opened for a limited time)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.