Site icon R-bloggers

R Packages: {janitor} for Data Cleaning

[This article was first published on Exploring Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • Github Page as of 8/10/20

    Quick Overview

    Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).

    Sign up Here to join the many other subscribers who also nerd out on new tips and tricks ????

    And if you enjoy the post be sure to share it

    New Series: Exploring R {packages}

    When I discover new and helpful functions I light up ???? with excitement.

    Interestingly, I have a pattern of finding one or two useful functions in a {package}, but rarely explore further to discover other useful functionality.

    That’s what this New Series is all about – Exploration.

    In each post, I will share a bit about how I was using a {package} and then use a case-study to highlight other functionality I discovered to be useful.

    Leave a Comment

    Leave a comment at the end to let me know if you like this style of post. The feedback will be considered in determining the direction of this series.

    Examples of feedback:

    1. Was the post to long?
    2. Do you like the case-study approach?
    3. Could I have just shared the functions without the case-study?
    4. Was sharing about the pre-processing pro-tip helpful or distracting?

    janitor {package}

    This a wonderful {package} built by Sam Firke.

    Take the time to explore the Github Page for the {janitor:package}, where Sam describes it as follow:

    janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.

    There are many more functions (20+) in the package that we will not cover – head over to the following page to learn more about them: Overview of janitor functions.

    Case-Study

    The case-study will provide and illustrate the following:

    1. A pro-tip for setting up a pre-processing data pipepline.
    2. The function I use often: clean_names().
    3. Newly discovered functions from {janitor}.

    Let’s dive in…

    Imagine being tasked with doing an analysis on Starbucks coffee locations. Your manager has provided you with raw-data from coffee chains and requested that you:

    1. QA the data for duplicates (by store and by location).
    2. Tabulate the various types of Starbucks Ownership:
      • Worldwide &
      • US (lower 48)
    3. Deliver a US map that identifies patterns in ownership types.

    To streamline your efforts and get swiftly to making that map, you decide to leverage the {janitor:package}.

    Load our Libraries

    library(tidyverse)     # Work-Horse Package
    library(janitor)       # Data cleaning (+tabulating data)
    library(janitor)       # Business Ready Plots
    library(ggthemes)      # Clean ggplot theme for Maps
    library(USAboundaries) # Get state name/code mapping

    Let’s Get Some Data

    For our case-study we are using data from the Tidy Tuesday Project archive.

    # Import Data ----
    # tuesdata <- tidytuesdayR::tt_load("2018-05-07") 

    Pro-Tip: Pre-Processing Pipeline

    When working with new data, I’ll typically setup up a pre-processing step at the beginning of the script. It typically starts out with no steps and then they get added as I move through my analysis.

    The idea is that as you conduct your Exploratory Data Analysis (EDA), you will discover pre-processing steps that need to be added to your pipeline.

    In this post, I’ll illustrate this technique by adding to our pipeline as we go; however, this data pipeline would live near the top of the script and would not move.

    Step 1

    Save raw data to a variable.

    # coffee_chains_raw <- tuesdata$week6_coffee_chains

    Step 2

    Immediately save the raw data to new variable labeled with the suffix, processed.

    # Beginning of Pre-Processing Pipeline
    coffee_chains_processed <- coffee_chains_raw

    This obviously has ZERO pre-processing done to the data at this point. The point though is that as you discover areas of your data that require attention, you then can circle back to this pipeline and add those steps.

    This may seem odd, but the beauty comes in not having to get further along in your analysis before realizing that you need to do data cleaning steps; if you approach it that way, then you have to go back and rename your variables created along the way – this method allows you to keep working with your processed data as you move swiftly through your analysis.

    I picked up this pro-tip while watching David Robinson in his Tidy Tuesday Screencasts – check those out here: Tidy Tuesday R Screencasts

    # Hat-Tip to D-Rob

    Step 3

    Begin Exploring your Data and conducting your analysis.

    At this point, I’ll do a bit of EDA to familiarize myself with the data I’m working with; this process is always to get a high-level understanding of the data so that I can pick up on nuances along with data integrity issues that need attention (dealt with in the pre-processing pipeline).

    Initial Exploration

    Let’s look at these raw data using the tibble::glimpse() function.

    The glimpse() function allows us to quickly assess column names, data-types, and also view a sample of the values contained in each column – you can read more about the glimpse() function in my archived post, Examining Data with glimpse().

    coffee_chains_processed %>% 
      tibble::glimpse()
    ## Rows: 25,600
    ## Columns: 13
    ## $ Brand            <chr> "Starbucks", "Starbucks", "Starbucks", "Starbucks", …
    ## $ `Store Number`   <chr> "47370-257954", "22331-212325", "47089-256771", "221…
    ## $ `Store Name`     <chr> "Meritxell, 96", "Ajman Drive Thru", "Dana Mall", "T…
    ## $ `Ownership Type` <chr> "Licensed", "Licensed", "Licensed", "Licensed", "Lic…
    ## $ `Street Address` <chr> "Av. Meritxell, 96", "1 Street 69, Al Jarf", "Sheikh…
    ## $ City             <chr> "Andorra la Vella", "Ajman", "Ajman", "Abu Dhabi", "…
    ## $ `State/Province` <chr> "7", "AJ", "AJ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ",…
    ## $ Country          <chr> "AD", "AE", "AE", "AE", "AE", "AE", "AE", "AE", "AE"…
    ## $ Postcode         <chr> "AD500", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "31…
    ## $ `Phone Number`   <chr> "376818720", NA, NA, NA, NA, NA, NA, NA, "26670052",…
    ## $ Timezone         <chr> "GMT+1:00 Europe/Andorra", "GMT+04:00 Asia/Dubai", "…
    ## $ Longitude        <dbl> 1.53, 55.47, 55.47, 54.38, 54.54, 54.49, 54.49, 54.6…
    ## $ Latitude         <dbl> 42.51, 25.42, 25.39, 24.48, 24.51, 24.40, 24.40, 24.…

    Immediately, we can see that our column names are not optimal for analysis. Personally, I’m VERY biased towards snake_case and therefore always like to get column names into that format.

    janitor::clean_names()

    In comes {janitor::clean_names} to the rescue ⛑️

    By default, clean_names() outputs column naming with the snake_case format – maybe this is one of the reasons that it’s in my top 10 for favorite functions in R.

    Let’s test it out on our coffee data.

    # clean_names() with default naming
    coffee_chains_processed %>% 
      janitor::clean_names() %>% 
      base::names()
    ##  [1] "brand"          "store_number"   "store_name"     "ownership_type"
    ##  [5] "street_address" "city"           "state_province" "country"       
    ##  [9] "postcode"       "phone_number"   "timezone"       "longitude"     
    ## [13] "latitude"

    Awesome!

    You’ll notice the function took care of the / in State/Province and replaced it with an underscore – simply amazing ????

    Naming Convention Options

    If you prefer a different naming convention – I’m not sure why you would ???? – then you can use the case argument.

    # clean_names() with diff. naming convention
    coffee_chains_processed %>% 
      clean_names(case = "small_camel") %>% 
      names()
    ##  [1] "brand"         "storeNumber"   "storeName"     "ownershipType"
    ##  [5] "streetAddress" "city"          "stateProvince" "country"      
    ##  [9] "postcode"      "phoneNumber"   "timezone"      "longitude"    
    ## [13] "latitude"

    Pre-Processing Addition

    Now let’s add this step to our pre-processing pipeline.

    # Adding to our Pre-Processing Pipeline
    coffee_chains_processed <- coffee_chains_raw %>% 
      
      # clean up column names
      janitor::clean_names()

    janitor::get_dupes()

    get_dupes() is at the top of the list for newly discovered functionality within the {janitor} package.

    This is one of those things you need to do often (check for duplicates) and {janitor} makes it simple.

    Going back to our case-study, our manager asked us to check for duplicated records (a common data-cleaning and EDA step).

    Let’s subset our data and investigate.

    coffee_chains_processed %>% 
      
      # subset data by store and by location
      dplyr::select(brand, store_number, 
                    city, state_province, country) %>% 
      
      # identify duplicated records
      janitor::get_dupes()
    ## # A tibble: 2 x 6
    ##   brand     store_number city  state_province country dupe_count
    ##   <chr>     <chr>        <chr> <chr>          <chr>        <int>
    ## 1 Starbucks 19773-160973 Seoul 11             KR               2
    ## 2 Starbucks 19773-160973 Seoul 11             KR               2

    Using janitor::get_dupes() we’ve quickly identified a potential issue: store number 19773-160973 has duplicated records.

    Let’s investigate further.

    # filter to store with dupes
    coffee_chains_processed %>% 
      
      # filter to store and glimpse data
      dplyr::filter(store_number == "19773-160973") %>% 
      glimpse()
    ## Rows: 2
    ## Columns: 13
    ## $ brand          <chr> "Starbucks", "Starbucks"
    ## $ store_number   <chr> "19773-160973", "19773-160973"
    ## $ store_name     <chr> "Yoido IFC Mall - 1F", "Yoido IFC Mall - 1F"
    ## $ ownership_type <chr> "Joint Venture", "Joint Venture"
    ## $ street_address <chr> "23 & 23-1, Yoido-Dong, Yongdongpo-Gu, 1F, #101", "23 …
    ## $ city           <chr> "Seoul", "Seoul"
    ## $ state_province <chr> "11", "11"
    ## $ country        <chr> "KR", "KR"
    ## $ postcode       <chr> "153-023", "153-023"
    ## $ phone_number   <chr> NA, NA
    ## $ timezone       <chr> "GMT+09:00 Asia/Seoul", "GMT+09:00 Asia/Seoul"
    ## $ longitude      <dbl> NA, 126.92
    ## $ latitude       <dbl> NA, 37.53

    Look carefully and you’ll notice that the latitude/longitude are missing for one of these records.

    We need lat/long for mapping and so we will want to prioritize the records with those data. Also, we don’t want duplicated records to interfere with our tabulations later on in this analysis.

    Let’s quickly look and see how much data is missing from the lat/long columns.

    # plot missing data (using raw data)
    DataExplorer::plot_missing(
      title   = "% of Missing Data (filtered to cols w/missing data)",
      data    = coffee_chains_raw,
      ggtheme = tidyquant::theme_tq(), 
      missing_only = TRUE) 

    The plot shows that 0% of data are missing for lat/long leading me to believe that the store identified earlier is the only record with missing data (insignificant amount when plotted).

    We will filter that record out in our data-cleaning step.

    Pre-Processing Addition

    Now let’s add this step to our pre-processing pipeline

    # Adding to our Pre-Processing Pipeline
    coffee_chains_processed <- coffee_chains_raw %>% 
      
      # clean up column names
      janitor::clean_names() %>% 
      
      # filter out records missing lat/long values
      dplyr::filter(!is.na(latitude), !is.na(longitude))

    Let’s use get_dupes() to confirm the problem is solved

    coffee_chains_processed %>% 
      
      # subset data
      dplyr::select(brand, store_number, city, state_province, country) %>% 
      
      # identify duplicated records
      janitor::get_dupes()
    ## # A tibble: 0 x 6
    ## # … with 6 variables: brand <chr>, store_number <chr>, city <chr>,
    ## #   state_province <chr>, country <chr>, dupe_count <int>

    Starbucks Analysis

    Now that we’ve done our due diligence in being sure we’ve dealt with data issues, let’s knock out this analysis by tabulating these data and compiling a map, or two ????

    Before doing so, let’s add one final step to our pre-processing data pipeline.

    Pre-Processing Addition

    The final step is to subset the columns needed to complete the analysis.

    # Adding to our Pre-Processing Pipeline
    coffee_chains_processed <- coffee_chains_raw %>% 
      
      # clean up column names
      janitor::clean_names() %>% 
      
      # filter out records missing lat/long values
      dplyr::filter(!is.na(latitude), !is.na(longitude)) %>% 
      
      # subset columns for analysis
      dplyr::select(brand, ownership_type, country, 
                    state_province, latitude, longitude)

    View Data

    # view first 5 rows
    coffee_chains_processed %>% head(5)
    ## # A tibble: 5 x 6
    ##   brand     ownership_type country state_province latitude longitude
    ##   <chr>     <chr>          <chr>   <chr>             <dbl>     <dbl>
    ## 1 Starbucks Licensed       AD      7                  42.5      1.53
    ## 2 Starbucks Licensed       AE      AJ                 25.4     55.5 
    ## 3 Starbucks Licensed       AE      AJ                 25.4     55.5 
    ## 4 Starbucks Licensed       AE      AZ                 24.5     54.4 
    ## 5 Starbucks Licensed       AE      AZ                 24.5     54.5

    Tabulate Data (worldwide)

    Let’s start with looking at Ownership Types worldwide.

    janitor::tabyl stuck out to me because the ease with which to generate frequency tables.

    Check it out.

    # generate frequency table
    coffee_chains_processed %>% 
      
      # filter data
      dplyr::filter(brand == "Starbucks") %>% 
      
      # tabulate and arrange data
      janitor::tabyl(ownership_type) %>% 
      arrange(desc(percent)) %>% 
      
      # formatting
      janitor::adorn_totals() %>% 
      janitor::adorn_pct_formatting() %>% 
      rmarkdown::paged_table()

    Using just the tabyl function we were able to generate frequencies along with the percent of total.

    However, {janitor} is packed full of other goodies – the creator(s) have crafted a number of adorn options for formatting our outputs. I used the adorn_totals and adorn_pct_formatting to tidy up and make our table ready for presentation.

    Simply Amazing ????

    Tabulate Data (US, lower 48)

    # generate frequency table
    coffee_chains_processed %>% 
      
      # filter data
      dplyr::filter(brand   == "Starbucks", 
                    country == "US", 
                    state_province != "AK", 
                    state_province != "HI") %>% 
      
      # tabulate and arrange data
      janitor::tabyl(ownership_type) %>% 
      arrange(desc(percent)) %>% 
      
      # formatting
      janitor::adorn_totals() %>% 
      janitor::adorn_pct_formatting() %>% 
      rmarkdown::paged_table()

    All Starbucks are either company owned, which is almost all of them, or else they’re “licensed” locations, which are the Starbucks in airports, supermarkets, etc. – Charles Partrick

    Map Starbucks Locations

    Now lets make those maps and get this analysis wrapped up.

    Lets start by getting a general sense of where in the US these Starbucks are located.

    Data Manipulation

    # Data Manipulation
    starbucks_lower_48 <- coffee_chains_processed %>% 
      
      # filter data
      dplyr::filter(brand   == "Starbucks", 
                    country == "US", 
                    state_province != "AK", 
                    state_province != "HI") 

    Data Visualization

    # Data Visualization
    starbucks_lower_48 %>% 
      
      # setup ggplot canvas + US borders
      ggplot(aes(longitude, latitude, color = ownership_type)) +
      
      # add geometries
      borders("state") +
      geom_point(size = .75, alpha = 0.5) +
      
      # formatting
      ggthemes::theme_map() +   # remove x/y for tidy map
      coord_map() +             # scales map (simple approach)
      scale_color_manual(values = c("#2c3e50", "#18BC9C")) +
      labs(title = "Starbucks Locations by Ownership Type (Lower 48)",
           color = "Ownership Type")

    That’s a solid map but I think we can do better to identify patterns in ownership types.

    Let’s calculate the ratio of Corporate (Company Owned) vs. Licensed ownership and map that at the state level.

    Data Acquisition (state boundaries)

    # Get state level lat/long table
    states <- ggplot2::map_data("state") %>% 
      tibble() %>% 
      mutate(region = str_to_title(region))

    Data Manipulation

    # Data Manipulation
    ownership_ratios_by_state <- starbucks_lower_48 %>% 
      
      # count ownership types by state
      group_by(state_province, ownership_type) %>% 
      summarize(n = n()) %>% 
      ungroup() %>% 
      
      # pivot data and calculate ratios
      pivot_wider(names_from = ownership_type, 
                  values_from = n) %>% 
      clean_names() %>% 
      mutate(corp_vs_lic = company_owned/licensed) %>% 
      
      # join to get state names from codes
      left_join(USAboundaries::state_codes %>% 
              select(state_name, state_abbr),
              by = c("state_province" = "state_abbr")) %>% 
      
      # reorder columns
      select(state_name, everything())

    View Data

    ownership_ratios_by_state %>% head()
    ## # A tibble: 6 x 5
    ##   state_name  state_province company_owned licensed corp_vs_lic
    ##   <chr>       <chr>                  <int>    <int>       <dbl>
    ## 1 Alabama     AL                        48       36       1.33 
    ## 2 Arkansas    AR                        35       19       1.84 
    ## 3 Arizona     AZ                       196      283       0.693
    ## 4 California  CA                      1943      839       2.32 
    ## 5 Colorado    CO                       227      250       0.908
    ## 6 Connecticut CT                        83       35       2.37

    Data Visualization

    ownership_ratios_by_state %>% 
      
      # join to get state boundaries (lat/long)
      left_join(states, by = c("state_name" = "region")) %>% 
      
      # setup ggplot canvas + US borders
      ggplot(aes(long, lat, fill = corp_vs_lic, group = group)) +
      
      # add geometries
      geom_polygon() +
      ggplot2::borders("state") +
      
      # formatting
      ggthemes::theme_map() + # remove x/y for tidy map
      theme(legend.position = c(.9, .05)) +
      coord_map(projection = "mercator") + # scales map projection 
      scale_fill_gradient2(low = "white", high = "#18BC9C", ) + 
      labs(title = "Ratio of Corporate vs. Licensed Starbucks in the US (Lower 48)",
           subtitle = "Darker green equates to more corporate locations compared to licensed establishments.",
           fill = "Ratio of\nCorporate/\nLicensed") 

    This represent the data in a way that helps us identify patterns – our manager will be pleased ????

    Wrap Up

    I hope you enjoyed the first post in this new series.

    Leave a comment and let me know.

    Get the code here: Github Repo.

    Subscribe + Share

    Enter your Email Here to get the latest from Exploring-Data in your inbox.

    PS: Be Kind and Tidy your Data ????

    PSS: Leave a comment to help guide the subsequent posts in this series.

    Learn R Fast

    I’ve been learning Data Science at Business Science University.

    Join me on the journey.

    Check out this link to get 15% off of the courses that are helping 1000s of analytics professionals take their careers to the next level: Business Science Courses

    Good luck.

    To leave a comment for the author, please follow the link and comment on their blog: Exploring Data.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.