Easy data cleaning with the janitor package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The janitor
package contains only a little number of functions but nevertheless it is surprisingly convenient to use them. I never really fully appreciated its functionality until I took a look into the documentation. So let’s dive into this package. You can find a video version of this blog post at
INSERT VIDEO HERE
Clean column names
As everyone working with data knows, data sets rarely come in a clean format. Often, the necessary cleaning process already starts with the column names. Here, take this data set from TidyTuesday, week 41.
library(tidyverse) nurses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-05/nurses.csv') names(nurses) ## [1] "State" ## [2] "Year" ## [3] "Total Employed RN" ## [4] "Employed Standard Error (%)" ## [5] "Hourly Wage Avg" ## [6] "Hourly Wage Median" ## [7] "Annual Salary Avg" ## [8] "Annual Salary Median" ## [9] "Wage/Salary standard error (%)" ## [10] "Hourly 10th Percentile" ## [11] "Hourly 25th Percentile" ## [12] "Hourly 75th Percentile" ## [13] "Hourly 90th Percentile" ## [14] "Annual 10th Percentile" ## [15] "Annual 25th Percentile" ## [16] "Annual 75th Percentile" ## [17] "Annual 90th Percentile" ## [18] "Location Quotient" ## [19] "Total Employed (National)_Aggregate" ## [20] "Total Employed (Healthcare, National)_Aggregate" ## [21] "Total Employed (Healthcare, State)_Aggregate" ## [22] "Yearly Total Employed (State)_Aggregate"
These column names are intuitively easy to understand but not necessarily easy to process by code. There are too many white spaces and other special characters. Therefore, I accompany most data inputs by clean_names()
from the janitor
package.
library(janitor) nurses |> clean_names() |> names() ## [1] "state" ## [2] "year" ## [3] "total_employed_rn" ## [4] "employed_standard_error_percent" ## [5] "hourly_wage_avg" ## [6] "hourly_wage_median" ## [7] "annual_salary_avg" ## [8] "annual_salary_median" ## [9] "wage_salary_standard_error_percent" ## [10] "hourly_10th_percentile" ## [11] "hourly_25th_percentile" ## [12] "hourly_75th_percentile" ## [13] "hourly_90th_percentile" ## [14] "annual_10th_percentile" ## [15] "annual_25th_percentile" ## [16] "annual_75th_percentile" ## [17] "annual_90th_percentile" ## [18] "location_quotient" ## [19] "total_employed_national_aggregate" ## [20] "total_employed_healthcare_national_aggregate" ## [21] "total_employed_healthcare_state_aggregate" ## [22] "yearly_total_employed_state_aggregate"
Did you see what happened? White spaces were converted to _
and parentheses were removed. Even the %
signs were converted to percent
. Now, these labels are easy to understand AND process by code. This does not mean that you are finished cleaning but at least now the columns are more accessible.
Remove empty and or constant columns and rows
Data sets with empty or superfluous rows or columns are not a rare sighting. This is especially true if you work with Excel files because there will be a lot of empty cells. Take a look at the dirty Excel data set from janitor’s GitHub page. It looks like this when you open it with Excel.
Taking a look just at this picture we may notice a couple of things.
First, Jason Bourne is teaching at a school. I guess being a trained assassin qualifies him to teach physical education. Also - and this is just a hunch - undercover work likely earned him his “Theater” certification.
Second, the header above the actual table will be annoying, so we must skip the first line when we read the data set.
Third, the column names are not ideal but we know how to deal with that by now.
Fourth, there are empty rows and columns we can get rid of.
Fifth, there is a column that contains only ‘YES’. Therefore it contains no information at all and can be removed.
So, let us read and clean the data. The janitor
package will help us with remove_empty()
and remove_constant()
.
xl_file <- readxl::read_excel('dirty_data.xlsx', skip = 1) |> clean_names() |> remove_empty() |> remove_constant() xl_file ## # A tibble: 12 × 9 ## first_name last_name employee_status subject hire_date percent_allocated ## <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 Jason Bourne Teacher PE 39690 0.75 ## 2 Jason Bourne Teacher Drafting 43479 0.25 ## 3 Alicia Keys Teacher Music 37118 1 ## 4 Ada Lovelace Teacher <NA> 38572 1 ## 5 Desus Nice Administration Dean 42791 1 ## 6 Chien-Shiung Wu Teacher Physics 11037 0.5 ## 7 Chien-Shiung Wu Teacher Chemistry 11037 0.5 ## 8 James Joyce Teacher English 36423 0.5 ## 9 Hedy Lamarr Teacher Science 27919 0.5 ## 10 Carlos Boozer Coach Basketball 42221 NA ## 11 Young Boozer Coach <NA> 34700 NA ## 12 Micheal Larsen Teacher English 40071 0.8 ## # ℹ 3 more variables: full_time <chr>, certification_9 <chr>, ## # certification_10 <chr>
Here, remove_empty()
defaulted to remove, both, rows and colums. If we wish, we can change that by setting e.g. which = 'rows'
.
Now, we may also want to see the hire_data
in a sensible format. For example, in this dirty data set, Jason Bourne was hired on 39690
. Luckily, our janitor
can make sense of it all.
xl_file |> mutate(hire_date = excel_numeric_to_date(hire_date)) ## # A tibble: 12 × 9 ## first_name last_name employee_status subject hire_date percent_allocated ## <chr> <chr> <chr> <chr> <date> <dbl> ## 1 Jason Bourne Teacher PE 2008-08-30 0.75 ## 2 Jason Bourne Teacher Drafting 2019-01-14 0.25 ## 3 Alicia Keys Teacher Music 2001-08-15 1 ## 4 Ada Lovelace Teacher <NA> 2005-08-08 1 ## 5 Desus Nice Administration Dean 2017-02-25 1 ## 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.5 ## 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.5 ## 8 James Joyce Teacher English 1999-09-20 0.5 ## 9 Hedy Lamarr Teacher Science 1976-06-08 0.5 ## 10 Carlos Boozer Coach Basketba… 2015-08-05 NA ## 11 Young Boozer Coach <NA> 1995-01-01 NA ## 12 Micheal Larsen Teacher English 2009-09-15 0.8 ## # ℹ 3 more variables: full_time <chr>, certification_9 <chr>, ## # certification_10 <chr>
Rounding
To my surprise shock, R uses some unexpected rounding rule. In my world, whenever a number ends in .5
, standard rounding would round up. Apparently, R uses something called banker’s rounding that in these cases rounds towards the next even number. Take a look.
round(seq(0.5, 4.5, 1)) ## [1] 0 2 2 4 4
I would expect that the rounded vector contains the integers from one to five. Thankfully, janitor
offers a convenient rounding function.
round_half_up(seq(0.5, 4.5, 1)) ## [1] 1 2 3 4 5
Ok, so that gives us a new function for rounding towards integers. But what is really convenient is that janitor
can round_to_fraction
s.
round_to_fraction(seq(0.5, 2.0, 0.13), denominator = 4) ## [1] 0.50 0.75 0.75 1.00 1.00 1.25 1.25 1.50 1.50 1.75 1.75 2.00
Here, I rounded the numbers to the next quarters (denominator = 4
) but of course any fraction is possible. You can now live the dream of rounding towards arbitrary fractions.
Find matches in multiple characteristics
In my opinion, the get_dupes()
function is really powerful. It allows us to find “similar” observations in a data set based on certain characteristics. For example, the starwars
data set from dplyr
contains a lot of informationon characters from the Star Wars movies. Possibly, we want to find out which characters are similar w.r.t. to certain traits.
starwars |> get_dupes(eye_color, hair_color, skin_color, sex, homeworld) |> select(1:8) ## # A tibble: 6 × 8 ## eye_color hair_color skin_color sex homeworld dupe_count name height ## <chr> <chr> <chr> <chr> <chr> <int> <chr> <int> ## 1 blue black yellow female Mirial 2 Luminara U… 170 ## 2 blue black yellow female Mirial 2 Barriss Of… 166 ## 3 blue blond fair male Tatooine 2 Luke Skywa… 172 ## 4 blue blond fair male Tatooine 2 Anakin Sky… 188 ## 5 brown brown light female Naboo 2 Padmé Amid… 185 ## 6 brown brown light female Naboo 2 Dormé 165
So, Luke and Anakin Skywalker are similar to one another. Who would have thought that. Sadly, I don’t enough about Star Wars to know whether the other matches are similarly “surprising”. In any case, the point here is that we can easily find matches according to arbitrarily many characteristics. Conveniently, these characteristics are the first columns of the new output and we get a dupe_count
.
Count all the combinations with tabyl()
I frequently find myself counting how many different things a column in a data set consists of. Let’s look at the mpg
data set from the ggplot2
package.
mpg ## # A tibble: 234 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… ## 3 audi a4 2 2008 4 manu… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… ## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… ## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… ## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… ## # ℹ 224 more rows
There, we can count how many different car classes are available in the data set.
mpg |> count(class, sort = TRUE) ## # A tibble: 7 × 2 ## class n ## <chr> <int> ## 1 suv 62 ## 2 compact 47 ## 3 midsize 41 ## 4 subcompact 35 ## 5 pickup 33 ## 6 minivan 11 ## 7 2seater 5
But that’s rarely where I stop. Most of the time, I also want to get the percentages.
mpg |> count(class, sort = TRUE) |> mutate(prop = n / sum(n)) ## # A tibble: 7 × 3 ## class n prop ## <chr> <int> <dbl> ## 1 suv 62 0.265 ## 2 compact 47 0.201 ## 3 midsize 41 0.175 ## 4 subcompact 35 0.150 ## 5 pickup 33 0.141 ## 6 minivan 11 0.0470 ## 7 2seater 5 0.0214
With tabyl()
from {janitor}
there’s a much faster way to do all of that in one go.
mpg |> tabyl(class) ## class n percent ## 2seater 5 0.02136752 ## compact 47 0.20085470 ## midsize 41 0.17521368 ## minivan 11 0.04700855 ## pickup 33 0.14102564 ## subcompact 35 0.14957265 ## suv 62 0.26495726
The only drawback this has is that we do not get a tibble returned but a data frame. I do like the tibble output much more though. You can enforce that with as_tibble()
.
mpg |> tabyl(class) |> as_tibble() ## # A tibble: 7 × 3 ## class n percent ## <chr> <int> <dbl> ## 1 2seater 5 0.0214 ## 2 compact 47 0.201 ## 3 midsize 41 0.175 ## 4 minivan 11 0.0470 ## 5 pickup 33 0.141 ## 6 subcompact 35 0.150 ## 7 suv 62 0.265
It’s tedious to include that every time you count something though. So if you want to do more tabyl()
calls, you might want to override the print()
defaults for tabyl
objects.
print.tabyl <- function(x, ...) { tib <- x |> as_tibble(.name_repair = 'minimal') if ('n' %in% colnames(tib)) { tib |> arrange(desc(n)) |> print() } else { tib |> janitor::clean_names() |> print() } } mpg |> tabyl(class) ## class n percent ## 2seater 5 0.02136752 ## compact 47 0.20085470 ## midsize 41 0.17521368 ## minivan 11 0.04700855 ## pickup 33 0.14102564 ## subcompact 35 0.14957265 ## suv 62 0.26495726
Tabyl works with vectors too
The nice thing about tabyl()
is that it works the same for vectors.
fruits <- sample(fruit, size = 125, replace = TRUE) tabyl(fruits) ## fruits n percent ## apricot 3 0.024 ## avocado 2 0.016 ## banana 1 0.008 ## bell pepper 1 0.008 ## bilberry 1 0.008 ## blood orange 2 0.016 ## blueberry 2 0.016 ## boysenberry 1 0.008 ## breadfruit 2 0.016 ## canary melon 3 0.024 ## cantaloupe 3 0.024 ## cherimoya 4 0.032 ## cherry 1 0.008 ## clementine 2 0.016 ## cloudberry 1 0.008 ## coconut 2 0.016 ## cranberry 2 0.016 ## currant 2 0.016 ## damson 1 0.008 ## date 2 0.016 ## dragonfruit 2 0.016 ## durian 2 0.016 ## eggplant 2 0.016 ## elderberry 1 0.008 ## feijoa 4 0.032 ## goji berry 4 0.032 ## gooseberry 4 0.032 ## grape 1 0.008 ## grapefruit 3 0.024 ## guava 4 0.032 ## honeydew 1 0.008 ## huckleberry 1 0.008 ## jackfruit 1 0.008 ## jujube 2 0.016 ## kiwi fruit 3 0.024 ## kumquat 1 0.008 ## lime 2 0.016 ## loquat 1 0.008 ## lychee 1 0.008 ## mandarine 1 0.008 ## mulberry 1 0.008 ## nectarine 3 0.024 ## nut 1 0.008 ## olive 1 0.008 ## orange 3 0.024 ## pamelo 1 0.008 ## passionfruit 3 0.024 ## peach 1 0.008 ## persimmon 2 0.016 ## physalis 3 0.024 ## pineapple 1 0.008 ## plum 1 0.008 ## pomegranate 2 0.016 ## pomelo 1 0.008 ## purple mangosteen 3 0.024 ## quince 2 0.016 ## raisin 1 0.008 ## rambutan 1 0.008 ## raspberry 1 0.008 ## rock melon 2 0.016 ## salal berry 3 0.024 ## star fruit 2 0.016 ## tamarillo 3 0.024 ## tangerine 3 0.024 ## ugli fruit 1 0.008
And with two-way & three-way counts too
Even cooler, tabyl()
can generate two-way and three-way counts for you. In the case of three-way counts, you will get a list of tibbles.
mpg |> tabyl(class, trans) ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 1 0 0 0 0 ## compact 2 1 8 4 0 2 2 ## midsize 3 0 14 5 0 1 0 ## minivan 0 1 8 0 2 0 0 ## pickup 0 0 12 8 0 0 0 ## subcompact 0 0 11 4 0 0 0 ## suv 0 0 29 18 4 0 1 ## auto(s6) manual(m5) manual(m6) ## 1 0 3 ## 5 18 5 ## 6 9 3 ## 0 0 0 ## 0 8 5 ## 1 16 3 ## 3 7 0 mpg |> tabyl(class, trans, fl) ## $c ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 0 0 0 0 0 ## compact 0 0 0 0 0 0 0 ## midsize 0 0 0 0 0 0 0 ## minivan 0 0 0 0 0 0 0 ## pickup 0 0 0 0 0 0 0 ## subcompact 0 0 0 1 0 0 0 ## suv 0 0 0 0 0 0 0 ## auto(s6) manual(m5) manual(m6) ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## ## $d ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 0 0 0 0 0 ## compact 0 0 0 0 0 0 0 ## midsize 0 0 0 0 0 0 0 ## minivan 0 0 0 0 0 0 0 ## pickup 0 0 0 0 0 0 0 ## subcompact 0 0 1 0 0 0 0 ## suv 0 0 1 1 0 0 0 ## auto(s6) manual(m5) manual(m6) ## 0 0 0 ## 0 1 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 1 0 ## 0 0 0 ## ## $e ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 0 0 0 0 0 ## compact 0 0 0 0 0 0 0 ## midsize 0 0 0 0 0 0 0 ## minivan 0 0 1 0 0 0 0 ## pickup 0 0 0 2 0 0 0 ## subcompact 0 0 0 0 0 0 0 ## suv 0 0 2 2 0 0 0 ## auto(s6) manual(m5) manual(m6) ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 0 ## 0 0 1 ## 0 0 0 ## 0 0 0 ## ## $p ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 1 0 0 0 0 ## compact 2 0 0 4 0 1 0 ## midsize 2 0 1 3 0 1 0 ## minivan 0 0 0 0 0 0 0 ## pickup 0 0 0 0 0 0 0 ## subcompact 0 0 0 0 0 0 0 ## suv 0 0 4 2 0 0 1 ## auto(s6) manual(m5) manual(m6) ## 1 0 3 ## 4 5 5 ## 4 2 2 ## 0 0 0 ## 0 0 0 ## 0 1 2 ## 0 1 0 ## ## $r ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0 0 0 0 0 0 0 ## compact 0 1 8 0 0 1 2 ## midsize 1 0 13 2 0 0 0 ## minivan 0 1 7 0 2 0 0 ## pickup 0 0 12 6 0 0 0 ## subcompact 0 0 10 3 0 0 0 ## suv 0 0 22 13 4 0 0 ## auto(s6) manual(m5) manual(m6) ## 0 0 0 ## 1 12 0 ## 2 7 1 ## 0 0 0 ## 0 8 4 ## 1 14 1 ## 3 6 0
Style your tabyls
The nice thing with tabyls
is that you can customize them through adorn_*()
functions. For example, you can add totals or make the percent labels nicer.
mpg |> tabyl(class) |> # Adds a total row adorn_totals() |> # Makes the percent columns look nicer adorn_pct_formatting() ## class n percent ## 2seater 5 2.1% ## compact 47 20.1% ## midsize 41 17.5% ## minivan 11 4.7% ## pickup 33 14.1% ## subcompact 35 15.0% ## suv 62 26.5% ## Total 234 100.0%
Or you could also use counts and percentages in two-way counts.
mpg |> tabyl(class, trans) |> # Adds a total row adorn_totals() |> # Turns counts into percentages adorn_percentages() |> # Makes percentages look nice adorn_pct_formatting() |> # includes the counts (n) again adorn_ns() |> # Adds a title row adorn_title() ## trans ## class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) ## 2seater 0.0% (0) 0.0% (0) 20.0% (1) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) ## compact 4.3% (2) 2.1% (1) 17.0% (8) 8.5% (4) 0.0% (0) 4.3% (2) 4.3% (2) ## midsize 7.3% (3) 0.0% (0) 34.1% (14) 12.2% (5) 0.0% (0) 2.4% (1) 0.0% (0) ## minivan 0.0% (0) 9.1% (1) 72.7% (8) 0.0% (0) 18.2% (2) 0.0% (0) 0.0% (0) ## pickup 0.0% (0) 0.0% (0) 36.4% (12) 24.2% (8) 0.0% (0) 0.0% (0) 0.0% (0) ## subcompact 0.0% (0) 0.0% (0) 31.4% (11) 11.4% (4) 0.0% (0) 0.0% (0) 0.0% (0) ## suv 0.0% (0) 0.0% (0) 46.8% (29) 29.0% (18) 6.5% (4) 0.0% (0) 1.6% (1) ## Total 2.1% (5) 0.9% (2) 35.5% (83) 16.7% (39) 2.6% (6) 1.3% (3) 1.3% (3) ## ## auto(s6) manual(m5) manual(m6) ## 20.0% (1) 0.0% (0) 60.0% (3) ## 10.6% (5) 38.3% (18) 10.6% (5) ## 14.6% (6) 22.0% (9) 7.3% (3) ## 0.0% (0) 0.0% (0) 0.0% (0) ## 0.0% (0) 24.2% (8) 15.2% (5) ## 2.9% (1) 45.7% (16) 8.6% (3) ## 4.8% (3) 11.3% (7) 0.0% (0) ## 6.8% (16) 24.8% (58) 8.1% (19)
Conclusion
Sweeeet! We learned a whole lot of convenient helper functions from the janitor package. I hope you enjoyed this little tutorial. Have a great day and see you next time. And if you found this helpful, here are some other ways I can help you:
- 3 Minute Wednesdays: A weekly newsletter with bite-sized tips and tricks for R users
- Insightful Data Visualizations for “Uncreative” R Users: A course that teaches you how to leverage
{ggplot2}
to make charts that communicate effectively without being a design expert.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.