Data Cleaning in R: 2 R Packages to Clean and Validate Datasets
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Real-world datasets are messy. Unless the dataset was created for teaching purposes, it’s likely you’ll have to spend hours or even tens of hours cleaning it before you can show it on a dashboard. That’s where two packages for data cleaning in R come into play – janitor
and data.validator
. And today you’ll learn how to use them together.
If you’re a software engineer, think of data cleaning and validation as writing and testing code. Think of data cleaning as coding an app – it takes a huge amount of time to get it working correctly. On the other hand, you can’t be sure it’ll work as expected until you’ve tested it properly (validation). They’re not two separated concepts, but one is rather an extension of the other.
Regardless if you’re a software engineer or a data scientist, combining these two is the way to go.
Join the biggest R Shiny event of the year – 2022 Appsilon Shiny Conference.
Table of contents:
- Data Cleaning in R with the Janitor Package
- Data Validation in R with the data.validator Package
- Summary
Data Cleaning in R with the Janitor Package
So, what is janitor
? Put simply, it’s an R package that has simple functions for examining and cleaning dirty data. It can format data frame column names, isolate duplicate and partially duplicate records, isolate empty and constant data, and much more!
We’ll use janitor
extensively through this section to clean custom datasets, and isolate duplicates of the well-known Iris dataset.
Cleaning column names
Imagine you had a dataset with terribly-formatted column names. Would you clean them by hand? Well, that’s an option if you only have a couple of them. Real-world datasets oftentimes have hundreds of columns, so the by-hand approach is a no-go.
The janitor
package has a nifty clean_names()
function, and it’s used to reformat data frame column names.
The snippet below creates a data frame with inconsistent column names – some are blank, repeated, or have unwanted characters. janitor
cleans them instantly:
Cleaning column names – Approach #2
There’s another way you could approach cleaning data frame column names – and it’s by using the make_clean_names()
function.
The snippet below shows a tibble of the Iris dataset:
Separating words with a dot could lead to messy or unreadable R code. It’s preferred to use underscores instead. janitor
can do it automatically for you:
The column names are now much more consistent with what you’d find in other datasets.
Removing empty data
It’s not rare to get a dataset with missing values. Filling them isn’t always straightforward. Approaches like average value imputation are often naive, and you should have good domain knowledge before using them.
Sometimes, it’s best to remove missing values altogether. The remove_empty()
function does just that – either for rows, columns, or both. Take a look at an example:
Easy, right? Feel free to experiment with different options for the which
parameter to get the hang of it.
Removing constant data
A column with only one unique value is just useless. It provides no value for analysis, visualization, and even training machine learning models. It’s best to remove such columns entirely. Use the remove_constant()
function for the task:
Keep in mind: Only remove constant data if you’re 100% certain other values are not possible. For example, maybe you’re looking at a small sample of a larger dataset that originally has multiple values in the given column. Be extra careful.
Isolating duplicate date
Missing data is no fun, but duplicates can even be worse! Two rows with identical values convey the same information. If they’re in the dataset by accident, they might skew your analysis and models if left untouched.
Luckily, janitor
comes with a get_dupes()
function you can use to check. The code snippet below considers a value as duplicate only if values for all columns are identical:
You can also specify the columns that will be used when checking for duplicates:
As you can see, we get a much larger duplicate base the second time, just because fewer columns were used for the check.
Janitor package – summary
The janitor
package is extremely powerful when it comes to data cleaning in R. We’ve explored basic functionality which is enough to clean most datasets. So, what’s the next step?
As mentioned earlier, the next step is data validation. It will make sure all test cases have passed.
Data Validation in R with the data.validator Package
Appsilon’s data.validator is a go-to package for scalable and reproducible data validation. You can use it to validate a dataset in the IDE, and you can even export an interactive report. You’ll learn how to do both.
For simplicity’s sake, we’ll use the Iris dataset for validation. You’re free to use any dataset and any validation condition.
You’ll have to start by creating a report object and then using validation functions, such as validate_if()
and validate_cols()
to validate conditions:
It looks like one validation failed with three violations. You can’t see more details in the console, unfortunately. But what you can do is create an HTML report instead:
Unlike with the console option, now you can click on the Show button to get detailed insights into why the validation check failed:
As you can see, the Sepal.Width
column was outside the given range in these three instances, so the validation check failed.
Want to learn more about data.validator? Read our complete guide on Appsilon blog.
Summary of Data Cleaning in R
Long story short – it’s crucial to clean and validate your dataset before continuing with analysis, visualization, or predictive modeling. Today you’ve seen how to approach this task with two highly-capable R packages. They work best when used together – janitor
for data cleaning and data.validator
for validation.
For a homework assignment, we recommend you download any messy dataset of your choice and use two discussed packages for cleaning and validation. Share your results with us on Twitter – @appsilon. We’d love to see what you come up with.
Are you completely new to R? These are 6 R packages you must learn as a beginner.
The post Data Cleaning in R: 2 R Packages to Clean and Validate Datasets appeared first on Appsilon | Enterprise R Shiny Dashboards.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.