Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Every data science project needs a data validation step. It’s a crucial part, especially when feeding data into machine learning models. You don’t want errors or unexpected behaviors in a production environment. Data validation is a way you can check the data before it touches the model and ensures it’s not corrupted. And yes, you can automate data quality reports!
Today you’ll learn how to work with datasets in R, in order to create automated R data quality reports. The best part – you’ll build a UI around the data validation logic, so you can easily upload CSV files for examination and validation, and R Shiny will handle the rest. Let’s dive straight in!
Looking to create reports with R Markdown? Read our top tips on code, images, comments, tables, and more.
Table of contents:
- How to Approach R Data Quality Checks for Reporting
- Introduction to data.validator Package
- Data Validation and Automated Reporting in R Shiny
- Summing up R Data Quality
How to Approach R Data Quality Checks for Reporting
In machine learning, you typically train and optimize a model once and then deploy it. What happens from there is more or less wild west because you can’t know how the user will use your model. For example, if you create a web interface around it and allow for data entry, you can expect some users to enter values that have nothing to do with the data your model was trained on. If a feature value typically ranges between 0 to 10, but the user enters 10000, things will go wrong.
Some of these risks can be avoided with basic form validation, but sooner or later an unexpected request will go through. That’s where data quality checks play a huge role.
So, what are your options? Several R data quality packages are available, but today we’ll focus on Appsilon’s data.validator. It’s a package we developed for scalable and reproducible data validations, and it includes a ton of functions for adding checks to the data.
Other options for automated data quality reports with R exist, such as pagedown and officeR, and you’re free to use them. We found these alternatives capable of report automation, but nowhere near as interactive and scalable as data.validator
.
Let’s dive into the examples next.
Introduction to the data.validator Package
The package is available on CRAN, which means you can install it easily through the R console:
install.packages("data.validator")
Alternatively, you can install the latest development version:
remotes::install_github("Appsilon/data.validator")
We’ll work with the Iris dataset for R data quality reports, and we recommend you download the CSV version instead of using the one built into R. You’ll see reasons why in the following section.
For now, just load these couple of libraries and read the dataset:
library(assertr) library(dplyr) library(data.validator) df_iris <- read.csv("/path/to/iris.csv") head(df_iris)
Here’s what it looks like:
Now, how can you implement data quality checks on this dataset? Just imagine you were to build a machine learning model on it, and you want to somehow regulate the constraints of user-entered data going into the forecasting. First, you need to think of the conditions that have to be satisfied. Let’s mention a few:
- Values can’t be missing (NA)
- Column value has to be in a given range, let’s say between 0 and 10 for
sepal.width
andsepal.length
You can add these and many other conditions to a data validation report. Here’s how:
# A custom function that returns a predicate between <- function(a, b) { function(x) { a <= x & x <= b } } # Initialize the report report <- data_validation_report() # Add validation to the report validate(data = df_iris, description = "Iris Dataset Validation Test") %>% validate_cols(predicate = in_set(c("Setosa", "Virginica", "Versicolor")), variety, description = "Correct species category") %>% validate_cols(predicate = not_na, sepal.length:variety, description = "No missing values") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), sepal.width, description = "Column sepal.width between 0 and 10") %>% add_results(report = report) # Print the report print(report)
The between()
function is a user-defined one, allowing you to check if a value is within range. You can define your custom functions in a similar manner, or use the ones built into data.validator.
Here’s what the printed report looks like:
Saving Reports Locally – HTML, CSV, and TXT
You can also save the report locally in HTML format and open it by running the following code:
save_report(report) browseURL("validation_report.html")
In case you prefer a simpler, flatter design without colors, change the value of the ui_constructor
parameter:
save_report(report, ui_constructor = render_raw_report_ui) browseURL("validation_report.html")
You’re not limited to saving R data quality reports to HTML. After all, it’s not the most straightforward file format to automatically parse and see if any validation checks failed. For that reason, we include an option for saving reports as CSV files:
save_results(report, "results.csv")
It’s not as visually attractive, sure, but you can easily write scripts that would read these CSV files if you want to automate data quality checks.
And finally, there’s an option to save the data as a plain text document:
save_summary(report, "validation_log.txt")
The report now looks like the one from the R console, which is the format you may prefer.
Want to dive deeper into
data.validator
? This article further explores validation functions.
You now know the basic idea behind R data quality checks and reports, so next, we’ll take this a step further by introducing R Shiny.
Data Validation and Automated Reporting in R Shiny
By now you’ve created a data validation report with data.validator
, so why bring R Shiny into the mix? The answer is simple – it will allow you to create an application around the data validation logic, and will further simplify data quality checks for non-tech users. The app you’ll create in a minute allows the user to upload a CSV file, for which a validation report is displayed.
We recommend you save the following snippet in a separate CSV file. It contains a couple of Iris instances that will fail the data validation test:
"sepal.length","sepal.width","petal.length","petal.width","variety" 5.1,3.5,1.4,.2,"Setosa" 100,3,1.4,.2,"Setosa" 4.7,3.2,47,.2,"Setosa" 4.6,3.1,1.5,.2,"Sertosa" 5,NA,1.4,.2,"Setosa"
As you can see, either the species is wrong, the value is missing, or the value is out of range.
Our Shiny app will have a sidebar that allows for CSV file upload and the main section that renders the head of the uploaded CSV file and its validation report.
Keep in mind: R Shiny already has the validate()
function, so we have to explicitly write data.validator::validate()
to avoid confusion and errors:
library(shiny) library(data.validator) library(assertr) library(dplyr) # data.validator helper function between <- function(a, b) { function(x) { ifelse(!is.na(x), a <= x & x <= b, FALSE) } } ui <- fluidPage( titlePanel("Appsilon's data.validator Shiny Example"), sidebarLayout( sidebarPanel( fileInput(inputId = "dataFile", label = "Choose CSV File", multiple = FALSE, accept = c(".csv")), checkboxInput(inputId = "header", label = "Data has a Header row", value = TRUE) ), mainPanel( tableOutput(outputId = "datasetHead"), uiOutput(outputId = "validation") ) ) ) server <- function(input, output, session) { # Store the dataset as a reactive value data <- reactive({ req(input$dataFile) tryCatch( { df <- read.csv(file = input$dataFile$datapath, header = input$header) }, error = function(e) { stop(safeError(e)) } ) }) # Render the table with the first 5 rows output$datasetHead <- renderTable({ return(head(data(), 5)) }) # Render the data validation report output$validation <- renderUI({ report <- data_validation_report() data.validator::validate(data(), description = "Iris Dataset Validation Test") %>% validate_cols(in_set(c("Setosa", "Virginica", "Versicolor")), variety, description = "Correct species category") %>% validate_cols(predicate = not_na, sepal.length:variety, description = "No missing values") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.width between 0 and 10") %>% validate_cols(predicate = between(0, 10), petal.length, description = "Column petal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), petal.width, description = "Column petal.width between 0 and 10") %>% add_results(report) render_semantic_report_ui(get_results(report = report)) }) } shinyApp(ui = ui, server = server)
Here’s what the app looks like:
And that’s how easy it is to build a UI around the data validation pipeline. You can (and should) add more checks, especially for custom datasets with many attributes. The overall procedure will be identical, only the validation part would get longer.
Let’s make a short recap next.
Summing Up Automated R Data Quality Reporting
In data science, it’s essential to stay on top of your data. You never know what the user may enter into a form or how the data may change over time, so that’s where data validation and automated data quality checks come in handy. You should create constraints around data going into a machine learning model if you want to guarantee reasonable predictions. Quality data in – quality prediction out.
Appsilon’s data.validator
package simplifies data quality checks with a ton of built-in functions. You can also declare custom ones, so there’s no hard limit on the checks you want to make. The package can also save data quality reports in HTML, CSV, and TXT format, and is fully compatible with R Shiny. What more do you need?
What are your thoughts on data.validator
and automated data quality checks in general? Which tools/packages do you use daily? Let us know in the comment section below, and don’t hesitate to reach out on Twitter – @appsilon. We’d love to hear from you.
R Shiny seems more interesting by the day? Our detailed guide shows you how to make a career out of it.
The post appeared first on appsilon.com/blog/.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.