Writing Functions in R: Example One
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A. Background
In previous posts, I covered a number of useful functions and packages for writing reusable code. I wanted to extend on that information by providing a working example of how to put together a function. In particular, I will walk through the process of generating a function that executes evaluation of a time series. Going through a task step by step will hopefully be useful for those who are just starting to use R for programming and writing more abstract/generalizable code. This example will use a mix of the data.table package, base R, and various tidyverse functions. In general, I would say it is important to be versatile and utilize all the amazing tools and functions available in the R ecosystem.
For this blog post, we will use the following data from the forecastxgb package. Because the original data is stored as a ts format, we will use the as.data.table function to convert the ts object to our desired format. The final data is stored as a data table entitled myts. While I do use both ts and xts objects, I generally use data frames or data tables when I am putting together generalizable functions that pertain to time series analysis.
library(data.table) library(xts) myts = as.data.table(as.xts(woolyrnq)) colnames(myts) = c("date", "value") myts
B. Initial Steps
The first function I will put together will take time series data and evaluate whether some common characteristics are present. There are two arguments to this function. First is the name of the data set. Furthermore, the user must specify the name of the data column.
Here is what our initial outline would look like for this function.
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ ... }
As a first step in writing this function, we may want to check that certain conditions of a function are met. For example, if we wanted to check that the user provided a data table as the input, we could use the assert_that function. If a the input is not a data.table, the function will throw an error message and the remaining code in the function will not be executed.
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ assertthat::assert_that(any(class(full_dat) == "data.table"), msg = "the data is not stored in a data.table...please investigate") print("Completed execution!") }
Let us try this code out using different inputs.
Forecast_Evaluation(myts, data_column = "value", date_column = "date") # [1] "Completed execution!" Forecast_Evaluation(as.data.frame(myts), data_column = "value", date_column = "date") # Error: the data is not stored in a data.table...please investigate
In the first example, we called the function after providing it with a data.table as an input and column name present in that data, and it executed perfectly. In the second example, an error is thrown tells us that the input data is actually a data.frame.
While this type of defensive programming is useful is some cases, I tend to avoid getting too obsessed with checking conditions. With that said, when it comes to more intricate projects, I will actually create a separate function to check conditions. For example, the following user defined function checks to see if the user specifies a vector of length 1 and quits execution if that condition is not met.
is_vector <- function(x, class, length = 1, nullable = TRUE, verbose = FALSE, debug = FALSE) { if(is.null(x) && !nullable) { return(FALSE) } else if (is.null(x)) { return(invisible(TRUE)) } if(!is.vector(x)) { stop('Argument "', deparse(substitute(x)),'" is not vector. Argument expects ', class, ' vector of length ', length , ', not ', class(vector), '.', message) } return(TRUE) }
The function above for checking conditions would then be inserted in the function as follows.
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ assertthat::assert_that(any(class(full_dat) == "data.table"), msg = "the data is not stored in a data.table...please investigate") if(!is_vector(data_column, 'character', 1, FALSE)) { stop('data_column is not defined.') } print("Completed execution!") }
This certainly complicates the code, but it is still worth considering when putting together code for a package or more complex processes.
Let us run the function using the condition checker functions that I defined.
Forecast_Evaluation(myts, data_column = NULL, date_column = "date") # Error in Forecast_Evaluation(myts, data_column = NULL) : # data_column is not defined. Forecast_Evaluation(myts, data_column = "value", date_column = "date") # [1] "Completed execution!"
In the first example, the code throws an error because the data_column argument is not a vector of length one. However, the second runs because we have provides the function with a data table and a data_column input that is a vector of length one.
C. Storing the Output
Before we start putting the function together, one thing we will need is some sort of data structure where we can save the results. So when we take a time series and assess its characteristics, we want to take each of those results and save it in a data structure that is initialized at the start of the function. It is best to use a list and not a data frame because if some sort of loop is required, rbinding many rows together may not be the most efficient.
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ Evaluation_Results <- list() }
D. Select Input Variables
The next step is to select the data we need for the ‘analysis’. Given that this particular function relies on data.table for data storage, there are a number of ways to select a column based on variable names. So we need a way to take the values assigned to date_column and data_column, and select the data.
The three main ways that this can be done is with the following commands.
full_dat[, ..data_column] full_dat[, get(data_column)] full_dat[, data_column, with = FALSE]
I am partial to using the get function, so let us select the right data by adding the following lines to our function.
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ Evaluation_Results <- list() dates = full_dat[, get(date_column)] vals = full_dat[, get(data_column)] }
Since there really is no need to use the variable to select and reassign the value to another variable, let us do the following
Forecast_Evaluation <- function(full_dat, data_column = "value", date_column = "date"){ Evaluation_Results <- list() Evaluation_Results[["stationary"]] <- PP.test(full_dat[, get(data_column)]) Evaluation_Results[["seasonal"]] <- !is.null(tbats(full_dat[, get(data_column)])$seasonal) Evaluation_Results[["auto-correlated"]] <- Box.test(full_dat[, get(data_column)], type="Ljung-Box") return(Evaluation_Results) }
What this code does is take the name of the data_column that was specified, and then used that to assess whether the time series was stationary, seasonal, or had auto correlated values. The results for each are saved into the list entitled Evaluation_Results that was created at the start of the function. The return function ensures that the results are returned.
E. Testing
The first iteration of this basic function is now written. Let us now test it out. For more complex and involved processes such as a package, I would suggest using the testthat package. However, for this simple function we will just execute the function and save the results to a variable.
Results = Forecast_Evaluation(myts, data_column = "value", date_column = "date") Results #$stationary #Phillips-Perron Unit Root Test #data: full_dat[, get(data_column)] #Dickey-Fuller = -4.5857, Truncation lag parameter = 4, p-value = # 0.01 #$seasonal #[1] FALSE #$`auto-correlated` #Box-Ljung test #data: full_dat[, get(data_column)] #X-squared = 75.155, df = 1, p-value < 2.2e-16
F. Conclusion
So there you have it. A basic example of how to write functions in R. I wrote this for beginners so that you can slowly walk through the process and have it make more sense than a typical computer science tutorial. In part two, I will investigate a more involved user defined function to automate a forecasting task.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.