Efficient R: Performant data.frame constructors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
sha256 1 eb5d71529ab540bc4865c181a1129e03186e0959c76196a9fbc0c2a16c767856
About as.data.frame
data.frame() or as.data.frame() are such ubiquitous functions that we rarely think twice about using them to create dataframes or to convert other objects to dataframes.
However, they are slow. Extremely slow.
This is somewhat surprising considering how much they are used, and given that the ‘data.frame’ object is the de facto standard for tabular data in R, for their constructors to be so inefficient.
However this is the direct result of the presence of a lot of error checking and validation code, which is perhaps understandable for something as widely used. You simply don’t know what is going to be thrown at the function and so it needs to try to do its best or fail gracefully.
Below, we demonstrate the inefficiencies of as.data.frame() versus efficient ‘data.frame’ constructors from the ‘ichimoku’ package coded for performance.
For benchmarking, the ‘microbenchmark’ package will be used. It is usual to compare the median times averaged over a large number of runs, and we will use 1,000 in the cases below.
Matrix conversion benchmarking
A 100×10 matrix of random data drawn from the normal distribution is created as the object ‘matrix’.
This will be converted into a dataframe using as.data.frame() and ichimoku::matrix_df().
library(ichimoku) library(microbenchmark) matrix <- matrix(rnorm(1000), ncol = 10, dimnames = list(1:100, letters[1:10])) dim(matrix) [1] 100 10 head(matrix) a b c d e f 1 1.0470296 -0.6531076 0.8278910 0.9708001 -0.1014626 1.2253514 2 -0.1436921 0.5482620 1.3607562 0.8354925 0.7415475 -0.1541012 3 -0.2369179 1.0897400 -0.8158241 0.2736871 -0.1851880 -1.0202761 4 -0.1883866 -0.4844175 -0.3421133 0.8321749 0.5960344 0.4411143 5 -0.2062340 0.9212781 -0.3687319 -0.2210680 -0.9493628 0.2689948 6 -1.2267639 0.7466243 -0.1845343 1.3502588 -1.1756389 -1.2925598 g h i j 1 1.2828614 0.4024055 -0.04694549 -1.1447872 2 -0.6869244 -0.2542681 0.48761441 -0.6505677 3 -0.5479265 -0.5446966 0.09914298 0.8869836 4 -0.4678586 0.9396598 -0.89564969 1.0552123 5 -0.6126510 0.4527644 -1.43793557 0.5074292 6 1.0173340 0.2888818 0.09522833 -0.1836863 microbenchmark(as.data.frame(matrix), matrix_df(matrix), times = 1000) Unit: microseconds expr min lq mean median uq as.data.frame(matrix) 31.148 32.6130 34.37202 33.2355 34.3305 matrix_df(matrix) 14.508 15.6355 22.78279 16.0575 16.8175 max neval 410.675 1000 6297.590 1000 identical(as.data.frame(matrix), matrix_df(matrix)) [1] TRUE all.equal(as.data.frame(matrix), matrix_df(matrix)) [1] TRUE
As can be seen, the outputs are identical, but ichimoku::matrix_df()
, which is designed to be a performant ‘data.frame’ constructor, is around twice as fast.
xts conversion benchmarking
The ‘xts’ format is a popular choice for large time series data as each observation is indexed by a unique valid timestamp.
As an example, we use the ichimoku() function from the ‘ichimoku’ package which creates ichimoku objects inheriting the ‘xts’ class. We run ichimoku() on the sample data contained within the package to create an ‘xts’ object ‘cloud’.
This will be converted into a dataframe using as.data.frame() and ichimoku::xts_df().
library(ichimoku) library(microbenchmark) cloud <- ichimoku(sample_ohlc_data) xts::is.xts(cloud) [1] TRUE dim(cloud) [1] 260 12 print(cloud[1:6], plot = FALSE) open high low close cd tenkan kijun senkouA senkouB 2020-01-02 123.0 123.1 122.5 122.7 -1 NA NA NA NA 2020-01-03 122.7 122.8 122.6 122.8 1 NA NA NA NA 2020-01-05 122.8 123.4 122.4 123.3 1 NA NA NA NA 2020-01-06 123.3 124.3 123.3 124.1 1 NA NA NA NA 2020-01-07 124.1 124.8 124.0 124.8 1 NA NA NA NA 2020-01-08 124.8 125.4 124.5 125.3 1 NA NA NA NA chikou cloudTop cloudBase 2020-01-02 122.9 NA NA 2020-01-03 123.0 NA NA 2020-01-05 123.9 NA NA 2020-01-06 123.6 NA NA 2020-01-07 122.5 NA NA 2020-01-08 122.6 NA NA microbenchmark(as.data.frame(cloud), xts_df(cloud), times = 1000) Unit: microseconds expr min lq mean median uq as.data.frame(cloud) 230.269 236.3060 252.35890 240.4095 246.1955 xts_df(cloud) 33.862 36.9205 45.36266 38.6870 40.7065 max neval 6871.517 1000 5703.421 1000
It can be seen that ichimoku::xts_df()
, which is designed to be a performant ‘data.frame’ constructor, is over 6x as fast.
df1 <- as.data.frame(cloud) is.data.frame(df1) [1] TRUE str(df1) 'data.frame': 260 obs. of 12 variables: $ open : num 123 123 123 123 124 ... $ high : num 123 123 123 124 125 ... $ low : num 122 123 122 123 124 ... $ close : num 123 123 123 124 125 ... $ cd : num -1 1 1 1 1 1 -1 0 -1 -1 ... $ tenkan : num NA NA NA NA NA ... $ kijun : num NA NA NA NA NA NA NA NA NA NA ... $ senkouA : num NA NA NA NA NA NA NA NA NA NA ... $ senkouB : num NA NA NA NA NA NA NA NA NA NA ... $ chikou : num 123 123 124 124 122 ... $ cloudTop : num NA NA NA NA NA NA NA NA NA NA ... $ cloudBase: num NA NA NA NA NA NA NA NA NA NA ... df2 <- xts_df(cloud) is.data.frame(df2) [1] TRUE str(df2) 'data.frame': 260 obs. of 13 variables: $ index : POSIXct, format: "2020-01-02 00:00:00" ... $ open : num 123 123 123 123 124 ... $ high : num 123 123 123 124 125 ... $ low : num 122 123 122 123 124 ... $ close : num 123 123 123 124 125 ... $ cd : num -1 1 1 1 1 1 -1 0 -1 -1 ... $ tenkan : num NA NA NA NA NA ... $ kijun : num NA NA NA NA NA NA NA NA NA NA ... $ senkouA : num NA NA NA NA NA NA NA NA NA NA ... $ senkouB : num NA NA NA NA NA NA NA NA NA NA ... $ chikou : num 123 123 124 124 122 ... $ cloudTop : num NA NA NA NA NA NA NA NA NA NA ... $ cloudBase: num NA NA NA NA NA NA NA NA NA NA ...
The outputs are slightly different as xts_df() preserves the date-time index of ‘xts’ objects as a new first column ‘index’ which is POSIXct in format. The default as.data.frame() constructor converts the index into the row names, which is not desirable as the dates are coerced to type ‘character’.
So it can be seen that in this case, not only is the performant constructor faster, it is also more fit for purpose.
When to use performant constructors
- Data which is not already a ‘data.frame’ object being plotted using ‘ggplot2’. For example if you have time series data in the ‘xts’ format, calling a ‘ggplot2’ plot method automatically converts the data into a dataframe as ggplot() only works with dataframes internally. Fortunately it does not use as.data.frame() but its own constructor ggplot2::fortify(). Benchmarked below, it is slightly faster than as.data.frame() but the performant constructor
ichimoku::xts_df()
is still almost 4x as fast.
microbenchmark(as.data.frame(cloud), ggplot2::fortify(cloud), xts_df(cloud), times = 1000) Unit: microseconds expr min lq mean median uq as.data.frame(cloud) 231.683 240.5755 260.94003 246.4725 253.3840 ggplot2::fortify(cloud) 132.811 145.2860 170.82074 153.1985 162.7935 xts_df(cloud) 34.382 38.1695 41.71392 40.5490 42.7240 max neval 5246.692 1000 4828.824 1000 381.869 1000
In a context where performance is critical. This is usually in interactive environments such as a Shiny app, perhaps with real time data where slow code can reduce responsiveness or cause bottlenecks in execution.
Within packages. It is usually safe to use performant constructors within functions or for internal unexported functions. If following programming best practices the input and output types for functions are kept consistent, and so the input to the constructor can be controlled and hence its function predictable. Setting appropriate unit tests can also catch any issues early.
When to question the use of performant constructors
For user-facing functions. Having no validation or error-checking code means that a performant constructor may behave unpredictably on data that is not intended to be an input. Within a function, there is a specific or at most finite range of objects that a constructor can receive. When that limit is removed, if the input is not the intended input for a constructor then an error can be expected. As long as this is made clear to the user and there are adequate instructions on proper usage, in an environment where the occasional error message is acceptable, then proceed to use the performant constructor.
When the constructor needs to handle a range of input types. as.data.frame() is actually an S3 generic with a variety of methods for different object classes. If required to handle a variety of different types of input, it may be easier (if not more performant) to rely on as.data.frame() rather than write code which handles different scenarios.
What is a performant constructor
First of all, it is possible to directly use the functions matrix_df() and xts_df() which are exported from the ‘ichimoku’ package. Given the nature of the R ecosystem, this is indeed encouraged.
However, having seen the advantages of using a performant constructor above, we can now turn to the ‘what’ for the curious.
What lies behind those functions? Some variation of the below:
# structure() is used to set the 'class' and other attributes on an object structure(list(vec1, vec2, vec3), class = "data.frame", row.names = seq_len(length(vec1)))
- A data.frame is simply a list (where each element must be the same length).
- It has an attribute ‘class’ which equals ‘data.frame’.
- It must have row names, which is usually just an integer sequence.
Note:
- The vectors in the list (vec1, vec2, vec3, etc.) must be the same length, othwerwise a corrupt data.frame warning will be generated.
- If row names are missing then the data will still be present but dim() will show a 0-row dataframe and its print method will not work.
- Row names are not limited to an integer sequence. They can be dates for example. However if dates are set as row names, they are first coerced to type ‘character’.
In conclusion, dataframes are not complicated structures but essentially lists with a couple of constraints. Indeed you can see that the underlying data type of a dataframe is just a list:
class(df1) [1] "data.frame" typeof(df1) [1] "list" class(df2) [1] "data.frame" typeof(df2) [1] "list"
References
ichimoku R package site: https://shikokuchuo.net/ichimoku/
ichimoku CRAN page: https://CRAN.R-project.org/package=ichimoku
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.