Site icon R-bloggers

Timing Working With a Row or a Column from a data.frame

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Introduction

We will time the above tasks both using base R, and dplyr 0.8.1. We will actually perform the timings using the tibble variation of data.frames, as this (hopefully) should not affect the base-R timings, and may potentially help dplyr (and we want to time each system “used well”).

Below are the results in graphical form (images link through to the full code of the experiments).

The experiments

We won’t go deep into the details of the timings, as we are just interested in relative performance, except to say:

Timings

Selecting the first column from a data.frame

First we look at the time to select the first column from a data frame, with time to complete the task plotted as a function of the number of columns in the task.

For both the base-R bracket notation and for dplyr::select() the time is growing with the number of columns. In both cases this is undesirable, but can be excused as it may not be a good design trade-off to maintain fast column lookup structures: column names may change and column names can repeat. However, the ranges of behavior differ: base-R is slowing down for larger tasks, but is always fast. dplyr::select() is taking many seconds on the larger tasks.

The ratio of the time for dplyr::select() to complete the task over the time it takes base-R to complete the task is given here.

The take-away being: dplyr::select() is routinely thousands of times slower than R itself, and the times are substantial for large data.

Altering the first column in a data.frame

As one would expect the time to alter a the first column in a data.frame shows similar behavior to the time to select the first column.

(We apologize for swapping colors in the above graph, it is due to us using the alpha order as the color order.)

Selecting the first row from a data.frame

Selecting the first row from a data.frame can be fast due to the row indexing structures. It may or may not be constant time as the indexing structures may be non-trivial keys (not just integers). Similarly, picking a row from a more general database requires a full scan, unless there are index keys.

However, selecting the first row by index should in fact be easy. Let’s look at the timings.

It appears R does achieve constant time access to the first row when the row indices are simple integers. dplyr::slice() shows time-growth, even though dplyr::slice() is defined to work with indices (so has a good chance of being constant time). These timings are faster than the column examples, as both systems are designed to work with many rows.

data.table

We did not time data.table, as it is known to be very fast both on trivial and non-trivial workloads (at all scales). Here are some shared data.table timings. If you are working with data at even medium scale, we strongly recommend data.table.

Conclusion

If you are working in a data set size range where the above effects are present, and you are unsatisfied with how long your analyses are taking to run: you may want to look into timing both dplyr and base-R variations of small extracts of your work pattern to see what the trade-offs of using either methodology are. The point is: even if your code is currently taking a long time to run there may be an easy variation of it that is in fact fast.

R is not intrinsically slow in working with data.

If you are not seeing slow results the above issues can be safely ignored.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.