Establishing Meaningful Performance Comparisons between R and Python
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R vs Python
Performance comparisons between R and Python suck.
Most seem to be run in Jupyter Notebook and many are using Python’s rpy2
library to run poorly optimized R code. I’m not an anti-for()
loop Nazi (yes, you can use them effectively in R), but thanks to the base::*apply()
family and their beautiful purrr::map*()
children, there are usually better solutions.
Unfortunately, some of these comparisons arbitrarily test loops in R where you would never, ever do so.
In a language where vector
s serve as the fundamental data structure, it doesn’t make any sense why code like this receives such prominent treatment in seemingly every test…..
normal_distibution <- rnorm(2500) bad_R <- vector(mode = "numeric", length = length(normal_distibution)) for(i in normal_distibution) { bad_R[i] <- normal_distibution[i] * normal_distibution[i] }
If we had to do something explicitly “loopy”, we’d still probably do something like this…
not_so_good_R <- vapply(normal_distibution, function(x) x^2, numeric(1)) identical(bad_R, not_so_good_R) ## [1] TRUE
… but it’s still taking advantage of the fact that normal_distibution
is a homogeneous collection of atomic
values: a vector
.
all(is.vector(normal_distibution), is.atomic(normal_distibution)) ## [1] TRUE
With that in mind, just do this…
good_R <- normal_distibution^2 identical(bad_R, good_R) ## [1] TRUE
In Python, using reticulate
here, we can do this in a whole bunch of ways…
py_run_string( " normal_distibution_py = r.normal_distibution py_index_results = [None]*len(normal_distibution_py) py_append_results = [] py_dict_results = {} ", convert = FALSE) py_loop_index <- ( "for i in range(len(normal_distibution_py)): py_index_results[i] = normal_distibution_py[i]**2 ") py_loop_append <- ( "for i in normal_distibution_py: py_append_results.append(i**2) ") py_loop_dict <- ( "for i in range(len(normal_distibution_py)): py_dict_results[i] = normal_distibution_py[i]**2 ") py_list_comp <- ( " [x**2 for x in normal_distibution_py] " )
… but what runs fastest?
speeds <- mark( for(i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal_distibution[i], vapply(normal_distibution, function(x) x^2, numeric(1)), normal_distibution^2, py_run_string(py_loop_index, convert = FALSE), py_run_string(py_loop_append, convert = FALSE), py_run_string(py_loop_dict, convert = FALSE), py_run_string(py_list_comp), check = FALSE, iterations = 100 )
expression | mean | median | |
---|---|---|---|
Good R | normal_distibution^2 | 2.02us | 1.98us |
Python |
[x**2 for x in normal_distibution_py] |
675.43us | 595.75us |
Python |
for i in normal_distibution_py: py_append_results.append(i**2) |
935.9us | 864.79us |
Python |
for i in range(len(normal_distibution_py)): py_dict_results[i] = normal_distibution_py[i]**2 |
1.2ms | 1.11ms |
Python |
for i in range(len(normal_distibution_py)): py_index_results[i] = normal_distibution_py[i]**2 |
1.37ms | 1.16ms |
Not-Good R | vapply(normal_distibution, function(x) x^2, numeric(1)) | 2.18ms | 1.79ms |
Bad R | for (i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal… | 55.08ms | 49.69ms |
In these conditions and for this task, we can say two things:
- All the Python solutions are faster than the poorly-optimized R solutions.
- The optimized R solution is faster than all the Python solutions.
That said, there are issues with this test.
Are we really testing the same thing?
In terms of the exact steps that a computer takes to crunch the numbers? No, but that’s not very realistic or useful.
In terms of reaching a desired result? Ignoring that pure Python list()
s are not inherently homogeneous, yes.
py_run_string("py_append_results = []") py_run_string(py_loop_append) all.equal(good_R, py$py_append_results) ## [1] TRUE
Is running the Python code through R’s
reticulate
actually fair?
Is it less fair than running rpy2
in Python? After running all these tests, I’d say that reticulate
is fairer.
Is this even a good task to compare performance?
Based on the number of articles including a similar test, you’d almost think so. I don’t entirely agree as that’s a bit reductionist. The R solution is only the variable followed by literally two characters: ^2
.
But, I do think it serves as a great example of fundamental differences in the languages.
Considering the above results and simplicity of the good R solution, it illustrates how easily you can place arbitrary handicaps on the R code, which you’ll find in many of these “language war” articles. I hope that’s simply due to ignorant assumptions, but then the author shouldn’t be writing an article claiming authority.
While there are articles that do make a point of notifying the reader that the tests are lacking, some will sell the results as gospel anyways. Others seem to dismiss the merits of rigor entirely.
In a field referred to as “Data Science”, the mountain of articles discussing such poor metrics is concerning. Consider how many newcomers seem to use them when choosing a language in which to invest their time, and often money. (BTW the answer is both, but get great at one before tackling the other).
With that in mind, what would an objective comparison look like?
Here’s a barrage of tests applied to a task that’s both common in practice and common in these “language war” tests: reading a .csv file to a data frame. This is a task for which many articles assert Python’s superiority, despite the evidence here and elsewhere.
However, the real goal is to experiment with methods that can be used to make future tests involving less trivial tasks more objective and thus more useful to everyone.
I also think it’s a cool demonstration of some RStudio and {reticulate
} sweetness. I hope it spurs some interest in how awesome a multilingual workflow can be.
If you want to skip a pile of monotonous code, go ahead and jump to the results.
Otherwise, the entire workflow is here to scrutinize…
library(bench) library(kableExtra); options(knitr.kable.NA = "") library(scales) library(tidyverse)
Reproducible Python Environment
library(reticulate) conda_create("r-py-benchmarks", c("python=3.6", "numpy", "pandas")) use_condaenv("r-py-benchmarks", required = TRUE)
The Data
The data come from a neutral third-party in the form of .csv, which can be obtained from Majestic Million CSV .
Download and Read Data Set
file_url <- "http://downloads.majestic.com/majestic_million.csv" temp_file <- tempfile(fileext = ".csv") download.file(file_url, destfile = temp_file) test_df <- read_csv(temp_file)
Quick Inspection
glimpse(test_df) ## Observations: 1,000,000 ## Variables: 12 ## $ GlobalRank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ... ## $ TldRank <int> 1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,... ## $ Domain <chr> "google.com", "facebook.com", "youtube.com", "t... ## $ TLD <chr> "com", "com", "com", "com", "com", "com", "org"... ## $ RefSubNets <int> 463232, 451237, 410764, 409068, 303679, 292966,... ## $ RefIPs <int> 2963708, 3046847, 2444016, 2546940, 1139322, 13... ## $ IDN_Domain <chr> "google.com", "facebook.com", "youtube.com", "t... ## $ IDN_TLD <chr> "com", "com", "com", "com", "com", "com", "org"... ## $ PrevGlobalRank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ... ## $ PrevTldRank <int> 1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,... ## $ PrevRefSubNets <int> 462861, 451086, 410676, 408692, 303296, 292918,... ## $ PrevRefIPs <int> 2966284, 3049605, 2447455, 2549623, 1138675, 13... test_df %>% summarise_all(funs(sum(is.na(.)))) %>% # where the NAs at? gather(Variable, NAs) %>% kable() %>% kable_styling(full_width = FALSE)
Variable | NAs |
---|---|
GlobalRank | 0 |
TldRank | 0 |
Domain | 0 |
TLD | 0 |
RefSubNets | 0 |
RefIPs | 0 |
IDN_Domain | 0 |
IDN_TLD | 0 |
PrevGlobalRank | 0 |
PrevTldRank | 0 |
PrevRefSubNets | 0 |
PrevRefIPs | 0 |
Write the .csv
write_csv(test_df, path)
Small
The “small” .csv consists of the first 100 rows.
(small_df <- test_df %>% slice(1:100)) ## # A tibble: 100 x 12 ## GlobalRank TldRank Domain TLD RefSubNets RefIPs IDN_Domain IDN_TLD ## <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 1 1 googl~ com 463232 2.96e6 google.com com ## 2 2 2 faceb~ com 451237 3.05e6 facebook.~ com ## 3 3 3 youtu~ com 410764 2.44e6 youtube.c~ com ## 4 4 4 twitt~ com 409068 2.55e6 twitter.c~ com ## 5 5 5 micro~ com 303679 1.14e6 microsoft~ com ## 6 6 6 linke~ com 292966 1.35e6 linkedin.~ com ## 7 7 1 wikip~ org 287420 1.24e6 wikipedia~ org ## 8 8 7 plus.~ com 284103 1.46e6 plus.goog~ com ## 9 9 8 insta~ com 277145 1.37e6 instagram~ com ## 10 10 9 apple~ com 276152 1.05e6 apple.com com ## # ... with 90 more rows, and 4 more variables: PrevGlobalRank <int>, ## # PrevTldRank <int>, PrevRefSubNets <int>, PrevRefIPs <int> (small_rows <- nrow(small_df)) %>% comma() %>% cat("rows") ## 100 rows path_small_csv <- "test-data/small_csv.csv" write_csv(small_df, path_small_csv)
Medium
The “medium” .csv consists of the first 5,000 rows.
(medium_df <- test_df %>% slice(1:5000)) ## # A tibble: 5,000 x 12 ## GlobalRank TldRank Domain TLD RefSubNets RefIPs IDN_Domain IDN_TLD ## <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 1 1 googl~ com 463232 2.96e6 google.com com ## 2 2 2 faceb~ com 451237 3.05e6 facebook.~ com ## 3 3 3 youtu~ com 410764 2.44e6 youtube.c~ com ## 4 4 4 twitt~ com 409068 2.55e6 twitter.c~ com ## 5 5 5 micro~ com 303679 1.14e6 microsoft~ com ## 6 6 6 linke~ com 292966 1.35e6 linkedin.~ com ## 7 7 1 wikip~ org 287420 1.24e6 wikipedia~ org ## 8 8 7 plus.~ com 284103 1.46e6 plus.goog~ com ## 9 9 8 insta~ com 277145 1.37e6 instagram~ com ## 10 10 9 apple~ com 276152 1.05e6 apple.com com ## # ... with 4,990 more rows, and 4 more variables: PrevGlobalRank <int>, ## # PrevTldRank <int>, PrevRefSubNets <int>, PrevRefIPs <int> (med_rows <- nrow(medium_df)) %>% comma() %>% cat("rows") ## 5,000 rows path_medium_csv <- "test-data/medium_csv.csv" write_csv(medium_df, path_medium_csv)
Big
The “big” .csv stacks all 1,000,000 rows five times, creating a 5,000,000 row .csv.
(big_df <- test_df %>% rerun(.n = 5) %>% bind_rows()) ## # A tibble: 5,000,000 x 12 ## GlobalRank TldRank Domain TLD RefSubNets RefIPs IDN_Domain IDN_TLD ## <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 1 1 googl~ com 463232 2.96e6 google.com com ## 2 2 2 faceb~ com 451237 3.05e6 facebook.~ com ## 3 3 3 youtu~ com 410764 2.44e6 youtube.c~ com ## 4 4 4 twitt~ com 409068 2.55e6 twitter.c~ com ## 5 5 5 micro~ com 303679 1.14e6 microsoft~ com ## 6 6 6 linke~ com 292966 1.35e6 linkedin.~ com ## 7 7 1 wikip~ org 287420 1.24e6 wikipedia~ org ## 8 8 7 plus.~ com 284103 1.46e6 plus.goog~ com ## 9 9 8 insta~ com 277145 1.37e6 instagram~ com ## 10 10 9 apple~ com 276152 1.05e6 apple.com com ## # ... with 4,999,990 more rows, and 4 more variables: ## # PrevGlobalRank <int>, PrevTldRank <int>, PrevRefSubNets <int>, ## # PrevRefIPs <int> (big_rows <- nrow(big_df)) %>% comma() %>% cat("rows") ## 5,000,000 rows path_big_csv <- "test-data/big_csv.csv" write_csv(big_df, path_big_csv)
The Code
The following steps were taken to “standardize” code.
- R and Python functions:
- File paths are assigned to a
"*_csv.csv"
variable. - The column data types are identified ahead of time via a
*_col_specs
variable in order to maximize read speed. In future tests, it would be interesting to skip this step.- All “numeric” data are read as
double
via:"double"
forutils::read.csv()
anddata.table::fread()
readr::col_double()
forreadr::read_csv()
float
forpandas.read_csv()
- This is to standardize numeric usage as my understanding is that both R’s
double
s and Python’sfloat
s aredouble
s in the underlying C code. It also prevents the need toimport numpy
in every call to a Python script. If this is incorrect, don’t hesitate to say so.
- All “numeric” data are read as
- The function assigns the result to an internal
df
variable. - The function explicitly
return()
s the data frame.
- File paths are assigned to a
- .R and .py Script Execution:
- .R scripts are called via
system()
instead ofsource()
assource()
appeared to offer a potentially unfair advantage. - Similarly, .py scripts were tested via
system()
,reticulate::py_run_file()
, andreticulate::py_run_string()
instead ofreticulate::source_python()
, to minimize the amount of steps required for execution and minimize potential handicaps.
- .R scripts are called via
- .R and .py Script Code:
- Relevant package are loaded via R’s
library()
or Python’simport
. - File paths are assigned to a
"*_csv.csv"
variable. - The column data types are identified ahead of time via a
*_col_specs
variable.- All “numeric” data are read as
double
s.
- All “numeric” data are read as
- Data frames are assigned to a variable upon reading the file.
- Relevant package are loaded via R’s
inspect_script <- function(path) { url_base <- "https://github.com/syknapptic/syknapptic/tree/master/content/post/" contents <- read_lines(path) cat("File available at", paste0(url_base, path), "\n") cat("```\n") cat("# ", path, " ", rep("=", (80 - nchar(path) - 2)), "\n", sep = "") contents %>% walk(cat, "\n") cat("```\n\n") }
R
“Base” - utils::read.csv()
Local R Function
base_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") base_test <- function(path) { df <- read.csv(file = path, colClasses = base_col_specs) return(df) }
Scripts to Source by Operating System via system()
c("r/base_test_small.R", "r/base_test_med.R", "r/base_test_big.R") %>% walk(inspect_script)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_small.R
# r/base_test_small.R =========================================================== path_small_csv <- "test-data/small_csv.csv" base_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- read.csv(file = path_small_csv, colClasses = base_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_med.R
# r/base_test_med.R ============================================================= path_medium_csv <- "test-data/medium_csv.csv" base_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- read.csv(file = path_medium_csv, colClasses = base_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_big.R
# r/base_test_big.R ============================================================= path_big_csv <- "test-data/big_csv.csv" base_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- read.csv(file = path_big_csv, colClasses = base_col_specs)
readr::read_csv()
Local R Function
library(readr) readr_col_specs <- list(col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_double(), col_double()) readr_test <- function(path) { df <- read_csv(file = path, col_types = readr_col_specs) return(df) }
Scripts to Source by Operating System via system()
c("r/readr_test_small.R", "r/readr_test_med.R", "r/readr_test_big.R") %>% walk(inspect_script)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_small.R
# r/readr_test_small.R ========================================================== library(readr) path_small_csv <- "test-data/small_csv.csv" readr_col_specs <- list(col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_double(), col_double()) df <- read_csv(file = path_small_csv, col_types = readr_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_med.R
# r/readr_test_med.R ============================================================ library(readr) path_medium_csv <- "test-data/medium_csv.csv" readr_col_specs <- list(col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_double(), col_double()) df <- read_csv(file = path_medium_csv, col_types = readr_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_big.R
# r/readr_test_big.R ============================================================ library(readr) path_big_csv <- "test-data/big_csv.csv" readr_col_specs <- list(col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_character(), col_character(), col_double(), col_double(), col_double(), col_double()) df <- read_csv(file = path_big_csv, col_types = readr_col_specs)
data.table::fread()
Local R Function
library(data.table) datatable_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") datatable_test <- function(path) { df <- fread(file = path, colClasses = datatable_col_specs) return(df) }
Scripts to Source by Operating System via system()
c("r/datatable_test_small.R", "r/datatable_test_med.R", "r/datatable_test_big.R") %>% walk(inspect_script)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_small.R
# r/datatable_test_small.R ====================================================== library(data.table) path_small_csv <- "test-data/small_csv.csv" datatable_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- fread(file = path_small_csv, colClasses = datatable_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_med.R
# r/datatable_test_med.R ======================================================== library(data.table) path_medium_csv <- "test-data/medium_csv.csv" datatable_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- fread(file = path_medium_csv, colClasses = datatable_col_specs)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_big.R
# r/datatable_test_big.R ======================================================== library(data.table) path_big_csv <- "test-data/big_csv.csv" datatable_col_specs <- c("double", "double", "character", "character", "double", "double", "character", "character", "double", "double", "double", "double") df <- fread(file = path_big_csv, colClasses = datatable_col_specs)
Python
pandas.read_csv()
Local Python Function
import pandas path_small_csv = 'test-data/small_csv.csv' path_medium_csv = 'test-data/medium_csv.csv' path_big_csv = 'test-data/big_csv.csv' pandas_col_specs = { 'GlobalRank':float, 'TldRank':float, 'Domain':str, 'TLD':str, 'RefSubNets':float, 'RefIPs':float, 'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float } def pandas_test_small(): df = pandas.read_csv(filepath_or_buffer = path_small_csv, dtype = pandas_col_specs, low_memory = False) return(df) def pandas_test_medium(): df = pandas.read_csv(filepath_or_buffer = path_medium_csv, dtype = pandas_col_specs, low_memory = False) return(df) def pandas_test_big(): df = pandas.read_csv(filepath_or_buffer = path_big_csv, dtype = pandas_col_specs, low_memory = False) return(df)
Scripts to Source via system()
and reticulate::py_run_file(..., convert = FALSE)
c("py/pandas_test_small.py", "py/pandas_test_med.py", "py/pandas_test_big.py") %>% walk(inspect_script)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_small.py
# py/pandas_test_small.py ======================================================= import pandas path_small_csv = 'test-data/small_csv.csv' pandas_col_specs = { 'GlobalRank':float, 'TldRank':float, 'Domain':str, 'TLD':str, 'RefSubNets':float, 'RefIPs':float, 'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float } df = pandas.read_csv(filepath_or_buffer = path_small_csv, dtype = pandas_col_specs, low_memory = False)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_med.py
# py/pandas_test_med.py ========================================================= import pandas path_medium_csv = 'test-data/medium_csv.csv' pandas_col_specs = { 'GlobalRank':float, 'TldRank':float, 'Domain':str, 'TLD':str, 'RefSubNets':float, 'RefIPs':float, 'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float } df = pandas.read_csv(filepath_or_buffer = path_medium_csv, dtype = pandas_col_specs, low_memory = False)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_big.py
# py/pandas_test_big.py ========================================================= import pandas path_big_csv = 'test-data/big_csv.csv' pandas_col_specs = { 'GlobalRank':float, 'TldRank':float, 'Domain':str, 'TLD':str, 'RefSubNets':float, 'RefIPs':float, 'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float } df = pandas.read_csv(filepath_or_buffer = path_big_csv, dtype = pandas_col_specs, low_memory = False)
reticulate::py_run_string(..., convert = FALSE)
py_run_string( " import pandas path_small_csv = 'test-data/small_csv.csv' path_medium_csv = 'test-data/medium_csv.csv' path_big_csv = 'test-data/big_csv.csv' pandas_col_specs = { 'GlobalRank':float, 'TldRank':float, 'Domain':str, 'TLD':str, 'RefSubNets':float, 'RefIPs':float, 'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float } def retic_pandas_test_small(): df = pandas.read_csv(filepath_or_buffer = path_small_csv, dtype = pandas_col_specs, low_memory = False) return(df) def retic_pandas_test_medium(): df = pandas.read_csv(filepath_or_buffer = path_medium_csv, dtype = pandas_col_specs, low_memory = False) return(df) def retic_pandas_test_big(): df = pandas.read_csv(filepath_or_buffer = path_big_csv, dtype = pandas_col_specs, low_memory = False) return(df) ", convert = FALSE )
Dependencies Only
c("r/test_load_readr.R", "r/test_load_datatable.R", "py/test_load_pandas.py") %>% walk(inspect_script)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_readr.R
# r/test_load_readr.R =========================================================== library(readr)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_datatable.R
# r/test_load_datatable.R ======================================================= library(data.table)
File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/test_load_pandas.py
# py/test_load_pandas.py ======================================================== def square(x): return x**2 def evens(x): y = [] for i in x: if i // 2 == 0: y.append(i) return(y) def ceci_nest_pas_une_pipe(args, *funs): for arg in args: for fun in funs: arg = fun(arg) return arg ceci_nest_pas_une_pipe([1, 2, 3, 4], even)
The Test
100 iterations were run to provide a reasonable balance between rigor and compute time.
n_iterations <- 100
All the code was tested via the {bench
} package and its bench::mark()
function. This package was only selected over others as a chance to take it for a test drive.
The convert
argument of reticulate::py_run_string()
and reticulate::py_run_file()
calls is set to FALSE
to minimize any handicap.
results <- mark( base_test(path_small_csv), readr_test(path_small_csv), datatable_test(path_small_csv), system("Rscript r/base_test_small.R"), system("Rscript r/readr_test_small.R"), system("Rscript r/datatable_test_small.R"), py$pandas_test_small(), py_run_string("retic_pandas_test_small()", convert = FALSE), py_run_file("py/pandas_test_small.py", convert = FALSE), system("python py/pandas_test_small.py"), base_test(path_medium_csv), readr_test(path_medium_csv), datatable_test(path_medium_csv), system("Rscript r/base_test_med.R"), system("Rscript r/readr_test_med.R"), system("Rscript r/datatable_test_med.R"), py$pandas_test_medium(), py_run_string("retic_pandas_test_medium()", convert = FALSE), py_run_file("py/pandas_test_med.py", convert = FALSE), system("python py/pandas_test_med.py"), base_test(path_big_csv), readr_test(path_big_csv), datatable_test(path_big_csv), system("Rscript r/base_test_big.R"), system("Rscript r/readr_test_big.R"), system("Rscript r/datatable_test_big.R"), py$pandas_test_big(), py_run_string("retic_pandas_test_big()", convert = FALSE), py_run_file("py/pandas_test_big.py", convert = FALSE), system("python py/pandas_test_big.py"), check = FALSE, filter_gc = FALSE, iterations = n_iterations ) package_results <- mark( system("Rscript r/test_load_readr.R"), system("Rscript r/test_load_datatable.R"), system("python py/test_load_pandas.py"), check = FALSE, filter_gc = FALSE, iterations = n_iterations )
Initial Carpentry
package_results_df <- package_results %>% unnest() %>% mutate(package = case_when( str_detect(expression, "datatable") ~ "data.table", str_detect(expression, "readr") ~ "readr", str_detect(expression, "pandas") ~ "pandas" )) %>% mutate(call = case_when( package == "data.table" ~ "library(data.table)", package == "readr" ~ "library(readr)", package == "pandas" ~ "import pandas" )) package_medians_df <- package_results_df %>% rename(median_package = median, min_package = min, max_package = max) %>% distinct(package, median_package, min_package, max_package) %>% add_row(median_package = bench_time(0), package = "utils") all_exprs <- results$expression system_calls <- all_exprs %>% str_subset("^system\\(") local_r_fun_calls <- all_exprs %>% str_subset("^(base|readr|datatable)_test\\(") python_eng_calls <- all_exprs %>% str_subset("^py\\$") reticulate_calls <- all_exprs %>% str_subset("py_run") knitr_calls <- c(local_r_fun_calls, python_eng_calls, reticulate_calls) results_df <- results %>% unnest() %>% mutate(package = case_when( str_detect(expression, "datatable") ~ "data.table", str_detect(expression, "readr") ~ "readr", str_detect(expression, "pandas") ~ "pandas", TRUE ~ "utils" )) %>% mutate(call = case_when( str_detect(expression, "base") ~ "utils::read.csv()", str_detect(expression, "readr") ~ "readr::read_csv()", str_detect(expression, "datatable") ~ "data.table::fread()", str_detect(expression, "py_run_string") ~ "reticulate::py_run_string()", str_detect(expression, "py_run_file") ~ "reticulate::py_run_file()", str_detect(expression, "pandas") ~ "pandas.read_csv()" ) %>% str_pad(max(nchar(.)), side = "right") # enforce left alignment in plots ) %>% mutate(execution_type = case_when( expression %in% system_calls ~ "Sourced Script", expression %in% knitr_calls ~ "knitr Engine" )) %>% mutate(dependency_status = case_when( expression %in% system_calls ~ "Dependencies Loaded on Execution (Sourced Script)", expression %in% knitr_calls ~ "Dependencies Pre-Loaded")) %>% mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>% mutate(file_size = str_extract(expression, "small|med|big")) %>% mutate(rows = case_when( file_size == "small" ~ small_rows, file_size == "med" ~ med_rows, file_size == "big" ~ big_rows )) %>% left_join(package_medians_df, by = "package") gg_df <- results_df %>% mutate(n_rows = rows) %>% arrange(rows) %>% mutate(rows = rows %>% comma() %>% paste("Rows") %>% as_factor() ) %>% group_by(expression) %>% mutate(med_time = as.numeric(median(time))) %>% ungroup() %>% arrange(desc(med_time)) %>% mutate(call = as_factor(call)) %>% arrange(desc(lang)) %>% mutate(lang = as_factor(lang))
The Results
theme_simple <- function(pnl_ln_col = "black", line_type = "dotted", cap_size = 10, facet = NULL, ...) { theme_minimal(15, "serif") + theme(legend.title = element_blank(), legend.text = element_text(size = 12), legend.position = "top", panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.major.y = element_line(colour = pnl_ln_col, linetype = line_type), legend.key.size = unit(1.5, "lines"), axis.text.y = element_text("mono", face = "bold", hjust = 0, size = 12), plot.caption = element_text(size = cap_size), ...) } prep_lab <- function(lab) { lab <- substitute(lab) bquote(italic(paste(" ", .(lab), " "))) } t_R <- prep_lab(t[R]) t_Python <- prep_lab(t[Python]) t_import_pandas <- prep_lab(t[Python]~-~max~group("(",t[import~~pandas],")")) plot_times <- function(df, ...) { plot_init <- df %>% ggplot(aes(call, time)) + stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, trim = FALSE) + scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) + scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) + coord_flip() + theme_simple() if(length(vars(...))) { n_rows <- sort(df$n_rows, decreasing = TRUE)[[1]] plot_fin <- plot_init + facet_wrap(vars(...), ncol = 1, scales = "free") + labs(x = NULL, y = "Execution Time", title = str_glue("CSV to Data Frame: {comma(n_rows)} Rows"), caption = str_glue("{n_iterations} iterations")) } else { plot_fin <- plot_init + labs(x = NULL, y = "Execution Time", title = "Dependency Load Times", caption = str_glue("{n_iterations} iterations")) + geom_text(aes(y = median, label = paste("Median Time:", median)), color = "darkgreen", nudge_x = 0.515) } plot_fin }
Execution Times
At 100 rows, R is faster, with base R’s utils::read.csv()
finishing first.
gg_df %>% filter(file_size == "small") %>% plot_times(facet = dependency_status)
At 5,000 rows, R is still faster. In the sourced scripts, pandas.read_csv()
has nearly caught up with utils::read.csv()
, but data.table::fread()
has pulled away.
gg_df %>% filter(file_size == "med") %>% plot_times(facet = dependency_status)
At 5,000,000 million rows, we’ve reached the size where time differences would actually be noticeable.
The advantage ofutils::read.csv()
’s lack of dependencies has run its course and pandas.read_csv()
is faster in nearly every case.
That said, readr::read_csv()
is still faster than pandas.read_csv()
and, like most R users would expect, data.table::fread()
is by far the fastest.
gg_df %>% filter(file_size == "big") %>% plot_times(facet = dependency_status)
tl;dr
gg_df %>% mutate(dependency_status = dependency_status %>% str_remove("\\s\\(.*$") %>% str_replace("Loaded on", "Loaded\non") ) %>% ggplot(aes(call, time)) + stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, trim = FALSE) + scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) + scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) + coord_flip() + theme_simple(pnl_ln_col = "gray") + theme(axis.text = element_text(size = 8), strip.text = element_text(size = 12), strip.text.y = element_text(face = "bold", size = 15), panel.background = element_rect(fill = "transparent", size = 0.5)) + facet_grid(rows ~ dependency_status, scales = "free", switch = "y", space = "free") + labs(x = NULL, y = "Time", title = "R vs Python - CSV to Data Frame", caption = "12 columns, 100 iterations each")
Appendices
Dependency Load Times
package_results_df %>% mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>% arrange(desc(lang)) %>% mutate(lang = as_factor(lang)) %>% plot_times()
gg_df %>% filter(dependency_status == "Dependencies Loaded on Execution (Sourced Script)") %>% filter(file_size == "big") %>% mutate(adjusted_time = if_else(lang == "Python", time - max_package, NA_real_)) %>% rename(original_time = time) %>% gather(time_type, time, original_time, adjusted_time) %>% drop_na(time) %>% mutate(descrip = case_when( lang == "R" ~ "Original R Time", lang == "Python" & time_type == "original_time" ~ "Original Python Time", lang == "Python" & time_type == "adjusted_time" ~ "Adjusted Python Time" )) %>% arrange(desc(descrip)) %>% mutate(descrip = as_factor(descrip)) %>% ggplot(aes(call, time, fill = descrip)) + stat_ydensity(width = 1, size = 0, color = "transparent", scale = "width", bw = 0.01, trim = FALSE) + scale_fill_manual(values = c("#165CAA", "#ffde57", "#ff9051"), labels = c(t_R, t_Python, t_import_pandas)) + guides(fill = guide_legend(nrow = 3, label.hjust = 0)) + coord_flip() + theme_simple() + labs(x = NULL, y = "Execution Time", title = "Comparing Sourced Scripts with Adjusted Python Times", caption = str_glue("CSV to Data Frame: {comma(big_rows)} Rows") )
Summary Tables
results_df %>% select(rows, lang, execution_type, call, mean, median, `itr/sec`, n_gc, mem_alloc) %>% distinct() %>% arrange(rows, desc(lang)) %>% mutate(rows = comma(rows), `itr/sec` = round(`itr/sec`, 2), n_gc = ifelse(execution_type == "Sourced Script", "unknown", n_gc), mem_alloc = ifelse(execution_type == "Sourced Script", "unknown", mem_alloc)) %>% mutate_at(vars(-c(rows, lang)), funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"), color = ifelse(lang == "R", "#002963", "#809100")) )) %>% mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"), color = ifelse(lang == "R", "#002647", "#809100"))) %>% mutate(n_gc = if_else(str_detect(n_gc, "unknown"), "unknown", n_gc), mem_alloc = if_else(str_detect(mem_alloc, "unknown"), "unknown", mem_alloc)) %>% rename(garbage_collections = n_gc, language = lang, memory_allocated = mem_alloc) %>% rename_all(funs(str_to_title(str_replace(., "_", " ")))) %>% kable(caption = "CSV to Data Frame Times", escape = FALSE, digits = 2) %>% kable_styling(bootstrap_options = "condensed", font_size = 12) %>% collapse_rows(columns = 1:3, valign = "top")
Rows | Language | Execution Type | Call | Mean | Median | Itr/Sec | Garbage Collections | Memory Allocated |
---|---|---|---|---|---|---|---|---|
100 | R | knitr Engine | utils::read.csv() | 891.85us | 834.96us | 1121.26 | 0 | 364168 |
readr::read_csv() | 3.05ms | 2.9ms | 327.96 | 0 | 143680 | |||
data.table::fread() | 1.4ms | 1.36ms | 711.99 | 0 | 276896 | |||
Sourced Script | utils::read.csv() | 205.24ms | 200.39ms | 4.87 | unknown | unknown | ||
readr::read_csv() | 548.78ms | 512.73ms | 1.82 | unknown | unknown | |||
data.table::fread() | 322.13ms | 295.43ms | 3.1 | unknown | unknown | |||
Python | knitr Engine | pandas.read_csv() | 74.99ms | 48.48ms | 13.34 | 9 | 1298288 | |
reticulate::py_run_string() | 5.02ms | 3.75ms | 199.04 | 1 | 2840 | |||
reticulate::py_run_file() | 5.78ms | 3.75ms | 173.05 | 0 | 9504 | |||
Sourced Script | pandas.read_csv() | 559.01ms | 530.84ms | 1.79 | unknown | unknown | ||
5,000 | R | knitr Engine | utils::read.csv() | 18.53ms | 18.37ms | 53.98 | 0 | 1975368 |
readr::read_csv() | 10.17ms | 9.75ms | 98.3 | 0 | 1621744 | |||
data.table::fread() | 5ms | 4.73ms | 200.02 | 0 | 675120 | |||
Sourced Script | utils::read.csv() | 221.48ms | 216.81ms | 4.52 | unknown | unknown | ||
readr::read_csv() | 529.75ms | 522.03ms | 1.89 | unknown | unknown | |||
data.table::fread() | 296.08ms | 294.28ms | 3.38 | unknown | unknown | |||
Python | knitr Engine | pandas.read_csv() | 83.11ms | 67.52ms | 12.03 | 10 | 3058232 | |
reticulate::py_run_string() | 24.05ms | 24.31ms | 41.58 | 0 | 2840 | |||
reticulate::py_run_file() | 22.14ms | 21.67ms | 45.16 | 0 | 2840 | |||
Sourced Script | pandas.read_csv() | 577.44ms | 552.81ms | 1.73 | unknown | unknown | ||
5,000,000 | R | knitr Engine | utils::read.csv() | 19.55s | 19.4s | 0.05 | 168 | 2052818952 |
readr::read_csv() | 7.11s | 7.08s | 0.14 | 107 | 1567378576 | |||
data.table::fread() | 2.73s | 2.6s | 0.37 | 28 | 665639952 | |||
Sourced Script | utils::read.csv() | 23.25s | 23.07s | 0.04 | unknown | unknown | ||
readr::read_csv() | 10.06s | 10.05s | 0.1 | unknown | unknown | |||
data.table::fread() | 3.78s | 3.78s | 0.26 | unknown | unknown | |||
Python | knitr Engine | pandas.read_csv() | 20.23s | 19.67s | 0.05 | 23 | 1921138232 | |
reticulate::py_run_string() | 13.26s | 13.25s | 0.08 | 0 | 2840 | |||
reticulate::py_run_file() | 13.26s | 13.26s | 0.08 | 0 | 2840 | |||
Sourced Script | pandas.read_csv() | 13.8s | 13.77s | 0.07 | unknown | unknown |
package_results_df %>% mutate(lang = if_else(str_detect(expression, "\\.py"), "Python", "R"), `itr/sec` = round(`itr/sec`, 2)) %>% select(lang, call, min, mean, median, max, `itr/sec`) %>% distinct() %>% mutate_at(vars(-lang), funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"), color = ifelse(lang == "R", "#002963", "#809100")) )) %>% mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"), color = ifelse(lang == "R", "#002647", "#809100"))) %>% rename(language = lang) %>% rename_all(str_to_title) %>% kable(caption = "Dependency Load Times", escape = FALSE, digits = 2) %>% kable_styling(bootstrap_options = "condensed", font_size = 12) %>% collapse_rows(columns = 1, valign = "top")
Language | Call | Min | Mean | Median | Max | Itr/Sec |
---|---|---|---|---|---|---|
R | library(readr) | 506ms | 510ms | 507ms | 552ms | 1.96 |
library(data.table) | 305ms | 308ms | 305ms | 407ms | 3.25 | |
Python | import pandas | 569ms | 611ms | 607ms | 911ms | 1.64 |
Environment
IDE
rstudio_info <- rstudioapi::versionInfo() # obtain in interactive session write_rds(rstudio_info, "test-data/rstudio_info.rds") read_rds("test-data/rstudio_info.rds") %>% as_tibble() %>% mutate(IDE = "RStudio") %>% select(IDE, mode, version) %>% mutate(version = as.character(version)) %>% kable() %>% kable_styling(full_width = FALSE)
IDE | mode | version |
---|---|---|
RStudio | desktop | 1.1.453 |
R
sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 17134) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] data.table_1.11.5 bindrcpp_0.2.2 forcats_0.3.0 ## [4] stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5 ## [7] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2.9004 ## [10] ggplot2_3.0.0.9000 tidyverse_1.2.1.9000 scales_0.5.0.9000 ## [13] reticulate_1.9.0.9001 kableExtra_0.9.0 bench_1.0.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.17 lubridate_1.7.4 lattice_0.20-35 ## [4] utf8_1.1.4 assertthat_0.2.0 digest_0.6.15 ## [7] psych_1.8.4 R6_2.2.2 cellranger_1.1.0 ## [10] plyr_1.8.4 evaluate_0.10.1 highr_0.7 ## [13] httr_1.3.1 blogdown_0.7.1 pillar_1.3.0.9000 ## [16] rlang_0.2.1 lazyeval_0.2.1 readxl_1.1.0 ## [19] rstudioapi_0.7 Matrix_1.2-14 rmarkdown_1.10.7 ## [22] selectr_0.4-1 foreign_0.8-70 munsell_0.5.0 ## [25] broom_0.4.5 compiler_3.5.1 modelr_0.1.2 ## [28] xfun_0.3 pkgconfig_2.0.1 mnormt_1.5-5 ## [31] htmltools_0.3.6 tidyselect_0.2.4 bookdown_0.7 ## [34] fansi_0.2.3 viridisLite_0.3.0 crayon_1.3.4 ## [37] withr_2.1.2 grid_3.5.1 nlme_3.1-137 ## [40] jsonlite_1.5 gtable_0.2.0 magrittr_1.5 ## [43] cli_1.0.0 stringi_1.2.3 profmem_0.5.0 ## [46] reshape2_1.4.3 xml2_1.2.0 htmldeps_0.1.0 ## [49] tools_3.5.1 glue_1.2.0 hms_0.4.2 ## [52] parallel_3.5.1 yaml_2.1.19 colorspace_1.3-2 ## [55] rvest_0.3.2 knitr_1.20.8 bindr_0.1.1 ## [58] haven_1.1.2
Python
import sys import numpy import pandas print(sys.version) ## 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] print(numpy.__version__) ## 1.14.5 print(pandas.__version__) ## 0.23.1
System
CPU
cat("CPU:\n", system("wmic cpu get name", intern = TRUE)[[2]]) ## CPU: ## Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Memory
ram_df <- system("wmic MEMORYCHIP get BankLabel, Capacity, Speed", intern = TRUE) %>% str_trim() %>% as_tibble() %>% slice(2:3) %>% separate(value, into = c("BankLabel", "Capacity", "Speed"), sep = "\\s{2,}") ram_df %>% rename_all(str_replace, "L", " L") %>% kable() %>% kable_styling(full_width = FALSE)
Bank Label | Capacity | Speed |
---|---|---|
DIMM A | 17179869184 | 2400 |
DIMM B | 17179869184 | 2400 |
ram_df %>% mutate(Capacity = as.numeric(Capacity) / 1e9, Speed = as.numeric(Speed)) %>% summarise(`Capacity in GB` = sum(Capacity), `Speed in MHz` = unique(Speed)) %>% kable() %>% kable_styling(full_width = FALSE)
Capacity in GB | Speed in MHz |
---|---|
34.35974 | 2400 |
Storage
cat("SSD:\n", system("wmic diskdrive get Model", intern = TRUE)[[2]]) ## SSD: ## PM951 NVMe SAMSUNG 512GB
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.