Running the Same Task in Python and R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
According to a KDD poll fewer respondents (by rate) used only R
in 2017 than in 2018. At the same time more respondents (by rate) used only Python
in 2017 than in 2016.
Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.
For our task we picked the painful exercise of directly reading a 50,000,000 row by 50 column data set into memory on a machine with only 8GB of ram.
In Python
the Pandas
package takes around 6 minutes to read the data, and then one is ready to work.
In R
both utils::read.csv()
and readr::read_csv()
fail with out of memory messages. So if your view of R
is “base R
only”, or “base R
plus tidyverse
only”, or “tidyverse
only”: reading this file is a “hard task.”
With the above narrow view one would have no choice but to move to Python
if one wants to get the job done.
Or, we could remember data.table
. While data.table
is obviously not part of the tidyverse
, data.table
has been a best-practice in R
for around 12 years. It can read the data and is ready to work in R
in under a minute.
In conclusion, to get things done in a pinch: learn Python
or learn data.table
. And, in my opinion, “tidyverse
first teaching” (commonly code for “tidyverse
only teaching”) may not serve the R
community well in the long run.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.