Site icon R-bloggers

RvsPython #5.1: Making the Game even with Python’s Best Practices

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, it turns out that my last blog that R was over 220 times faster than Python got a lot of (constructive) criticism saying that I wasn’t using “best practices” with Python, which was why my Python code was so slow. This is a totally acceptable critique; thus, I’ve decided to write a follow up and rewrite the code I used making a more even playing field for both R and Python.

In this blog I’m going to do the following:

  1. Perform Monte Carlo Using R and Python with for loops.
  2. Use some of Python’s best practices and see how it compares to R’s lapply.

(Note: There is already a popular article article comparing for loops in R and Python against R’s lapply here)

Disclaimer

This is a follow up blog to my last write up comparing R and Python with Monte Carlo simulations. For context, check out the blog here.

Using for loops in R

In the last post, we didn’t have our generated data timed. I thought it would be a good idea to include it in the total processing time. I will also be changing the method I generated the data by using R’s for loop and will start timing the code from there.

I did my best to make my R code as similar to the Python code in the last blog- if you see an issue, please comment!

#' Define Number of points we want to estimate
n<-c(10,100,1000,10000,100000,1000000)

#' Our Transformation function

y<- function(u) {
  4*sqrt(1-u^2)
}

#' Start the timer

startTime<-Sys.time()

#' Generate our random uniform variables
x<-list()

for(i in 1:length(n)){
  x[[i]]<-runif(n[i])
}

#' Transform our uniform variables.

yvals<-list()
for (i in 1:length(x)){
  #' Need to define this so that the list element will be populated
  #' See: https://stackoverflow.com/questions/14333525/error-in-tmpk-subscript-out-of-bounds-in-r
  yvals[[i]]<-1
  for(j in 1:length(x[[i]])){
    yvals[[i]][j]<-y(x[[i]][j])
  }
}


#' Calculate our approximations of pi

avgs<- c()

for(i in 1:length(yvals)){
  avgs[i]<-mean(yvals[[i]])
}

endTime<-Sys.time()-startTime

endTime


## Time difference of 1.009413958 secs


data.frame(n, "MC Estimate"=unlist(avgs), "Difference from True Pi"= abs(unlist(avgs)-pi))


##         n MC.Estimate Difference.from.True.Pi
## 1      10 3.281637132         0.1400444782036
## 2     100 3.391190973         0.2495983193740
## 3    1000 3.090265904         0.0513267494211
## 4   10000 3.143465663         0.0018730098616
## 5  100000 3.141027069         0.0005655842822
## 6 1000000 3.141768899         0.0001762457079

Using for loops in Python (From previous blog)

As I did in the previous blog, here is the code I used to run the Monte Carlo algorithm with for loops. I heard there are more accurate ways to time this code, but since I want it to be similar to my R code- I am doing it this way.

import numpy as np
import pandas as pd
import time

# Define Number of points we want to estimate

n = [10, 100, 1000, 10000, 100000, 1000000]

# Our Transformation function

def y(x):
    return 4 * np.sqrt(1 - x ** 2)


#Start the timer

startTime= time.time()

# Generate our random uniform variables

x = [np.random.uniform(size=n) for n in n]



startTime= time.time()
yvals = []
for array in x:
    yval=[]
    for i in array:
        yval.append(y(i))
    yvals.append(yval)

avgs=[]

for array in yvals:
  avgs.append(np.mean(array))

endTime= time.time()-startTime

# How long it took to run our code
print("Time difference of "+ str(endTime) + " secs\n")


# Output


## Time difference of 2.790393352508545 secs


print("Estimated Values of Pi\n")


## Estimated Values of Pi


pd.DataFrame({"n":n,
              "MC Estimate":avgs,
              "Difference from True Pi": [np.abs(avg-np.pi) for avg in avgs]})


##          n  MC Estimate  Difference from True Pi
## 0       10     3.259405                 0.117812
## 1      100     3.351556                 0.209963
## 2     1000     3.130583                 0.011009
## 3    10000     3.126542                 0.015050
## 4   100000     3.144484                 0.002891
## 5  1000000     3.140740                 0.000853


library(reticulate)

py$endTime/as.numeric(endTime)


## [1] 2.764369693

Ok- so using for loops R isn’t as fast as I initally stated. However, based on my machine R is still over twice as fast as Python with for loops.

Hey, it ain’t 220 but its something

Using R’s “Best Practices” (Using the apply family)

Instead of using for loops, a faster alternative is to use the apply family of functions, namely sapply and lapply.

#' Start the timer
startTime<-Sys.time()

#' Generate our random uniform variables
x<-sapply(n,runif)

yvals<-lapply(x,y)
avgs<-lapply(yvals,mean)

newendTime<-Sys.time()-startTime
newendTime


## Time difference of 0.1879060268 secs


#' Speed - for loop vs apply
as.numeric(endTime)/as.numeric(newendTime)


## [1] 5.371908366

Using Python’s best practices

After getting several comments of (constructive) criticism about how the comparison was not fair here’s some new code implementing some of the best practices in writing faster code.

This is some code that I saw posted in a comment on my LinkedIn post (thank you Thomas Halvorson), which is pretty similar in structure to the R code I have listed above.

I’m sure there are better ways out there (I have seen in the comments for the last blog a lot of very good solutions), but I found this to be the most readable and follows a structure similar to R’s.

(Let me know if you have something better!)


startTime= time.time()

x = [np.random.uniform(size=n) for n in n]
yvals = list(map(y, x))
avgs = list(map(np.mean, yvals))

endTime= time.time()-startTime

# How long it took to run our code
print("Time difference of "+ str(endTime) + " secs\n")


## Time difference of 0.0629582405090332 secs

Comparing R with Python now we have:

as.numeric(newendTime)/py$endTime


## [1] 2.984613695

Python is nearly 3 times faster on my machine using the updated code.

Conclusion

Well, you live and learn. Best practices can make it or break it for your code and this updated analysis can help give you a better idea.


Thank you everyone for reading my last blog post and pointing out some obvious issues that I didn’t notice! I definitely will be using the map() function more often in my Python code!

Did you like this content? Be sure to never miss an update and Subscribe!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.