Site icon R-bloggers

RObservations #11 Within()- Base R’s Mutate() function

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

As surprising as it may sound, many new R programmers are unaware of Base R syntax and the powerful functions that come with it- many of which do what functions from packages like dplyr and the rest of the tidyverse do. In this short blog post I am going to talk about the within() function and its synonmity of the dplyr package’s mutate() function.

The lore that I heard around the within() function and mutate() stems from healthy competition between RStudio and the R-Core to create a function that does exactly what they do. If someone has a verifiable source for this I would love to see it myself!

Disclaimer: I am a big fan of the work Hadley Wickham and RStudio have done and have personally adopted a lot of tidy practices myself.This blog is to highlight a Base R function that I personally have overlooked and recently learned about. I want to thank my professor Georges Monette for showing me this. I would have never known about it without him!

For the example I will be using the all too classic iris data set for demonstration purposes.

The Question

Suppose we’re interested in creating new variables that will give us the sepal length/width ratio for each flower measured. How would we do this?

Using mutate()

mutate() is generally written by piping the data set to the function and then defining the variables.

For brevity, I’ll just show the first six rows.

library(dplyr)

iris %>% mutate(Sepal.LWRatio=Sepal.Length/Sepal.Width) %>% head()

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesSepal.LWRatio
5.13.51.40.2setosa1.457143
4.93.01.40.2setosa1.633333
4.73.21.30.2setosa1.468750
4.63.11.50.2setosa1.483871
5.03.61.40.2setosa1.388889
5.43.91.70.4setosa1.384615

Using within()

Interestingly enough, we could do the same thing with within() as well! The syntax looks pretty similar to mutate()‘s if I were to not utilize pipes.

head(
within(iris,{
  Sepal.LWRatio = Sepal.Length/Sepal.Width
})
)

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesSepal.LWRatio
5.13.51.40.2setosa1.457143
4.93.01.40.2setosa1.633333
4.73.21.30.2setosa1.468750
4.63.11.50.2setosa1.483871
5.03.61.40.2setosa1.388889
5.43.91.70.4setosa1.384615

Interchangeability

Its important to note the way I wrote the above codes is only how I would personally use them. There are no hard rules for them. I can easily write the above codes with or without ‘tidy’ syntax.

# Looks like mutate()

iris %>% within({
  Sepal.LWRatio = Sepal.Length/Sepal.Width
}) %>% head()

# Looks like within()
head(
mutate(iris, 
       Sepal.LWRatio = Sepal.Length/Sepal.Width)
)

Which one is faster

Now for what really matters. Which one is faster? For this we will use the rbenchmark package. To make the game as even as possible I will not incorporate any pipes which may slow the code down.

library(rbenchmark)

benchmark(
  'mutate()'=mutate(iris, Sepal.LWRatio = Sepal.Length/Sepal.Width),
  'within()'=within(iris,{Sepal.LWRatio<-Sepal.Length/Sepal.Width}),
  replications = 1000
  )

testreplicationselapsedrelativeuser.selfsys.selfuser.childsys.child
mutate()10001.608.8891.560.02NANA
within()10000.181.0000.170.00NANA

Looking at this visually, we see that within() is much faster than mutate(). Very interesting indeed.

Which one uses more Memory.

Now lets talk about production. In order to ensure that applications are lightweight, memory use is essential. In order to check how much memory is used, we will use Hadley Wickham’s pryr package.

library(pryr)

df<- tibble::tibble(
  `Function`=c('mutate','within'),
   Bytes=c(object_size(mutate),object_size(within)))

df

FunctionBytes
mutate()1312
within()1296

mutate() is slightly larger than within in terms of memory, but the size difference is minute. In terms of application size, I wouldn’t sweat it with that much of a size difference (unless you will be using mutate() multiple times in your code).

Conclusion

As far as I can see, Base R’s within() is a faster alternative to dplyr‘s mutate(). I’m sure there is something even faster out there with data.table but I have to still spend some time with it. (If you know some data.table code for this I would love to see it!)

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.