Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the things I do frequently in my research is the apply some function on a large number of rows in a data set. I am a great fan of the loop structure in R, and use this a lot. I know one should always vectorize and avoid loops in R, however for me it is easier to write my code using loops as it mirrors the way I think about a given data problem. This has the unfortunate consequence that sometimes my R scripts become unwieldy and take hours to run. Reading The Art of R Programming I recently came across a simple way to avoid large loop structures, by feeding functions to the apply() function. I have read about this before but the explanation of it in the above mentioned book opened my eyes.
One of the difficult functions in R to wrap ones head around is the apply family of functions. Essentially they are wrappers for the loop function, and allow you to apply pre-specified functions to the elements of matrices, vectors or lists. The apply() function works with matrices, and is the function that I mainly use. Often in my scripts I will have entries like:
mThis is an example of using anonymous functions, where the actions to be specified are captured within the function(x) argument, where x refers to each row in the matrix (if I had written 2 after the data entry then it would have been applied to each column). This basically allows you to avoid writing a loop, but within each apply command a lot of code is written, and if we use the same procedure on a number of different matrices, then we will repeat a lot of code, and make the R script unnecessarily complex. Hence not much is gained compared to simply writing the loop explicitly.
One solution to this is to first capture your operations in a function, and then pass this function to the apply() function, this has the advantage of allowing you to reuse your function, and you can place all your custom functions at the beginning of the script, making your code easier to read and less cluttered.
To give an example the code below first creates a matrix with three columns, where each column is a draw of 100 random values from a normal distribution with mean 0 and standard deviation 1. A function is defined to find outliers in rows of the matrix. The function first calculates the mean of a row, then subtract each value in the row from that mean, and takes the absolute value. After that the function first find the position of the maximum value in the row (i.e is it in place 1,2 or 3) and then the maximum value. It returns these two values in a vector called out. The function is then fed to the apply function and is used on the data generated.
As you can see we have pulled the operations out of the apply loop, into their own function. This saves us the effort of writing these operations for every time we use the apply function to find outliers, hence reducing the number of lines in the R script, and making the code more readable.To leave a comment for the author, please follow the link and comment on their blog: The PolStat Feed.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.