Writing a for-loop in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There may be no R topic that is more controversial than the humble for-loop. And, to top it off, good help is hard to find. I was astounded by the lack of useful posts when I googled “for loops in R” (the top return linked to a page that did not exist). In fact, even searching for help within R is not easy and not even that helpful when successful (?for
won’t get you anywhere. ?'for'
will get you the help page but it is by no means exhaustive.) So, at the request of Sam, a faithful reader of the Paleocave blog, I’m going to throw my hat into the ring and brace myself for the potential onslaught of internet troll wrath.
How to loop in R
Use the for loop if you want to do the same task a specific number of times.
It looks like this.
for (counter in vector) {commands}
I’m going to set up a loop to square every element of my dataset, foo
, which contains the odd integers from 1 to 100 (keep in mind that vectorizing would be faster for my trivial example – see below).
foo = seq(1, 100, by=2)
foo.squared = NULL
for (i in 1:50 ) {
foo.squared[i] = foo[i]^2
}
If the creation of a new vector is the goal, first you have to set up a vector to store things in prior to running the loop. This is the foo.squared = NULL
part. This was a hard lesson for me to learn. R doesn’t like being told to operate on a vector that doesn’t exist yet. So, we set up an empty vector to add stuff to later (note that this isn’t the most speed efficient way to do this, but it’s fairly fool-proof). Next, the real for-loop begins. This code says we’ll loop 50 times(1:50
). The counter we set up is ‘i’ (but you can put whatever variable name you want there). For our new vector foo.squared
, the i
th element will equal the number of loops that we are on (for the first loop, i=1
; second loop, i=2
).
If you are new to programming it is sometimes difficult to keep straight the difference in the number of loops you are on versus the value of the element of vector being operated on. For example when we’ve looped through the instructions 4 times, the next loop will be loop number 5 (so i=5). However the 5th element of foo will be foo[5]
, which is equal to 9. Therefore, foo.squared[5]
should equal 81.
Silly mistakes to be made
If you are having problems with your loop, it could be one of these silly mental slips.
Did you reset your vector inside the loop? Is it possible you put a new.vector = NULL
inside the loop instead of before it? Yeah, I’ve done it. About 45 minutes later I finally figured out what was wrong with my loop.
Did you forget to subscript your new vector? Possibly the inside of your loop looks like this
{ foo.squared = foo[i]^2 }
.
You are missing your square brackets with a counter on the left side of the equal sign. This will result in foo.squared
containing only one value – the last value calculated by the loop.
Why is this controversial?
A little background:
1) Loops are slow in R. This fact puts lots of R users on the defense from the very beginning. Users of almost any other language can just bring up looping speed when they want to get under R users’ skins. The fact is, for many people, it doesn’t matter. Computers are fast and even slow looping will likely accomplish what you need in a reasonable length of time unless you are working with a really huge dataset. And there are lots of workarounds for users of big data in R.
2) R itself is primarily written in C (or some variant like C++). When you set up a vector in R, you can easily do operations on the entire vector (this is the vectorization that gets discussed so frequently in R literature).
foo.squared = foo^2
Underneath the R code you just executed is blazingly fast C code running loops to get you the answer. The upshot here is that C is much faster than R and if you can do get what you seek in R by applying a command to a vector it’s typically a good idea to do so.
3) R is a functional language, the result of that is the flow control and programming is somewhat de-emphasized. Many R natives would prefer that you use the apply
family of functions rather than writing a for-loop (often possible, but not always). Adding a layer of vitriol to this preference for the apply
command is the rumor (left over from the S language from which R was derived) that apply
is faster than a for-loop. This is false (at least theoretically), because inside the code for the apply
command is a for-loop written in R. There are a couple of functions in the apply
family which do avoid R loops and therefore probably are faster than a loop. But most apply
functions are no faster than a well constructed loop (more on well constructed later). But using apply
is best left for another post, we have plenty to tackle just learning how to write a half-way decent loop.
Some more advanced looping thoughts
If you are writing a for-loop inside of a larger construct, the number of times you want to loop could depend on the length of a vector which could change depending on other factors. Therefore, you can set up your counter in vector part of the loop like this
for (i in 1:length(foo)) {
#stuff to do the number of times that foo is long
}
The well constructed loop
If you are running into speed problems there are a couple of things to try (see also the R inferno).
Get as much stuff as possible out of the loop. If there are any operations that could be done to the vector prior to looping, get them outside of those curly brackets.
Avoid growing your object
In the example above we created an empty vector to store our new values in (foo.squared
). That vector is empty, and every time we go through the loop we grow the vector by one. It would be faster if we could set up our vector to be the right length ahead of time and then just simply fill that vector with the correct values.
foo.squared = numeric(length=50) #generates a vector of 50 zeros; now we run the loop as before
Of course, sometimes when we write loops we don’t know how many things are going to come out the other end. Usually we can guess on an upper bound though. It’s going to be faster to partially fill a very long vector using a loop then get rid of the meaningless stuff at the end than to grow a vector one loop at a time. We can make a very large vector full of NAs and dump them at the end. Give these two loops a try and note the speed difference on your computer.
bar = seq(1,200000, by=2)
bar.squared = rep(NA, 200000)
for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}
#get rid of excess NAs
bar.squared = bar.squared[!is.na(bar.squared)]
summary(bar.squared)
Versus
bar = seq(1, 200000, by=2)
bar.squared = NULL
for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}
summary(bar.squared)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.