Functions and iteration

[This article was first published on Numerilab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Functions and iteration

Introduction

Functions in R allow us to automate things, instead of copy-pasting code.

3 major advantages :

  • Functions can have names, so it makes code easier to read
  • If your requirements change, you only have one place where to change your code
  • It minimises the risks associated with copy-pasting (e.g. forgetting to change a variable name, etc.)

When to write a function?

It is commonly recommended that start writing a function whenever you copy-paste some code for the second time (i.e. you have 3 copies of the code)

For example, if you have the relative abundance of species in 3 communities :

com1 = c(0.5,0.3,0.2)
com2 = c(0.7,0.2,0.1)
com3 = c(0.9,0.1)

You could calculate Shannon diversity for the first community with :

-sum(com1*log(com1))

[1] 1.029653

And then for the second and third one with :

-sum(com2*log(com2))

[1] 0.8018186

-sum(com3*log(com3))

[1] 0.325083

At this point, we’ve copy-pasted our code twice, so it is time to turn it into a function…

A good first step when writing a function is to determine what the function needs, what are its inputs. In this case, the function needs the relative abundances of a community. Let’s call them p, like in the traditional notation.

p <- com1
-sum(p*log(p))

[1] 1.029653

p <- com2
-sum(p*log(p))

[1] 0.8018186

Notice that I immediately test my code, to make sure that I’ve extracted everything that I neede to make the code work.

Then, you only need to wrap this code and tell R that you function needs one argument, p

If you don’t mention anything, a function in R returns the result of the last command it executes

diversite_shannon <- function(p) {
  -sum(p*log(p))
}
diversite_shannon(com1)

[1] 1.029653

diversite_shannon(com2)

[1] 0.8018186

diversite_shannon(com3)

[1] 0.325083

Our code is now much more readable AND easier to maintain

(yes, we’re still copy-pasting some stuff…)

Conditional execution

You can insert conditional statements inside your functions (anywhere in fact), using the if keyword

The typical structure of an IF statement goes like this :

if (condition) {
  # gets run if condition is true
} else {
  # gets run if condition is false
}

salutations <- function(nom) {
  if (nom == "Charles") {
   print("Allo Charles")
  } else {
    print("Qui êtes vous?")
  }
}
salutations("Esteban")

[1] "Qui êtes vous?"

salutations("Charles")

[1] "Allo Charles"

Note that it is optional for your function to return a value

Arguments

Functions in R can have as many arguments as you wish

Usually, in R, the first arguments contain data, whereas the last ones contain details about how to do the calculations.

These arguments about calculation details can have default values, which the user only has to change if needed.

pile_face <- function(n,probabilite_pile = 0.5) {
  sample(
    c("pile","face"),
    size = n,
    prob = c(probabilite_pile,1 - probabilite_pile),
    replace = TRUE
  )
}
pile_face(25)

 [1] "face" "face" "pile" "face" "face" "pile" "pile" "pile" "pile" "face"
[11] "pile" "pile" "face" "pile" "pile" "pile" "face" "pile" "face" "face"
[21] "pile" "face" "pile" "pile" "pile"

pile_face(25, 0.9)

 [1] "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile"
[11] "pile" "pile" "pile" "pile" "pile" "pile" "face" "pile" "pile" "pile"
[21] "face" "pile" "pile" "pile" "pile"

Defensive programming

When you become comfortable with writing functions, there will rapidly come a point where you won’t remember how you coded everything inside every function you’ve written and the constaints associated.

For example, our function to calculate Shannon’s diversity expects to receive relative frequencies. The calculation is not defined if the sum of the p values is not 1.

However, our function let’s us do the calculation on absolute numbers

diversite_shannon(c(1,5,25,12))

[1] -118.338

To protect our future-self, we can add some checks, which will stop the calculations if some condition is not met.

diversite_shannon <- function(p) {
  stopifnot(sum(p) == 1)
  - sum(p*log(p))
}
diversite_shannon(c(1,5,25,12))

Error in diversite_shannon(c(1, 5, 25, 12)): sum(p) == 1 is not TRUE

You can also write more user-friendly messages with just a little more code

diversite_shannon <- function(p) {
  if (sum(p) != 1) {
    stop("L'argument p doit contenir des probabilités relatives, dont la somme est 1.")
  }
  - sum(p*log(p))
}
diversite_shannon(c(1,5,25,12))

Error in diversite_shannon(c(1, 5, 25, 12)): L'argument p doit contenir des probabilités relatives, dont la somme est 1.

The special dot-dot-dot (…) argument

Function definitions in R can also contain a special argument named dot-dot-dot. This argument, if present, will catch every argument that the function received and was not explicitely named in the function definition.

It can be especially useful to send these arguments back to another function.

If anytime you need an histogram you like to have it blue instead of white, you could create your own function, which calls the original, adding you favorite or often used options:

bleustogram <- function(...){
  hist(..., col = "royalblue")
}
bleustogram(rnorm(50))

bleustogram(rlnorm(100))

Many exit points

An R function can contain many points where it stops to return a value. In this case, these points need to be defined explicitely :

diversite <- function(p, indice = "shannon") {
  if (indice == "shannon") {
    return(-sum(p*log(p)))
  } else if (indice == "simpson") {
    return(sum(p^2))
  } else {
    stop("L'indice doit être shannon ou simpson")
  }
}
diversite(c(0.5,0.3,0.2))

[1] 1.029653

diversite(c(0.5,0.3,0.2), indice = "simpson")

[1] 0.38

About the environment

Anything that you create inside a function exist only there, and cannot be accessed from the outside. Objects are reset every the functio is called.

A special feature of R is that, if a variable is not defined inside a function, R will also search outside of it, in the global environment, for an object having the same name.

b <- 2
f1 <- function(d) {
  d * b
}
f1(3)

[1] 6

f2 <- function(d) {
  b <- 8
  d * b
}
f2(3)

[1] 24

b

[1] 2

d

Error in eval(expr, envir, enclos): object 'd' not found

This is why it is important to be very careful while extracting the arguments necessary to your function. It is important to restart your R session once in a while, to make sure that things are not “accidently working” because an object with a particular name exist in your working environment.

Introducing interation

Beside function, another programming technique to reduce code duplication is the use of iterations (syn. repetitions)

There are two styles of iteration in R : Imperative programming and functional programming.

Imperative programming includes FOR and WHILE loops. It is often the most intuitive way to begin, because the concepts are explicit.

Consequently, imperative programming involves lots of “plumbing code”, that repeats itself from a loop to another and “drowns” the actual intent of the code. Functional programming allows us to extract to the core task of our code and produces denser, easier to read and maintain code.

FOR loops

Let’s start with the Baby Shark lyrics : https://genius.com/Pinkfong-baby-shark-lyrics

If one wanted to automate the writing of the first verse of the song, we could write :

for (i in 1:3) {
  print("Baby shark, doo doo doo doo doo doo")
}
print("Baby shark!")

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark!"

A FOR loop has 2 main components :

  • The first line defines how many times we wish to do the loop
  • The code between {} defines the code we wish to repeat.

Each iteration doesn’t need to do an identical job, their action can be customized based on the index the loop is currently at.

mots = c("Baby shark","Mommy shark","It's the end")
for (i in seq_along(mots)) { # vs. 1:length(mots)
  print(paste0(mots[i],", doo doo doo doo doo doo"))
}

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"

Nesting

You probably saw that one coming from a mile, but you can also nest loops inside one another:

mots = c("Baby shark","Mommy shark","It's the end")
for (i in seq_along(mots)) {
  for (j in 1:3) {
    print(paste0(mots[i],", doo doo doo doo doo doo"))
  }
  print(paste0(mots[i],"!"))
}

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark!"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark!"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end!"

Keeping a result for each iteration

In a loop, it is often useful to keep the result of some calculation at each iteration. In such a case, it is strongly recommended to pre-allocate our result object before the loop begins. This is the key to fast loops in R.

For example, let’s prepare a loop that would calculate the absolute value of a series of numbers

nombres <- c(-1,0,1,-5)
valeurs_absolues <- vector("double", length(nombres))
for (i in seq_along(nombres)) {
  valeurs_absolues[i] <- abs(nombres[i])
}
valeurs_absolues

[1] 1 0 1 5

Unknown number of iterations

When you don’t know beforehand how many times our loop will run, there exist a second R structure which allows this kind of iteration : the while loop.

These loops are particularly useful in simulations.

For example, let’s run some code to determine how many coin tosses are needed to obtain a sequence of 3 heads in a row.

tirages <- 0
piles_de_suite <- 0

while (piles_de_suite < 3) {

  resultat <- pile_face(1)

  if ((resultat) == "pile") {
    piles_de_suite <- piles_de_suite + 1
  } else {
    piles_de_suite <- 0
  }

  tirages <- tirages + 1
}

tirages

[1] 33

The Map family of functions

As said before, iteration can be tackly in a completely different way, using functionnal programming. In R, the purrr package contains many functions that allows us to do functional programming in a simple and intuitive way.

library(purrr)

The principle behind each of these functions is all the same : instead of writing the code that makes the loop work, you give it a function and a series of elements to apply it on.

There are many map functions, depending on the return type you want to obtain.

  • map returns a list
  • map_lgl returns TRUE/FALSE
  • map_int returns integers
  • map_dbl returns floating point numbers
  • map_chr returns text
  • map_df returns a data.frame object

Let’s revisit our code snippet about absolute values. It could be converted to this simple piece of code

nombres <- c(-1,0,1,-5)
valeurs_absolues <- map_dbl(nombres, abs)
valeurs_absolues

[1] 1 0 1 5

We thus save a lot of plumbing code about how to make that loop

You can also give map a custom made function, e.g. let’s say you wanted to calculate the absolute value of these numbers and then add 10.

nombres <- c(-1,0,1,-5)
ma_fonction <- function(x) {
  abs(x) + 10
}
map_dbl(nombres, ma_fonction)

[1] 11 10 11 15

If this custom made function is used only in this code, you can define it anonymously inside the map function :

nombres <- c(-1,0,1,-5)
map_dbl(nombres, function(x) {abs(x) + 10})

[1] 11 10 11 15

Generally, this use is limited to short pieces of code.

Les raccourcis inclus

Shortcuts included in map functions

One of the advantages of the map functions in purrr package is that it enables us to cut the obvious parts of our code with shortcuts :

  nombres <- c(-1,0,1,-5)
  map_dbl(nombres, ~ abs(.) + 10)

  [1] 11 10 11 15

You can replace the function part with a ~ and the name of the variable with a dot.

Managing issues

When using map functions on long series of data, it can happen that our function fails for a reason or another. When a problem happens, the map function stops, with an error message, but you don’t retrieve all the instances that worked before the issue.

Let’s go back to our Shannon diversity function. If one of our communities does not contain relative frequencies, we loose all the remaining results…

communautes <- list(
  c(0.5,0.3,0.2),
  c(0.9,0.1),
  c(10,3,1)
)

map_dbl(communautes,diversite_shannon)

Error in .f(.x[[i]], ...): L'argument p doit contenir des probabilités relatives, dont la somme est 1.

The purrr package contains many adverbs to manage these kinds of situations. In each case, these adverbs wrap our function by modifying it’s behaviour in case of errors. Here we’re only seing the simplest case, possibly, to which you need to supply a value in case of error.

map_dbl(communautes,possibly(diversite_shannon,NA))

[1] 1.029653 0.325083       NA

There is also a map function to build data.frames

map_df(communautes,function(x){
  data.frame(
    richesse = length(x),
    shannon = diversite(x, indice = "shannon"),
    simpson = diversite(x, indice = "simpson")
  )
})

  richesse    shannon simpson
1        3   1.029653    0.38
2        2   0.325083    0.82
3        3 -26.321688  110.00

This function can also me tremendously useful when you need to read a bunch of csv files from a folder and bind them in a single data.frame

fichiers <- list.files(
  "/Dossier/Avec/Les/Donnees",
  pattern = "*.csv",
  full.names = TRUE
)
tableau <- map_df(fichiers,read.csv)

data.frames on input *and * output

To illustrate this case, let’s imagine a scenario where you have explored the relationship between body weight and brain weight in mammals.

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

data(msleep)

msleep %>%
  ggplot(aes(x = bodywt, y = brainwt)) +
  geom_point(aes(color = vore)) +
  scale_x_log10() +
  scale_y_log10()

Warning: Removed 27 rows containing missing values (geom_point).

You would now like to run a regression per group to compare the parameter values.

First, let’s see how we’d do that for a single group, the herbivores :

x <- msleep %>% filter(vore == "herbi")
m <- lm(log(brainwt)~log(bodywt),data = x)
summary(m)


Call:
lm(formula = log(brainwt) ~ log(bodywt), data = x)

Residuals:
     Min       1Q   Median       3Q      Max
-0.79504 -0.28970 -0.08648  0.19723  1.12209

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.94213    0.12464  -39.65  < 2e-16 ***
log(bodywt)  0.74212    0.03007   24.68 2.48e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4864 on 18 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.9713,	Adjusted R-squared:  0.9697
F-statistic: 609.1 on 1 and 18 DF,  p-value: 2.484e-15

It is not necessarily simple to extract the slope estimate from this model object to put them in a data.frame object.

str(m)

List of 13
 $ coefficients : Named num [1:2] -4.942 0.742
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "log(bodywt)"
 $ residuals    : Named num [1:20] -0.6656 0.6216 0.1733 -0.0253 0.5345 ...
  ..- attr(*, "names")= chr [1:20] "2" "4" "5" "6" ...
 $ effects      : Named num [1:20] 15.3808 12.0043 0.3263 -0.0469 0.4878 ...
  ..- attr(*, "names")= chr [1:20] "(Intercept)" "log(bodywt)" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:20] -0.195 -2.942 -2.336 -5.178 -5.586 ...
  ..- attr(*, "names")= chr [1:20] "2" "4" "5" "6" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:20, 1:2] -4.472 0.224 0.224 0.224 0.224 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20] "2" "4" "5" "6" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "log(bodywt)"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.22 1.01
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 18
 $ na.action    : 'omit' Named int [1:12] 1 3 12 13 15 17 19 21 22 25 ...
  ..- attr(*, "names")= chr [1:12] "1" "3" "12" "13" ...
 $ xlevels      : Named list()
 $ call         : language lm(formula = log(brainwt) ~ log(bodywt), data = x)
 $ terms        :Classes 'terms', 'formula'  language log(brainwt) ~ log(bodywt)
  .. ..- attr(*, "variables")= language list(log(brainwt), log(bodywt))
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "log(brainwt)" "log(bodywt)"
  .. .. .. ..$ : chr "log(bodywt)"
  .. ..- attr(*, "term.labels")= chr "log(bodywt)"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
  .. ..- attr(*, "predvars")= language list(log(brainwt), log(bodywt))
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "log(brainwt)" "log(bodywt)"
 $ model        :'data.frame':	20 obs. of  2 variables:
  ..$ log(brainwt): num [1:20] -0.86 -2.32 -2.16 -5.2 -5.05 ...
  ..$ log(bodywt) : num [1:20] 6.397 2.695 3.512 -0.317 -0.868 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language log(brainwt) ~ log(bodywt)
  .. .. ..- attr(*, "variables")= language list(log(brainwt), log(bodywt))
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "log(brainwt)" "log(bodywt)"
  .. .. .. .. ..$ : chr "log(bodywt)"
  .. .. ..- attr(*, "term.labels")= chr "log(bodywt)"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
  .. .. ..- attr(*, "predvars")= language list(log(brainwt), log(bodywt))
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "log(brainwt)" "log(bodywt)"
  ..- attr(*, "na.action")= 'omit' Named int [1:12] 1 3 12 13 15 17 19 21 22 25 ...
  .. ..- attr(*, "names")= chr [1:12] "1" "3" "12" "13" ...
 - attr(*, "class")= chr "lm"

m$coefficients

(Intercept) log(bodywt)
  -4.942134    0.742122

In the tidyverse, there is a package made especially for such situations, to extract numbers from models and put them in a data.frame.

library(broom)
tidy(m)

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -4.94     0.125      -39.6 5.69e-19
2 log(bodywt)    0.742    0.0301      24.7 2.48e-15

You could also have been interest in model-level numbers instead :

glance(m)

# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
1     0.971         0.970 0.486      609. 2.48e-15     2  -12.9  31.8  34.8
# … with 2 more variables: deviance <dbl>, df.residual <int>

Now we have in hand everything we need to calculate a regression model per group

resultats <- msleep %>%
  group_by(vore) %>%
  group_modify(function(x,...){
    m <- lm(log(brainwt)~log(bodywt),data = x)
    tidy(m)
  })
resultats

# A tibble: 10 x 6
# Groups:   vore [5]
   vore    term        estimate std.error statistic  p.value
   <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 carni   (Intercept)   -4.47     0.278     -16.1  8.61e- 7
 2 carni   log(bodywt)    0.667    0.112       5.95 5.71e- 4
 3 herbi   (Intercept)   -4.94     0.125     -39.6  5.69e-19
 4 herbi   log(bodywt)    0.742    0.0301     24.7  2.48e-15
 5 insecti (Intercept)   -5.11     0.167     -30.6  7.66e- 5
 6 insecti log(bodywt)    0.702    0.0477     14.7  6.84e- 4
 7 omni    (Intercept)   -4.45     0.202     -22.0  7.76e-13
 8 omni    log(bodywt)    0.869    0.0724     12.0  4.28e- 9
 9 <NA>    (Intercept)   -4.97     0.345     -14.4  7.23e- 4
10 <NA>    log(bodywt)    0.837    0.194       4.33 2.28e- 2

See that, just as when building functions, we made sure to test our code before integrating in the iteration code

You can then vizualize all these coefficients in a single, synthetic plot.

resultats %>%
  ggplot(aes(x = vore, y = estimate)) +
  geom_point() +
  geom_linerange(aes(ymin = estimate-std.error, ymax = estimate+std.error)) +
  facet_wrap(~term, scale = "free_y", ncol = 1)

To leave a comment for the author, please follow the link and comment on their blog: Numerilab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)