R Training – Functions and programming blocks

Posted on March 20, 2017 by jprimav in R bloggers | 0 Comments

[This article was first published on R – SLOW DATA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the second module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Introduction

Typing commands each time you need them is not very efficient
Control statements and user-defined functions help us to move forward and write more productive and complex programs

Import, overview and analyse simple data

Reading Data

load iris dataset

data("iris")
dt <- iris

Overview Data

overview data using str() and summary()
What’s the maximum value of Petal.Width?
How many levels has Species?
Have a look at the first 10 and last 3 rows using head() and tail()

Remind:

head() and tail() display 6 rows by default, but you can choose how many with argument n

# Have an overview of the data
str(dt)

summary(dt)
# Have a look at the first 10 and last 3 rows (use head and tail) 
head(dt, n = 10)
tail(dt, 3)

Spot missing values

Check for missing values using any() and is.na()
Start doing that in two steps
Then nest the two functions to obtain same result

Remind:

is.na() is vectorized, as most of R functions
Vectorized functions are used same way with scalars, vectors, matrices, etc.
What changes is the value returned which depend on input provided…
…What kind of object do you expect to be returned by is.na applied to a dataframe?

# Check if there is at least one missing value
idx <- is.na(dt)
any(idx)
any(is.na(dt)) # same as above but in a single line

Explore factors

Discover how many different levels there are in Species variable using unique() or table() functions

unique(iris$Species)

table(iris$Species)

Summarize numerical variables

Calculate mean petal width in this dataset using mean()

Remind:

you can access variables in a dataframe with $ operator
Or with the [] operator specifying in the colum slot either…
…an integer index indicating which column to pick
…or a string indicating the exact name of variable to pick

# Calculate the average petal width
mean(dt$Petal.Width)

Summarize numerical variables from pre-filtered data

Calculate average petal width only for setosa iris
Calculate average petal length only for setosa iris

Tips:

Anytime you have a doubt on the exact spell (or position) of a variable consider having a quick view at them with names() or str()
You could create a filtered dataset first and summarize it after…
…but this is not very efficient
Rather get advantage of [] operator which allows you to query rows and columns at same time

# Calculate same indicators, but only for setosa iris
mean(  dt[ dt$Species=="setosa", "Petal.Width" ]   )  
mean(  dt[ dt$Species=="setosa", "Petal.Length" ]   )

Storing analysis’ results

Store last results in a vector mystats using function c()
Visualize the results by printing the vector

Tips:

Consider store separately the two results…
…then use them to fill the vector, our final output
To visualize the value of an object you can use print() (explicit printing) or simply the name of the object (auto-printing)

# Save these last two results in a vector called mystats
avg_set_wid <- mean(  dt[ dt$Species=="setosa", "Petal.Width" ]   )  
avg_set_len <- mean(  dt[ dt$Species=="setosa", "Petal.Length" ]   )  
mystat <- c(avg_set_wid, avg_set_len)

Change variable names

Change these names this way:

Petal.Width -> pwidth
Petal.Length -> plength

…using names() function

Remind:

names() allow you to access names attribute of the data frame

names(dt)[3:4] <- c("plength", "pwidth")

Explore numerical data with histograms

Create a histogram of the variable plength using hist()
Create another one only for setosa iris

Make it prettier by:

changing number of bins (nclass=30)
title (main=“yourTitle”)
x-axis label (xlab=“yourLabel”)

Store the plot in a variable called myhist

# Create a histogram of the variable mcost
hist(dt$plength)
# create a better histogram of variable mcost only for rows where there was actually at least a material claim
hist(dt[dt$Species=="setosa", "pwidth"])
# create you final output by changing the number of bins, the title and the x-axis label
hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "")

# save this plot in a variable called myhist
myhist <- hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "", col = "yellow")

Lists

Presenting lists

Lists are ordered collection of objects (called list’s components)
Components can be of any type (logical vector, matrix, function,…, whatever!)
Lists’ components are numbered and may be referred to as such
When components are named they can be referred to by name
Subscripting operator […] can be used to obtain sublists of a list… -…but to select single components you need [[. . . ]] or $
Let’s now create some objects we will then store in a list

Create a list

Create an object named ‘x’ taking value ‘MyReport’
Store the objects mystats, myhist and x in a list named mylist using list()
Give a name to each component in the list using the form list(namecomponent=component,. . . )
Check the number of components in mylist with function lenght()

# Create a simple character object taking value "My analysis"
myreport <- "my analysis"
myreport2 <- c("my analysis")
myreport2 == myreport  # they are the same!

# Store the objects you have created (mystats, myhist and myreport) in a list
mylist <- list(mystat, myhist, myreport)
# give a name to each object of the list
mylist <- list(mystat = mystat, myhist = myhist, myreport = myreport)
# check the number of components in mylist with function lenght()
length(mylist)

Play with sublists

Sublist mylist keeping only the first entry
Sublist mylist keeping only the first and third entries
Play a bit with […] operator filled with positive (and negative) integer indexes as well as variables’ names (since our list has named components) …Do you think is possible to extract the second element of mystat component with []?

# Play with [] operator and sublists
mylist[1]
mylist[c(1,3)]
# sum(mylist[1]) # This trigger an error! you need [[]]

Play with extraction of lists’ components

Extract first component using [[. . . ]], [[“. . .”]] and $ operators
Experiment partial matching with $ operator
Extract the second element of first component
Sum mystat vector contained in mylist Remind: – […] for lists returns sub-lists – [[…]] returns lists’ components

# Play with [[]] operator to extract components
mylist[[1]]
mylist[["mystat"]]
mylist$mystat
mylist$myr
mylist[["mystat"]][1]

Grouped expressions

Presenting grouped expressions

In R every command is a function returning some value
Commands may be grouped together in braces: {expr_1 expr_2 }
…in which case the value of the group is the result of the last expression in the group evaluated
Grouped expressions are very useful together with control statements available in the R language

Control statements: conditional execution

if-else statement

If-else conditional construction takes the form

if (expr_1) expr_2 else expr_3

where:

expr_1 must evaluate to a single logical value
expr_2 is returned when expr_1 returns TRUE
expr_3 is returned elsewhere

Play with if-else stataments

Use an if-else statement to test whether or not data contain at least one missing values
then trigger a warning in the first case (something like ‘Watch out! there are missing values…’)
or simply print a message otherwise (something like ‘everything’s fine, go ahead…’)

Tips:

Use the following functions: any(), is.na(), print(), warning()

if(any(is.na(dt))) {
  warning("Watch out missing values!")
} else {
  print("No missing values here")
}

Play further with if-else statements

Use an if-else statement to test whether data is of class dataframe
Then create a variable taking value the number of observation in the data when condition is met and NA otherwise

Tips:

use functions: class(), nrow()

y <- if(class(dt)=="data.frame") {
  nrow(dt)
} else {
  NA
}
y

# change the condition from data.frame to factor and contorl you get a NA in this case
y <- if(class(dt)=="factor") {
  nrow(dt)
} else {
  NA
}
y

Vectorized if/else: ifelse() function

ifelse() is a vectorized version of the if/else construct
Create a variable named plength_high taking value 1 if plength>5 and 0 otherwise using ifelse()

Tip:

ifelse() has this structure ifelse(condition, if_true, if_false)

# Use the vectorized version of if/else construct, ifelse() function
plength_high <- ifelse(dt$plength>5, 1, 0)

Control statement: repetitive execution

looping

A for loop statment has the form:

for (name in expr_1) expr_2

where: – name is the loop variable – expr_1 is a vector expression, (often a sequence like 1:20) – expr_2 is often a grouped expression with its sub-expressions written in terms of the loop variable

For loops make that expr_2 is repeatedly evaluated as name ranges through the values in the vector result of expr_1

Play with loops

Use a for statment to loop over integer sequence 1:10 and print iteratively the loop variable

for(i in 1:10) {
  print(i)
}

…not very useful, right?

Use for statement to loop over columns of dataset and print the class of each of it

for(i in 1:ncol(dt)) {
  print(class(dt[,i]))
}

Repeat it but store the classes in a vector

myclasses <- NULL
for(i in 1:ncol(dt)) {
  myclasses <- c(myclasses, class(dt[,i]))
}
myclasses

# or like this
myclasses <- NULL
for(i in 1:ncol(dt)) {
  myclasses[i] <- class(dt[,i])
}
myclasses

Tips:

in grouped expressions auto-printing do not apply (use explicit printing)
initialize your target vector before, out of the loop
You can use NULL to initialize an empty object

Play further with loops

A loop variable can be anything, not only an integer sequence
Use a for statement to loop over all possible values of Species
Then use them to calculate average petal length for each Species
Final output is a numerical vector
Bind result vector with the values of Species to make the result readable
Order the matrix by Species (alphabetically)
Round average lengths to have no decimal place

Tips:

Use functions cbind(), order(), round()

out <- NULL
for(i in unique(dt$Species)) {
  dtsubset <- dt[dt$Species==i,]
  out <- c(out, mean(dtsubset$plength))
}
out

lspecies <- cbind(unique(dt$Species), out)
ord <- order(lspecies[,1])
lspecies <- lspecies[ord,]
lspecies
lspecies[,2] <- round(lspecies[,2], digits = 0)
lspecies

Alternative (better) ways to loop

for looping is not usually the best way to obtain a result in R
Code that take a whole object view is likely to be both clearer and faster
More advanced ways to obtain that same result are available in R
Previous loops can be obtained with:

sapply(dt, class)
tapply(X = dt$plengthExposure, INDEX = dt$Species,
FUN = mean)

Apply family of functions

apply functions implement looping
You can imagine apply functions as doing the following behind the scenes:
SPLIT up some data into smaller pieces
APPLY a function to each piece
then COMBINE the results

?sapply

Functions

Presenting functions in R

Functions represent one of the most powerful tool of R
Somehow the transition between interactive and developing programming mode
A function is defined by an assignment of the form:

name <- function(arg_1, arg_2) expression

where:

expression usually is a grouped expression that uses the arguments to calculate a value
A call to the function then usually takes the form:

name(arg_1 = value_1, arg_2 = value_2)

Functions can be treated much like any other object
They can be nested
They return last expression evaluated

Functions, arguments and defaults

Arguments in R can be matched by order and by name
if you mantain the exact order of function definition then there’s no need to specify the names of the arguments
Arguments can have default values:

name <- function(arg_1, arg_2 = NULL, arg_3 = 1)

arguments without default values must be always specified when calling the function

Built-in functions

Most of the functions supplied as part of the R system are themselves written in R and thus do not differ materially from user written functions

Check the R code behind built-in functions like mean, sd, etc by simply typing their names

sd

Many functions call some primitive functions whose code (written in C) is masked

Functions in loaded packages

library(e1071)
skewness

Write your functions

Write a simple function called AvgLenIris taking no arguments and returning average petal length of Iris
Call the function to see if it works (remember the parenthesis even if no argument is needed…)
Store the value returned by the function in an object named ‘y’

AvgLenIris <- function() {
   out <- mean(dt$plength)
   return(out)
}

y <- AvgLenIris()

Generalize the function adding an argument which will be the field to average, call it AvgAnyIris

AvgAnyIris <- function(field) mean(dt[,field])
AvgAnyIris(field = "pwidth")

AvgAnyIris(field = "plength")

Functions with control structures

Write a new function AvgAnyIris identical to previous one but with a control over argument validity
Control whether field is numeric using is.numeric()
Test the control structures are working

Remind:

Remind that integers are also numeric
Remind you can revert logical expressions with ! operator

Tips:

Use if statement followed by stop() function
stop() takes as argument a string/message to print whether a condition is met

AvgAnyIrisCtrl <- function(field) {
  if(!is.numeric(dt[,field])) stop("field must be numeric")
  out <- mean(dt[,field])
  out
}

AvgAnyIrisCtrl("plength")
# AvgAnyIrisCtrl("Species") # this triggers error. It is no sense to ask for the average of a factor variable

Functions with optional arguments

Write a function GroupedAvgAnyIris similar to last one but with an additional argument indicating a categorical variable to group the results by (univariate analysis)
argument indicating grouping variable has default to NULL
When no group value is provided then the same result of previous function (overall average) should be returned

Tips:

Use a for loop or tapply() as seen before

GroupedAvgAnyIris <- function(field, group=NULL) {
  if(!is.numeric(dt[,field])) stop("field must be numeric")
  
  if(is.null(group)) {
    out <- mean(dt[,field])
  } else {
    if(!is.factor(dt[,group])) stop("group must be a factor")
    out <- tapply(X = dt[,field], INDEX = dt[,group], FUN = mean)
  }
  out
}

GroupedAvgAnyIris("plength")

GroupedAvgAnyIris("plength", group = "Species")

Functions returning lists

Sometimes functions need to return more than one object
In this case lists come in very handy
Write a simple function returning a list with the division, multiplication, addition and difference between any two numbers
Call the function and store the result in a new object
Check the result with str() function
Extract the division with $ operator

mystat <- function(a, b) {
  list(div = a/b, mult = a*b, add = a+b, diff = a-b)
}

mystatvalue <- mystat(1,2)
str(mystatvalue)
mystatvalue$div

Probability distributions & simulation

Probability distributions

R language implements the tables of main probability distributions (normal, poisson, binomial, etc.)

For each distribution R provide 4 useful functions:

pnorm(), evaluate the cumulative distribution function
qnorm(), quantile function (given q, the smallest x such that P(X <= x) > q)
dnorm(), the probability density function
rnorm(), simulate from the distribution
Replace norm with pois, binom, gamma, ecc. and you have same functions for these other distributions

Let’s simulate some deviates

Simulate 1000 random numbers from normal distribution with mean=2 and sd=1 and store them in an object called “x” (use rnorm())
simulate 1000 random numbers from gamma distribution with scale=1000 and shape=0.8 and store it in an object called “z” (use rgamma())
Each distribution has its own parameters
Sometimes they have default values, sometimes not
Default parameters for normal distribution are those of standard distr. (mean=0, sd=1)

nn <- rnorm(n = 1000, mean = 2, sd = 1)
gg <- rgamma(n = 1000, scale = 750, shape = 0.8)

Visualize distributions

Plot the histogram of simulated data using hist() function
Adjust the number of bins with nclass argument
Replace the absolute frequency count with the relative one using argument probability=TRUE
Superimpose the theoretical density on the histogram

hist(nn, nclass = 30, probability = TRUE)
curve(dnorm(x, mean = 2, sd = 1), col = "red", add = TRUE)

hist(gg, nclass = 30, probability = TRUE)
curve(dgamma(x, scale = 750, shape = 0.8), col = "red", add = TRUE)

Quantiles & probabilities

Calculate the 95th quantile of normal deviates using quantile() and compare it with the theoretical one using qnorm()
Calculate the probability (relative frequency) of having values above 2000 for gamma deviates (tip: use length() function) and compare it with the theoretical one using pgamma()

# Calculate the 95th quantile of nn and compare it with the theoretical one
quantile(x = nn, probs = 0.95)

qnorm(p = 0.95, mean = 2, sd = 1)

# Calculate the probability (relative frequency) of having values above 2000 for gg
# and compare it with the theoretical one
length(gg[gg>2000])/length(gg)

Sampling

Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement:
use the size argument
replacement argument is FALSE by default, so no need to specify it
Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample

# Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement
ss <- 1:10
sample(x = ss, size = 4)

# Sample with the following probabilities instead c(1/2, rep(1/18,9)) instead
sample(x = ss, size = 4, prob = c(1/2, rep(1/8, 9)))

# Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample
idx_train <- sample(1:nrow(dt), size = 0.8*nrow(dt))
train <- dt[idx_train,]
test <- dt[-idx_train,]

That’s it for this module! If you have gone through all this code you should have learnt how to use use functions or programming blocks to develop more advanced programming than simple interactive commands.

When you’re ready, go ahead with the third module: R training – data manipulation.

The post R Training – Functions and programming blocks appeared first on SLOW DATA.

To leave a comment for the author, please follow the link and comment on their blog: R – SLOW DATA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Import, overview and analyse simple data

Reading Data

Overview Data

Spot missing values

Explore factors

Summarize numerical variables

Summarize numerical variables from pre-filtered data

Storing analysis’ results

Change variable names

Explore numerical data with histograms

Lists

Presenting lists

Create a list

Play with sublists

Play with extraction of lists’ components

Grouped expressions

Presenting grouped expressions

Control statements: conditional execution

if-else statement

Play with if-else stataments

Play further with if-else statements

Vectorized if/else: ifelse() function

Control statement: repetitive execution

looping

Play with loops

Play further with loops

Alternative (better) ways to loop

Apply family of functions

Functions

Presenting functions in R

Functions, arguments and defaults

Built-in functions

Functions in loaded packages

Write your functions

Functions with control structures

Functions with optional arguments

Functions returning lists

Probability distributions & simulation

Probability distributions

Let’s simulate some deviates

Visualize distributions

Quantiles & probabilities

Sampling

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)