R Training – Functions and programming blocks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the second module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.
Introduction
- Typing commands each time you need them is not very efficient
- Control statements and user-defined functions help us to move forward and write more productive and complex programs
Import, overview and analyse simple data
Reading Data
- load iris dataset
data("iris") dt <- iris
Overview Data
- overview data using str() and summary()
- What’s the maximum value of Petal.Width?
- How many levels has Species?
- Have a look at the first 10 and last 3 rows using head() and tail()
Remind:
- head() and tail() display 6 rows by default, but you can choose how many with argument n
# Have an overview of the data str(dt)
summary(dt) # Have a look at the first 10 and last 3 rows (use head and tail) head(dt, n = 10) tail(dt, 3)
Spot missing values
- Check for missing values using any() and is.na()
- Start doing that in two steps
- Then nest the two functions to obtain same result
Remind:
- is.na() is vectorized, as most of R functions
- Vectorized functions are used same way with scalars, vectors, matrices, etc.
- What changes is the value returned which depend on input provided…
- …What kind of object do you expect to be returned by is.na applied to a dataframe?
# Check if there is at least one missing value idx <- is.na(dt) any(idx) any(is.na(dt)) # same as above but in a single line
Explore factors
- Discover how many different levels there are in Species variable using unique() or table() functions
unique(iris$Species)
table(iris$Species)
Summarize numerical variables
- Calculate mean petal width in this dataset using mean()
Remind:
- you can access variables in a dataframe with $ operator
- Or with the [] operator specifying in the colum slot either…
- …an integer index indicating which column to pick
- …or a string indicating the exact name of variable to pick
# Calculate the average petal width mean(dt$Petal.Width)
Summarize numerical variables from pre-filtered data
- Calculate average petal width only for setosa iris
- Calculate average petal length only for setosa iris
Tips:
- Anytime you have a doubt on the exact spell (or position) of a variable consider having a quick view at them with names() or str()
- You could create a filtered dataset first and summarize it after…
- …but this is not very efficient
- Rather get advantage of [] operator which allows you to query rows and columns at same time
# Calculate same indicators, but only for setosa iris mean( dt[ dt$Species=="setosa", "Petal.Width" ] ) mean( dt[ dt$Species=="setosa", "Petal.Length" ] )
Storing analysis’ results
- Store last results in a vector mystats using function c()
- Visualize the results by printing the vector
Tips:
- Consider store separately the two results…
- …then use them to fill the vector, our final output
- To visualize the value of an object you can use print() (explicit printing) or simply the name of the object (auto-printing)
# Save these last two results in a vector called mystats avg_set_wid <- mean( dt[ dt$Species=="setosa", "Petal.Width" ] ) avg_set_len <- mean( dt[ dt$Species=="setosa", "Petal.Length" ] ) mystat <- c(avg_set_wid, avg_set_len)
Change variable names
Change these names this way:
- Petal.Width -> pwidth
- Petal.Length -> plength
…using names() function
Remind:
- names() allow you to access names attribute of the data frame
names(dt)[3:4] <- c("plength", "pwidth")
Explore numerical data with histograms
- Create a histogram of the variable plength using hist()
- Create another one only for setosa iris
Make it prettier by:
- changing number of bins (nclass=30)
- title (main=“yourTitle”)
- x-axis label (xlab=“yourLabel”)
Store the plot in a variable called myhist
# Create a histogram of the variable mcost hist(dt$plength) # create a better histogram of variable mcost only for rows where there was actually at least a material claim hist(dt[dt$Species=="setosa", "pwidth"]) # create you final output by changing the number of bins, the title and the x-axis label hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "")
# save this plot in a variable called myhist myhist <- hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "", col = "yellow")
Lists
Presenting lists
- Lists are ordered collection of objects (called list’s components)
- Components can be of any type (logical vector, matrix, function,…, whatever!)
- Lists’ components are numbered and may be referred to as such
- When components are named they can be referred to by name
- Subscripting operator […] can be used to obtain sublists of a list… -…but to select single components you need [[. . . ]] or $
- Let’s now create some objects we will then store in a list
Create a list
- Create an object named ‘x’ taking value ‘MyReport’
- Store the objects mystats, myhist and x in a list named mylist using list()
- Give a name to each component in the list using the form list(namecomponent=component,. . . )
- Check the number of components in mylist with function lenght()
# Create a simple character object taking value "My analysis" myreport <- "my analysis" myreport2 <- c("my analysis") myreport2 == myreport # they are the same!
# Store the objects you have created (mystats, myhist and myreport) in a list mylist <- list(mystat, myhist, myreport) # give a name to each object of the list mylist <- list(mystat = mystat, myhist = myhist, myreport = myreport) # check the number of components in mylist with function lenght() length(mylist)
Play with sublists
- Sublist mylist keeping only the first entry
- Sublist mylist keeping only the first and third entries
- Play a bit with […] operator filled with positive (and negative) integer indexes as well as variables’ names (since our list has named components) …Do you think is possible to extract the second element of mystat component with []?
# Play with [] operator and sublists mylist[1] mylist[c(1,3)] # sum(mylist[1]) # This trigger an error! you need [[]]
Play with extraction of lists’ components
- Extract first component using [[. . . ]], [[“. . .”]] and $ operators
- Experiment partial matching with $ operator
- Extract the second element of first component
- Sum mystat vector contained in mylist Remind: – […] for lists returns sub-lists – [[…]] returns lists’ components
# Play with [[]] operator to extract components mylist[[1]] mylist[["mystat"]] mylist$mystat mylist$myr mylist[["mystat"]][1]
Grouped expressions
Presenting grouped expressions
- In R every command is a function returning some value
- Commands may be grouped together in braces: {expr_1 expr_2 }
- …in which case the value of the group is the result of the last expression in the group evaluated
- Grouped expressions are very useful together with control statements available in the R language
Control statements: conditional execution
if-else statement
- If-else conditional construction takes the form
if (expr_1) expr_2 else expr_3
where:
- expr_1 must evaluate to a single logical value
- expr_2 is returned when expr_1 returns TRUE
- expr_3 is returned elsewhere
Play with if-else stataments
- Use an if-else statement to test whether or not data contain at least one missing values
- then trigger a warning in the first case (something like ‘Watch out! there are missing values…’)
- or simply print a message otherwise (something like ‘everything’s fine, go ahead…’)
Tips:
- Use the following functions: any(), is.na(), print(), warning()
if(any(is.na(dt))) { warning("Watch out missing values!") } else { print("No missing values here") }
Play further with if-else statements
- Use an if-else statement to test whether data is of class dataframe
- Then create a variable taking value the number of observation in the data when condition is met and NA otherwise
Tips:
- use functions: class(), nrow()
y <- if(class(dt)=="data.frame") { nrow(dt) } else { NA } y
# change the condition from data.frame to factor and contorl you get a NA in this case y <- if(class(dt)=="factor") { nrow(dt) } else { NA } y
Vectorized if/else: ifelse() function
- ifelse() is a vectorized version of the if/else construct
- Create a variable named plength_high taking value 1 if plength>5 and 0 otherwise using ifelse()
Tip:
- ifelse() has this structure ifelse(condition, if_true, if_false)
# Use the vectorized version of if/else construct, ifelse() function plength_high <- ifelse(dt$plength>5, 1, 0)
Control statement: repetitive execution
looping
- A for loop statment has the form:
for (name in expr_1) expr_2
where: – name is the loop variable – expr_1 is a vector expression, (often a sequence like 1:20) – expr_2 is often a grouped expression with its sub-expressions written in terms of the loop variable
For loops make that expr_2 is repeatedly evaluated as name ranges through the values in the vector result of expr_1
Play with loops
- Use a for statment to loop over integer sequence 1:10 and print iteratively the loop variable
for(i in 1:10) { print(i) }
…not very useful, right?
- Use for statement to loop over columns of dataset and print the class of each of it
for(i in 1:ncol(dt)) { print(class(dt[,i])) }
- Repeat it but store the classes in a vector
myclasses <- NULL for(i in 1:ncol(dt)) { myclasses <- c(myclasses, class(dt[,i])) } myclasses
# or like this myclasses <- NULL for(i in 1:ncol(dt)) { myclasses[i] <- class(dt[,i]) } myclasses
Tips:
- in grouped expressions auto-printing do not apply (use explicit printing)
- initialize your target vector before, out of the loop
- You can use NULL to initialize an empty object
Play further with loops
- A loop variable can be anything, not only an integer sequence
- Use a for statement to loop over all possible values of Species
- Then use them to calculate average petal length for each Species
- Final output is a numerical vector
- Bind result vector with the values of Species to make the result readable
- Order the matrix by Species (alphabetically)
- Round average lengths to have no decimal place
Tips:
- Use functions cbind(), order(), round()
out <- NULL for(i in unique(dt$Species)) { dtsubset <- dt[dt$Species==i,] out <- c(out, mean(dtsubset$plength)) } out lspecies <- cbind(unique(dt$Species), out) ord <- order(lspecies[,1]) lspecies <- lspecies[ord,] lspecies lspecies[,2] <- round(lspecies[,2], digits = 0) lspecies
Alternative (better) ways to loop
- for looping is not usually the best way to obtain a result in R
- Code that take a whole object view is likely to be both clearer and faster
- More advanced ways to obtain that same result are available in R
- Previous loops can be obtained with:
sapply(dt, class) tapply(X = dt$plengthExposure, INDEX = dt$Species, FUN = mean)
Apply family of functions
- apply functions implement looping
- You can imagine apply functions as doing the following behind the scenes:
- SPLIT up some data into smaller pieces
- APPLY a function to each piece
- then COMBINE the results
?sapply
Functions
Presenting functions in R
- Functions represent one of the most powerful tool of R
- Somehow the transition between interactive and developing programming mode
- A function is defined by an assignment of the form:
name <- function(arg_1, arg_2) expression
where:
- expression usually is a grouped expression that uses the arguments to calculate a value
- A call to the function then usually takes the form:
name(arg_1 = value_1, arg_2 = value_2)
- Functions can be treated much like any other object
- They can be nested
- They return last expression evaluated
Functions, arguments and defaults
- Arguments in R can be matched by order and by name
- if you mantain the exact order of function definition then there’s no need to specify the names of the arguments
- Arguments can have default values:
name <- function(arg_1, arg_2 = NULL, arg_3 = 1)
- arguments without default values must be always specified when calling the function
Built-in functions
Most of the functions supplied as part of the R system are themselves written in R and thus do not differ materially from user written functions
Check the R code behind built-in functions like mean, sd, etc by simply typing their names
sd
- Many functions call some primitive functions whose code (written in C) is masked
Functions in loaded packages
library(e1071) skewness
Write your functions
- Write a simple function called AvgLenIris taking no arguments and returning average petal length of Iris
- Call the function to see if it works (remember the parenthesis even if no argument is needed…)
- Store the value returned by the function in an object named ‘y’
AvgLenIris <- function() { out <- mean(dt$plength) return(out) } y <- AvgLenIris()
- Generalize the function adding an argument which will be the field to average, call it AvgAnyIris
AvgAnyIris <- function(field) mean(dt[,field]) AvgAnyIris(field = "pwidth")
AvgAnyIris(field = "plength")
Functions with control structures
- Write a new function AvgAnyIris identical to previous one but with a control over argument validity
- Control whether field is numeric using is.numeric()
- Test the control structures are working
Remind:
- Remind that integers are also numeric
- Remind you can revert logical expressions with ! operator
Tips:
- Use if statement followed by stop() function
- stop() takes as argument a string/message to print whether a condition is met
AvgAnyIrisCtrl <- function(field) { if(!is.numeric(dt[,field])) stop("field must be numeric") out <- mean(dt[,field]) out } AvgAnyIrisCtrl("plength") # AvgAnyIrisCtrl("Species") # this triggers error. It is no sense to ask for the average of a factor variable
Functions with optional arguments
- Write a function GroupedAvgAnyIris similar to last one but with an additional argument indicating a categorical variable to group the results by (univariate analysis)
- argument indicating grouping variable has default to NULL
- When no group value is provided then the same result of previous function (overall average) should be returned
Tips:
- Use a for loop or tapply() as seen before
GroupedAvgAnyIris <- function(field, group=NULL) { if(!is.numeric(dt[,field])) stop("field must be numeric") if(is.null(group)) { out <- mean(dt[,field]) } else { if(!is.factor(dt[,group])) stop("group must be a factor") out <- tapply(X = dt[,field], INDEX = dt[,group], FUN = mean) } out } GroupedAvgAnyIris("plength")
GroupedAvgAnyIris("plength", group = "Species")
Functions returning lists
- Sometimes functions need to return more than one object
- In this case lists come in very handy
- Write a simple function returning a list with the division, multiplication, addition and difference between any two numbers
- Call the function and store the result in a new object
- Check the result with str() function
- Extract the division with $ operator
mystat <- function(a, b) { list(div = a/b, mult = a*b, add = a+b, diff = a-b) } mystatvalue <- mystat(1,2) str(mystatvalue) mystatvalue$div
Probability distributions & simulation
Probability distributions
- R language implements the tables of main probability distributions (normal, poisson, binomial, etc.)
For each distribution R provide 4 useful functions:
- pnorm(), evaluate the cumulative distribution function
- qnorm(), quantile function (given q, the smallest x such that P(X <= x) > q)
- dnorm(), the probability density function
- rnorm(), simulate from the distribution
- Replace norm with pois, binom, gamma, ecc. and you have same functions for these other distributions
Let’s simulate some deviates
- Simulate 1000 random numbers from normal distribution with mean=2 and sd=1 and store them in an object called “x” (use rnorm())
- simulate 1000 random numbers from gamma distribution with scale=1000 and shape=0.8 and store it in an object called “z” (use rgamma())
- Each distribution has its own parameters
- Sometimes they have default values, sometimes not
- Default parameters for normal distribution are those of standard distr. (mean=0, sd=1)
nn <- rnorm(n = 1000, mean = 2, sd = 1) gg <- rgamma(n = 1000, scale = 750, shape = 0.8)
Visualize distributions
- Plot the histogram of simulated data using hist() function
- Adjust the number of bins with nclass argument
- Replace the absolute frequency count with the relative one using argument probability=TRUE
- Superimpose the theoretical density on the histogram
hist(nn, nclass = 30, probability = TRUE) curve(dnorm(x, mean = 2, sd = 1), col = "red", add = TRUE)
hist(gg, nclass = 30, probability = TRUE) curve(dgamma(x, scale = 750, shape = 0.8), col = "red", add = TRUE)
Quantiles & probabilities
- Calculate the 95th quantile of normal deviates using quantile() and compare it with the theoretical one using qnorm()
- Calculate the probability (relative frequency) of having values above 2000 for gamma deviates (tip: use length() function) and compare it with the theoretical one using pgamma()
# Calculate the 95th quantile of nn and compare it with the theoretical one quantile(x = nn, probs = 0.95) qnorm(p = 0.95, mean = 2, sd = 1) # Calculate the probability (relative frequency) of having values above 2000 for gg # and compare it with the theoretical one length(gg[gg>2000])/length(gg)
Sampling
- Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement:
- use the size argument
- replacement argument is FALSE by default, so no need to specify it
- Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample
# Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement ss <- 1:10 sample(x = ss, size = 4) # Sample with the following probabilities instead c(1/2, rep(1/18,9)) instead sample(x = ss, size = 4, prob = c(1/2, rep(1/8, 9))) # Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample idx_train <- sample(1:nrow(dt), size = 0.8*nrow(dt)) train <- dt[idx_train,] test <- dt[-idx_train,]
That’s it for this module! If you have gone through all this code you should have learnt how to use use functions or programming blocks to develop more advanced programming than simple interactive commands.
When you’re ready, go ahead with the third module: R training – data manipulation.
The post R Training – Functions and programming blocks appeared first on SLOW DATA.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.