Introduction to R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Quickstart
Installation
Variables
Values can be assigned to variables with the operators <-
, =
or ->
.
# assign 1 to variable x x <- 1 # or x = 1 # or 1 -> x
Functions
R functions are invoked by their name, then followed by the parenthesis, and zero or more arguments.
# summing 1+2+3+4+5 sum(1,2,3,4,5)
Packages
Additional functionality beyond those offered by the core R library are available with R packages. In order to install an additional package, the install.packages
function can be invoked.
# install the "xts" package install.packages('xts')
There are two ways to invoke functions from add-on packages: using the package namespace or loading the package.
# using the namespace. # Invoke the function as package_name::function_name xts::is.xts(1) # loading the package with the 'require' function. # This makes its functions available without using namespaces require(xts) is.xts(1)
Help
R provides extensive documentation. Enter ?function_name
to access the documentation of a function.
# examples ?sum ?mean ?rnorm
Basic Data Types
There are several basic R data types that are of frequent occurrence in routine R calculations.
Numeric
Decimal values are called numerics in R. It is the default computational data type. If a decimal value is assigned to a variable x
as follows, x
will be of numeric type.
x <- 10.5 # assign a decimal value class(x) # class of x ## [1] "numeric"
Furthermore, even if an integer is assigned to a variable x
, it is still being saved as a numeric value.
x <- 10 # assign an integer value is.integer(x) # is integer? ## [1] FALSE
Integer
In order to create an integer variable in R, the as.integer
function can be invoked.
x <- as.integer(10) # assign an integer data type is.integer(x) # is integer? ## [1] TRUE
Integers can also be declared by appending an L
suffix.
x <- 10L # assign an integer data type is.integer(x) # is integer? ## [1] TRUE
Complex
Complex numbers are of complex
type
z <- 3+4i # assign a complex number class(z) # class of x ## [1] "complex"
Basic functions which support complex arithmetic are:
Re(z) # real part ## [1] 3 Im(z) # imaginary part ## [1] 4 Mod(z) # modulus ## [1] 5 Arg(z) # argument ## [1] 0.9272952 Conj(z) # complex conjugate ## [1] 3-4i
Logical
A logical value is often created via comparison between variables.
x <- 2 > 1 # is 2 greater than 1? x ## [1] TRUE
Standard logical operations are &
(and), |
(or), and !
(not).
u <- TRUE v <- FALSE u & v ## [1] FALSE u | v ## [1] TRUE !u ## [1] FALSE
Character
A character object is used to represent string values in R. Two character values can be concatenated with the paste
function.
address <- 'example' domain <- 'gmail.com' paste(address, domain, sep = '@') ## [1] "[email protected]"
However, it is often more convenient to create a readable string with the sprintf
function, which has a C language syntax.
sprintf("%s has %d dollars", "Sam", 100) ## [1] "Sam has 100 dollars"
And to replace the first occurrence of the word “little” by another word “big” in the string, the sub
function can be applied.
sub("little", "big", "Mary has a little lamb.") ## [1] "Mary has a big lamb."
More functions for string manipulation can be found in the R documentation.
?sub
Basic Data Structures
Vector
The basic data structure in R is the vector. They are usually created with the c()
function, short for combine:
c(1,2,3) ## [1] 1 2 3
Vectors can contain only similar data types. If this is not the case, some conversion takes place.
c(FALSE,1,"2") ## [1] "FALSE" "1" "2"
Named Vector
# Declaring a named vector c('first' = 1, 'second' = 2, 'third' = 3) ## first second third ## 1 2 3 # Generating a named vector x <- c(1,2,3) # vector n <- c('first','second','third') # vector of names names(x) <- n # assign names x ## first second third ## 1 2 3
Matrix
A matrix is a collection of similar data types arranged in a two-dimensional rectangular layout. They are usually created with the matrix()
function:
matrix(data = c(1,2,3,4,5,6), # the data elements ncol = 3, # number of columns nrow = 2, # number of rows byrow = TRUE) # fill matrix by rows ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Named Matrix
# Declaring a named matrix matrix(data = c(1,2,3,4,5,6), # the data elements ncol = 3, # number of columns nrow = 2, # number of rows byrow = TRUE, # fill matrix by rows dimnames = list( # list containing names c('r1','r2'), # rownames c('c1','c2','c3') # colnames )) ## c1 c2 c3 ## r1 1 2 3 ## r2 4 5 6 # Generating a named matrix M <- matrix(data = c(1,2,3,4,5,6), # the data elements ncol = 3, # number of columns nrow = 2, # number of rows byrow = TRUE) # fill matrix by rows rn <- c('r1','r2') # vector of rownames cn <- c('c1','c2','c3') # vector of colnames rownames(M) <- rn # assign rownames colnames(M) <- cn # assign colnames M ## c1 c2 c3 ## r1 1 2 3 ## r2 4 5 6
Data Frame
A data frame is used for storing data tables. It is similar to a matrix
but data.frame
can contain heterogeneous inputs while a matrix
cannot. In matrix
only similar data types can be stored whereas in a data.frame
there can be different data types. They are usually created with the data.frame()
function. Beware data.frame()
’s default behaviour which turns strings into factors (a factor is a vector that can contain only predefined values, and is used to store categorical data). Use stringsAsFactors = FALSE
to suppress this behaviour:
v1 <- c(10,20,30) # numeric vector v2 <- c('a','b','c') # character vector v3 <- c(TRUE,TRUE,FALSE) # logical vector data.frame(v1, v2, v3, stringsAsFactors = FALSE) # data.frame ## v1 v2 v3 ## 1 10 a TRUE ## 2 20 b TRUE ## 3 30 c FALSE
Named Data Frame
# Declaring a named data.frame v1 <- c(10,20,30) # numeric vector v2 <- c('a','b','c') # character vector v3 <- c(TRUE,TRUE,FALSE) # logical vector data.frame('c1' = v1, # column named 'c1' 'c2' = v2, # column named 'c2' 'c3' = v3, # column named 'c3' row.names = c('r1', 'r2', 'r3'), # vector of rownames stringsAsFactors = FALSE) # suppress character conversion ## c1 c2 c3 ## r1 10 a TRUE ## r2 20 b TRUE ## r3 30 c FALSE # Generating a named data.frame v1 <- c(10,20,30) # numeric vector v2 <- c('a','b','c') # character vector v3 <- c(TRUE,TRUE,FALSE) # logical vector rn <- c('r1','r2','r3') # vector of rownames cn <- c('c1','c2','c3') # vector of colnames df <- data.frame(v1, v2, v3,stringsAsFactors = FALSE) # data.frame rownames(df) <- rn # assign rownames colnames(df) <- cn # assign colnames df ## c1 c2 c3 ## r1 10 a TRUE ## r2 20 b TRUE ## r3 30 c FALSE
List
A list
is a generic structure which can be thought as an ordered set of objects. They are usually created with the list()
function:
list(matrix(100), data.frame(1,2,3), c('a','b','c','d')) ## [[1]] ## [,1] ## [1,] 100 ## ## [[2]] ## X1 X2 X3 ## 1 1 2 3 ## ## [[3]] ## [1] "a" "b" "c" "d"
Named List
# Declaring a named list list('matrix' = matrix(100), # matrix 'data.frame' = data.frame(1,2,3), # data.frame 'vector' = c('a','b','c','d')) # vector ## $matrix ## [,1] ## [1,] 100 ## ## $data.frame ## X1 X2 X3 ## 1 1 2 3 ## ## $vector ## [1] "a" "b" "c" "d" # Generating a named list M <- matrix(100) # matrix df <- data.frame(1,2,3) # data.frame v <- c('a','b','c','d') # vector n <- c('matrix','data.frame','vector') # vector of names l <- list(M, df, v) # list names(l) <- n # assign names l ## $matrix ## [,1] ## [1,] 100 ## ## $data.frame ## X1 X2 X3 ## 1 1 2 3 ## ## $vector ## [1] "a" "b" "c" "d"
Environment
Generally, an environment
is similar to a list
, with four important exceptions:
- Every name in an environment is unique.
- The names in an environment are not ordered (i.e., it doesn’t make sense to ask what the first element of an environment is).
- An environment has a parent (nested structure).
- Environments have reference semantics.
To create an environment manually, use new.env()
.
x <- new.env() # create a new environment x ## <environment: 0x000000001cace6c0>
Basic Operations
Subsetting
Vector
Values in a vector
are retrieved by using the single square bracket []
operator.
s = c("aaa"="a", "bbb"="b", "ccc"="c", "ddd"="d", "eee"="e") s # print the full vector ## aaa bbb ccc ddd eee ## "a" "b" "c" "d" "e" # retrieve the 3rd element s[3] ## ccc ## "c" # drop the 3rd element s[-3] ## aaa bbb ddd eee ## "a" "b" "d" "e" # out-of-range returns NA s[10] ## <NA> ## NA # retrieve the 2nd, 3rd, 5th and 5th element i <- c(2,3,5,5) s[i] ## bbb ccc eee eee ## "b" "c" "e" "e" # drop the 1st and 3rd element i <- c(1,3) s[-i] ## bbb ddd eee ## "b" "d" "e" # retrieve the elements named 'ddd' and 'bbb' i <- c('ddd','bbb') s[i] ## ddd bbb ## "d" "b" # retrieve the 3rd element using a logical vector i <- c(FALSE,FALSE,TRUE,FALSE,FALSE) s[i] ## ccc ## "c" # the logical vector will be recycled if it is shorter than the vector to subset i <- c(FALSE,TRUE) # -> c(FALSE,TRUE,FALSE,TRUE,FALSE) s[i] ## bbb ddd ## "b" "d" # select elements greater than 'b' i <- s > 'b' s[i] ## ccc ddd eee ## "c" "d" "e"
Matrix
Values in a matrix
are retrieved by using the single square bracket []
operator.
M <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE) rownames(M) <- c('r1','r2','r3') colnames(M) <- c('c1','c2','c3','c4') M # print the full matrix ## c1 c2 c3 c4 ## r1 1 2 3 4 ## r2 5 6 7 8 ## r3 9 10 11 12 # retrieve the element in 2nd row, 3rd column M[2,3] ## [1] 7 # retrieve the 1st row M[1,] ## c1 c2 c3 c4 ## 1 2 3 4 # retrieve the 1st column M[,1] ## r1 r2 r3 ## 1 5 9 # retrieve the 2nd and 3rd row i <- c(2,3) M[i,] ## c1 c2 c3 c4 ## r2 5 6 7 8 ## r3 9 10 11 12 # drop the 1st and 3rd column i <- c(1,3) M[,-i] ## c2 c4 ## r1 2 4 ## r2 6 8 ## r3 10 12 # retrieve the elements in 1st and 3rd row, 2nd and 4th column M[c(1,3),c(2,4)] ## c2 c4 ## r1 2 4 ## r3 10 12 # retrieve the rows named 'r1' and 'r3' i <- c('r1','r3') M[i,] ## c1 c2 c3 c4 ## r1 1 2 3 4 ## r3 9 10 11 12 # retrieve the columns named 'c2' and 'c4' i <- c('c2','c4') M[,i] ## c2 c4 ## r1 2 4 ## r2 6 8 ## r3 10 12 # retrieve the 3rd row of the columns named 'c2' and 'c4' i <- c('c2','c4') M[3,i] ## c2 c4 ## 10 12 # retrieve the 1st row using a logical vector i <- c(TRUE,FALSE,FALSE) M[i,] ## c1 c2 c3 c4 ## 1 2 3 4 # the logical vector will be recycled if it is shorter than the number of rows/columns to subset i <- c(TRUE,FALSE) # -> c(TRUE,FALSE,TRUE) M[i,] ## c1 c2 c3 c4 ## r1 1 2 3 4 ## r3 9 10 11 12 # select the column named 'c4' where 'c3' is less than twice 'c1' i <- M[,'c3'] < 2*M[,'c1'] M[i,'c4'] ## r2 r3 ## 8 12
Data Frame
Elements of a data.frame
are retrieved by using the single square bracket []
operator as seen with matrix
. Here, also the $
or [[]]
operators can be used to retrieve columns.
df <- data.frame('age' = c(48,18,51), 'sex' = c('M','F','M')) df # print full data.frame ## age sex ## 1 48 M ## 2 18 F ## 3 51 M # retrieve the "age" column df$age # equivalent to df[["age"]] or df[,"age"] ## [1] 48 18 51 # retrieve the age of males ("M") i <- df$sex == "M" # equivalent to df[["sex"]]=="M" or df[,"sex"]=="M" df$age[i] # equivalent to df[["age"]][i] or df[i,"age"] ## [1] 48 51
List
A list is subsetted using the single square bracket []
operator.
l <- list( 'data' = data.frame('age' = c(48,18,51), 'sex' = c('M','F','M')), 'letters' = c('a','b','c'), 'extra' = c(1:5) ) l # print full list ## $data ## age sex ## 1 48 M ## 2 18 F ## 3 51 M ## ## $letters ## [1] "a" "b" "c" ## ## $extra ## [1] 1 2 3 4 5 # select the 1st and 3rd elements i <- c(1,3) l[i] ## $data ## age sex ## 1 48 M ## 2 18 F ## 3 51 M ## ## $extra ## [1] 1 2 3 4 5 # select the elements named "extra" and "letters" i <- c("extra","letters") l[i] ## $extra ## [1] 1 2 3 4 5 ## ## $letters ## [1] "a" "b" "c" # drop the "extra" element l["extra"] <- NULL l ## $data ## age sex ## 1 48 M ## 2 18 F ## 3 51 M ## ## $letters ## [1] "a" "b" "c"
Objects in a list
are retrieved by using the operator [[]]
or $
.
# extract the 2nd object l[[2]] ## [1] "a" "b" "c" # extract the "data" object l$data # equivalent to l[["data"]] ## age sex ## 1 48 M ## 2 18 F ## 3 51 M
Environment
An environment
is not subsettable, i.e. the []
operator cannot be used. Objects in an environment
are retrieved by using the operator [[]]
, $
or the function get()
.
x <- new.env() # create a new environment x$a <- 1 # create a new object in the environment x$a ## [1] 1 x[["a"]] ## [1] 1 get("a", envir = x) ## [1] 1
Remember that an environment
is similar to a list
, but has a reference semantics.
x <- list() # using a list x$a <- 1 # assign 1 to the element "a" in x y <- x # COPY x to y x$a <- 2 # assign 2 to the element "a" in x y$a # what happens to the element "a" in y? ## [1] 1 x <- new.env() # using an environment x$a <- 1 # assign 1 to the element "a" in x y <- x # REFERENCE x to y x$a <- 2 # assign 2 to the element "a" in x y$a # what happens to the element "a" in y? ## [1] 2
Arithmetics
Arithmetic operations of vectors and matrices are performed element-by-element, data.frames are treated as matrices when containing one data type only. If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the following vectors u
and v
have different lengths, and their sum is computed by recycling values of the shorter vector u
.
u <- c(10, 20, 30) v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) M <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), ncol = 3, nrow = 3, byrow = TRUE) # vector + vector u + v ## [1] 11 22 33 14 25 36 17 28 39 # vector + 1 u + 1 ## [1] 11 21 31 # vector * 2 u * 2 ## [1] 20 40 60 # matrix + 1 M + 1 ## [,1] [,2] [,3] ## [1,] 2 3 4 ## [2,] 5 6 7 ## [3,] 8 9 10 # matrix + vector M + u ## [,1] [,2] [,3] ## [1,] 11 12 13 ## [2,] 24 25 26 ## [3,] 37 38 39 # matrix + matrix M + M ## [,1] [,2] [,3] ## [1,] 2 4 6 ## [2,] 8 10 12 ## [3,] 14 16 18 # matrix * vector M * u ## [,1] [,2] [,3] ## [1,] 10 20 30 ## [2,] 80 100 120 ## [3,] 210 240 270 # matrix product (rows x columns) M %*% u ## [,1] ## [1,] 140 ## [2,] 320 ## [3,] 500
Control Structures
In order to control the execution of the expressions flow in R, we make use of the control structures.
if
This task is carried out only if this condition is returned as TRUE
.
if(1<2){ print('executing if') } ## [1] "executing if"
if-else
The if-else combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it’s true or false.
if(1>2){ print('executing if') } else { print('executing else') } ## [1] "executing else"
You can have a series of tests by following the initial if with any number of else if
s.
if(1>2){ print('executing if') } else if(1<2) { print('executing else-if') } else { print('executing else') } ## [1] "executing else-if"
for
In R, for loops take an interator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.).
somevector <- 1:5 for(i in somevector){ print(i) } ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5
while
While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. While loops can potentially result in infinite loops if not written properly. Use with care!
val <- 1 while(val < 5) { val <- val + 1 print(val) } ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5
repeat
repeat
initiates an infinite loop right from the start. These are not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a repeat loop is to call break
.
val <- 5 repeat { print(val) val <- val+1 if (val == 10){ break } } ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9
break
We use break statement inside a loop (repeat, for, while) to stop the iterations and flow the control outside of the loop. While in a nested looping situation, where there is a loop inside another loop, this statement exits from the innermost loop that is being evaluated.
x <- 1:4 for (i in x) { if (i == 2) { break } print(i) } ## [1] 1
next
next
jumps to the next cycle without completing a particular iteration. In fact, it jumps to the evaluation of the condition holding the current loop. Next statement enables to skip the current iteration of a loop without terminating it.
x <- 1:4 for (i in x) { if (i == 2) { next } print(i) } ## [1] 1 ## [1] 3 ## [1] 4
Loop Functions
https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html
R has some functions which implement looping in a compact form to make your life easier.
lapply()
: Loop over a list and evaluate a function on each element
sapply()
: Same as lapply but try to simplify the result
apply()
: Apply a function over the margins of an array
lapply
The lapply()
function does the following simple series of operations:
- it loops over a list, iterating over each element in that list
- it applies a function to each element of the list (a function that you specify)
- returns a list
Here’s an example of applying the mean()
function to all elements of a list. If the original list has names, the the names will be preserved in the output.
x <- list(a = 1:10, b = 1:100) lapply(x, FUN = mean) ## $a ## [1] 5.5 ## ## $b ## [1] 50.5
You can use lapply()
to evaluate a function multiple times each with a different argument. Below, is an example where I call the runif()
function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.
lapply(1:4, runif) ## [[1]] ## [1] 0.2875775 ## ## [[2]] ## [1] 0.7883051 0.4089769 ## ## [[3]] ## [1] 0.8830174 0.9404673 0.0455565 ## ## [[4]] ## [1] 0.5281055 0.8924190 0.5514350 0.4566147
When you pass a function to lapply()
, lapply()
takes elements of the list and passes them as the first argument of the function you are applying. In the above example, the first argument of runif()
is n
, and so the elements of the sequence 1:4
all got passed to the n argument of runif()
.
Functions that you pass to lapply()
may have other arguments. For example, the runif()
function has a min
and max
argument too. Here is where the ...
argument to lapply()
comes into play. Any arguments that you place in the ...
argument will get passed down to the function being applied to the elements of the list.
Here, the min = 0
and max = 10
arguments are passed down to runif()
every time it gets called.
lapply(1:4, runif, min = 0, max = 10) ## [[1]] ## [1] 9.568333 ## ## [[2]] ## [1] 4.533342 6.775706 ## ## [[3]] ## [1] 5.726334 1.029247 8.998250 ## ## [[4]] ## [1] 2.4608773 0.4205953 3.2792072 9.5450365
sapply
The sapply()
function behaves similarly to lapply()
; the only real difference is in the return value. sapply()
will try to simplify the result of lapply()
if possible. Essentially, sapply()
calls lapply()
on its input and then applies the following algorithm:
- if the result is a list where every element is length 1, then a vector is returned
- if the result is a list where every element is a vector of the same length (> 1), a matrix is returned
- if it can’t figure things out, a list is returned
x <- list(a = 1:10, b = 1:100) sapply(x, FUN = mean) ## a b ## 5.5 50.5
apply
The apply()
function is used to a evaluate a function over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix
or data.frame
.
Here we create a 20 by 10 matrix of Normal random numbers.
x <- matrix(rnorm(200), 20, 10)
Compute the mean of each column: MARGIN = 2
.
apply(x, MARGIN = 2, FUN = mean) ## [1] -0.10796846 0.15666451 0.17238484 -0.02491125 0.02062309 -0.21667563 ## [7] -0.22015146 0.20890536 0.02170259 0.02202638
Compute the mean of each row: MARGIN = 1
.
apply(x, MARGIN = 1, FUN = mean) ## [1] 0.406234477 -0.015545859 0.131356598 0.076696478 -0.299281746 ## [6] 0.647064814 0.125209819 -0.234601007 0.232658077 -0.628908133 ## [11] 0.092009303 0.003069751 -0.717961239 0.186628459 -0.248612490 ## [16] -0.377757063 0.092656464 -0.172082143 0.391801315 0.374564059
User-Defined Functions
Abstracting code into many small functions is key for writing nice R code. Functions are defined by code with a specific format:
functionName <- function(arg1, arg2, arg3=NULL, ...) { # code here... return(...) }
where
functionName
: the name of the function (case sensitive)arg1
,arg2
,arg3
,...
: input valuesarg3=NULL
: default value. Ifarg3
is not provided when calling the function,NULL
will be used insteadreturn()
: the output value
Define a function to compute the sum of the first n
integer numbers.
sumInt <- function(n){ s <- sum(1:n) return(s) }
Compute the sum of the first 10 integers
sumInt(10) ## [1] 55
Define a function to compute the p
norm of a vector x
. By default, compute the Euclidean norm (p = 2
).
norm <- function(x, p = 2){ d <- sum(x^p)^(1/p) return(d) }
Compute the Euclidean norm of the vector c(1,1)
norm(x = c(1,1)) # equivalento to norm(x = c(1,1), p = 2) ## [1] 1.414214
Compute the 3-norm of the vector c(1,1)
norm(x = c(1,1), p = 3) ## [1] 1.259921
Compute the \(\infty\)-norm of the vector c(1,1)
norm(x = c(1,1), p = Inf) ## [1] 1
Scope of functions
If you use an R function, the function first creates a temporary local environment. This local environment is nested within the global environment, which means that, from that local environment, you also can access any object from the global environment (not considered a good practice). As soon as the function ends, the local environment is destroyed along with all the objects in it.
# define function test1 <- function(){ teststring <- 'This object is destroyed as soon as the function ends!' return(invisible()) } # run function test1() # try to access teststring teststring ## Error in eval(expr, envir, enclos): object 'teststring' not found
If R sees any object name, it first searches the local environment. If it finds the object there, it uses that one else it searches in the global environment for that object.
# global i i <- 1 # define function test2 <- function(){ # there is no i in the local environment -> search in parent environment i <- i*10 # return return(i) } # run function test2() ## [1] 10 # the global variable has not changed i ## [1] 1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Comments
All text after the sign
#
within the same line is considered a comment.