Site icon R-bloggers

R Training – The Basics

[This article was first published on R – SLOW DATA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the first module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Tips

# this is a comment
x <- 1
y <- 2 # this is also a comment

 

R and your machine

Working directory

getwd()

Change working directory

dir.create("C:/Users/pc/Desktop/RFundamentalsWeek1") # fit this path to your machine (e.g. "C:/Users/YOUR-USER-NAME/Desktop/RFundamentalsWeek1")

setwd("C:/Users/pc/Desktop/RFundamentalsWeek1")

dir.create("sub")

…exactly, a sub-folder in your working directory

Check content folder

dir() # can you see "sub"?

dir("C:/Users")

dir("./sub") # "." set the start in your WD
dir("..") # ".." moves you one level up

R workspace

ls() # character(0) indicates empty

x <- 1

Remove objects from workspace

y <- 99; msg <- "Hello"; msg2 <- "Hi"

rm("x")

rm(list=ls()) # In R is very common to nest functions

?rm

 

R basic objects and operators

Objects’ classes in R

"Hola" # character, any string within quotes
3.14 # numeric, any real number
4L # integer, any integer number
TRUE # logical, TRUE or FALSE reserved words

class("Hello")
class(3.14)
class(4L)
class(4) # without suffix "L" all numbers are numeric by default
class(TRUE)

Arithmetic operators

3 + 4
3 - 4
3 * 4
3 / 4
abs(3 - 4)
3^4 # or 3**4
sqrt(4)

3 == 4 # equality
"a" == "a"
3 > 4 # greater than
3 <= 4 # lower or equal than
3 != 4 # different from
"hello" == "Hello"

4 >= 3 & 3==3
4 < 3 | 3==3

Atomic vectors

length("Hello")
length(2)
length(TRUE)

More complex data structures

 

Create vectors with ‘combine’ function

c("Hola", "Ciao", "Hello", "Bonjour") # character vector
c(0.99, 2.4, 1.4, 5.9) # numeric vector
c(1L, 2L, 3L, 4L) # integer vector
c(TRUE, TRUE, FALSE, TRUE) # logical vector

class(c("Hola", "Ciao", "Hello", "Bonjour"))
class(c(0.99, 2.4, 1.4, 5.9))
class(c(1L, 2L, 3L, 4L))
class(c(TRUE, TRUE, FALSE, TRUE))

Other ways to create vectors

seq(from = 1, to = 4, by = 1)
seq(from=1, to=4) # by=1 is default
seq(1, 4) # arguments in R can be matched by position
1:4 # common operations in R have shortcuts

rep(x = "a", times = 4) # replicate "a" four times
rep("a", 4) # same as above
rep(c("a", "b"), times = 2) # same but for a vector
rep(c("a", "b"), each = 2) # element-by-element

Subsetting vectors

x <- 1:10
x >= 5
idx <- (x > 5)
x[idx] # all values of x greater then 5
x[x < 7] # calculate index directly within brackets

x[1] # 1st element
x[c(1,5)] # 1st and 5th element

x[-1] # all but the 1st
x[-c(1,10)] # all but the 1st and the 10th

Arithmetic and logical operators are vectorized

c(1, 2, 3, 4) + c(5, 6, 7, 8)
c(1, 2, 3, 4) / c(5, 6, 7, 8)
sqrt(c(1, 2, 3, 4))
c(1, 2, 3, 4) == c(5, 6, 7, 8)
c(1, 2, 3, 4) != c(5, 6, 7, 8)

Vectorization + Recycling

c(1, 2, 3) + c(5, 6, 7) # simple element-by-element

c(1, 2) + c(5, 6, 7, 8) # shortest vector "recycled"
c(1, 2, 1, 2) + c(5, 6, 7, 8)

c(1, 2) + c(5, 6, 7) # recycling + warning
r <- c(1, 2, 1) + c(5, 6, 7)

Useful functions for numerical objects

mynum <- c(3.14, 6, 8.99, 10.21, 10, 56.9, 32.1, 2.3)
sum(mynum)
mean(mynum)
sd(mynum) # standard deviation
median(mynum)

Install a package and use its functions

install.packages("e1071") # install the package
library(e1071) # load the package

skewness(mynum)
kurtosis(mynum)

Useful functions for logical objects

mylogic <- c(F, T, F, rep(T, 3))
sum(mylogic)

which(mylogic)

any(mylogic) # is at least one of the values TRUE?
all(mylogic) # are all of the values TRUE?

Useful functions for character objects

mychar <- c("201510", "201511", "201512", "201601")

substr(x = mychar, start = 1, stop = 4) # the ubiquitous substring...
nchar("Hello") # number of characters in a string

paste("I", "m", sep = "'")
paste("N.", 1, sep="") # 1 is coerced to "1"

gsub(pattern = "20", replacement = "", x = mychar)

Implicit coercion

Coercion happens when we force an object to belong to a class – implicit coercion numeric vs CHARACTER

c(1.7, "a")
class(c(1.7, "a"))

c(FALSE, 2)
class(c(TRUE, 2))

c("a", TRUE)
class(c("a", TRUE))

What’s holding here is a principle of least common denominator…

Explicit coercion

x <- c(0, 1, 2, 3, 4, 5, 6)
class(x)

as.character(x)
as.logical(x) # 0=FALSE, 1+ = TRUE

as.numeric(c("a", "b", "c"))
as.logical(c("a", "b", "c"))

 

Special values in R

Missing values

NA <- 1 # This will trigger an error!

year <- c(2012, 2013, 2014)
gwp <- c(NA, 98.7, 32.5)

class(gwp)
is.na(gwp) # indicates which elements are missing

Other special values

help(reserved)

x <- NULL # useful to initialize objects to be filled later

1/0 # infinite
-1/0 # minus infinite
0/0 # undefined number

 

Matrices

Matrix underlying structure

x <- 1:6 # take a vector
dim(x) # vector do not have dimension attribute
dim(x) <- c(2, 3) # impose a 2x3 dimesion (2 rows, 3 columns)
class(x) # here it is a matrix!
x

  • This tricky way to create a matrix is not so common, but it is useful to understand the underlying structure of objects in R…
  • …and so be able to better manipulate them for future needs

More common ways to create matrices

  • with function matrix()

m <- matrix(data = 1:6, nrow = 2, ncol = 3)
class(m)
dim(m)

  • by binding rows or columns with functions rbind() or cbind()

x <- 1:3
y <- 10:12
m1 <- cbind(x,y)
m2 <- rbind(x,y)
class(m1)
class(m2)

Subsetting matrices

  • Matrices can be subset using (i,j)-style index

m[1,2] # one single element
m[1,] # one full row
m[,3] # one full column
m[,-1] # all columns but one

  • Can you think about another way to obtain the last result?
  • Tip: use an integer vector with function c()

Factors

Nominal factors

  • Factors are used to describe items that can have a finite number of values (i.e. categories)
  • You can see them as positive-integer-sequences with labels

f <- factor( c("f", "m", "m", "f", "f") )
class(f)

  • Factors have a levels attribute listing its unique categories
  • Access levels attribute with levels() function

attributes(f)
levels(f)

Ordered factors

  • If a factor has a natural order this should be specified

fo <- factor( c("low", "med", "low", "high"), ordered = TRUE)

  • Default order is alphabetical

levels(fo) <- c("low", "med", "high") # re-order

  • It can useful sometimes re-order also nominal factors (e.g. to change default base levels taken by a GLM)

levels(f) <- c("m", "f") # change alphabetical default

  • Obtain frequency count of factor combinations with table()

table(f)

 

Data Frames

Create a data frame from scratch

  • R structure which most closely mimic SAS data set (i.e. a ‘cases by variables’ matrix of data)
  • R-speaking, it is a collection of vectors and factors all having the same length
  • A data frame generally has names and row.names attributes to label variables and observations respectively
  • You create a data.frame with function data.frame()

df <- data.frame( x = 1:3, y = c("a", "b", "c"),
f = factor( c("m", "f", "m") ) )
class(df)

  • Although more often you will create a data.frame by reading some data from a file (excel, internet, SAS, etc.)

Read some data

  • R provides some example data we can use to practice
  • We will use the ‘iris’ data.frame which gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris

data("iris") # load the example data in the workspace

  • Have an overview of the data using these functions

str(iris) # returns a compact summary of R objects
summary(iris) # few statistics for each variable
head(iris, n = 20) # visualize first 20 observations
tail(iris) # last 6 observations

Subset data frames

  • [i,j]-index notation is valid also for data.frames

iris[1,1]
iris[1,5]

  • Additionally you can retain one or more variables by name

iris$Sepal.Length # using $ operator
iris[, "Sepal.Length"] # quoting variable's name in j slot
iris[, c("Sepal.Length", "Sepal.Width")]

  • Tip: after you type ‘$’ wait for RStudio auto-completion menu
  • Tip: In general press ‘Tab’ to ask RStudio auto-completion options

Analyse data frames

  • Use the mean() function to get some overall statistics from this data

mean(iris$Petal.Length)

  • Calculate same statistic only for Setosa iris:

mean( iris[ iris$Species =="setosa", "Petal.Length" ] )

Useful functions to analyse data frames

You can see that syntax become twisted quite rapidly when more complex manipulation is needed (filter rows, select columns, etc.)

  • Use with() and subset() to make your program more readable
  • with() allows to call dataframe’s variables directly

with( iris, sum(Petal.Width)/sum(Petal.Length) )

  • subset() returns a dataframe meeting certain conditions

subset( iris, Species=="setosa" )

  • Try out this more compact code:

with( subset(iris, Species=="setosa"),
sum(Petal.Width)/sum(Petal.Length))

Subset data frames to remove missing values

  • To better explore missing values in R let’s create a new dataframe like iris but with some missing values
  • Don’t worry too much about this dump code now, we’ll see for loops in the second module

i_na <- sample(x = 1:nrow(iris), size = 0.1*nrow(iris) )
j_na <- sample(1:4, size = 0.1*nrow(iris), replace = TRUE)
iris_na <- iris
for(k in 1:length(i_na)) {
  iris_na[i_na[k], j_na[k]] <- NA
}

  • Control if there is some missing value with is.na() function

sum( is.na(iris_na) ) # not surprised ehm?

  • Understand which variables have missing values with which() function with arr.ind = TRUE argument

class(is.na(iris_na))
w <- which( x = is.na(iris_na), arr.ind = TRUE )
head(w)

Note: When x has dimesion > 1 then the arr.ind argument tells R whether array indices should be returned Now subset the dataframe by eliminating rows where Petal.Width is missing:

iris_clean <- subset(iris_na, !is.na(Petal.Width))

  • A method to eliminate all records including at least one missing value (no matter in which variable) is with function complete.cases()
  • It returns a logical vector indicating which cases (full record) are complete

good <- complete.cases(iris_na) # "good" is a logical vector
iris_clean <- iris[good,]

Add new variables to a data frame

  • Add a variable with random values 1:10 using sample() function

iris$new <- sample(1:10, nrow(iris),
replace = TRUE)

Calculating an index of correlation

  • Pearson correlation can be easily calculated in R with function cor()

cor(iris[,which(lapply(iris, class)=="numeric")])

That’s it for this module! If you have gone through all this code you should have learnt the fundamentals of the R language.

When you’re ready, go ahead with the second module: R training – functions and programming blocks.

The post R Training – The Basics appeared first on SLOW DATA.

To leave a comment for the author, please follow the link and comment on their blog: R – SLOW DATA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.