R Vocabulary – Part 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To be a proficient R user, you need to read and understand the material in the book Advanced R by Hadley Wickham. The second chapter in this book is on vocabulary – a list of functions from the base, stats and utils packages which all R users should be familiar with. In a series of posts, we will attempt to learn most of the functions mentioned in the chapter using some examples.
We will skip the function ? and start with str. According to its documentation, str can be used to display the internal structure of an R object. Let us look at a few simple examples first.
x <- c(1, 2, 3) str(x) ## num [1:3] 1 2 3 x <- c(1L, 2L) str(x) ## int [1:2] 1 2 x <- c(TRUE, FALSE, TRUE, TRUE) str(x) ## logi [1:4] TRUE FALSE TRUE TRUE x <- c("a", "b", "c") str(x) ## chr [1:3] "a" "b" "c" x <- c(1 + 2i, 3 + 0i, 1i) str(x) ## cplx [1:3] 1+2i 3+0i 0+1i str(charToRaw("radmuzom")) ## raw [1:8] 72 61 64 6d ...
From the above examples, we see that the for atomic vectors, it displays the type, the number of elements in the vector and the first few elements. What happens if we apply str to functions?
str(c) ## function (...) str(ls) ## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, ## pattern, sorted = TRUE) str(print) ## function (x, ...)
It’s interesting to see that the output is different for different functions. That is because c is a primitive function, ls is an R function while print is a S3 generic function. This can be verified by typing the function name in the console without any parantheses. An explanation of primitive or S3 generics is beyond the scope of this post.
Let us now look at lists.
l <- list(x = 1, a = "A") str(l) ## List of 2 ## $ x: num 1 ## $ a: chr "A" l2 <- list(m = matrix(1:4, nrow = 2), l = l) str(l2) ## List of 2 ## $ m: int [1:2, 1:2] 1 2 3 4 ## $ l:List of 2 ## ..$ x: num 1 ## ..$ a: chr "A" l3 <- list(l = l, l2 = l2, w = rnorm(10)) str(l3) ## List of 3 ## $ l :List of 2 ## ..$ x: num 1 ## ..$ a: chr "A" ## $ l2:List of 2 ## ..$ m: int [1:2, 1:2] 1 2 3 4 ## ..$ l:List of 2 ## .. ..$ x: num 1 ## .. ..$ a: chr "A" ## $ w : num [1:10] -0.0122 0.5986 0.9694 -0.7869 -1.3261 ...
From the output, we notice that str displays the name of the list elements, their class and the basic structure similar to the one we saw for vectors. Use the max.level argument to restrict the level of nesting in the output.
str(l3, max.level = 2) ## List of 3 ## $ l :List of 2 ## ..$ x: num 1 ## ..$ a: chr "A" ## $ l2:List of 2 ## ..$ m: int [1:2, 1:2] 1 2 3 4 ## ..$ l:List of 2 ## $ w : num [1:10] -0.0122 0.5986 0.9694 -0.7869 -1.3261 ...
A common use of str is to compactly look at the structure of a dataset.
str(InsectSprays) ## 'data.frame': 72 obs. of 2 variables: ## $ count: num 10 7 20 14 14 12 10 23 17 20 ... ## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
The above example shows that this dataset is a data.frame object comprising 72 observations and 2 variables. The first variable is count, which is a numeric vector while the second variable is spray which is a factor with 6 levels.
From the above examples, it is hopefully clear that if you are unsure of what an R object is, str provides information useful to understand it’s structure. For datasets, it also helps to understand the number of rows and columns for that dataset.
%in and match are most useful in matching the elements of one vector in another vector.
x <- c(6, 37) y <- sample(1:100, 1000, replace = TRUE) x %in% y ## [1] TRUE TRUE match(x, y) ## [1] 26 119 which(y == 6) ## [1] 26 156 190 233 295 316 360 390 492 618 648 667 968 987
Note that the length of the result returned by %in% is the same as the first argument. match only returns the indices of the first occurence of the values in x.
We won’t spend too much time on =, <- and <<- in this article. However, do remember that these are functions and we can use backticks to call them in the “usual” way for functions. The -> and ->> operators are rarely used.
`<-`(x, 3) x ## [1] 3 1 -> x x ## [1] 1
$, [ and [[ are operators which act on vectors, matrices, arrays or lists to extract or replace parts. They are described in great detail in the chapter Subsetting.
head returns the first parts of a variety of different objects, but is most useful for vectors or data frames. tail works similarly but returns the last parts of the object.
y <- sample(1:100, 1000, replace = TRUE) head(y) ## [1] 87 50 25 46 2 33 head(cars) ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 head(cars, n = 10) ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17 head(ls) ## ## 1 function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, ## 2 pattern, sorted = TRUE) ## 3 { ## 4 if (!missing(name)) { ## 5 pos <- tryCatch(name, error = function(e) e) ## 6 if (inherits(pos, "error")) {
subset is used to return parts of a vector, matrix or data frame which meets conditions provided as an argument to the function. It is most useful for data frames.
subset(cars, speed < 10 & dist > 10) ## speed dist ## 4 7 22 ## 5 8 16 subset(cars, speed < 10 & dist > 10, select = speed) ## speed ## 4 7 ## 5 8
with is used to evaluate an R expression in an environment constructed from data. For interactive use, it usually saves some typing and is nicer to read.
For example, instead of
plot(cars$speed, cars$dist)
one can use
with(cars, plot(speed, dist))
assign is used to assign a value to a name in an environment.
rm(x) f <- function() { assign("x", 1, pos = 1) } f() x ## [1] 1
In the above example, the pos argument is a positive integer which denotes the position in the search list. This causes x to have the value 1 in the global environment.
get is in some sense the opposite of assign.
x <- 3 g <- function() { get("x") } g() ## [1] 3
all.equal is used to compare objects and report differences.
x <- c(2, 3) y <- c(2, 3) all.equal(x, y) ## [1] TRUE x <- 1 all.equal(x, y) ## [1] "Numeric: lengths (1, 2) differ" x <- c(1, 2) all.equal(x, y) ## [1] "Mean relative difference: 0.6666667" l1 <- list(x = c(1, 2), y = c("A", "B")) l2 <- list(x = c(1, 2)) all.equal(l1, l2) ## [1] "Length mismatch: comparison on first 1 components" l2 <- list(x = c(1, 2), y = c("C", "D")) all.equal(l1, l2) ## [1] "Component \"y\": 2 string mismatches" l2 <- list(x = c(1, 2), y = c("A", "B")) all.equal(l1, l2) ## [1] TRUE
identical is used to safely and reliablty test for two objects being exactly equal. In if or while statements, and in logical expressions which use && or ||, identical will ensure that a single logical value is obtained.
2 == c(1, 2) ## [1] FALSE TRUE identical(2, c(1, 2)) ## [1] FALSE 1 == NULL ## logical(0) identical(1, NULL) ## [1] FALSE identical(1, 1.0) ## [1] TRUE identical(1, 1L) ## [1] FALSE
We will not look at the relational operators !=, ==, >, >=, < and <= in detail here. However, it is worth remembering that these operators are vectorized along with vector recycling (if one of the vectors is shorter than the other, then the elements of the shorter vector are recycled).
x <- c(1, 2) y <- c(1, 2) x < y ## [1] FALSE FALSE x <- 1 x < y ## [1] FALSE TRUE
is.na should be used to test whether elements are missing. Note that one should not use the == relational operator.
x <- NA is.na(x) ## [1] TRUE x == NA ## [1] NA
Also, note that there are separate constants for missing values of the atomic vector types.
x <- c(NA, NA) class(x) ## [1] "logical" x <- c(NA, 1.0) class(x) ## [1] "numeric" x <- c(NA_character_, NA_character_) class(x) ## [1] "character"
complete.cases is used to check which cases have no missing values and is most useful with data frames. For data frames, it returns a logical vector specifying which rows have no missing values across the entire sequence.
d <- data.frame( x = c(1, NA, 2), y = c("A", "B", NA) ) complete.cases(d) ## [1] TRUE FALSE FALSE
is.finite returns a logical vector specifying which elements are finite. Even though NaN is “not a number”, is.finte still returns FALSE when evaluated with NaN as the argument.
x <- c(1, 3.0, Inf, NaN, 7) is.finite(x) ## [1] TRUE TRUE FALSE FALSE TRUE
The basic math functions are explained via the examples below. While the examples use “scalar” values in most cases, all the operations are vectorized. Examples using the trigonometric functions are not provided.
5 * 3 ## [1] 15 `*`(5, 3) ## [1] 15 5.1 * 2L ## [1] 10.2 5 * (2 + 3i) ## [1] 10+15i (2 + 3i) + 7 ## [1] 9+3i (2 + 3i) - 7 ## [1] -5+3i 3 / 5 ## [1] 0.6 3L / 5L ## [1] 0.6 (3 + 7i) / 6 ## [1] 0.5+1.166667i 2 ^ 3 ## [1] 8 2.2 ^ 7.5 ## [1] 369.9731 (2 + 3i) ^ 3 ## [1] -46+9i (2 + 3i) ^ (3 + 4i) ## [1] -0.2045529+0.8966233i 7 %% 5 # remainder ## [1] 2 7 %/% 5 # integer division ## [1] 1 abs(5) ## [1] 5 abs(5 + 3i) ## [1] 5.830952 abs(-5) ## [1] 5 sign(2) ## [1] 1 sign(-2) ## [1] -1 sign(0) ## [1] 0 sign(2 + 3i) ## Error in sign(2 + (0+3i)): unimplemented complex function ceiling(c(3.2, 3.8)) ## [1] 4 4 floor(c(3.2, 3.8)) ## [1] 3 3 trunc(c(3.2, 3.8)) ## [1] 3 3 round(c(3.2, 3.8)) ## [1] 3 4 round(c(3.275, 3.811), digits = 2) ## [1] 3.27 3.81 signif(c(3.2, 3.8)) ## [1] 3.2 3.8 signif(c(3.275, 3.811), digits = 2) ## [1] 3.3 3.8 round(-2.3) ## [1] -2 round(33, digits = -1) # nearest 10 ## [1] 30 round(75, digits = -2) # nearest 100 ## [1] 100 exp(1) ## [1] 2.718282 exp(-1) ## [1] 0.3678794 log(3) ## [1] 1.098612 log(-2) ## Warning in log(-2): NaNs produced ## [1] NaN log(exp(3)) ## [1] 3 exp(log(3)) ## [1] 3 log10(100) ## [1] 2 log2(1024) ## [1] 10 sqrt(25) ## [1] 5 sqrt(-25) ## Warning in sqrt(-25): NaNs produced ## [1] NaN sqrt(3 + 4i) ## [1] 2+1i
max will find the maximum element from numeric or character vectors. pmax will do an element by element comparison, and return the largest among the first element, largest among the second element and so on. If the vectors are not of equal length, then the elements of the shorter vectors are recycled.
max(c(1, 2.3), c(2.7, 1.5), c(4, 2.2)) ## [1] 4 pmax(c(1, 2.3), c(2.7, 1.5), c(4, 2.2)) ## [1] 4.0 2.3
min and pmin work in the same way as above. prod and sum calculates the product and sum of the values present in its arguments. diff is used to calculate lagged differences between subsequent values (the default lag is 1).
prod(rnorm(10) + 1) ## [1] -0.009277813 sum(rnorm(10) + 1) ## [1] 3.521682 prod(c(1i, 1 + 2i)) ## [1] -2+1i diff(1:10) ## [1] 1 1 1 1 1 1 1 1 1 diff(1:10, lag = 3) ## [1] 3 3 3 3 3 3 3
The cumulative versions of max, min, prod and sum return the cumulative results as a vector. For the nth element, it will apply the function on the nth element and the result of the cumulative function till the (n-1)th element.
x <- 1:10 cumsum(x) ## [1] 1 3 6 10 15 21 28 36 45 55 cumprod(x) ## [1] 1 2 6 24 120 720 5040 40320 ## [9] 362880 3628800 cummax(x) ## [1] 1 2 3 4 5 6 7 8 9 10 cummin(x) ## [1] 1 1 1 1 1 1 1 1 1 1
Next we will look at some of the basic descriptive statistical functions. The mean, median, standard deviation and variance of a variable are calculated as follows.
x <- rnorm(10) mean(x) ## [1] 0.5545163 median(x) ## [1] 0.3892076 sd(x) ## [1] 0.7962907 var(x) ## [1] 0.6340788
cor is used to calculate the correlation between a pair of variables. The method argument is used to specify which method to use - the Pearson correlation coefficient, Kendall’s rank correlation or Spearman’s rank correlation. The default method is the Pearson coefficient.
x <- rnorm(100) y <- rnorm(100) cor(x, y) ## [1] -0.06544386 cor(x, y, method = "kendall") ## [1] -0.02909091 cor(x, y, method = "spearman") ## [1] -0.03924392
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.