Vectors and Functions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the previous set we started with arithmetic operations on vectors. We’ll take this a step further now, by practising functions to summarize, sort and round the elements of a vector.
Sofar, the functions we have practised (log
, sqrt
, exp
, sin
, cos
, and acos
) always return a vector with the same length as the input vector. In other words, the function is applied element by element to the elements of the input vector. Not all functions behave this way though. For example, the function min(x)
returns a single value (the minimum of all values in x
), regardless of whether x has length 1, 100 or 100,000.
Before starting the exercises, please note this is the third set in a series of five: In the first two sets, we practised creating vectors and vector arithmetics. In the fourth set (posted next week) we will practise regular sequences and replications.
You can find all sets right now in our ebook Start Here To Learn R – vol. 1: Vectors, arithmetic, and regular sequences. The book also includes all solutions (carefully explained), and the fifth and final set of the series. This final set focuses on the application of the concepts you learned in the first four sets, to real-world data.
One more thing: I would really appreciate your feedback on these exercises: Which ones did you like? Which ones were too easy or too difficult? Please let me know what you think here!
Exercise 1
Did you know R has actually lots of built-in datasets that we can use to practise? For example, the rivers
data “gives the lengths (in miles) of 141 “major” rivers in North America, as compiled by the US Geological Survey” (you can find this description, and additonal information, if you enter help(rivers)
in R. Also, for an overview of all built-in datasets, enter data()
.
Have a look at the rivers
data by simply entering rivers
at the R prompt. Create a vector v
with 7 elements, containing the number of elements (length
) in rivers
, their sum (sum
), mean (mean
), median (median
), variance (var
), standard deviation (sd
), minimum (min
) and maximum (max
).
(Solution)
Exercise 2
For many functions, we can tweak their result through additional arguments. For example, the mean
function accepts a trim
argument, which trims a fraction of observations from both the low and high end of the vector the function is applied to.
- What is the result of
mean(c(-100, 0, 1, 2, 3, 6, 50, 73), trim=0.25)
? Don’t use R, but try to infer the result from the explanation of thetrim
argument I just gave. Then check your answer with R. - Calculate the mean of
rivers
after trimming the 10 highest and lowest observations. Hint: first calculate the trim fraction, using thelength
function.
(Solution)
Exercise 3
Some functions accept multiple vectors as inputs. For example, the cor
function accepts two vectors and returns their correlation coefficient. The women
data “gives the average heights and weights for American women aged 30-39”. It contains two vectors height
and weight
, which we access after entering attach(women)
(we’ll discuss the details of attach
in a later chapter).
- Using the
cor
function, show that the average height and weight of these women are almost perfectly correlated. - Calculate their covariance, using the
cov
function. - The
cor
function accepts a third argumentmethod
which allows for three distinct methods (“pearson”, “kendall”, “spearman”) to calculate the correlation. Repeat part (a) of this exercise for each of these methods. Which is the method chosen by the default (i.e. without specifying the method explicitly?)
(Solution)
Exercise 4
In the previous three exercises, we practised functions that accept one or more vectors of any length as input, but return a single value as output. We’re now returning to functions that return a vector of the same length as their input vector. Specifically, we’ll practise rounding functions. R has several functions for rounding. Let’s start with floor
, ceiling
, and trunc
:
floor(x)
rounds to the largest integer not greater thanx
ceiling(x)
rounds to the smallest integer not less thanx
trunc(x)
returns the integer part ofx
To appreciate the difference between the three, I suggest you first play around a bit in R with them. Just pick any number (with or without a decimal point, positive and negative values), and see the result each of these functions gives you. Then make it somewwat closer to the next integer (either above or below), or flip the sign, and see what happens. Then continue with the following exercise:
Below you will find a series of arguments (x), and results (y), that can be obtained by choosing one or more of the 3 functions above (e.g. y <- floor(x)
). Which of the above 3 functions could have been used in each case? First, choose your answer without using R, then check with R.
x <- c(300.99, 1.6, 583, 42.10)
y <- c(300, 1, 583, 42)
x <- c(152.34, 1940.63, 1.0001, -2.4, sqrt(26))
y <- c(152, 1940, 1, 5, -2)
x <- -c(3.2, 444.35, 1/9, 100)
y <- c(-3, -444, 0, -100)
x <- c(35.6, 670, -5.4, 3^3)
y <- c(36, 670, -5, 27)
(Solution)
Exercise 5
In addition to trunc
, floor
, and ceiling
, R also has round
and signif
rounding functions. The latter two accept a second argument digits
. In case of round
, this is the number of decimal places, and in case of signif
, the number of significant digits. As with the previous exercise, first play around a little, and see how these functions behave. Then continue with the exercise below:
Below you will find a series of arguments (x), and results (y), that can be obtained by choosing one, or both, of the 2 functions above (e.g. y <- round(x, digits=d)
). Which of these functions could have been used in each case, and what should the value of d
be? First, choose your answer without using R, then check with R.
x <- c(35.63, 300.20, 0.39, -57.8)
y <- c(36, 300, 0, -58)
x <- c(153, 8642, 10, 39.842)
y <- c(153.0, 8640.0, 10.0, 39.8)
x <- c(3.8, 0.983, -23, 7.1)
y <- c(3.80, 0.98, -23.00, 7.10)
(Solution)
Exercise 6
Ok, let’s continue with a really interesting function: cumsum
. This function returns a vector of the same length as its input vector. But contrary to the previous functions, the value of an element in the output vector depends not only on its corresponding element in the input vector, but on all previous elements in the input vector. So, its results are cumulative, hence the cum
prefix. Take for example: cumsum(c(0, 1, 2, 3, 4, 5))
, which returns: 0, 1, 3, 6, 10, 15. Do you notice the pattern?
Functions that are similar in their behavior to cumsum
, are: cumprod
, cummax
and cummin
. From just their naming, you might already have an idea how they work, and I suggest you play around a bit with them in R before continuing with the exercise.
- The
nhtemp
data contain “the mean annual temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971”. (Althoughnhtemp
is not a vector, but a timeseries object (which we’ll learn the details of later), for the purpose of this exercise this doesn’t really matter.) Use one of the four functions above to calculate the maximum mean annual temperature in New Haven observed since 1912, for each of the years 1912-1971. - Suppose you put $1,000 in an investment fund that will exhibit the following annual returns in the next 10 years: 9% 18% 10% 7% 2% 17% -8% 5% 9% 33%. Using one of the four functions above, show how much money your investment will be worth at the end of each year for the next 10 years, assuming returns are re-invested every year. Hint: If an investment returns e.g. 4% per year, it will be worth 1.04 times as much after one year, 1.04 * 1.04 times as much after two years, 1.04 * 1.04 * 1.04 times as much after three years, etc.
(Solution)
Exercise 7
R has several functions for sorting data: sort
takes a vector as input, and returns the same vector with its elements sorted in increasing order. To reverse the order, you can add a second argument: decreasing=TRUE
.
- Use the
women
data (exercise 3) and create a vectorx
with the elements of theheight
vector sorted in decreasing order. - Let’s look at the
rivers
data (exercise 1) from another perspective. Looking at the 141 data points inrivers
, at first glance it seems quite a lot have zero as their last digit. Let’s examine this a bit closer. Using the modulo operator you practised in exercise 9 of the previous exercise set, to isolate the last digit of therivers
vector, sort the digits in increasing order, and look at the sorted vector on your screen. How many are zero? - What is the total length of the 4 largest rivers combined? Hint: Sort the rivers vector from longest to shortest, and use one of the
cum...
functions to show their combined length. Read off the appropriate answer from your screen.
(Solution)
Exercise 8
Another sorting function is rank
, which returns the ranks of the values of a vector. Have a look at the following output:
x <- c(100465, -300, 67.1, 1, 1, 0) rank(x) ## [1] 6.0 1.0 5.0 3.5 3.5 2.0
- Can you describe in your own words what
rank
does? - In exercise 3(c) you estimated the correlation between
height
andweight
, using Spearman’s rho statistic. Try to replicate this using thecor
function, without themethod
argument (i.e., using its default Pearson method, and usingrank
to first obtain the ranks ofheight
andweight
.
(Solution)
Exercise 9
A third sorting function is order
. Have a look again at the vector x
introduced in the previous exercise, and the output of order
applied to this vector:
x <- c(100465, -300, 67.1, 1, 1, 0) order(x) ## [1] 2 6 4 5 3 1
- Can you describe in your own words what
order
does? Hint: look at the output ofsort(x)
if you run into trouble. - Remember the time series of mean annual temperature in New Haven, Connecticut, in exercise 6? Have a look at the output of
order(nhtemp)
:
order(nhtemp) ## [1] 6 15 29 9 13 3 5 12 7 23 1 24 47 25 51 14 18 32 16 11 56 8 17 ## [24] 28 45 52 31 37 4 22 36 39 54 19 34 26 49 30 33 53 55 21 27 58 10 50 ## [47] 57 59 43 44 35 2 46 48 40 20 60 41 38 42
Given that the starting year for this series is 1912, in which years did the lowest and highest mean annual temperature occur?
- What is the result of order(sort(x)), if x is a vector of length 100, and all of its elements are numbers? Explain your answer.
(Solution)
Exercise 10
In exercise 1 of this set, we practised the max
function, followed by the cummax
function in exercise 6. In the final exercise of this set, we’re returning to this topic, and will practise yet another function to find a maximum. While the former two functions applied to a single vector, it’s also possible to find a maximum across multiple vectors.
- First let’s see how
max
deals with multiple vectors. Create two vectorsx
andy
, wherex
contains the first 5 even numbers greater than zero, andy
contains the first 5 uneven numbers greater than zero. Then see whatmax
does, as inmax(x, y)
. Is there a difference withmax(y, x)
? - Now, try
pmax(x, y)
, where p stands for “parallel”. Without using R, what do you think intuitively, what it will return? Then, check, and perhaps refine, your answer with R. - Now try to find the parallel minimum of x and y. Again, first try to write down the output you expect. Then check with R (I assume, you can guess the appropriate name of the function).
- Let’s move from two to three vectors. In addition to
x
andy
, add-x
as a third vector. Write down the expected output for the parallel minima and maxima, then check your answer with R. - Finally, let’s find out how
pmax
handles vectors of different lenghts Write down the expected output for the following statements, then check your answer with R.
pmax(x, 6)
pmax(c(x, x), y)
pmin(x, c(y, y), 3)
(Solution)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.