Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Contents
- Introduction
- How to use this article
- Selecting (subsetting) relevant data from a data.frame
- Constructing meaningful subsets simply and safely
- More ways to provide subset indices
- Sorting data
- Sorting by multiple vectors with different order
- Alternatives to base R
- TL;DR – Just want the code
- Exercises
- References
- Exercise answers
- Discuss the article
Introduction
In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data and sorting the data to suit our needs. The key will be safety and avoiding complication and confusion as much as possible. This is why we:
- try to avoid using
subset
, as this function is implemented via non-standard evaluation. - also skip
$
as it uses partial matching and is hardly used with variables as column names. - not mention more details related to
list
properties ofdata.frames
here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R
How to use this article
- This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a
case4base
after all! - If you have no time for reading, you can click here to get just the code with commentary
First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):
gdi <- read.csv( stringsAsFactors = FALSE , url("https://jozefhajnala.gitlab.io/r/post/data/ESA2010_GDI.csv") ) head(gdi[, 1:6, drop = FALSE]) ## country Y.1995 Y.1996 Y.1997 Y.1998 Y.1999 ## 1 EU 28 NA NA NA NA 5982392.8 ## 2 Euro area 19 NA NA NA NA 4393727.3 ## 3 Belgium 140734.1 141599.4 145023.2 149705.2 153804.0 ## 4 Bulgaria 1036.0 1468.1 12367.4 14921.1 16052.8 ## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0 ## 6 Denmark 566363.0 578102.0 591416.0 621236.0 614893.0
Selecting (subsetting) relevant data from a data.frame
In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting data.frames
can be summarised:
dataframe_name[row_subset, col_subset, drop = FALSE]
Where:
dataframe_name
is the name of thedata.frame
we are subsettingrow_subset
is a vector specifying the subset of rowscol_subset
is a vector specifying the subset of columnsdrop = FALSE
is to make sure the result does not get simplyfied when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)
Constructing meaningful subsets simply and safely
In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:
Let us now subset the rows of our data to get the data for countries that have known (not NA
) value in the year 2016 and this value is less than 1 million:
rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000 gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 3 Belgium 243825.50 ## 4 Bulgaria 60237.00 ## 8 Estonia 12548.30 ## 9 Ireland 97318.90 ## 11 Spain 698701.00 ## 13 Croatia 0.00 ## 16 Latvia 15737.79 ## 17 Lithuania 24743.49 ## 18 Luxembourg 20155.80 ## 21 Netherlands 357383.00 ## 22 Austria 214980.60 ## 24 Portugal 128789.39 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 28 Finland 126590.00 ## 33 Switzerland 458641.00
Note that when creating the
rowidx
we omitted thedrop = FALSE
despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.
More ways to provide subset indices
Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:
- Logical vectors
TRUE
for rows/columns to subset,FALSE
for those to omit
st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8)) , c(TRUE, rep(FALSE, 20), rep(TRUE, 2)) , drop = FALSE ]
- Numeric vectors of row/column numbers to subset
st2 <- gdi[c(1:2, 27) , c(1, 22:23) , drop = FALSE ]
- Negative numeric vectors of row/column numbers to omit
st3 <- gdi[c(-3:-26, -28:-35) , c(-2:-21) , drop = FALSE ]
- Character vectors of row/column names to subset
st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames , c("country", "Y.2015", "Y.2016") , drop = FALSE ] st4 ## country Y.2015 Y.2016 ## 1 EU 28 9439578.39 9454683.60 ## 2 Euro area 19 6598231.27 6736686.43 ## 27 Slovakia 47464.71 48882.91
- All of the above give identical results
identical(st1, st2) && identical(st2, st3) && identical(st3, st4) ## [1] TRUE
Tips
- The above methods are also working and safe for matrices
- Negative and positive numeric vectors cannot be combined
Sorting data
The concept of subsetting can easily be generalized for sorting purposes. We can easily see this by the following command giving us identical data.frame
to GDI:
identical(gdi[0:nrow(gdi), 0:ncol(gdi), drop = FALSE], gdi) ## [1] TRUE
To get the rows and columns in reverse order:
gdi_reversed <- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE]
To get the order by GDI in 2016, we simply find the rowidx
with order
and subset with it:
rowidx <- order(gdi[, "Y.2016"]) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 4 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 28 Finland 126590.00 ## 24 Portugal 128789.39 ## 22 Austria 214980.60 ## 3 Belgium 243825.50 ## 21 Netherlands 357383.00 ## 33 Switzerland 458641.00 ## 11 Spain 698701.00 ## 6 Denmark 1091542.00 ## 23 Poland 1136916.00 ## 14 Italy 1142273.40 ## 30 United Kingdom 1330854.00 ## 12 France 1425435.00 ## 32 Norway 1627136.00 ## 7 Germany 2019917.00 ## 29 Sweden 2402587.00 ## 5 Czech Republic 2523229.00 ## 2 Euro area 19 6736686.43 ## 1 EU 28 9454683.60 ## 10 Greece NA ## 15 Cyprus NA ## 19 Hungary NA ## 20 Malta NA ## 25 Romania NA ## 31 Iceland NA ## 34 Turkey NA ## 35 Serbia NA
That looks good, but we may want to order the rows that have NA
as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:
rowidx <- order(gdi[, "Y.2016"], gdi[, "country"]) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 4 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 28 Finland 126590.00 ## 24 Portugal 128789.39 ## 22 Austria 214980.60 ## 3 Belgium 243825.50 ## 21 Netherlands 357383.00 ## 33 Switzerland 458641.00 ## 11 Spain 698701.00 ## 6 Denmark 1091542.00 ## 23 Poland 1136916.00 ## 14 Italy 1142273.40 ## 30 United Kingdom 1330854.00 ## 12 France 1425435.00 ## 32 Norway 1627136.00 ## 7 Germany 2019917.00 ## 29 Sweden 2402587.00 ## 5 Czech Republic 2523229.00 ## 2 Euro area 19 6736686.43 ## 1 EU 28 9454683.60 ## 15 Cyprus NA ## 10 Greece NA ## 19 Hungary NA ## 31 Iceland NA ## 20 Malta NA ## 25 Romania NA ## 35 Serbia NA ## 34 Turkey NA
To order in descending order, we can use decreasing = TRUE
, to see NAs first we can use na.last = FALSE
rowidx <- order(gdi[, "Y.2016"], decreasing = TRUE, na.last = FALSE) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 10 Greece NA ## 15 Cyprus NA ## 19 Hungary NA ## 20 Malta NA ## 25 Romania NA ## 31 Iceland NA ## 34 Turkey NA ## 35 Serbia NA ## 1 EU 28 9454683.60 ## 2 Euro area 19 6736686.43 ## 5 Czech Republic 2523229.00 ## 29 Sweden 2402587.00 ## 7 Germany 2019917.00 ## 32 Norway 1627136.00 ## 12 France 1425435.00 ## 30 United Kingdom 1330854.00 ## 14 Italy 1142273.40 ## 23 Poland 1136916.00 ## 6 Denmark 1091542.00 ## 11 Spain 698701.00 ## 33 Switzerland 458641.00 ## 21 Netherlands 357383.00 ## 3 Belgium 243825.50 ## 22 Austria 214980.60 ## 24 Portugal 128789.39 ## 28 Finland 126590.00 ## 9 Ireland 97318.90 ## 4 Bulgaria 60237.00 ## 27 Slovakia 48882.91 ## 26 Slovenia 24756.63 ## 17 Lithuania 24743.49 ## 18 Luxembourg 20155.80 ## 16 Latvia 15737.79 ## 8 Estonia 12548.30 ## 13 Croatia 0.00
Sorting by multiple vectors with different order
To order by multiple columns differently, for numeric vectors we can use a simple -
, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:
rowidx <- order(-gdi[, "Y.2016"], gdi[, "country"]) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 1 EU 28 9454683.60 ## 2 Euro area 19 6736686.43 ## 5 Czech Republic 2523229.00 ## 29 Sweden 2402587.00 ## 7 Germany 2019917.00 ## 32 Norway 1627136.00 ## 12 France 1425435.00 ## 30 United Kingdom 1330854.00 ## 14 Italy 1142273.40 ## 23 Poland 1136916.00 ## 6 Denmark 1091542.00 ## 11 Spain 698701.00 ## 33 Switzerland 458641.00 ## 21 Netherlands 357383.00 ## 3 Belgium 243825.50 ## 22 Austria 214980.60 ## 24 Portugal 128789.39 ## 28 Finland 126590.00 ## 9 Ireland 97318.90 ## 4 Bulgaria 60237.00 ## 27 Slovakia 48882.91 ## 26 Slovenia 24756.63 ## 17 Lithuania 24743.49 ## 18 Luxembourg 20155.80 ## 16 Latvia 15737.79 ## 8 Estonia 12548.30 ## 13 Croatia 0.00 ## 15 Cyprus NA ## 10 Greece NA ## 19 Hungary NA ## 31 Iceland NA ## 20 Malta NA ## 25 Romania NA ## 35 Serbia NA ## 34 Turkey NA
For non-numeric vectors, we can take advantage of the xtfrm
function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use -
to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:
rowidx <- order(gdi[, "Y.2016"], -xtfrm(gdi[, "country"])) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 4 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 28 Finland 126590.00 ## 24 Portugal 128789.39 ## 22 Austria 214980.60 ## 3 Belgium 243825.50 ## 21 Netherlands 357383.00 ## 33 Switzerland 458641.00 ## 11 Spain 698701.00 ## 6 Denmark 1091542.00 ## 23 Poland 1136916.00 ## 14 Italy 1142273.40 ## 30 United Kingdom 1330854.00 ## 12 France 1425435.00 ## 32 Norway 1627136.00 ## 7 Germany 2019917.00 ## 29 Sweden 2402587.00 ## 5 Czech Republic 2523229.00 ## 2 Euro area 19 6736686.43 ## 1 EU 28 9454683.60 ## 34 Turkey NA ## 35 Serbia NA ## 25 Romania NA ## 20 Malta NA ## 31 Iceland NA ## 19 Hungary NA ## 10 Greece NA ## 15 Cyprus NA
Alternatives to base R
- dplyr::select and dplyr::filter
- Using data.table
TL;DR – Just want the code
No time for reading? Click here to get just the code with commentary
Exercises
- What is the difference between
gdi[3, 3]
andgdi[3, 3, drop = FALSE]
? - What is the difference between
gdi[-3, 3]
andgdi[3, -3]
? What aboutgdi[-3, 3, drop = FALSE]
? - Why does
rowidx <- order(-gdi[, "country"])
not work as expected? Can you fix? - Why cannot we omit the first part of the & in
rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
. What would happen if we just didrowidx <- gdi[, "Y.2016"] < 1000000
? - Bonus question 1: Why is
identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])
- Bonus question 2: Why is
identical(gdi[, "Y.2016"], gdi[["Y.2016"]])
References
- Advanced R’s chapter on subsetting
- and on data types
Exercise answers
Discuss the article
Please enable JavaScript to view the comments powered by Disqus.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.