How to subset rows from a data frame in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After understanding “how to subset columns data in R“; this article aims to demonstrate row subsetting using base R and the “dplyr” package. Let’s see how to subset rows from a data frame in R and the flow of this article is as follows:
- Data
- Reading Data
- Subset an nth row from a data frame
- Subset range of rows from a data frame
- Conditionally subset rows from a data frame
Prerequisites:
- R
- R Studio (for ease)
Data
We will use s and p 500 companies financials data to demonstrate row data subsetting. Interestingly, this data is available under the PDDL licence. For this article, we have altered the original file to have a limited number of columns only. However, we have a great post explaining how to prepare data for analysis in R in 5 steps using multiple CSV files where we have split the original file into multiple files and combined them to produce an original result. The CSV file we are using in this article is a result of how to prepare data for analysis in R in 5 steps article.
Reading Data
Before importing data into R, it is important to know that R keeps the data into the memory of your computer. As a result, performance may be impacted. In other words, we should calculate memory requirements in advance.
Let’s import constituents-financials_csv.csv
file into R using read.csv
function.
financials <- read.csv("constituents-financials_csv.csv")
The command above has created a new object called financials
. This object is a data frame which contains all the data from financials CSV file. Let’s look at the structure of the newly created data frame.
str(financials) ## 'data.frame': 505 obs. of 6 variables: ## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ... ## $ Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ... ## $ Price : num 222.9 60.2 56.3 108.5 150.5 ... ## $ Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ## $ Dividend.Yield: num 2.33 1.15 1.91 2.5 1.71 ...
Interestingly, the result of str(financials)
command shows that there are 505 observations (rows) and 14 variables (columns). Now let’s look at different ways of row subsetting from a data frame.
Subset nth row from a data frame
Using base R
Interestingly, if you type financials
on the console you would find that R will display several observations from the data frame financials
. The result could be overwhelming. Let’s have a look at the result and observe. To save space head
function is used here with argument 2 to show only two observations.
head(financials,2) ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 1 MMM 3M Company Industrials 222.89 24.31 2.332862 ## 2 AOS A.O. Smith Corp Industrials 60.24 27.76 1.147959
The number next to the two # symbols identifies the row uniquely. This number is known as the index. To select an nth row we have to supply the number of the row in bracket notation. Here is the example where we are selecting the 7th row of financials
data frame:
financials[7,] ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 7 AYI Acuity Brands Inc Industrials 145.41 18.22 0.3511853
Square bracket notation is one way of subsetting data from a data frame. We should be careful when selecting rows and columns; especially the position of the expression concerning a comma. Expression left side of the comma is operated upon observations (rows) and on the other hand, expression at the right side of the comma is operated upon variables (columns).
Using the dplyr package
Following is the nth row selection using the dplyr package. You can see the result is identical. Function "row" used below returns a matrix of an integer as the row number or index.
library(dplyr) financials %>% filter(row(financials) == 7) ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 1 AYI Acuity Brands Inc Industrials 145.41 18.22 0.3511853
Subset range of rows from a data frame
Using base R
It is interesting to know that we can select any row by just supplying the number or the index of that row with square brackets to get the result. Similarly, we can retrieve the range of rows as well. This can be done by simply providing the range in square brackets notations. Let’s look at the example by selecting 3 rows starting from 2nd row.
financials[2:4,] ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 2 AOS A.O. Smith Corp Industrials 60.24 27.76 1.147959 ## 3 ABT Abbott Laboratories Health Care 56.27 22.51 1.908982 ## 4 ABBV AbbVie Inc. Health Care 108.48 19.41 2.499560
But what is the use of this functionality?
Well, in complex scenario you may want to derive the range based on different conditions and observe the result. Interestingly, a range can also be defined within square brackets using concatenation function to select a range of rows from a data frame. Here is an example. Result for both the commands is identical.
financials[c(2:4),] ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 2 AOS A.O. Smith Corp Industrials 60.24 27.76 1.147959 ## 3 ABT Abbott Laboratories Health Care 56.27 22.51 1.908982 ## 4 ABBV AbbVie Inc. Health Care 108.48 19.41 2.499560
Using the dplyr function
financials %>% filter(row(financials) >= 2 & row(financials) <= 4) ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 1 AOS A.O. Smith Corp Industrials 60.24 27.76 1.147959 ## 2 ABT Abbott Laboratories Health Care 56.27 22.51 1.908982 ## 3 ABBV AbbVie Inc. Health Care 108.48 19.41 2.499560
Though using the dplyr package for this purpose is an overkill. We must note that the same operation can be performed in many ways.
Conditional subsetting of rows from a data frame
The subsetting we have seen in two sections above is simple but in real-world, the requirement may be based on complex conditions. In this section, our emphasis will be on conditional subsetting of the rows from a data frame.
Scenario
Let’s retrieve all the rows from the financials data frame where variable (column) symbol is “ACN” or “APTV” and the variable sector is either “Industrials” or “Consumer Discretionary”.
Using base R
financials[(financials$Symbol == "ACN" | financials$Symbol == "APTV") & (financials$Sector == "Consumer Discretionary" | financials$Sector == "Industrials"),] ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 54 APTV Aptiv Plc Consumer Discretionary 89.27 69.74
We can see that only one row falls into the subsetting criterion. Notice the use of “&”, “|” and “==” operators to define “AND”, “OR” and equality conditions respectively. Additionally, note that full subsetting criterion is written on the left side of the comma as we are filtering only rows.
Using the dplyr package
The dplyr package offers great functionality when it comes to filtering observations (rows). Let’s look at the same scenario by using dplyr function.
financials %>% filter((Symbol == "ACN" | Symbol == "APTV") & (Sector == "Consumer Discretionary" | Sector == "Industrials")) ## Symbol Name Sector Price Price.Earnings Dividend.Yield ## 1 APTV Aptiv Plc Consumer Discretionary 89.27 69.74 0.9392678
The result is identical and the difference is clear that we do not have to use the “$” subscript to define the column while applying the conditions. This is a normal notation when using the dplyr package.
I hope you enjoyed the read. Please do not forget to share and comment below. If you wish, you can download the R code file containing all commands listed here at our GitHub repository.
Image Credit unsplash.comR-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.