How to subset a data frame column data in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this article, we present the audience with different ways of subsetting data from a data frame column using base R and dplyr. Let’s check out how to subset a data frame column data in R. The summary of the content of this article is as follows:
- Data
- Reading Data
- Subset a data frame column data
- Subset all data from a data frame
- Subset column from a data frame
- Subset multiple columns from a data frame
- Subset all columns data but one from a data frame
- Subset columns which share same character or string at the start of their name
Prerequisites:
- R
- R Studio (for ease)
Assumption: Working directory is set and datasets are stored in the working directory. setwd() command is used to set the working directory. Supply the path of directory enclosed in double quotes to set it as a working directory.
Data
We will use s and p 500 companies financials data to demonstrate row data subsetting. Interestingly, this data is available under the PDDL licence.
We have a great post explaining how to prepare data for analysis in R in 5 steps using multiple CSV files where we have split the original file into multiple files and combined them to produce an original result. The CSV file we are using in this article is a result of how to prepare data for analysis in R in 5 steps article.
Reading data
Data can come from any source, it can be a flat file, database system, or handwritten notes. Usually, flat files are the most common source of the data. In this section, we will see how to load data from a CSV file. Most importantly, if we are working with a large dataset then we must check the capacity of our computer as R keep the data into memory. Let’s read the CSV file into R.
financials <- read.csv("constituents-financials_csv.csv")
The command above will import the content of the constituents-financials_csv.csv file into an object called the financials
. Object financials
is a data frame that contains all the data from the constituents-financials_csv.csv file. If you have a relation database experience then we can loosely compare this to a relational database object “table”. Similar to tables, data frames also have rows and columns, and data is presented in rows and columns form.
To clarify, function read.csv above take multiple other arguments other than just the name of the file. Information on additional arguments can be found at read.csv. Let’s continue learning how to subset a data frame column data in R.
Subset a data frame column data
Before we learn how to subset columns data in R from a data frame "financials", I would recommend learning the following three functions using "financials" data frame:
names(financials) ## [1] "Symbol" "Name" "Sector" "Price" ## [5] "Price.Earnings" "Dividend.Yield" "Earnings.Share" "X52.Week.Low" ## [9] "X52.Week.High" "Market.Cap" "EBITDA" "Price.Sales" ## [13] "Price.Book" "SEC.Filings"
Command names(financials)
above would return all the column names of the data frame. Checking column names just after loading the data is useful as this will make you familiar with the data frame. Do not worry about the numbers in the square brackets just yet, we will look at them in a future article. The names of the columns are listed next to the numbers in the brackets and there are a total of 14 columns in the financials
data frame.
str(financials) ## 'data.frame': 505 obs. of 14 variables: ## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ... ## $ Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ... ## $ Price : num 222.9 60.2 56.3 108.5 150.5 ... ## $ Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ## $ Dividend.Yield: num 2.33 1.15 1.91 2.5 1.71 ... ## $ Earnings.Share: num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ... ## $ X52.Week.Low : num 259.8 68.4 64.6 125.9 162.6 ... ## $ X52.Week.High : num 175.5 48.9 42.3 60 114.8 ... ## $ Market.Cap : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ... ## $ EBITDA : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ... ## $ Price.Sales : num 4.39 3.58 3.74 6.29 2.6 ... ## $ Price.Book : num 11.34 6.35 3.19 26.14 10.62 ... ## $ SEC.Filings : chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
Command str(financials)
would return the structure of the data frame. Function str()
compactly displays the internal structure of the object, be it data frame or any other. Above is the structure of the financials
data frame. The result from str()
function above shows the data type of the columns financials
data frame has, as well as sample data from the individual columns.
dim(financials) ## [1] 505 14
Command dim(financials)
mentioned above will result in dimensions of the financials
data frame or in other words total number of rows and columns this data frame has. In statistics terms, a column is a variable and row is an observation. Data frame financials has 505 observations and 14 variables.
Subset all data from a data frame
In base R, just putting the name of the data frame financials
on the prompt will display all of the data for that data frame. Commands head(financials)
or head(financials, 10)
, 10 is just to show the parameter that head function can take which limit the number of lines. Similarly, tail(financials) or tail(financials, 10) will be helpful to quickly check the data from the end.
Subset column from a data frame
In base R, you can specify the name of the column that you would like to select with $ sign (indexing tagged lists) along with the data frame. The command head(financials$Population, 10)
would show the first 10 observations from column Population from data frame financials:
head(financials$Name, 10) ## [1] "3M Company" "A.O. Smith Corp" ## [3] "Abbott Laboratories" "AbbVie Inc." ## [5] "Accenture plc" "Activision Blizzard" ## [7] "Acuity Brands Inc" "Adobe Systems Inc" ## [9] "Advance Auto Parts" "Advanced Micro Devices Inc"
What we have done above can also be done using dplyr package. As per rdocumentation.org “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Here is a command using dplyr package which selects Population column from the financials data frame:
You can see the presentation of the result between subsetting using $ sign (element names operator) and using dplyr package.
Subsetting multiple columns from a data frame
Using base R
The following command will help subset multiple columns. In the command below first two columns are selected from the data frame financials. If you see the result for command names(financials)
above, you would find that "Symbol" and "Name" are the first two columns.
head(financials[,c(1,2)],10) ## Symbol Name ## 1 MMM 3M Company ## 2 AOS A.O. Smith Corp ## 3 ABT Abbott Laboratories ## 4 ABBV AbbVie Inc. ## 5 ACN Accenture plc ## 6 ATVI Activision Blizzard ## 7 AYI Acuity Brands Inc ## 8 ADBE Adobe Systems Inc ## 9 AAP Advance Auto Parts ## 10 AMD Advanced Micro Devices Inc
Or we can supply the name of the columns and select them. Here is an example:
head(financials[,c("Symbol","Name")],10) ## Symbol Name ## 1 MMM 3M Company ## 2 AOS A.O. Smith Corp ## 3 ABT Abbott Laboratories ## 4 ABBV AbbVie Inc. ## 5 ACN Accenture plc ## 6 ATVI Activision Blizzard ## 7 AYI Acuity Brands Inc ## 8 ADBE Adobe Systems Inc ## 9 AAP Advance Auto Parts ## 10 AMD Advanced Micro Devices Inc
Using dplyr package select clause
result <- financials %>% select(Symbol, Name) head(result, 10) ## Symbol Name ## 1 MMM 3M Company ## 2 AOS A.O. Smith Corp ## 3 ABT Abbott Laboratories ## 4 ABBV AbbVie Inc. ## 5 ACN Accenture plc ## 6 ATVI Activision Blizzard ## 7 AYI Acuity Brands Inc ## 8 ADBE Adobe Systems Inc ## 9 AAP Advance Auto Parts ## 10 AMD Advanced Micro Devices Inc
Using base R
Any number of columns can be selected this way by giving the number or the name of the column within a vector. Let’s find out the first, fourth, and eleventh column from the financials
data frame. Remember, instead of the number you can give the name of the column enclosed in double-quotes:
head(financials[,c(1,4,11)],10) ## Symbol Price EBITDA ## 1 MMM 222.89 9048000000 ## 2 AOS 60.24 601000000 ## 3 ABT 56.27 5744000000 ## 4 ABBV 108.48 10310000000 ## 5 ACN 150.51 5643228000 ## 6 ATVI 65.83 2704000000 ## 7 AYI 145.41 587800000 ## 8 ADBE 185.16 2538040000 ## 9 AAP 109.63 853941000 ## 10 AMD 11.22 339000000
Using dplyr package select clause
result <- financials %>% select(Symbol, Price, EBITDA) head(result, 10) ## Symbol Price EBITDA ## 1 MMM 222.89 9048000000 ## 2 AOS 60.24 601000000 ## 3 ABT 56.27 5744000000 ## 4 ABBV 108.48 10310000000 ## 5 ACN 150.51 5643228000 ## 6 ATVI 65.83 2704000000 ## 7 AYI 145.41 587800000 ## 8 ADBE 185.16 2538040000 ## 9 AAP 109.63 853941000 ## 10 AMD 11.22 339000000
Subset all columns but one from a data frame
This approach is called subsetting by the deletion of entries. In base R you can specify which column you would like to exclude from the selection by putting a minus sign in from of it. Let’s try:
result <- head(financials[,-6],10) dim(result) ## [1] 10 13
Now if we analyse the result of the above command, we can see the dimension of the result variable is showing 10 observations (rows) and 13 variables (columns). If you check the result of command dim(financials) above, you can see there were total 14 variables in the financials data frame but as we have excluded the sixth column using -6 in column section in command result <- head(financials[,-6],10) which returned a result for all columns except sixth. A similar operation can be performed using dplyr package and instead of using the minus sign on the number of a column, you can use it directly on the name of the column. Here is the example where we would exclude column “EBITDA” form the result set:
result <- select(financials, -EBITDA) dim(head(result,10)) ## [1] 10 13
Subsetting all columns which start with a particular character or string
If you go back to the result of names(financials)
command you would see that few column names start with the same string. Columns we particularly interested in here start with word “Price”. Following R command using dplyr package will help us subset these two columns by writing as little code as possible. Imagine a scenario when you have several columns which start with the same character or string and in such scenario following command will be helpful:
result <- select(financials, starts_with("Price")) head(result, 10) ## Price Price.Earnings Price.Sales Price.Book ## 1 222.89 24.31 4.390271 11.34 ## 2 60.24 27.76 3.575483 6.35 ## 3 56.27 22.51 3.740480 3.19 ## 4 108.48 19.41 6.291571 26.14 ## 5 150.51 25.47 2.604117 10.62 ## 6 65.83 31.80 10.595120 5.16 ## 7 145.41 18.22 1.795347 3.55 ## 8 185.16 52.31 13.092818 11.06 ## 9 109.63 19.54 1.130106 2.51 ## 10 11.22 187.00 2.109195 21.47
I hope you enjoyed this post and learned how to subset a data frame column data in R. If it helped you in any way then please do not forget to share this post.
Image Credit unsplash.comR-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.