[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to the second part of this two-part series on data manipulation in R. This article aims to present the reader with different ways of data aggregation and sorting. Here is the composition of this article.
- Aggregation in R
- Why aggregate data?
- Data used in this article
- Aggregation using base R
- Aggregation using the aggregate function
- Assigning a name to an aggregated column in aggregate function in R
- Aggregating multiple columns using an aggregate function in R
- Aggregating multiple columns data by multiple functions using the aggregate function in R
- Aggregation using by function
- Aggregation using sweep function
- Aggregation using dplyr
- Aggregation using group_by and summarize functions
- Aggregation by group_by and summarize functions with multiple variables and functions
- Aggregation using group_by and summarize functions with a range of variables
- Aggregation using group_by and summarize_if function
- Aggregation using group_by and summarize_at function
- Aggregation using group_by and summarize_all function
- Aggregation using data.table
- Converting data frame to data table
- Aggregating single variable using data table
- Assigning a custom name to aggregated variable in data table
- Aggregating multiple variables using multiple aggregation functions in data table
- Aggregating variables as well as filtering observations in data table
- Counting the number of observations within a group in data table
- Aggregation using sqldf
- Sorting in R
- Why sort data?
- Sorting using base R
- Sorting data using order function
- Sorting data in descending order using order function
- Sorting using dplyr
- Sorting data using arrange function
- Sorting data in descending order using arrange function
- Sorting using data.table
- Sorting data using order function
- Sorting data in descending order using order function
- Sorting data using setorder function
- Sorting data in descending order using setorder function
- Sorting using sqldf
Aggregation in R
To begin with, let’s look at the term aggregation. As per the Cambridge dictionary, “the process of combining things or amounts into a single group or total” is aggregation.
Why aggregate data?
Usually, the data has two types, qualitative and quantitative. These two are statistical terms. Qualitative data defines the characteristics of the data. Labels, properties, attributes, and categories are all examples of qualitative data. As the name suggests, data that express the quality is qualitative data. On the other hand, quantitative data represent numbers. In other disciplines like data warehousing or business intelligence, qualitative data is equivalent to dimensions and quantitative data to measures.
Data analysis is a complex process and involves several steps. Some of these steps may include data to be examined by its quality. Usually, the qualitative data’s granularity is higher. Granularity is the level of detail.
For example, a dataset that contains all countries of the world may have multiple variables describing qualitative data. The name of the country would be at the most granular level, as all countries’ names would be unique and, no two countries will have the same name. Whereas, granularity level would rise as we look at the countries by continents and then by the hemisphere.
Similarly, in an analysis, data is examined on various levels by its different qualities. At this point, aggregation comes into the picture. It is required if you want to explore quantitative data elements by its quality that sits higher than the most granular level.
In this article, we will practice different aggregation functions and options offered by base R and other packages like dplyr and data.table. Note that this article does not aim to list all the functions that (if used in a way) can aggregate data, whereas the aim here is to explain the concept of data aggregation. R is a rich language that offers different ways of doing the same thing.
Data used in this article
We will use the data that we have prepared in our previous post How to prepare data for analysis in R. Click on this link to download the original data. This data is available under the PDDL license. However, to save space and make it cleaner, we will remove a few columns from the data frame. It’s an opportunity for us to revise how to remove a column from the data frame. Let’s first create the data frame and understand the data.
financials <- read.csv("constituents-financials_csv.csv") str(financials) ## 'data.frame': 505 obs. of 14 variables: ## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ... ## $ Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ... ## $ Price : num 222.9 60.2 56.3 108.5 150.5 ... ## $ Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ## $ Dividend.Yield: num 2.33 1.15 1.91 2.5 1.71 ... ## $ Earnings.Share: num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ... ## $ X52.Week.Low : num 259.8 68.4 64.6 125.9 162.6 ... ## $ X52.Week.High : num 175.5 48.9 42.3 60 114.8 ... ## $ Market.Cap : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ... ## $ EBITDA : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ... ## $ Price.Sales : num 4.39 3.58 3.74 6.29 2.6 ... ## $ Price.Book : num 11.34 6.35 3.19 26.14 10.62 ... ## $ SEC.Filings : chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
Now we have our data frame ready, let’s reshape our data frame by selecting a few columns to keep the results clean and tidy.
financials <- financials %>% select(Symbol, Sector, Price, X52.Week.Low, X52.Week.High) head(financials,10) ## Symbol Sector Price X52.Week.Low X52.Week.High ## 1 MMM Industrials 222.89 259.770 175.490 ## 2 AOS Industrials 60.24 68.390 48.925 ## 3 ABT Health Care 56.27 64.600 42.280 ## 4 ABBV Health Care 108.48 125.860 60.050 ## 5 ACN Information Technology 150.51 162.600 114.820 ## 6 ATVI Information Technology 65.83 74.945 38.930 ## 7 AYI Industrials 145.41 225.360 142.000 ## 8 ADBE Information Technology 185.16 204.450 114.451 ## 9 AAP Consumer Discretionary 109.63 169.550 78.810 ## 10 AMD Information Technology 11.22 15.650 9.700
Our final data now has five variables and 505 observations.
Aggregation using base R
Aggregation using the aggregate function
When talking about aggregation, the aggregate function is the most obvious choice. We can use the aggregate function with data frames and time series. We can be specific with the signature if we are using the data frame by writing aggregate.data.frame. Both aggregate and aggregate.data.frame will result in the same data.
Aggregate is a generic function that can be used for both data frames and time series.
We will look at this function by different scenarios. Before the use of any aggregation, the use case must exist. It means why you want to aggregate data. This information is crucial as this will guide you to extract correctly summarized data that will answer your question.
So the scenario here is to “find out the total price of all shares by all sectors”. If you recall from the section above on qualitative and quantitative data, the variables sector and price represents qualitative and quantitative data, respectively. This information is crucial to put the variables into their correct place within the function.
aggregate(financials$Price, by = list(financials$Sector), FUN = sum) ## Group.1 x ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Assigning a name to an aggregated column in aggregate function in R
The result above shows that the function aggregate split the original data into small subsets by sector variable and applied sum function over the price variable. The result above shows x as the name of the summarised column. If you want to rename it, a little tweak in the code using setName function will do the trick. Here is an example.
setNames(aggregate(financials$Price, by = list(financials$Sector), FUN = sum),c("Sector","Total.Price")) ## Sector Total.Price ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Aggregating multiple columns using the aggregate function in R
Now to apply the same function on multiple variables, all you have to do is to supply an expression to subset the required columns from the data frame as an argument. Here is an example, we have used setName function on top to assign meaningful names to the variables.
The scenario here is to “find out the average price, average 52-weeks low price, and the average 52-week high price of all shares by all sectors”.
setNames(aggregate(financials[,c("Price","X52.Week.Low","X52.Week.High")], by = list(financials$Sector), FUN = mean), c("Sector","Average.Price","Average.Low","Average.High")) ## Sector Average.Price Average.Low Average.High ## 1 Consumer Discretionary 124.03452 146.93143 96.09236 ## 2 Consumer Staples 79.76412 92.83229 68.92944 ## 3 Energy 57.88750 72.58969 48.14123 ## 4 Financials 89.05603 101.82185 72.69447 ## 5 Health Care 132.51574 160.75853 103.71925 ## 6 Industrials 116.88761 134.57948 90.83702 ## 7 Information Technology 119.24286 138.77864 91.89142 ## 8 Materials 102.38680 118.03885 85.58325 ## 9 Real Estate 88.71273 110.55045 82.87809 ## 10 Telecommunication Services 33.60333 41.69333 29.50367 ## 11 Utilities 55.19464 68.49732 52.80232
Aggregating multiple columns data by multiple functions using the aggregate function in R
Moving on to the next level of the scenario, we would like to apply multiple functions over the different variables. It is possible by defining a custom function. The requirement now is to “find out the minimum and maximum values of price, 52-weeks low price of all shares by all sectors”. Here is an example.
aggregate(financials[c("Price","X52.Week.Low")], by = list(financials$Sector), FUN = function(x) c(min(x),max(x))) ## Group.1 Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640
The above command can be written in different ways. Here are a couple of examples. Note the use of cbind and formula (~).
aggregate(cbind(financials$Price,financials$X52.Week.Low) ~financials$Sector, FUN = function(x) c(min(x),max(x))) ## financials$Sector V1.1 V1.2 V2.1 V2.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640 aggregate(cbind(Price,X52.Week.Low) ~ Sector, data = financials, FUN = function(x) c(min(x),max(x))) ## Sector Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640
With the aggregate function, possibilities are endless. We can do a lot using the aggregate function. What we have seen in the sections above is just a part. Here are some other arguments which you can add to the aggregate function based on your requirements.