Data manipulation in r using data frames – an extensive article of basics part2 – aggregation and sorting
[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to the second part of this two-part series on data manipulation in R. This article aims to present the reader with different ways of data aggregation and sorting. Here is the composition of this article.
- Aggregation in R
- Why aggregate data?
- Data used in this article
- Aggregation using base R
- Aggregation using the aggregate function
- Assigning a name to an aggregated column in aggregate function in R
- Aggregating multiple columns using an aggregate function in R
- Aggregating multiple columns data by multiple functions using the aggregate function in R
- Aggregation using by function
- Aggregation using sweep function
- Aggregation using dplyr
- Aggregation using group_by and summarize functions
- Aggregation by group_by and summarize functions with multiple variables and functions
- Aggregation using group_by and summarize functions with a range of variables
- Aggregation using group_by and summarize_if function
- Aggregation using group_by and summarize_at function
- Aggregation using group_by and summarize_all function
- Aggregation using data.table
- Converting data frame to data table
- Aggregating single variable using data table
- Assigning a custom name to aggregated variable in data table
- Aggregating multiple variables using multiple aggregation functions in data table
- Aggregating variables as well as filtering observations in data table
- Counting the number of observations within a group in data table
- Aggregation using sqldf
- Sorting in R
- Why sort data?
- Sorting using base R
- Sorting data using order function
- Sorting data in descending order using order function
- Sorting using dplyr
- Sorting data using arrange function
- Sorting data in descending order using arrange function
- Sorting using data.table
- Sorting data using order function
- Sorting data in descending order using order function
- Sorting data using setorder function
- Sorting data in descending order using setorder function
- Sorting using sqldf
Aggregation in R
To begin with, let’s look at the term aggregation. As per the Cambridge dictionary, “the process of combining things or amounts into a single group or total” is aggregation.
Why aggregate data?
Usually, the data has two types, qualitative and quantitative. These two are statistical terms. Qualitative data defines the characteristics of the data. Labels, properties, attributes, and categories are all examples of qualitative data. As the name suggests, data that express the quality is qualitative data. On the other hand, quantitative data represent numbers. In other disciplines like data warehousing or business intelligence, qualitative data is equivalent to dimensions and quantitative data to measures.
Data analysis is a complex process and involves several steps. Some of these steps may include data to be examined by its quality. Usually, the qualitative data’s granularity is higher. Granularity is the level of detail.
For example, a dataset that contains all countries of the world may have multiple variables describing qualitative data. The name of the country would be at the most granular level, as all countries’ names would be unique and, no two countries will have the same name. Whereas, granularity level would rise as we look at the countries by continents and then by the hemisphere.
Similarly, in an analysis, data is examined on various levels by its different qualities. At this point, aggregation comes into the picture. It is required if you want to explore quantitative data elements by its quality that sits higher than the most granular level.
In this article, we will practice different aggregation functions and options offered by base R and other packages like dplyr and data.table. Note that this article does not aim to list all the functions that (if used in a way) can aggregate data, whereas the aim here is to explain the concept of data aggregation. R is a rich language that offers different ways of doing the same thing.
Data used in this article
We will use the data that we have prepared in our previous post How to prepare data for analysis in R. Click on this link to download the original data. This data is available under the PDDL license. However, to save space and make it cleaner, we will remove a few columns from the data frame. It’s an opportunity for us to revise how to remove a column from the data frame. Let’s first create the data frame and understand the data.
financials <- read.csv("constituents-financials_csv.csv") str(financials) ## 'data.frame': 505 obs. of 14 variables: ## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ... ## $ Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ... ## $ Price : num 222.9 60.2 56.3 108.5 150.5 ... ## $ Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ## $ Dividend.Yield: num 2.33 1.15 1.91 2.5 1.71 ... ## $ Earnings.Share: num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ... ## $ X52.Week.Low : num 259.8 68.4 64.6 125.9 162.6 ... ## $ X52.Week.High : num 175.5 48.9 42.3 60 114.8 ... ## $ Market.Cap : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ... ## $ EBITDA : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ... ## $ Price.Sales : num 4.39 3.58 3.74 6.29 2.6 ... ## $ Price.Book : num 11.34 6.35 3.19 26.14 10.62 ... ## $ SEC.Filings : chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
Now we have our data frame ready, let's reshape our data frame by selecting a few columns to keep the results clean and tidy.
financials <- financials %>% select(Symbol, Sector, Price, X52.Week.Low, X52.Week.High) head(financials,10) ## Symbol Sector Price X52.Week.Low X52.Week.High ## 1 MMM Industrials 222.89 259.770 175.490 ## 2 AOS Industrials 60.24 68.390 48.925 ## 3 ABT Health Care 56.27 64.600 42.280 ## 4 ABBV Health Care 108.48 125.860 60.050 ## 5 ACN Information Technology 150.51 162.600 114.820 ## 6 ATVI Information Technology 65.83 74.945 38.930 ## 7 AYI Industrials 145.41 225.360 142.000 ## 8 ADBE Information Technology 185.16 204.450 114.451 ## 9 AAP Consumer Discretionary 109.63 169.550 78.810 ## 10 AMD Information Technology 11.22 15.650 9.700
Our final data now has five variables and 505 observations.
Aggregation using base R
Aggregation using the aggregate function
When talking about aggregation, the aggregate function is the most obvious choice. We can use the aggregate function with data frames and time series. We can be specific with the signature if we are using the data frame by writing aggregate.data.frame. Both aggregate and aggregate.data.frame will result in the same data.
Aggregate is a generic function that can be used for both data frames and time series.
We will look at this function by different scenarios. Before the use of any aggregation, the use case must exist. It means why you want to aggregate data. This information is crucial as this will guide you to extract correctly summarized data that will answer your question.
So the scenario here is to "find out the total price of all shares by all sectors". If you recall from the section above on qualitative and quantitative data, the variables sector and price represents qualitative and quantitative data, respectively. This information is crucial to put the variables into their correct place within the function.
aggregate(financials$Price, by = list(financials$Sector), FUN = sum) ## Group.1 x ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Assigning a name to an aggregated column in aggregate function in R
The result above shows that the function aggregate split the original data into small subsets by sector variable and applied sum function over the price variable. The result above shows x as the name of the summarised column. If you want to rename it, a little tweak in the code using setName function will do the trick. Here is an example.
setNames(aggregate(financials$Price, by = list(financials$Sector), FUN = sum),c("Sector","Total.Price")) ## Sector Total.Price ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Aggregating multiple columns using the aggregate function in R
Now to apply the same function on multiple variables, all you have to do is to supply an expression to subset the required columns from the data frame as an argument. Here is an example, we have used setName function on top to assign meaningful names to the variables.
The scenario here is to "find out the average price, average 52-weeks low price, and the average 52-week high price of all shares by all sectors".
setNames(aggregate(financials[,c("Price","X52.Week.Low","X52.Week.High")], by = list(financials$Sector), FUN = mean), c("Sector","Average.Price","Average.Low","Average.High")) ## Sector Average.Price Average.Low Average.High ## 1 Consumer Discretionary 124.03452 146.93143 96.09236 ## 2 Consumer Staples 79.76412 92.83229 68.92944 ## 3 Energy 57.88750 72.58969 48.14123 ## 4 Financials 89.05603 101.82185 72.69447 ## 5 Health Care 132.51574 160.75853 103.71925 ## 6 Industrials 116.88761 134.57948 90.83702 ## 7 Information Technology 119.24286 138.77864 91.89142 ## 8 Materials 102.38680 118.03885 85.58325 ## 9 Real Estate 88.71273 110.55045 82.87809 ## 10 Telecommunication Services 33.60333 41.69333 29.50367 ## 11 Utilities 55.19464 68.49732 52.80232
Aggregating multiple columns data by multiple functions using the aggregate function in R
Moving on to the next level of the scenario, we would like to apply multiple functions over the different variables. It is possible by defining a custom function. The requirement now is to "find out the minimum and maximum values of price, 52-weeks low price of all shares by all sectors". Here is an example.
aggregate(financials[c("Price","X52.Week.Low")], by = list(financials$Sector), FUN = function(x) c(min(x),max(x))) ## Group.1 Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640
The above command can be written in different ways. Here are a couple of examples. Note the use of cbind and formula (~).
aggregate(cbind(financials$Price,financials$X52.Week.Low) ~financials$Sector, FUN = function(x) c(min(x),max(x))) ## financials$Sector V1.1 V1.2 V2.1 V2.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640 aggregate(cbind(Price,X52.Week.Low) ~ Sector, data = financials, FUN = function(x) c(min(x),max(x))) ## Sector Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2 ## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990 ## 2 Consumer Staples 19.96 208.73 21.175 229.500 ## 3 Energy 2.82 169.16 6.590 199.830 ## 4 Financials 13.38 509.38 16.530 594.520 ## 5 Health Care 25.20 601.00 29.930 697.260 ## 6 Industrials 14.45 334.30 30.590 361.790 ## 7 Information Technology 11.22 1007.71 15.650 1198.000 ## 8 Materials 17.16 387.65 20.250 435.150 ## 9 Real Estate 14.01 409.98 21.530 495.345 ## 10 Telecommunication Services 16.20 49.04 27.610 54.770 ## 11 Utilities 10.06 145.29 12.050 159.640
With the aggregate function, possibilities are endless. We can do a lot using the aggregate function. What we have seen in the sections above is just a part. Here are some other arguments which you can add to the aggregate function based on your requirements.
Argument | Use |
na.action | Use this to specify if your data contain NA or missing values and how you want to handle them |
simplify | If used results will be presented in vector or matrix form |
formula | Depending on subsetting requirement formulas like y ~ x or cbind(y1,y2) ~ x1 + x2 can be used |
data | The data frame in question which has the variables to subset and aggregate |
drop | Use if you want to drop unused combination of grouping values |
subset | Use it to limit the operation to certain observations |
Lastly, to name a few, you can use sum, count, first, last, mean, median, min, max, and sd functions with the aggregate function. If you think this list should have more function names, then please add a comment in the comment box below.
Aggregation using by function
The by function is different from aggregate as it is a wrapper for tapply function, this means it also encapsulates the functionality of tapply function. The return value from "by" function depends on the usage of simplify argument. If the value of simplify argument is TRUE then it returns a list or array otherwise it always returns a list. Here is an example of by function. In this example, we are trying to find the sum of price variable by sector.
All other arguments in this function are self-explanatory except INDICES that represent factors or a list of factors.
by(data=financials$Price, INDICES = financials$Sector, FUN = sum, simplify = TRUE) ## financials$Sector: Consumer Discretionary ## [1] 10418.9 ## ------------------------------------------------------------ ## financials$Sector: Consumer Staples ## [1] 2711.98 ## ------------------------------------------------------------ ## financials$Sector: Energy ## [1] 1852.4 ## ------------------------------------------------------------ ## financials$Sector: Financials ## [1] 6055.81 ## ------------------------------------------------------------ ## financials$Sector: Health Care ## [1] 8083.46 ## ------------------------------------------------------------ ## financials$Sector: Industrials ## [1] 7831.47 ## ------------------------------------------------------------ ## financials$Sector: Information Technology ## [1] 8347 ## ------------------------------------------------------------ ## financials$Sector: Materials ## [1] 2559.67 ## ------------------------------------------------------------ ## financials$Sector: Real Estate ## [1] 2927.52 ## ------------------------------------------------------------ ## financials$Sector: Telecommunication Services ## [1] 100.81 ## ------------------------------------------------------------ ## financials$Sector: Utilities ## [1] 1545.45
Aggregation using sweep function
sweep(financials["Price"], MARGIN = 1, STATS = 0,FUN = sum) ## [1] 52434.47
The function above is returning the sum of variable Price. Over the years, I didn't use the sweep function for my work. If you know the aggregation scenario where this function would be most useful, then please comment in the comment box below. All other arguments are obvious except STATS, my understanding of this argument is that this is the value that you would like to apply to the variable in conjunction with the function argument. So, I have set STATS to 0 and used sum as the function, which results in the total Price variable. If I set STATS to 1, then all the values of variable Price would be increased with one, and the total will be different.
Though sweep function doesn't result in grouped or subsetted data as others, I thought this function is worth mentioning. Similarly, other functions from the apply family are available to aggregate data, but you may encounter limitations on grouping or performance requirements. If you want, you can achieve aggregation results using these functions, but there are better and faster options available, and from this point onwards, we will concentrate on those.
Aggregation using dplyr
Aggregation using group_by and summarize functions
The three functions we have seen above may fit into different aggregation scenarios, but in terms of ease of use, I consider the group_by and summarize functions of the dplyr package. The performance of functions is subjective and dependent on the size and operation of the data. I personally never evaluated the performance though wherever I have read, group_by is praised for its performance. Let's look at an example of aggregating data using the group by function.
The scenario here is to "find out the total price of all shares by all sectors".
financials %>% group_by(Sector) %>% summarize(total.price = sum(Price)) ## `summarise()` ungrouping output (override with `.groups` argument) ## # A tibble: 11 x 2 ## Sector total.price ## <chr> <dbl> ## 1 Consumer Discretionary 10419. ## 2 Consumer Staples 2712. ## 3 Energy 1852. ## 4 Financials 6056. ## 5 Health Care 8083. ## 6 Industrials 7831. ## 7 Information Technology 8347 ## 8 Materials 2560. ## 9 Real Estate 2928. ## 10 Telecommunication Services 101. ## 11 Utilities 1545.Simply supplying the variable name within the group_by function is enough to set the grouping of observations by the value of the selected variable. The summarize function is the engine where the data is operated upon using the function of your choice. In the example above, I have chosen the sum function which is applied over grouped observation. This strategy or method of aggregation is known as split-apply-combine. First, we are splitting the observations into groups followed by applying a function and then combining results to present.
If you were observant, then you may have noticed a warning as well. This warning is a hint to remind you how you want to control the grouping of data. You can suppress this warning by using options(dplyr.summarise.inform = FALSE). For more information, please click this link.
Also, did you notice how the results are printed above?
Nothing wrong with that, it is because the return value is in the form of tibble. You can print.data.frame() function to print is in a nice format. Here is an example.
financials %>% group_by(Sector) %>% summarize(total.price = sum(Price)) %>% print.data.frame() ## `summarise()` ungrouping output (override with `.groups` argument) ## Sector total.price ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Lastly, did you notice that the summarised variable is now named total.price? By simply putting the new name in front of the aggregation function will do the trick. It is not mandatory, but I am sure you wouldn't like the default name.
Aggregation by group_by and summarize functions with multiple variables and functions
Let's progress ahead with our scenario and find out the total price, minimum 52-week low and maximum 52-week high price of all shares and by all sectors. The aim here is to show the use of multiple variables with multiple aggregation functions.
financials %>% group_by(Sector) %>% summarize(total.price = sum(Price), min.52.week.low = min(X52.Week.Low), max.52.week.high = max(X52.Week.High)) %>% print.data.frame() ## `summarise()` ungrouping output (override with `.groups` argument) ## Sector total.price min.52.week.low max.52.week.high ## 1 Consumer Discretionary 10418.90 13.480 1589.0000 ## 2 Consumer Staples 2711.98 21.175 152.0100 ## 3 Energy 1852.40 6.590 125.4600 ## 4 Financials 6055.81 16.530 368.0000 ## 5 Health Care 8083.46 29.930 459.3400 ## 6 Industrials 7831.47 30.590 256.4000 ## 7 Information Technology 8347.00 15.650 824.3000 ## 8 Materials 2559.67 20.250 302.0101 ## 9 Real Estate 2927.52 21.530 361.9000 ## 10 Telecommunication Services 100.81 27.610 42.8000 ## 11 Utilities 1545.45 12.050 124.1800
Also, we can supply more than one variable in the group_by function if you want to group data by multiple variables.
Aggregation using group_by and summarize functions with a range of variables
You can use the summarize function in a variety of ways. If you have several variables to aggregate, then the following type of command would help.
financials %>% group_by(Sector) %>% summarize(across(Price:X52.Week.High, ~sum(.x))) %>% print.data.frame() ## `summarise()` ungrouping output (override with `.groups` argument) ## Sector Price X52.Week.Low X52.Week.High ## 1 Consumer Discretionary 10418.90 12342.240 8071.759 ## 2 Consumer Staples 2711.98 3156.298 2343.601 ## 3 Energy 1852.40 2322.870 1540.519 ## 4 Financials 6055.81 6923.886 4943.224 ## 5 Health Care 8083.46 9806.271 6326.874 ## 6 Industrials 7831.47 9016.825 6086.081 ## 7 Information Technology 8347.00 9714.505 6432.399 ## 8 Materials 2559.67 2950.971 2139.581 ## 9 Real Estate 2927.52 3648.165 2734.977 ## 10 Telecommunication Services 100.81 125.080 88.511 ## 11 Utilities 1545.45 1917.925 1478.465
Aggregation using group_by and summarize_if function
Or, if you want to aggregate all of the numeric variables from your data frame, then the following version of summarize function would help.
financials %>% group_by(Sector) %>% summarize_if(is.numeric,sum) %>% print.data.frame() ## Sector Price X52.Week.Low X52.Week.High ## 1 Consumer Discretionary 10418.90 12342.240 8071.759 ## 2 Consumer Staples 2711.98 3156.298 2343.601 ## 3 Energy 1852.40 2322.870 1540.519 ## 4 Financials 6055.81 6923.886 4943.224 ## 5 Health Care 8083.46 9806.271 6326.874 ## 6 Industrials 7831.47 9016.825 6086.081 ## 7 Information Technology 8347.00 9714.505 6432.399 ## 8 Materials 2559.67 2950.971 2139.581 ## 9 Real Estate 2927.52 3648.165 2734.977 ## 10 Telecommunication Services 100.81 125.080 88.511 ## 11 Utilities 1545.45 1917.925 1478.465
Aggregation using group_by and summarize_at function
Summarize_at, on the other hand, helps you to aggregate more than one variable in one go. The requirement is that the variables are supplied as vector or vars.
financials %>% group_by(Sector) %>% summarize_at(c("Price","X52.Week.High"),sum) %>% print.data.frame() ## Sector Price X52.Week.High ## 1 Consumer Discretionary 10418.90 8071.759 ## 2 Consumer Staples 2711.98 2343.601 ## 3 Energy 1852.40 1540.519 ## 4 Financials 6055.81 4943.224 ## 5 Health Care 8083.46 6326.874 ## 6 Industrials 7831.47 6086.081 ## 7 Information Technology 8347.00 6432.399 ## 8 Materials 2559.67 2139.581 ## 9 Real Estate 2927.52 2734.977 ## 10 Telecommunication Services 100.81 88.511 ## 11 Utilities 1545.45 1478.465
Aggregation using group_by and summarize_all function
Lastly, summarize_all function helps to aggregate all the values from the data frame, and we can use multiple aggregation functions as well. Note in our data frame we have two character variables, let's prepare the data for using summarize_all function and aggregate data to show total and minimum values of all variable across the data frame.
financials %>% select(-Symbol) %>% group_by(Sector) %>% summarize_all(c(sum,min)) ## # A tibble: 11 x 7 ## Sector Price_fn1 X52.Week.Low_fn1 X52.Week.High_f… Price_fn2 X52.Week.Low_fn2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Consu… 10419. 12342. 8072. 10.4 13.5 ## 2 Consu… 2712. 3156. 2344. 20.0 21.2 ## 3 Energy 1852. 2323. 1541. 2.82 6.59 ## 4 Finan… 6056. 6924. 4943. 13.4 16.5 ## 5 Healt… 8083. 9806. 6327. 25.2 29.9 ## 6 Indus… 7831. 9017. 6086. 14.4 30.6 ## 7 Infor… 8347 9715. 6432. 11.2 15.6 ## 8 Mater… 2560. 2951. 2140. 17.2 20.2 ## 9 Real … 2928. 3648. 2735. 14.0 21.5 ## 10 Telec… 101. 125. 88.5 16.2 27.6 ## 11 Utili… 1545. 1918. 1478. 10.1 12.0 ## # … with 1 more variable: X52.Week.High_fn2 <dbl>
Aggregation using data.table
Converting data frame to data table
Aggregation using the data table is a step further from dplyr in terms of simplicity. We write the aggregation code as if we are subsetting the data frame. Before we progress, let's convert our data frame into a data table.
financials <- setDT(financials) class(financials) ## [1] "data.table" "data.frame"
Aggregating single variable using data table
Let's look at an example where the scenario is to find out the total price of all shares by all sectors.
financials[,sum(Price), by=Sector] ## Sector V1 ## 1: Industrials 7831.47 ## 2: Health Care 8083.46 ## 3: Information Technology 8347.00 ## 4: Consumer Discretionary 10418.90 ## 5: Utilities 1545.45 ## 6: Financials 6055.81 ## 7: Materials 2559.67 ## 8: Real Estate 2927.52 ## 9: Consumer Staples 2711.98 ## 10: Energy 1852.40 ## 11: Telecommunication Services 100.81
Assigning a custom name to an aggregated variable in data table
The result above shows the aggregated value by the selected variable. However, the aggregated variable's name is not friendly. Let's give it a proper name.
financials[,list(total.price=sum(Price)), by=Sector]
financials[,.(total.price=sum(Price)), by=Sector] ## Sector total.price ## 1: Industrials 7831.47 ## 2: Health Care 8083.46 ## 3: Information Technology 8347.00 ## 4: Consumer Discretionary 10418.90 ## 5: Utilities 1545.45 ## 6: Financials 6055.81 ## 7: Materials 2559.67 ## 8: Real Estate 2927.52 ## 9: Consumer Staples 2711.98 ## 10: Energy 1852.40 ## 11: Telecommunication Services 100.81
Aggregating multiple variables using multiple aggregation functions in data table
financials[,.(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector]
Or
financials[,list(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector] ## Sector total.price mim.52.week.low ## 1: Industrials 7831.47 30.590 ## 2: Health Care 8083.46 29.930 ## 3: Information Technology 8347.00 15.650 ## 4: Consumer Discretionary 10418.90 13.480 ## 5: Utilities 1545.45 12.050 ## 6: Financials 6055.81 16.530 ## 7: Materials 2559.67 20.250 ## 8: Real Estate 2927.52 21.530 ## 9: Consumer Staples 2711.98 21.175 ## 10: Energy 1852.40 6.590 ## 11: Telecommunication Services 100.81 27.610
Aggregating variables as well as filtering observations in data table
What if we want to aggregate as well as filter the observations by supplying specific conditions. The data table offers a solution to this. Here is an example.
financials[Sector == "Industrials",list(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector] ## Sector total.price mim.52.week.low ## 1: Industrials 7831.47 30.59
Counting the number of observations within a group in data table
Although we can use different aggregation functions, what if we want to count the observations within a group? We can do this using .N special variable.
financials[,.N, by=Sector] ## Sector N ## 1: Industrials 67 ## 2: Health Care 61 ## 3: Information Technology 70 ## 4: Consumer Discretionary 84 ## 5: Utilities 28 ## 6: Financials 68 ## 7: Materials 25 ## 8: Real Estate 33 ## 9: Consumer Staples 34 ## 10: Energy 32 ## 11: Telecommunication Services 3
If the requirement is to group observations using multiple variables, then the following syntax can be used for by argument.
by=.(variable to group 1, variable to group 2)
Aggregation using sqldf
I love sqldf, it makes things easy, but if you are not familiar with SQL then it may be a little challenging. Aggregation in sqldf require us to understand SQL, and hence I would just list one example here. SQL is an easy language to learn and I will soon be posing articles on SQL.
Let's define our scenario, find out the total price and minimum 52-week low price from all shares by all sectors where aggregated total price is greater than 5000. This scenario includes more than one variable with the use of more than one aggregation function as well as filtration criterion over grouped observations.
sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector having sum(Price) > 5000') ## Sector sum(Price) min("X52.Week.Low") ## 1 Consumer Discretionary 10418.90 13.48 ## 2 Financials 6055.81 16.53 ## 3 Health Care 8083.46 29.93 ## 4 Industrials 7831.47 30.59 ## 5 Information Technology 8347.00 15.65
Let's give aggregated columns a custom name.
sqldf('select Sector, sum(Price) as "total.price", min("X52.Week.Low") as "min.52.week.low" from financials group by Sector having sum(Price) > 5000') ## Sector total.price min.52.week.low ## 1 Consumer Discretionary 10418.90 13.48 ## 2 Financials 6055.81 16.53 ## 3 Health Care 8083.46 29.93 ## 4 Industrials 7831.47 30.59 ## 5 Information Technology 8347.00 15.65
Sorting in R
Why sort data?
It is a fact that the human brain looks for patterns. Identifying patterns is crucial while performing analysis, and sorted data helps a lot in finding these patterns. Sort help understanding the data, moreover, in profiling the data. Sort functionality help put data into an order which can provide information to help your analysis.
Sorting data using base R
Sorting data using order function
In base R, function order help sort the data. The usage of the order function is straight forward. Here is an example.
financials[order(Symbol),c("Symbol","Price")] ## Symbol Price ## 1: A 65.05 ## 2: AAL 48.60 ## 3: AAP 109.63 ## 4: AAPL 155.15 ## 5: ABBV 108.48 ## --- ## 501: XYL 70.24 ## 502: YUM 76.30 ## 503: ZBH 115.53 ## 504: ZION 50.71 ## 505: ZTS 71.51
Sorting data in descending order using order function
We are subsetting variables Symbol and Price from data frame financials and sorting the output by Symbol variable's values. If we want to sort the data in descending order, then function desc comes to our rescue.
financials[order(desc(Symbol)),c("Symbol","Price")] ## Symbol Price ## 1: ZTS 71.51 ## 2: ZION 50.71 ## 3: ZBH 115.53 ## 4: YUM 76.30 ## 5: XYL 70.24 ## --- ## 501: ABBV 108.48 ## 502: AAPL 155.15 ## 503: AAP 109.63 ## 504: AAL 48.60 ## 505: A 65.05
Sorting data using dplyr
Sorting data using the arrange function
dplyr package has a function arrange which help sort data. Here is an example.
financials %>% group_by(Sector) %>% summarize(sum(Price)) %>% arrange(Sector) %>% print.data.frame() ## `summarise()` ungrouping output (override with `.groups` argument) ## Sector sum(Price) ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Sorting data in descending order using arrange function
Let's reorder the data in descending order.
financials %>% group_by(Sector) %>% summarize(sum(Price)) %>% arrange(desc(Sector)) %>% print.data.frame() ## `summarise()` ungrouping output (override with `.groups` argument) ## Sector sum(Price) ## 1 Utilities 1545.45 ## 2 Telecommunication Services 100.81 ## 3 Real Estate 2927.52 ## 4 Materials 2559.67 ## 5 Information Technology 8347.00 ## 6 Industrials 7831.47 ## 7 Health Care 8083.46 ## 8 Financials 6055.81 ## 9 Energy 1852.40 ## 10 Consumer Staples 2711.98 ## 11 Consumer Discretionary 10418.90
Sorting data using data table
Sorting data using order function
Data table has two functions order and setorder to help sort the data. Here is an example but first convert the data frame to data table and then sort.
financials <- setDT(financials) financials[order(Sector),sum(Price), by=Sector] ## Sector V1 ## 1: Consumer Discretionary 10418.90 ## 2: Consumer Staples 2711.98 ## 3: Energy 1852.40 ## 4: Financials 6055.81 ## 5: Health Care 8083.46 ## 6: Industrials 7831.47 ## 7: Information Technology 8347.00 ## 8: Materials 2559.67 ## 9: Real Estate 2927.52 ## 10: Telecommunication Services 100.81 ## 11: Utilities 1545.45
Sorting data in descending order using order function
Let's reverse the order now using desc function.
financials[order(desc(Sector)),sum(Price), by=Sector] ## Sector V1 ## 1: Utilities 1545.45 ## 2: Telecommunication Services 100.81 ## 3: Real Estate 2927.52 ## 4: Materials 2559.67 ## 5: Information Technology 8347.00 ## 6: Industrials 7831.47 ## 7: Health Care 8083.46 ## 8: Financials 6055.81 ## 9: Energy 1852.40 ## 10: Consumer Staples 2711.98 ## 11: Consumer Discretionary 10418.90
Sorting data using setorder function
Now let's look at the sorting using setorder function.
order.data <- setorder(financials[,c("Symbol","Price")],Symbol) order.data ## Symbol Price ## 1: A 65.05 ## 2: AAL 48.60 ## 3: AAP 109.63 ## 4: AAPL 155.15 ## 5: ABBV 108.48 ## --- ## 501: XYL 70.24 ## 502: YUM 76.30 ## 503: ZBH 115.53 ## 504: ZION 50.71 ## 505: ZTS 71.51
Sorting data in descending order using setorder function
Same setorder function can be used to reverse the order of the data by simply putting a minus sign in the front. Here is an example.
order.data <- setorder(financials[,c("Symbol","Price")],-Symbol) order.data ## Symbol Price ## 1: ZTS 71.51 ## 2: ZION 50.71 ## 3: ZBH 115.53 ## 4: YUM 76.30 ## 5: XYL 70.24 ## --- ## 501: ABBV 108.48 ## 502: AAPL 155.15 ## 503: AAP 109.63 ## 504: AAL 48.60 ## 505: A 65.05
Sorting data using sqldf
Data sorting using sqldf is simply achieved using the "order by" clause in the SQL command.
financials <- read.csv("constituents-financials_csv.csv") sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector order by Sector') ## Sector sum(Price) min("X52.Week.Low") ## 1 Consumer Discretionary 10418.90 13.480 ## 2 Consumer Staples 2711.98 21.175 ## 3 Energy 1852.40 6.590 ## 4 Financials 6055.81 16.530 ## 5 Health Care 8083.46 29.930 ## 6 Industrials 7831.47 30.590 ## 7 Information Technology 8347.00 15.650 ## 8 Materials 2559.67 20.250 ## 9 Real Estate 2927.52 21.530 ## 10 Telecommunication Services 100.81 27.610 ## 11 Utilities 1545.45 12.050
Reversing the order of the data using desc clause. Here is an example.
sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector order by Sector desc') ## Sector sum(Price) min("X52.Week.Low") ## 1 Utilities 1545.45 12.050 ## 2 Telecommunication Services 100.81 27.610 ## 3 Real Estate 2927.52 21.530 ## 4 Materials 2559.67 20.250 ## 5 Information Technology 8347.00 15.650 ## 6 Industrials 7831.47 30.590 ## 7 Health Care 8083.46 29.930 ## 8 Financials 6055.81 16.530 ## 9 Energy 1852.40 6.590 ## 10 Consumer Staples 2711.98 21.175 ## 11 Consumer Discretionary 10418.90 13.480
Concludingly, you may have noticed that we have not discussed NA values at all. Almost all aggregation function allows na.rm argument. Please try and comment if you find out that any particular function did not support it.
Well done, you've made to the end. Thank you for reading this article. I hope you've liked it. Please share it with your friends and if you are agreeing or disagreeing with any point above then please comment in the comment box below.
To leave a comment for the author, please follow the link and comment on their blog: dataENQ.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.