Basic ideas on aggregate, plyr and crosstables!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A common task using R is the investigation of one particular dataset. Usually we have a mixture of numerical and categorial data and are interested in some statistics (e.g. means and so on).
And there are a lot of threads, blogs etc around that. Sorry for adding another one, but so I remember myself.
Let’s take as example the dataset diamonds that is included in the ggplot2-package.
library(ggplot2) data(diamonds) str(diamonds) ## 'data.frame': 53940 obs. of 10 variables: ## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... ## $ cut : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ... ## $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ... ## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... ## $ table : num 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int 326 326 327 334 335 336 336 337 337 338 ... ## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... ## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... ## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
We can see that the datsets consist of a mixture of variales and types.
Let’s say we are interest in the average price of diamonds with regards to the category of clarity.
Do it with agggregate
Aggregate comes with the stats package. The relevant call to answer our questions could look like
aggregate(price ~ clarity, diamonds, mean) ## clarity price ## 1 I1 3924 ## 2 SI2 5063 ## 3 SI1 3996 ## 4 VS2 3925 ## 5 VS1 3839 ## 6 VVS2 3284 ## 7 VVS1 2523 ## 8 IF 2865
Et voilà. Other way of typing would be the usage of the “list” argument as below.
We can also enlarge the list argument by other categorical variables, as shown as well, using clarity and color.
aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity), mean) ## CLARITY x ## 1 I1 3924 ## 2 SI2 5063 ## 3 SI1 3996 ## 4 VS2 3925 ## 5 VS1 3839 ## 6 VVS2 3284 ## 7 VVS1 2523 ## 8 IF 2865 aggregate(diamonds[, c("price")], list(CLARITY = diamonds$clarity, COLOR = diamonds$color), mean) ## CLARITY COLOR x ## 1 I1 D 3863 ## 2 SI2 D 3931 ## 3 SI1 D 2976 ## 4 VS2 D 2587 ## 5 VS1 D 3030 ## 6 VVS2 D 3351 ## 7 VVS1 D 2948 ## 8 IF D 8307 ## 9 I1 E 3488 ## 10 SI2 E 4174 ## 11 SI1 E 3162 ## 12 VS2 E 2751 ## 13 VS1 E 2856 ## 14 VVS2 E 2500 ## 15 VVS1 E 2220 ## 16 IF E 3669 ## 17 I1 F 3342 ## 18 SI2 F 4473 ## 19 SI1 F 3714 ## 20 VS2 F 3757 ## 21 VS1 F 3797 ## 22 VVS2 F 3476 ## 23 VVS1 F 2804 ## 24 IF F 2751 ## 25 I1 G 3546 ## 26 SI2 G 5022 ## 27 SI1 G 3775 ## 28 VS2 G 4416 ## 29 VS1 G 4131 ## 30 VVS2 G 3845 ## 31 VVS1 G 2867 ## 32 IF G 2558 ## 33 I1 H 4453 ## 34 SI2 H 6100 ## 35 SI1 H 5032 ## 36 VS2 H 4722 ## 37 VS1 H 3781 ## 38 VVS2 H 2649 ## 39 VVS1 H 1846 ## 40 IF H 2288 ## 41 I1 I 4302 ## 42 SI2 I 7003 ## 43 SI1 I 5355 ## 44 VS2 I 5691 ## 45 VS1 I 4633 ## 46 VVS2 I 2968 ## 47 VVS1 I 2035 ## 48 IF I 1995 ## 49 I1 J 5254 ## 50 SI2 J 6521 ## 51 SI1 J 5186 ## 52 VS2 J 5311 ## 53 VS1 J 4884 ## 54 VVS2 J 5142 ## 55 VVS1 J 4034 ## 56 IF J 3364
We clearly achieve the same result using the above “formula notation”:
aggregate(price ~ clarity + color, diamonds, mean) ## clarity color price ## 1 I1 D 3863 ## 2 SI2 D 3931 ## 3 SI1 D 2976 ## 4 VS2 D 2587 ## 5 VS1 D 3030 ## 6 VVS2 D 3351 ## 7 VVS1 D 2948 ## 8 IF D 8307 ## 9 I1 E 3488 ## 10 SI2 E 4174 ## 11 SI1 E 3162 ## 12 VS2 E 2751 ## 13 VS1 E 2856 ## 14 VVS2 E 2500 ## 15 VVS1 E 2220 ## 16 IF E 3669 ## 17 I1 F 3342 ## 18 SI2 F 4473 ## 19 SI1 F 3714 ## 20 VS2 F 3757 ## 21 VS1 F 3797 ## 22 VVS2 F 3476 ## 23 VVS1 F 2804 ## 24 IF F 2751 ## 25 I1 G 3546 ## 26 SI2 G 5022 ## 27 SI1 G 3775 ## 28 VS2 G 4416 ## 29 VS1 G 4131 ## 30 VVS2 G 3845 ## 31 VVS1 G 2867 ## 32 IF G 2558 ## 33 I1 H 4453 ## 34 SI2 H 6100 ## 35 SI1 H 5032 ## 36 VS2 H 4722 ## 37 VS1 H 3781 ## 38 VVS2 H 2649 ## 39 VVS1 H 1846 ## 40 IF H 2288 ## 41 I1 I 4302 ## 42 SI2 I 7003 ## 43 SI1 I 5355 ## 44 VS2 I 5691 ## 45 VS1 I 4633 ## 46 VVS2 I 2968 ## 47 VVS1 I 2035 ## 48 IF I 1995 ## 49 I1 J 5254 ## 50 SI2 J 6521 ## 51 SI1 J 5186 ## 52 VS2 J 5311 ## 53 VS1 J 4884 ## 54 VVS2 J 5142 ## 55 VVS1 J 4034 ## 56 IF J 3364
Since I find the formula quotation much more intuitive I’ll continue using that one.
It allows also to pass further arguments, like the probabilities for the quantile-function:
aggregate(price ~ clarity + color, diamonds, FUN = quantile, probs = 0.8) ## clarity color price ## 1 I1 D 5139 ## 2 SI2 D 5257 ## 3 SI1 D 5096 ## 4 VS2 D 3305 ## 5 VS1 D 4057 ## 6 VVS2 D 4477 ## 7 VVS1 D 3781 ## 8 IF D 16015 ## 9 I1 E 4191 ## 10 SI2 E 5150 ## 11 SI1 E 5068 ## 12 VS2 E 3585 ## 13 VS1 E 3466 ## 14 VVS2 E 2857 ## 15 VVS1 E 2514 ## 16 IF E 6002 ## 17 I1 F 4570 ## 18 SI2 F 5163 ## 19 SI1 F 5292 ## 20 VS2 F 6598 ## 21 VS1 F 7265 ## 22 VVS2 F 8218 ## 23 VVS1 F 3501 ## 24 IF F 2656 ## 25 I1 G 6209 ## 26 SI2 G 6785 ## 27 SI1 G 5342 ## 28 VS2 G 7031 ## 29 VS1 G 7508 ## 30 VVS2 G 7952 ## 31 VVS1 G 3327 ## 32 IF G 2575 ## 33 I1 H 6530 ## 34 SI2 H 9257 ## 35 SI1 H 7173 ## 36 VS2 H 7510 ## 37 VS1 H 6796 ## 38 VVS2 H 4269 ## 39 VVS1 H 2304 ## 40 IF H 2449 ## 41 I1 I 5696 ## 42 SI2 I 12918 ## 43 SI1 I 8736 ## 44 VS2 I 9791 ## 45 VS1 I 8383 ## 46 VVS2 I 3934 ## 47 VVS1 I 2549 ## 48 IF I 2062 ## 49 I1 J 6576 ## 50 SI2 J 11865 ## 51 SI1 J 7952 ## 52 VS2 J 8055 ## 53 VS1 J 8507 ## 54 VVS2 J 8770 ## 55 VVS1 J 8803 ## 56 IF J 5361
However, we face one issue here: For “at a glance” investigation, the output is not easily readable because it is in “long-format” (is this the right expression?). What I’d like to have, is a kind of cross-table. The MS-Excel user might recognize this as “pivot tables”.
For this purpose we have to make us of the xtabs-function that creates a contingency table.
First, we create a dataframe using our aggregate function and then we pass this dataframe to xtabs, specifying that we want to see the price on the other variables:
dmnds <- aggregate(price ~ clarity + color, diamonds, mean) xtabs(price ~ ., data = dmnds) ## color ## clarity D E F G H I J ## I1 3863 3488 3342 3546 4453 4302 5254 ## SI2 3931 4174 4473 5022 6100 7003 6521 ## SI1 2976 3162 3714 3775 5032 5355 5186 ## VS2 2587 2751 3757 4416 4722 5691 5311 ## VS1 3030 2856 3797 4131 3781 4633 4884 ## VVS2 3351 2500 3476 3845 2649 2968 5142 ## VVS1 2948 2220 2804 2867 1846 2035 4034 ## IF 8307 3669 2751 2558 2288 1995 3364
That’s nice, isn’t it?
But as announced earlier, there is another way of achieving the same result: The ddply-function from the plyr-package.
dmnds.ddply <- ddply(.data = diamonds, .(clarity, color), summarise, PRICEMEAN = mean(price)) dmnds.ddply ## clarity color PRICEMEAN ## 1 I1 D 3863 ## 2 I1 E 3488 ## 3 I1 F 3342 ## 4 I1 G 3546 ## 5 I1 H 4453 ## 6 I1 I 4302 ## 7 I1 J 5254 ## 8 SI2 D 3931 ## 9 SI2 E 4174 ## 10 SI2 F 4473 ## 11 SI2 G 5022 ## 12 SI2 H 6100 ## 13 SI2 I 7003 ## 14 SI2 J 6521 ## 15 SI1 D 2976 ## 16 SI1 E 3162 ## 17 SI1 F 3714 ## 18 SI1 G 3775 ## 19 SI1 H 5032 ## 20 SI1 I 5355 ## 21 SI1 J 5186 ## 22 VS2 D 2587 ## 23 VS2 E 2751 ## 24 VS2 F 3757 ## 25 VS2 G 4416 ## 26 VS2 H 4722 ## 27 VS2 I 5691 ## 28 VS2 J 5311 ## 29 VS1 D 3030 ## 30 VS1 E 2856 ## 31 VS1 F 3797 ## 32 VS1 G 4131 ## 33 VS1 H 3781 ## 34 VS1 I 4633 ## 35 VS1 J 4884 ## 36 VVS2 D 3351 ## 37 VVS2 E 2500 ## 38 VVS2 F 3476 ## 39 VVS2 G 3845 ## 40 VVS2 H 2649 ## 41 VVS2 I 2968 ## 42 VVS2 J 5142 ## 43 VVS1 D 2948 ## 44 VVS1 E 2220 ## 45 VVS1 F 2804 ## 46 VVS1 G 2867 ## 47 VVS1 H 1846 ## 48 VVS1 I 2035 ## 49 VVS1 J 4034 ## 50 IF D 8307 ## 51 IF E 3669 ## 52 IF F 2751 ## 53 IF G 2558 ## 54 IF H 2288 ## 55 IF I 1995 ## 56 IF J 3364
Well, perhaps a bit verbose compared to the use of aggregate, but ddply allows great flexibility!
In order to bring it now into a “pivot table”, we can make use of (a) the xtabs-function again, or (b) use the cast-function in the reshape2-package (NB: Use dcast for dataframes):
dcast(dmnds.ddply, clarity ~ color, value.var = "PRICEMEAN") ## clarity D E F G H I J ## 1 I1 3863 3488 3342 3546 4453 4302 5254 ## 2 SI2 3931 4174 4473 5022 6100 7003 6521 ## 3 SI1 2976 3162 3714 3775 5032 5355 5186 ## 4 VS2 2587 2751 3757 4416 4722 5691 5311 ## 5 VS1 3030 2856 3797 4131 3781 4633 4884 ## 6 VVS2 3351 2500 3476 3845 2649 2968 5142 ## 7 VVS1 2948 2220 2804 2867 1846 2035 4034 ## 8 IF 8307 3669 2751 2558 2288 1995 3364
Again, we have a “formula interface” which makes the application straightforward. The parameter “value.var” specifies which column to take to apply the function to – very helpful in case of big datasets.
Alright, and last but not least a oneliner that achieves the same result, combining the melt-function and the cast-function (the melt-function is part of the reshape2 package and converts “datatables” in long format.)
dcast(melt(diamonds, id.vars = c("cut", "color", "clarity"), measure.vars = "price"), formula = clarity ~ color, mean) ## clarity D E F G H I J ## 1 I1 3863 3488 3342 3546 4453 4302 5254 ## 2 SI2 3931 4174 4473 5022 6100 7003 6521 ## 3 SI1 2976 3162 3714 3775 5032 5355 5186 ## 4 VS2 2587 2751 3757 4416 4722 5691 5311 ## 5 VS1 3030 2856 3797 4131 3781 4633 4884 ## 6 VVS2 3351 2500 3476 3845 2649 2968 5142 ## 7 VVS1 2948 2220 2804 2867 1846 2035 4034 ## 8 IF 8307 3669 2751 2558 2288 1995 3364
So first, the dataset is brought entirely in long format, and then cast is applied. But this clearly is the most time-consuming way of doing it (in terms of computation time)…
Bottom line:
My favorite function for not too big datasets is the combination of ddply and cast, sine it allows for most flexibility, I feel. That flexibility come with the price of syntax, which you’ll have to learn. Aggregate and xtabs are quickly applied for small problems (maybe also for big ones, but I do not understand how to apply for more complex tasks that ddply allows).
Additionally, I’ve heard talking about the data.table package, that seems to be the fastest. Go and check!
References:
mages’ blog on by, apply and friends
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.