Site icon R-bloggers

R 101 – Aggregate By Quarter

[This article was first published on Data Driven Security, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We were asked a question on how to (in R) aggregate quarterly data from what I believe was a daily time series. This is a pretty common task and there are many ways to do this in R, but we’ll focus on one method using the zoo and dplyr packages. Let’t get those imports out of the way:

library(dplyr)
library(zoo)
library(ggplot2)

Now, we need some data. This could be from a database, log file or even Excel spreadsheet or CSV. Since we’re focusing on the aggregation and not the parsing, let’s generate some data, for daily failed logins in calendar year 2014:

set.seed(1492)

yr_2014 <- seq(from=as.Date("2014-01-01"), 
                              to=as.Date("2014-12-31"), 
                              by="day")

logins <- data_frame(date=yr_2014,
                     failures=round(rlnorm(length(yr_2014)) * 
                                      sample(10:50, 1)), 0.5, 3)

glimpse(logins)

## Observations: 365
## Variables:
## $ date     (date) 2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014...
## $ failures (dbl) 18, 13, 6, 91, 24, 46, 14, 34, 10, 48, 45, 11, 8, 40,...
## $ 0.5      (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5...
## $ 3        (dbl) 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...

Using set.seed makes the pseudo-random draws via rlnorm repeatable on other systems. We can get a better look at that data:

ggplot(logins, aes(x=date, y=failures)) + 
  geom_bar(stat="identity") +
  labs(x=NULL, y="# Login Failures\n") +
  theme_bw() +
  theme(panel.grid=element_blank()) +
  theme(panel.border=element_blank())

We can then, summarize the number of failed logins by quarter using as.yearqtr:

logins %>% 
  mutate(qtr=as.yearqtr(date)) %>% 
  count(qtr, wt=failures) -> total_failed_logins_by_qtr

total_failed_logins_by_qtr

## Source: local data frame [4 x 2]
## 
##       qtr    n
## 1 2014 Q1 4091
## 2 2014 Q2 5915
## 3 2014 Q3 6141
## 4 2014 Q4 5229

NOTE: you can control the way those quarter labels look with the format parater to as.yearqtr:

format

character string specifying format. "%C", "%Y", "%y" and "%q", if present, are replaced with the century, year, last two digits of the year, and quarter (i.e. a number between 1 and 4), respectively.

But you can also get more intra-quarter detail as well by looking at the distribution of failed logins:

logins %>% 
  mutate(qtr=as.character(as.yearqtr(date))) %>% 
  ggplot() +
  geom_violin(aes(x=qtr, y=failures), fill="#cab2d6") +
  geom_boxplot(aes(x=qtr, y=failures), alpha=0) +
  scale_y_continuous(expand=c(0, 0)) +
  labs(x=NULL, y=NULL, title="\nDistribution of login failures per quarter") +
  coord_flip() +
  theme_bw() +
  theme(panel.grid=element_blank()) +
  theme(panel.border=element_blank()) +
  theme(axis.ticks.y=element_blank())

To leave a comment for the author, please follow the link and comment on their blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.