Site icon R-bloggers

ggmail + forecast = how many emails I will get tomorrow?

[This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


During the eRum 2016, Adam Zagdański gave a very good tutorial about time series modeling. Among other things I’ve learned that the forecast package (created by Rob Hyndman) got cool new plots based on the ggplot2 package.

Let’s use it to play with mailbox statistics for my gmail account!

1. Get the data

Follow this link to download the data from your gmail account as a single mbox file.
It may be large (15GB in my case), but for further steps it’s enough to keep only headers. grep + cat will do the job.

2. Read headers

The readLines() function can handle headers. Then the lubridate package is useful to extract and convert dates to the R format.

3. Basic gg-exploration

I’ve started with daily aggregates – number of emails per day.
The ts() function converts vector of aggregates to a time series object.
Then I’ve used the autoplot() function to plot the time series. Since it’s the ggplot2 plot, you can easily add a smooth trend to the plot with the geom_smooth() function.

There is some trend, but what about seasonality?
The geom_boxplot() is useful to check if there are differences among days of week or months.
It turns out that the number of emails per day is very different for week-days and weekends.
Also the August is the email-lightest month. Only, on average, 60 per day

4. Time Series

The decompose() + autoplot() functions extract trend and seasonal components from the time series. The multiplicative seasonal component is probably more appropriate here, but below the additive component is presented since it’s easier to read values on the oy axis.

A lot of models that can be fitted with the forecast package. From different choices the most scary one is for the forecast with the Holt method. Scary because of the trend.

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.