Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today we are introducing tibbletime v0.0.2
, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the ~period formula~
and making/calculating custom rolling functions with rollify()
(plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the weatherData
package). However, the new tools make tibbletime
useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view tibbletime
as the next phase of time series analysis in the tidyverse
. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!
Introduction
We are excited to announce the release of tibbletime v0.0.2
on CRAN. Loads of new
functionality have been added, including:
-
Generic period support: Perform time-based calculations by a number of supported periods using a new
~period formula~
. -
Creating series: Use
create_series()
to quickly create atbl_time
object initialized with a regular time series. -
Rolling calculations: Turn any function into a rolling version of itself with
rollify()
. -
A number of smaller tweaks and helper functions to make life easier.
As we further develop tibbletime
, it is becoming clearer that the package
is a tool that should be used in addition to the rest of the tidyverse
.
The combination of the two makes time series analysis in the tidyverse much easier to do!
In this post
Today we will take a look at weather data for New York and San
Francisco from 2013. It will be an exploratory analysis
to show off some of the new features in tibbletime
, but the package
itself has much broader application. As we will see, tibbletime
’s time-based
functionality can be a valuable data manipulation tool to help with:
-
Product and sales forecasting
-
Financial analysis with custom rolling functions
-
Grouping data into time buckets to analyze change over time, which is great for any part of an organization including sales, marketing, manufacturing, and HR!
Data and packages
The datasets used are from a neat package called weatherData
. While weatherData
has functionality to pull weather data for a number of cities, we will use the built-in datasets. We encourage you to explore the weatherData
API if you’re interested in collecting weather data.
To get started, load the following packages:
tibbletime
: Time-aware tibbles for the tidyversetidyverse
: Loads packages includingdplyr
,tidyr
,purrr
, andggplot
corrr
: Tidy correlationsweatherData
: Slick package for getting weather data
Also, load the datasets from weatherData
, “NewYork2013” and “SFO2013”.
Combine and convert
To tidy up, we first join our data sets together using bind_rows()
. Passing
a named list of tibbles along with specifying the .id
argument allows
bind_rows()
to create a new City
reference column for us.
Next, we will convert to tbl_time
and group by our City
variable. Note that we know this is a tbl_time
object by Index: Time
that gets printed along with the tibble.
Period conversion
The first new idea to introduce is the ~period formula~
. This tells the tibbletime
functions how you want to time-group your data. It is specified
as multiple ~ period
, with examples being 1~d
for “every 1 day,” and
4~m
for “every 4 months.”
In our original data, it looks like weather
is an hourly dataset, with each new
data point coming in on the 51st minute of the hour for NYC and the 56th minute
for SFO. Unfortunately, a number of points don’t follow this. Check out the following rows:
What we want is 1 row per hour, and in this case we get two rows for NYC hour 12.
We can use as_period()
to ensure that we only have 1 row for each hour
Now that we have our data in an hourly format, we probably don’t care about
which minute it came in on. We can floor the date column using a helper function,
time_floor()
. Credit to Hadley Wickham because this is essentially a convenient
wrapper around lubridate::floor_date()
. Setting the period to 1~h
floors
each row to the beginning of the last hour.
Visualize the data
Now that we have cleaned up a bit, let’s visualize the data.
Seems like hourly data is a bit overwhelming for this kind of chart. Let’s convert to daily and try again.
That’s better. It looks like NYC has a much wider range of temperatures than SFO. Both seem to be hotter in summer months.
Period-based summaries
The dplyr::summarise()
function is very useful for taking grouped summaries.
time_summarise()
takes this a step further by allowing you to summarise by
period.
Below we take a look at the average and standard deviation of the temperatures calculated at monthly and bimonthly intervals.
A closer look at July
July seemed to be one of the hottest months for NYC, let’s take a closer look at it.
To just grab July dates, use time_filter()
. If you haven’t seen this before, a time formula
is used to specify the dates to filter for. The one-sided formula below expands to include dates between, 2013-07-01 00:00:00 ~ 2013-07-31 23:59:59
.
To visualize July’s weather, we will make a boxplot of the temperatures.
Specifically, we will slice July into intervals of 2 days, and create a series
of boxplots based on the data inside those intervals. To do this, we will
use time_collapse()
, which collapses a column of dates into a column of the same
lenth, but where every row in a time interval shares the same date. You can use this resulting
column for further grouping or labeling operations.
Let’s visualize to see if we can gain any insights. Wow, San Fran maintained a constant cool average of 60 degrees in the hottest month of the year!
Period and rolling correlations
Finally, we will look at the correlation of temperatures in our two cities in a few different ways.
First, let’s look at the overall correlation. The corrr
package provides a nice way to accomplish this with data frames.
Next, let’s look at monthly correlations. The general idea will be
to nest each month into it’s own data frame, apply correlate()
to each
nested data frame, and then unnest the results. We will use time_nest()
to easily perform the monthly nesting.
For each month, calculate the correlation tibble and then focus()
on the NYC column. Then unnest and floor the results.
It seems that summer and fall months tend to have higher correlation than colder months.
And finally we will calculate the rolling correlation of NYC and SFO temperatures. The “width” of our roll will be monthly, or 360 hours since we are in hourly format.
There are a number of ways to do this, but for this example
we introduce rollify()
, which takes any function that you give it and creates a rolling version of it. The first argument to rollify()
is the function that you want to modify, and it is passed to rollify()
in the same way that you would pass a function to purrr::map()
. The second argument is the window size. Call the rolling function just as you would call a non-rolling version
of cor()
from inside mutate()
.
It looks like the correlation is definitely not stable throughout the year,
so that initial correlation value of .65
definitely has to be taken
with a grain of salt!
Rolling Functions: Pros/Cons and Recommendations
There are a number of ways to do rolling functions, and we recommend based on your needs. If you are interested in:
-
Flexibility: Use
rollify()
. You can literally turn any function into a “tidy” rolling function. Think everything from rolling statistics to rolling regressions. Whatever you can dream up, it can do. The speed is fast, but not quite as fast as otherRcpp
based libraries. -
Performance: Use the
roll
package, which usesRcppParallel
as its backend making it the fastest option available. The only downside is flexibility since you cannot create custom rolling functions and are bound to those available.
Wrapping up
We’ve touched on a few of the new features in tibbletime v0.0.2
. Notably:
-
rollify()
for rolling functions -
as_period()
with generic periods -
time_collapse()
for collapsing date columns
A full change log can be found in the NEWS file on Github or CRAN.
We are always open to new ideas and encourage you to submit an issue on our Github repo here.
Have fun with tibbletime
!
Final thoughts
Mind you this is only v0.0.2. We have a lot of work to do, but we couldn’t
wait any longer to share this. Feel free to kick the tires on tibbletime
, and let us know your thoughts. Please submit any comments, issues or bug reports to us on GitHub here. Enjoy!
About Business Science
Business Science takes the headache out of data science. We specialize in applying machine learning and data science in business applications. We help businesses that seek to build out this capability but may not have the resources currently to implement predictive analytics. Business Science works with clients as diverse as startups to Fortune 500 and seeks to guide organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Connect with Business Science
Connect, communicate and collaborate with us! The easiest way to do so is via social media. Connect with us out on:
- @bizScienc on twitter!
- Facebook!
- LinkedIn!
- Sign up for our insights blog to stay updated!
- If you like our software, star our GitHub packages!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.