Site icon R-bloggers

Change Point Detection in Time Series with R and Tableau

[This article was first published on Data * Science + R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Happy new year to all of you. Even if you still fight with the aftereffects of your new year’s party, the following is something that may help in getting you more active because that’s it what this blog post is about – Activity.

Regardless of the business you are working in, I bet that customer activity is something that matters. An active customer typically is someone who is receptive for offers, whereas non-activity is a good indication for increasing churn probability or simply for a deteriorating customer relationship. That’s why we try to keep our customers happy and engaged. Because of that different groups in business would benefit from monitoring changes in customer activity. For example, marketing will send a special offer to the customer if activity increases or a sales agent contact and ask if he or she can help increase current capabilities. Customer care can call if they see a drop in usage and ask if there is any problem and how they can assist. For all this algorithms developed for change detection provide a perfect fit, as they provide you the information when a change in customer activity occurred.

This blog post will show, how to apply such algorithms to univariate time series representing customer activity and present the results graphically. Visualizing the identified breaks provide an additional benefit for understanding customer behavior and also how those algorithms work.

Data Understanding and Preparation

Let’s start by having a look at the data used in this article. Customer activity appears in multiple forms and it depends on the type of business, the product and the technical platform, what is measurable or not. During first experiments at work, I had to deal with login information which in essence consists of an ID and a time stamp of the login. It showed up that the number of logins per day is highly correlated with monthly revenue and a low churn probability and therefore monitoring of this kind of KPI was strongly advised. As this kind of data cannot made public, I’ll use some artificial data for this posting. A nice side-effect of this approach is that we know the exact properties for the artificial data and can compare them with the outcome of the statistical modeling. This makes it a lot easier to understand which method is the best to be used for the data at hand. A simple way to approximate a sequence of count data is to draw random numbers from a Poisson distribution. The only parameter is the average number of events called lambda. Lambda can be translated as the average number of logins per day. To simulate login data for a couple of hypothetical customers the following R script can be used (please see comments in the code for an explanation):

Change Point Detection Packages in R

Thanks to the R community, there are packages already existing on CRAN all focusing on change point detection. Especially the following packages are useful because they are not restricted to a special application domain and applicable to time series in general:

There exist further packages in R for change point detection (for example the changepoint-package). An extensive overview over packages, prototypical code and code snippets can be found here. But for this post we continue with the three packages listed above.

Change Point Detection in R and Tableau

The following section shows how to create an interface to configure and examine the listed change point detection methods and visualize the results in Tableau for comparison and exploration. For this we use the Tableau-R connection which enables us, to have everything inside a single Tableau dashboard.

Giving direct visual feedback on the results is important for the following reasons:

The dashboard itself uses a very simple structure, showing the empirical observations together with the true means on top and the results of the three packages below. Each of the four parts display the “observed” login counts plus a line for the estimated segment means and is just a simple multi-line chart. For the three estimates only we add “signature” to the tooltip shelf. This signature is a text string containing the true change points and segment means and was created as part of the data generation.

Parameters on the right side of the dashboard allow the user to interact with the algorithms or the underlying data by choosing a customer, filtering for a specific period or changing the configuration of the change detection methods.
For each one of the three packages a calculated field in Tableau is created that calls a Tableau/R interface function. To estimate the change points a simple workflow is implemented:

  1. Load relevant packages and initialize parameters,
  2. Trigger change point detection,
  3. Extract the change point locations if necessary by applying filtering or significance testing and
  4. Calculate the segment means based on the identified change points and return results to Tableau.

For the cpm package the code looks as follows: The special case for the cpm method is that also the detection points should be displayed. Therefore, a second vector is initialized in R with the same length as the given time series. This vector contains the information for every observation, whether it’s also a detection point or not. For each detection point we store the value of the corresponding number of logins from the same day. At the end this vector is combined with the vector containing the segment means and handed back to Tableau as string. Back in Tableau the string is split and both sub-strings are converted into numerical values.
The change point detection method itself uses two parameters: one is the test statistic and the second parameter is the number of observations at the beginning until which no change point will be identified (kind of a burn-in phase). The test statistics offer multiple versions to detect changes depending on what we know about the distribution or the type of change. As we are interested about changes in the location of the mean in our scenario (user activity is increasing or decreasing over time) the Mann-Whitney test statistic is used as default.

Regarding the bcp approach we use three parameters. Two of them are the tuning parameters γ and λ with a default value for both of 0.2. A guideline from the package vignette for both is that in situations “where there aren’t too many changes”, γ should be small and in situations “where the changes that do occur are of a reasonable size”, λ should be small (more info about both parameters can be found in the original paper). The last parameter is a probability threshold for the estimated posterior probabilities. If the posterior probabilities is above the threshold the observation is considered a change point.

Similar to hierarchical clustering, the ecp package offers a top-down and a bottom-up approach for change point detection. We use the top-down approach (as recommend by the package authors) and connect two parameters to Tableau. One of them controls the minimal number of observations between two change points (“closeness”). The other one is a threshold used for the significance test that is done for every detected potential change point. The higher this value is, the more likely we classify an observation as significant change point.

That’s it. The screenshot at the beginning of the post shows how the result looks like.

Summary

The final dashboard provides a direct view on how the different change point detection methods perform on various time series. Changing the parameters – either for a specific method or for the underlying data – will give immediate response without any need to change the code or even confront the analyst with a programming language like R. It is also easy to add new parameters to the dashboard or use the pattern described above to add completely new methods for change point detection.
It is also easy to use your own data. Just bring it into the same structure as the presented toy data and change the data connection afterwards.

The two fields “trueMean” and “signature”, normally not known for real-life data, can be imputed with some constants (for example set the unknown true mean to 0 and the signature to “N/A”).

Hope that this was worth for you to read and it would make me happy if you leave a short comments. As always the underling workbook as twbx can be found here.

To leave a comment for the author, please follow the link and comment on their blog: Data * Science + R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.