Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Happy new year to all of you. Even if you still fight with the aftereffects of your new year’s party, the following is something that may help in getting you more active because that’s it what this blog post is about – Activity.
Regardless of the business you are working in, I bet that customer activity is something that matters. An active customer typically is someone who is receptive for offers, whereas non-activity is a good indication for increasing churn probability or simply for a deteriorating customer relationship. That’s why we try to keep our customers happy and engaged. Because of that different groups in business would benefit from monitoring changes in customer activity. For example, marketing will send a special offer to the customer if activity increases or a sales agent contact and ask if he or she can help increase current capabilities. Customer care can call if they see a drop in usage and ask if there is any problem and how they can assist. For all this algorithms developed for change detection provide a perfect fit, as they provide you the information when a change in customer activity occurred.
This blog post will show, how to apply such algorithms to univariate time series representing customer activity and present the results graphically. Visualizing the identified breaks provide an additional benefit for understanding customer behavior and also how those algorithms work.
Data Understanding and Preparation
Let’s start by having a look at the data used in this article. Customer activity appears in multiple forms and it depends on the type of business, the product and the technical platform, what is measurable or not. During first experiments at work, I had to deal with login information which in essence consists of an ID and a time stamp of the login. It showed up that the number of logins per day is highly correlated with monthly revenue and a low churn probability and therefore monitoring of this kind of KPI was strongly advised. As this kind of data cannot made public, I’ll use some artificial data for this posting. A nice side-effect of this approach is that we know the exact properties for the artificial data and can compare them with the outcome of the statistical modeling. This makes it a lot easier to understand which method is the best to be used for the data at hand. A simple way to approximate a sequence of count data is to draw random numbers from a Poisson distribution. The only parameter is the average number of events called lambda. Lambda can be translated as the average number of logins per day. To simulate login data for a couple of hypothetical customers the following R script can be used (please see comments in the code for an explanation):
Change Point Detection Packages in R
Thanks to the R community, there are packages already existing on CRAN all focusing on change point detection. Especially the following packages are useful because they are not restricted to a special application domain and applicable to time series in general:
- CPM – “Parametric and Nonparametric Sequential Change Detection in R”:
Useful for detecting multiple change points in a time series from an unknown underlying distribution. Another bonus is that the method is applicable to data streams, where an observation is only considered once. Because of the “stream nature” of the cpm approach a second output are the detection points themselves. They mark the time when the change point is detected by the algorithm and quantify the delay. Unfortunately the cpm package is no longer maintained on CRAN. For windows users I uploaded a zipped version of the installed package from my R library here. It should work with R 3.0 and 3.1 under Windows 7/8. - BCP – “An R Package for Performing a Bayesian Analysis of Change Point Problems”:
A package using Markov Chain Monte Carlo to find multiple change points within a sequence. The implementation is generalized to the multivariate case where we expect that within a segment all sequences have a constant mean where the mean is not necessarily the same for all sequences. Finding change points in multivariate time series is not discussed in this posting. - ECP – “An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data”:
Another package for the detection of multiple change points within a time series that is also applicable to multivariate time series and makes no assumptions about the distribution. It uses an approach similar to hierarchical clustering with either a divisive or an agglomerative procedure to identify the change points. Even if it is not explicitly stated, most probably the “E” stands for energy as the method is using the energy statistic of Székely and Rizzo to identify changes.
There exist further packages in R for change point detection (for example the changepoint-package). An extensive overview over packages, prototypical code and code snippets can be found here. But for this post we continue with the three packages listed above.
Change Point Detection in R and Tableau
The following section shows how to create an interface to configure and examine the listed change point detection methods and visualize the results in Tableau for comparison and exploration. For this we use the Tableau-R connection which enables us, to have everything inside a single Tableau dashboard.
Giving direct visual feedback on the results is important for the following reasons:
- First, when you engage the analyst directly into change point detection process, he or she can incorporate background knowledge about dates and possible effects from external events. This kind of knowledge is not easily available for the algorithms themselves. Doing this might show that the drop in usage end of February is not because the customer thinks about cancellations, but your company launched a new product during February and the customer is now just using a different tool.
- Second, detecting a change point is not the end of process because after that a decision is needed, if the change point requires actions (dropping from 100 logins/day to 25 logins/day – of course! But from 75 to 68?). This can lead to a more or less complex decision process involving soft facts and contextual knowledge. Directly involving an analyst might increases the overall decision quality.
- And third, in practice you will primarily be confronted with unlabeled data with no indication about the true number of change points. For these datasets the requirements from the different methods are difficult to check. Providing a visual feedback how those algorithms perform will give the person confronted with the “change” (e.g. a customer care agent) a tool to back up their decision on what to do next (e.g. to call a customer). In general, it helps the analyst with his own judgment by presenting him a second opinion.
The dashboard itself uses a very simple structure, showing the empirical observations together with the true means on top and the results of the three packages below. Each of the four parts display the “observed” login counts plus a line for the estimated segment means and is just a simple multi-line chart. For the three estimates only we add “signature” to the tooltip shelf. This signature is a text string containing the true change points and segment means and was created as part of the data generation.
Parameters on the right side of the dashboard allow the user to interact with the algorithms or the underlying data by choosing a customer, filtering for a specific period or changing the configuration of the change detection methods.
For each one of the three packages a calculated field in Tableau is created that calls a Tableau/R interface function. To estimate the change points a simple workflow is implemented:
- Load relevant packages and initialize parameters,
- Trigger change point detection,
- Extract the change point locations if necessary by applying filtering or significance testing and
- Calculate the segment means based on the identified change points and return results to Tableau.
For the cpm package the code looks as follows:
The special case for the cpm method is that also the detection points should be displayed. Therefore, a second vector is initialized in R with the same length as the given time series. This vector contains the information for every observation, whether it’s also a detection point or not. For each detection point we store the value of the corresponding number of logins from the same day. At the end this vector is combined with the vector containing the segment means and handed back to Tableau as string. Back in Tableau the string is split and both sub-strings are converted into numerical values.
The change point detection method itself uses two parameters: one is the test statistic and the second parameter is the number of observations at the beginning until which no change point will be identified (kind of a burn-in phase). The test statistics offer multiple versions to detect changes depending on what we know about the distribution or the type of change. As we are interested about changes in the location of the mean in our scenario (user activity is increasing or decreasing over time) the Mann-Whitney test statistic is used as default.
Regarding the bcp approach we use three parameters. Two of them are the tuning parameters γ and λ with a default value for both of 0.2. A guideline from the package vignette for both is that in situations “where there aren’t too many changes”, γ should be small and in situations “where the changes that do occur are of a reasonable size”, λ should be small (more info about both parameters can be found in the original paper). The last parameter is a probability threshold for the estimated posterior probabilities. If the posterior probabilities is above the threshold the observation is considered a change point.
Similar to hierarchical clustering, the ecp package offers a top-down and a bottom-up approach for change point detection. We use the top-down approach (as recommend by the package authors) and connect two parameters to Tableau. One of them controls the minimal number of observations between two change points (“closeness”). The other one is a threshold used for the significance test that is done for every detected potential change point. The higher this value is, the more likely we classify an observation as significant change point.
That’s it. The screenshot at the beginning of the post shows how the result looks like.
Summary
The final dashboard provides a direct view on how the different change point detection methods perform on various time series. Changing the parameters – either for a specific method or for the underlying data – will give immediate response without any need to change the code or even confront the analyst with a programming language like R. It is also easy to add new parameters to the dashboard or use the pattern described above to add completely new methods for change point detection.
It is also easy to use your own data. Just bring it into the same structure as the presented toy data and change the data connection afterwards.
The two fields “trueMean” and “signature”, normally not known for real-life data, can be imputed with some constants (for example set the unknown true mean to 0 and the signature to “N/A”).
Hope that this was worth for you to read and it would make me happy if you leave a short comments. As always the underling workbook as twbx can be found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.