Using recurrent neural networks to segment customers

tvladeck

4 years ago

[This article was first published on R – Gradient Metrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding consumer segments is key to any successful business. Analytically, segmentations involve clustering a dataset to find groups of similar customers. What “similar” means is defined by the data that goes into the clustering — it could be demographic, attitudinal, or other characteristics. And the data that goes into the clustering is often limited by the clustering algorithms themselves — most require some kind of tabular data structure, and common techniques like k-Means require strictly numeric input. Breaking out of these restrictions has been one of our top priorities since starting the company. So what do you do when you want to find segments of customers that are “similar” because they behave similarly — their experience with you, their brand, has been similar. How would you define that? Increasingly, companies are collecting sequence data, with each entry being an interaction with a customer — be it a purchase, reading an email, visiting the website, etc. Given the popularity of deep learning techniques to tackle sequence-related learning tasks, we thought applying neural networks to customer segmentation was the natural approach. This post builds off of our previous customer journey segmentation post and demonstrates a prototype of a deep learning approach to behavior sequence segmentation. We wanted to investigate if we could leverage the internal state of a recurrent neural network (RNN) on complex sequences of data to identify distinctive customer segments. Turns out that we can. And it works well.

Data description

Our client recorded a behavioral dataset for each customer interaction such as receiving an email, opening an email or using the app, so a single users “sequence” looks like this. Note that each sequence can have a variable number of rows.

User ID	Cancel	Sent Email	Open email	Click email	App used	Site visited	Days since last interaction
1001	0	0	0	0	0	1	0
1001	0	0	0	0	0	1	2
1001	0	0	0	0	0	1	4
1001	0	1	0	0	0	0	5
1001	0	0	1	0	0	0	7
1001	0	0	0	0	0	1	1

Developing the Neural Network

We developed a very simple neural network architecture which is described below. For this sample of customers, we knew whether or not they had churned by the time the data was collected, so our “X’s” were the sequences of customer behavior, and our “Y’s” were 0/1s depending on if the customer had churned. Therefore we had a sigmoid output layer which predicted either a 0 or 1 and a recurrent input layer, which is able to handle variable length sequences. We included a dense layer to make the network more powerful, and to generate encodings.

Layer	Input dimension	Output dimension
Recurrent	Variable	10
Dense	10	10 (used for encoding)
Sigmoid	10	1

We used Keras (on R) to specify and train the network. After training the network on the churn data, we used the weights from the Recurrent and Dense layers to produce a set of encodings for each user. After feeding in a user’s sequence, we get a ten-dimensional numeric encoding out:

User ID	Encoding_1	Encoding_2	Encoding_3	Encoding_4	Encoding_5	…
1001	0	0	0.4	12.8	0.5
1002	0.1	1.3	0.9	14.7	141.0
1003	0.1	1.3	0.9	14.7	141.0
1004	0.1	1.3	0.9	14.7	141.0
1005	0.0	0.0	0.0	0.5	0

Clustering the RNN encodings

The encodings capture all of the information of the neural network. Although they do not have any inherent meaning we can use them in a clustering algorithm to identify distinct segments. Which is exactly what we did.

We decided to run a DBSCAN on the encoded sequence data. DBSCAN had the advantage (in this case) of being able to handle non-linearities in the data and for not needing to specify the number of clusters in advance. K-means performed similarly.

Results

The DBSCAN algorithm identifies five distinct clusters with some significant, and valuable differences between them.

Segment	Percentage of customers	Avg. E-mails Clicked	Avg. E-mails Opened	Avg. App Actions	Avg. Site Visits	Avg. Churn Date	Churn percentage
1	0.3%	2.11	22.8	16.4	18.2	325	30.1%
2	34.5%	1.13	11.5	3.6	8.1	308	16.7%
3	59.5%	0.3	3.2	0.1	2.9	88	98%
4	5.5%	4.0	27.0	89.5	16.5	337	0.1%
5	0.2%	0.5	2.0	0.0	1.5	93	93%

Although the clusters are fairly imbalanced (likely an artifact of using a supervised clustering technique), the number of days since the first interaction is clearly a strong driver in defining segments. The key takeaway here is that clusters with the highest churn rate have an interaction history of three months or less. This business absolutely must focus on getting customers through the first three months to decrease the likelihood of churning early.

Takeaways

Sequence data is increasingly being captured by brands and methods for exploring it must be developed
Recurrent neural networks are an effective way of generating encodings for behavioral sequence data
Clustering the encodings (results of intermediate layers) of a neural network can be an effective way of peering inside the black box

We welcome any thoughts or comments you might have, and feel free to share this blog posts with your friends and colleagues!

To leave a comment for the author, please follow the link and comment on their blog: R – Gradient Metrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.