Pareto-NBD Customer Lifetime Value

Brian Callander

3 years ago

[This article was first published on Brian Callander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pareto-NBD Customer Lifetime Value

Posted on 6 April, 2019 by Brian Tags: customer lifetime value, pareto-nbd, smc Category: customer_lifetime_value

Suppose you have a bunch of customers who make repeat purchases – some more frequenty, some less. There are a few things you might like to know about these customers, such as

which customers are still active (i.e. not yet churned) and likely to continue purchasing from you?; and
how many purchases can you expect from each customer?

Modelling this directly is more difficult than it might seem at first. A customer that regularly makes purchases every day might be considered at risk of churning if they haven’t purchased anything in the past week, whereas a customer that regularly puchases once per month would not be considered at risk of churning. That is, churn and frequency of purchasing are closely related. The difficulty is that we don’t observe the moment of churn of any customer and have to model it probabilistically.

There are a number of established models for estimating this, the most well-known perhaps being the SMC model (a.k.a pareto-nbd model). There are already some implementations using maximum likelihood or Gibbs sampling. In this post, we’ll explain how the model works, make some prior predictive simulations, and fit a version implemented in Stan.

< !--more-->

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbern}{Bernoulli} \DeclareMathOperator{\dpois}{Poisson} \DeclareMathOperator{\dnorm}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexp}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvgamma}{InvGamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

Data Generating Process

Let’s describe the model first by simulation. Suppose we have a company that is 2 years old and a total of 2000 customers, \(C\), that have made at least one purchase from us. We’ll assume a linear rate of customer acquisition, so that the first purchase date is simply a uniform random variable over the 2 years of the company existance. These assumptions are just to keep the example concrete, and are not so important for understanding the model.

customers <- tibble(id = 1:1000) %>% 
  mutate(
    end = 2 * 365,
    start = runif(n(), 0, end - 1),
    T = end - start
  )

The \(T\)-variable is the total observation time, counted from the date of first joining to the present day.

First the likelihood. Each customer \(c \in C\) is assumed to have a certain lifetime, \(\tau_c\), starting on their join-date. During their lifetime, they will purchase at a constant rate, \(\lambda_c\), so that they will make \(k \sim \dpois(t\lambda_c)\) purchases over a time-interval \(t\). Once their lifetime is over, they will stop purchasing. We only observe the customer for \(T_c\) units of time, and this observation time can be either larger or smaller than the lifetime, \(\tau_c\). Since we don’t observe \(\tau_c\) itself, we will assume it follows an exponential distribution, i.e. \(\tau_c \sim \dexp(\mu_c)\).

The following function generates possible observations given \(\mu\) and \(\lambda\).

sample_conditional <- function(mu, lambda, T) {
  
  # lifetime
  tau <- rexp(1, mu)
  
  # start with 0 purchases
  t <- 0
  k <- 0
  
  # simulate time till next purchase
  wait <- rexp(1, lambda)
  
  # keep purchasing till end of life/observation time
  while(t + wait <= pmin(T, tau)) {
    t <- t + wait
    k <- k + 1
    wait <- rexp(1, lambda)
  }
  
  # return tabular data
  tibble(
    mu = mu,
    lambda = lambda,
    T = T,
    tau = tau,
    k = k,
    t = t
  )
}

s <- sample_conditional(0.01, 1, 30)

Example output from sample_conditional
mu	lambda	T	tau	k	t
0.01	1	30	49.63373	39	29.21926

Now the priors. Typically, \(\mu\) and \(\lambda\) are given gamma priors, which we’ll use too. However, the expected mean lifetime \(\mathbb E (\tau) = \frac{1}{\mu}\) is easier to reason about than \(\mu\), so we’ll put an inverse gamma distribution on \(\frac{1}{\mu}\). The reciprocal of an inverse gamma distribution has a gamma distribution, so \(\mu\) will still end up with a gamma distribution.

The mean expected lifetime in our simulated example will be ~2 months, with a standard deviation of 30. The mean purchase rate will be once a fortnight, with a standard deviation around 0.05.

set.seed(2017896)

etau_mean <- 60
etau_variance <- 30^2
etau_beta <- etau_mean^3 / etau_variance + etau_mean
etau_alpha <- etau_mean^2 / etau_variance + 2

lambda_mean <- 1 / 14
lambda_variance <- 0.05^2
lambda_beta <- lambda_mean / lambda_variance
lambda_alpha <- lambda_mean * lambda_beta

df <- customers %>% 
  mutate(
    etau = rinvgamma(n(), etau_alpha, etau_beta),
    mu = 1 / etau,
    lambda = rgamma(n(), lambda_alpha, lambda_beta)
  ) %>% 
  group_by(id) %>% 
  group_map(~sample_conditional(.$mu, .$lambda, .$T))

Sample of customers and their properties
id	mu	lambda	T	tau	k	t
1	0.0241091	0.2108978	295.3119	32.2814622	6	29.46052
2	0.0122084	0.0135551	673.2100	11.5250690	0	0.00000
3	0.0032994	0.0789800	357.1805	4.7921238	0	0.00000
4	0.0227431	0.0980176	270.0511	141.4766791	10	125.60765
5	0.0270742	0.0429184	608.9049	5.7293256	0	0.00000
6	0.0208168	0.0661296	666.1305	0.9481004	0	0.00000

The lifetimes are mostly under 3 months, but also allow some more extreme values up to around a year.

The purchase rates are mostly around once a fortnight, but there are also rates as high as 4 purchases per week and ras low as one per quarter.

Likelihood

The likelihood is somewhat complicated, so we’ll derive a more concise expression for it. Knowing the lifetime simplifies the probabilities, so we’ll marginalise the liklihood over \(\tau\).

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \int_{\tau = t}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) d\tau \\ &= \int_{\tau = t}^T \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) + \int_{\tau = T}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) \\ &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &\hphantom{=} + \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \end{align} \]

The right-hand side is straight forward. The Poisson probabilities can be pulled out of the integral since they are independent of \(\tau\), turning the remaining integral into the survival function of the exponential distribution.

\[ \begin{align} \text{RHS} &= \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau – t)\lambda) \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda}}{k!} e^{-(T-t)\lambda}\int_T^\infty \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-T\lambda}}{k!} e^{-T\mu} \\ &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} \end{align} \]

The left-hand side is a little more involved.

\[ \begin{align} \text{LHS} &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} \int_t^T e^{-(\tau – t)\lambda} \mu e^{-\tau\mu} d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} e^{t\lambda} \mu \int_t^T e^{-\tau(\lambda + \mu)} d\tau \\ &= \frac{(t\lambda)^k }{k!} \mu \left. \frac{ e^{-\tau(\lambda + \mu)}}{-(\lambda + \mu)} \right|_t^T \\ &= \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \end{align} \]

Adding both expressions gives our final expression for the likelihood

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} + \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &\propto \lambda^k e^{-T(\lambda + \mu)} + \lambda^k \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} – \mu e^{-T(\lambda + \mu)} + \mu e^{-T(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) , \end{align} \]

where we dropped any factors independent of the parameters, \(\lambda, \mu\). This expression agrees with equation 2 in ML07.

Another way to view this likelihood is as a mixture of censored observations, but where the mixture probability \(p(\mu, \lambda) := \frac{\mu}{\lambda + \mu}\) depends on the parameters. We can write this alternative interpretation as

\[ \begin{align} \mathbb P(k, t \mid \mu, \lambda) &\propto p \dpois(k \mid t\lambda)S(t \mid \mu) \\ &\hphantom{\propto}+ (1 – p) \dpois(k \mid t\lambda)\dpois(0 \mid (T-t)\lambda)S(T \mid \mu) , \end{align} \]

where \(S\) is the survival function of the exponential distribution. In other words, either we censor at \(t\) with probability \(p\), or we censor at \(T\) with probability \((1 – p)\). Note that

either decreasing the expected lifetime (i.e. increasing \(\mu\)) or decreasing the purchase rate increases \(p\);
if \(t \approx T\), then the censored distributions are approximately equal. The smaller \(\lambda\) is, the closer the approximation has to be for this to hold.

To implement this in stan, we’ll need the log-likelihood, which is given by

\[ \log\mathbb P (k, t \mid \mu, \lambda) = k \log\lambda – \log(\lambda + \mu) + \log\left(\mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) . \]

Stan implementation

Let’s take a look at our Stan implementation. Note that Stan uses the log-likelihood, and we can increment it by incrementing the target variable. We have also used the log_sum_exp for numeric stability, where \(\text{log_sum_exp}(x, y) := \log(e^x + e^y)\).

pnb <- here('models/pnbd.stan') %>% 
  stan_model() 
S4 class stanmodel 'pnbd' coded as follows:
data {
  int<lower = 1> n;       // number of customers
  vector<lower = 0>[n] t; // time to most recent purchase
  vector<lower = 0>[n] T; // total observation time
  vector<lower = 0>[n] k; // number of purchases observed

  // user-specified parameters
  real<lower = 0> etau_alpha;
  real<lower = 0> etau_beta;
  real<lower = 0> lambda_alpha;
  real<lower = 0> lambda_beta;
}

parameters {
  vector<lower = 0>[n] lambda; // purchase rate
  vector<lower = 0>[n] etau;   // expected mean lifetime
}

transformed parameters {
  vector<lower = 0>[n] mu = 1.0 ./ etau;
}

model {
  // priors
  etau ~ inv_gamma(etau_alpha, etau_beta);
  lambda ~ gamma(lambda_alpha, lambda_beta);

  // likelihood
  target += k .* log(lambda) - log(lambda + mu);
  for (i in 1:n) {
    target += log_sum_exp(
      log(lambda[i]) - (lambda[i] + mu[i]) .* T[i],
      log(mu[i]) - (lambda[i] + mu[i]) .* t[i]
    );
  }
}

Let’s fit the model to our simulated data, using the correct priors.

pnb_fit <- rstan::sampling(
    pnb,
    data = compose_data(
      df,
      etau_alpha = etau_alpha,
      etau_beta = etau_beta,
      lambda_alpha = lambda_alpha,
      lambda_beta = lambda_beta
    ),
    control = list(max_treedepth = 15),
    chains = 4,
    cores = 4,
    warmup = 1000,
    iter = 3000
  )

Using the default max_treedepth of 10 shows problems with the energy diagnostic, with the etau parameters seemingly most problematic. However, increasing it to 15 resolved these issues.

pnb_fit %>% 
  check_hmc_diagnostics()
Divergences:

0 of 8000 iterations ended with a divergence.


Tree depth:

0 of 8000 iterations saturated the maximum tree depth of 15.


Energy:

E-BFMI indicated no pathological behavior.

There are also no problems with the effective sample sizes, although etau typically has the lowest.

pnb_neff <- pnb_fit %>% 
  neff_ratio() %>% 
  tibble(
    ratio = .,
    parameter = names(.)
  ) %>% 
  arrange(ratio) %>% 
  head(5)

Parameters with the lowest effective sample size
ratio	parameter
0.3877002	lp__
0.4838375	etau[716]
0.5157888	etau[442]
0.5722499	etau[367]
0.5803245	etau[443]

The rhat statistic also looks good.

pnb_rhat <- pnb_fit %>% 
  rhat() %>% 
  tibble(
    rhat = .,
    parameter = names(.)
  ) %>% 
  summarise(min(rhat), max(rhat))

The most extreme rhat values
min(rhat)	max(rhat)
0.9995222	1.001541

Around 50% of our 50% posterior intervals contain the true value, which is a good sign.

calibration <- pnb_fit %>% 
  spread_draws(mu[id], lambda[id]) %>% 
  mean_qi(.width = 0.5) %>% 
  inner_join(df, by = 'id') %>% 
  summarise(
    mu = mean(mu.lower <= mu.y & mu.y <= mu.upper),
    lambda = mean(lambda.lower <= lambda.y & lambda.y <= lambda.upper)
  ) %>% 
  gather(parameter, fraction)

Fraction of 50% posterior intervals containing the true value. These should be close to 50%.
parameter	fraction
mu	0.495
lambda	0.498

Discussion

We described the data generating process behind the Pareto-NBD model, implemented a model in Stan using our derivation of the likelihood, and fit the model to simulated data. The diagnostics didn’t indicate any convergence problems, and around 50% of the 50% posterior intervals contained the true parameter values. However, we used our knowledge of the prior distribution to fit the model. It would be better to use a hierarchical prior to relax this requirement.

As a next step, it would be interesting to extend the model to

estimate spend per purchase;
use hierarchical priors on \(\mu\) and \(\lambda\);
allow correlation between \(\mu\) and \(\lambda\); and
allow covariates, such as cohorts.

Please enable JavaScript to view the comments powered by Disqus.

To leave a comment for the author, please follow the link and comment on their blog: Brian Callander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.