Fitting a spline with PyMC3

Posts | Joshua Cook

1 year ago

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Often, the model we want to fit is not a perfect line between some $x$ and $y$. Instead, the parameters of the model are expected to vary over $x$. There are multiple ways to handle this situation, one of which is to fit a spline. The spline is effectively multiple individual lines, each fit to a different section of $x$, that are tied togehter at their boundaries, often called knots. Below is an exmaple of how to fit a spline using the Bayesian framework PyMC3.

Fitting a spline with PyMC3

Below is a full working example of how to fit a spline using the probabilitic programming language PyMC3. The data and model are taken from Statistical Rethinking 2e by Richard McElreath. As the book uses Stan (another advanced probabilitistic programming language), the modeling code is primarily taken from the GitHub repository of the PyMC3 implementation of Statistical Rethinking. My contributions are primarily of explination and additional analyses of the data and results.

Set-up

Below is the code to import packages and set some variables used in the analysis. Most of the libraries and modules are likely familiar to most. Of those that may not be well known are ‘ArviZ’ and ‘patsy’, and ‘plotnine’. ‘ArviZ’ is a library for managing the components of a Bayesian model. I will use it to manage the results of fitting the model and some standard data visualizations. The ‘patsy’ library is an interface to statistical modeling using a specific formula language simillar to that used in the R language. Finally, ‘plotnine’ is a plotting library that implements the “Grammar or Graphics” system based on the ‘ggplot2’ R package. As I have a lot of experience with R, I found ‘plotnine’ far more natural than the “standard” in Python data science, ‘matplotlib’.

from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as gg
import pymc3 as pm
import seaborn as sns
from patsy import dmatrix
# Set default theme for 'plotnine'.
gg.theme_set(gg.theme_minimal())
# For reproducibility.
RANDOM_SEED = 847
np.random.seed(RANDOM_SEED)
# Path to the data used in Statistical Rethinking.
rethinking_data_path = Path("../data/rethinking_data")

Data

The data for this example is the number of days of the year (doy) that some cherry trees were in bloom in each year (year). We will ignore the other columns for now.

d = pd.read_csv(rethinking_data_path / "cherry_blossoms.csv")
d2 = d.dropna(subset=["doy"]).reset_index(drop=True)
d2.head(n=10)

	year	doy	temp	temp_upper	temp_lower
0	812	92	nan	nan	nan
1	815	105	nan	nan	nan
2	831	96	nan	nan	nan
3	851	108	7.38	12.1	2.66
4	853	104	nan	nan	nan
5	864	100	6.42	8.69	4.14
6	866	106	6.44	8.11	4.77
7	869	95	nan	nan	nan
8	889	104	6.83	8.48	5.19
9	891	109	6.98	8.96	5

There are 827 years with doy data.

>>> d2.shape
(827, 5)

Below is the doy values plotted over year.

(
gg.ggplot(d2, gg.aes(x="year", y="doy"))
+ gg.geom_point(color="black", alpha=0.4, size=1.3)
+ gg.theme(figure_size=(10, 5))
+ gg.labs(x="year", y="days of year", title="Cherry blossom data")
)

Model

We will fit the following model.

$D \sim \mathcal{N}(\mu, \sigma)$
$\quad \mu = a + Bw$
$\qquad a \sim \mathcal{N}(100, 10)$
$\qquad w \sim \mathcal{N}(0, 10)$
$\quad \sigma \sim \text{Exp}(1)$

The number of days of bloom will be modeled as a normal distribution with mean $\mu$ and standard deviation $\sigma$. The mean will be a linear model composed of a y-intercept $a$ and spline with basis $w$. Both have relatively weak normal priors.

Prepare the spline

We can now prepare the spline matrix. First, we must determine the boundaries of the spline, often referred to as “knots” because the different lines will be tied together at these boundaries to make a continuous and smooth curve. For this example, we will create 15 knots evenly spaced as quantiles of the years data (the x-axis).

num_knots = 15
knot_list = np.quantile(d2.year, np.linspace(0, 1, num_knots))

>>> knot_list
array([ 812., 1036., 1174., 1269., 1377., 1454., 1518., 1583., 1650.,
1714., 1774., 1833., 1893., 1956., 2015.])

Below is the plot of the data we are modeling with the splines indicated by the vertical gray lines.

(
gg.ggplot(d2, gg.aes(x="year", y="doy"))
+ gg.geom_point(color="black", alpha=0.4, size=1.3)
+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)
+ gg.theme(figure_size=(10, 5))
+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots")
)

We can get an idea of what the spline will look like by fitting a LOESS curve (a local ploynomial regression).

(
gg.ggplot(d2, gg.aes(x="year", y="doy"))
+ gg.geom_point(color="black", alpha=0.4, size=1.3)
+ gg.geom_smooth(method = "loess", span=0.3, size=1.5, color="blue", linetype="-")
+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)
+ gg.theme(figure_size=(10, 5))
+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots")
)

Another way of visualizing what the spline should look like is to plot individual linear models over the data between each knot. The spline will effectively be a compromise between these individual models and a continuous curve.

d2["knot_group"] = [np.where(a <= knot_list)[0][0] for a in d2.year]
d2["knot_group"] = pd.Categorical(d2["knot_group"], ordered=True)
(
gg.ggplot(d2, gg.aes(x="year", y="doy"))
+ gg.geom_point(color="black", alpha=0.4, size=1.3)
+ gg.geom_smooth(gg.aes(group = "knot_group"), method="lm", size=1.5, color="red", linetype="-")
+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)
+ gg.theme(figure_size=(10, 5))
+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots")
)

Finally we can use ‘patsy’ to create the matrix $B$ that will be the b-spline basis for the regression. The degree is set to 3 to create a cubic b-spline.

B = dmatrix(
"bs(year, knots=knots, degree=3, include_intercept=True) - 1",
{"year": d2.year.values, "knots": knot_list[1:-1]},
)

>>> B
DesignMatrix with shape (827, 17)
Columns:
['bs(year, knots=knots, degree=3, include_intercept=True)[0]',
'bs(year, knots=knots, degree=3, include_intercept=True)[1]',
'bs(year, knots=knots, degree=3, include_intercept=True)[2]',
'bs(year, knots=knots, degree=3, include_intercept=True)[3]',
'bs(year, knots=knots, degree=3, include_intercept=True)[4]',
'bs(year, knots=knots, degree=3, include_intercept=True)[5]',
'bs(year, knots=knots, degree=3, include_intercept=True)[6]',
'bs(year, knots=knots, degree=3, include_intercept=True)[7]',
'bs(year, knots=knots, degree=3, include_intercept=True)[8]',
'bs(year, knots=knots, degree=3, include_intercept=True)[9]',
'bs(year, knots=knots, degree=3, include_intercept=True)[10]',
'bs(year, knots=knots, degree=3, include_intercept=True)[11]',
'bs(year, knots=knots, degree=3, include_intercept=True)[12]',
'bs(year, knots=knots, degree=3, include_intercept=True)[13]',
'bs(year, knots=knots, degree=3, include_intercept=True)[14]',
'bs(year, knots=knots, degree=3, include_intercept=True)[15]',
'bs(year, knots=knots, degree=3, include_intercept=True)[16]']
Terms:
'bs(year, knots=knots, degree=3, include_intercept=True)' (columns 0:17)
(to view full data, use np.asarray(this_obj))

The b-spline basis is plotted below.

spline_df = (
pd.DataFrame(B)
.assign(year=d2.year.values)
.melt("year", var_name="spline_i", value_name="value")
)
(
gg.ggplot(spline_df, gg.aes(x="year", y="value"))
+ gg.geom_line(gg.aes(group="spline_i", color="spline_i"))
+ gg.scale_color_discrete(guide=gg.guide_legend(ncol=2))
+ gg.labs(x="year", y="basis", color="spline idx")
)

Fitting

Finally, the model can be built using PyMC3. A graphical diagram shows the organization of the model parameters.

with pm.Model() as m4_7:
a = pm.Normal("a", 100, 10)
w = pm.Normal("w", mu=0, sd=10, shape=B.shape[1])
mu = pm.Deterministic("mu", a + pm.math.dot(np.asarray(B, order="F"), w.T))
sigma = pm.Exponential("sigma", 1)
D = pm.Normal("D", mu, sigma, observed=d2.doy)

pm.model_to_graphviz(m4_7)

2000 samples of the posterior distribution are taken along with samples for prior and posterior predictive checks.

with m4_7:
prior_pc = pm.sample_prior_predictive(random_seed=RANDOM_SEED)
trace_m4_7 = pm.sample(2000, tune=2000, random_seed=RANDOM_SEED)
post_pc = pm.sample_posterior_predictive(trace_m4_7, random_seed=RANDOM_SEED)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, w, a]
Sampling 2 chains, 0 divergences: 100%|██████████| 8000/8000 [00:30<00:00, 259.07draws/s]
The number of effective samples is smaller than 25% for some parameters.
100%|██████████| 4000/4000 [00:06<00:00, 591.29it/s]

As mentioned above, the model and sampling results are collated into an ArviZ object for ease of use.

az_m4_7 = az.from_pymc3(
model=m4_7, trace=trace_m4_7, posterior_predictive=post_pc, prior=prior_pc
)

Fit parameters

Below is a table summarizing the posterior distributions of the model parameters. The posteriors of $a$ and $\sigma$ are quite narrow while those for $w$ are wider. This is likely because all of the data points are used to estimate $a$ and $\sigma$ whereas only a subset are used for each value of $w$. The number of effective samples for $a$ is quite low, though, likely due to autocorrelation of the MCMC chains (this is visible in the following plots of the trace).

az.summary(az_m4_7, var_names=["a", "w", "sigma"])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a	103.303	2.424	98.879	107.724	0.098	0.069	615	614	618	1027	1
w[0]	-2.876	3.862	-10.391	4.208	0.11	0.078	1225	1225	1223	2105	1
w[1]	-0.92	3.944	-7.944	6.87	0.109	0.077	1303	1303	1306	1794	1
w[2]	-0.95	3.64	-7.69	5.972	0.115	0.082	994	994	995	1799	1
w[3]	4.896	2.917	-1.005	10.029	0.099	0.07	871	871	872	1236	1
w[4]	-0.827	2.937	-6.642	4.437	0.105	0.075	776	776	781	1199	1
w[5]	4.384	2.969	-0.994	10.089	0.098	0.069	921	921	922	1600	1
w[6]	-5.305	2.848	-10.728	-0.249	0.103	0.073	771	771	774	1226	1
w[7]	7.899	2.845	2.319	12.968	0.098	0.07	848	818	849	1546	1
w[8]	-0.974	2.921	-6.47	4.431	0.1	0.07	861	861	863	1402	1
w[9]	3.132	3.007	-2.191	9.091	0.1	0.071	910	906	913	1399	1
w[10]	4.676	2.909	-0.455	10.563	0.104	0.074	780	780	781	1377	1
w[11]	-0.085	2.952	-5.434	5.604	0.098	0.069	909	909	911	1468	1
w[12]	5.6	2.947	0.167	11.29	0.104	0.073	809	809	813	1279	1
w[13]	0.784	3.116	-5.015	6.579	0.103	0.073	924	924	927	1382	1
w[14]	-0.782	3.333	-7.152	5.164	0.104	0.073	1030	1030	1030	1404	1
w[15]	-6.933	3.501	-13.454	-0.133	0.106	0.075	1091	1091	1084	1684	1
w[16]	-7.61	3.292	-14.056	-1.642	0.104	0.075	1003	965	1005	1368	1
sigma	5.946	0.147	5.66	6.199	0.002	0.002	3684	3671	3709	2558	1

We can visualize the trace (MCMC samples) of $a$ and $\sigma$, again showing they were confidently estimated.

az.plot_trace(az_m4_7, var_names=["a", "sigma"])
plt.show()

A forest plot shows the distributions of the values for $w$ are larger, though some do fall primarily away from 0 indicating a non-null effect/association.

az.plot_forest(az_m4_7, var_names=["w"], combined=True)
plt.show()

Another visualization of the fit spline values is to plot them multiplied against the basis matrix. The knot boundaries are shown in gray again, but now the spline bases are multipled against the values of $w$ (represented as the rainbow-colored curves). The dot product of $B$ and $w$ – the actual computation in the linear model – is shown in blue.

wp = trace_m4_7["w"].mean(0)
spline_df = (
pd.DataFrame(B * wp.T)
.assign(year=d2.year.values)
.melt("year", var_name="spline_i", value_name="value")
)
spline_df_merged = (
pd.DataFrame(np.dot(B, wp.T))
.assign(year=d2.year.values)
.melt("year", var_name="spline_i", value_name="value")
)
(
gg.ggplot(spline_df, gg.aes(x="year", y="value"))
+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.5)
+ gg.geom_line(data=spline_df_merged, linetype="-", color="blue", size=2, alpha=0.7)
+ gg.geom_line(gg.aes(group="spline_i", color="spline_i"), alpha=0.7, size=1)
+ gg.scale_color_discrete(guide=gg.guide_legend(ncol=2), color_space="husl")
+ gg.theme(figure_size=(10, 5))
+ gg.labs(x="year", y="basis", title="Fit spline", color="spline idx")
)

Model predictions

Lastly, we can visualize the predictions of the model using the posterior predictive check.

post_pred = az.summary(az_m4_7, var_names=["mu"]).reset_index(drop=True)
d2_post = d2.copy().reset_index(drop=True)
d2_post["pred_mean"] = post_pred["mean"]
d2_post["pred_hdi_lower"] = post_pred["hdi_3%"]
d2_post["pred_hdi_upper"] = post_pred["hdi_97%"]

(
gg.ggplot(d2_post, gg.aes(x="year"))
+ gg.geom_ribbon(
gg.aes(ymin="pred_hdi_lower", ymax="pred_hdi_upper"), alpha=0.3, fill="tomato"
)
+ gg.geom_line(gg.aes(y="pred_mean"), color="firebrick", alpha=1, size=2)
+ gg.geom_point(gg.aes(y="doy"), color="black", alpha=0.4, size=1.3)
+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)
+ gg.theme(figure_size=(10, 5))
+ gg.labs(
x="year",
y="days of year",
title="Cherry blossom data with posterior predictions",
)
)

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.