Locally Sparse Functional Regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Overview
In this post we present a new estimation procedure for functional linear regression useful when the regression surface – or curve – is supposed to be exactly zero within specific regions of its domain. Our approach involves regularization techniques, merging a B-spline representation of the unknown coefficient function with a peculiar overlap group lasso penalty. The methodology is illustrated on the well-known Swedish mortality dataset and can be employed by R users through the package fdaSP.
Introduction
We consider the framework on which a functional response yi(s) with s∈S is observed together with a functional covariate xi(t) with t∈T for i=1,…,n. The simplest modeling strategy for function-on-function regression is the concurrent model, which assumes that T=S and the covariate x(⋅) influences y(s) only through its values x(s) at the domain point s∈S. The relation when both variables are centered is given by
yi(s)=xi(s)ψ(s)+ei(s)
where ψ(s) is the regression function and ei(s) is a functional zero-mean random error. The more general approach, named nonconcurrent functional linear model, allows yi(s) to entirely depend on the functional regressor in the following way
yi(s)=∫Txi(t)ψ(t,s)dt+ei(s).
Locally sparse functional model
B-splines and sparsity
In order to represent functional objects using basis expansion, we select a basis {θl(s),l=1,…,L} of dimension L in the space of square integrable functions on S and a basis {φm(t),m=1,…,M} of dimension M in the space of square integrable functions on T. Exploiting a tensor product expansion of these two, we represent the regression coefficient ψ as
ψ(t,s)=M∑m=1L∑l=1ψmlφm(t)θl(s)=(φ1(t),…,φM(t))(ψ1,1⋯ψ1,L⋮⋱⋮ψM,1⋯ψM,L)(θ1(s)⋮θL(s))=φ(t)TΨθ(s),
where ψml∈R for l=1,…,L and m=1,…,M. A key point is to assume that elements in this representation are B-splines (De Boor 1978) of order d with L−d and M−d interior knots, respectively. Suitable zero patterns in the B-spline basis coefficients of Ψ induce sparsity in ψ(t,s). Let τ1<⋯<τm<⋯<τM−d+2 and σ1<⋯<σl<⋯<σL−d+2 denote the knots defining the tensor product splines, with τ1,τM−d+2 and σ1,σL−d+2 being the boundaries of the two domains, and let Dml∈D be the rectangular subset of D defined as Dml=(τm,τm+1)×(σl,σl+1) for m=1,…,M−d+1 and l=1,…,L−d+1. To obtain ψ(t,s)=0 for each (t,s)∈Dml, it is sufficient that all the coefficients ψm′l′ with m′=m,…,m+d−1 and l′=l,…,l+d−1 are jointly zero. The following figure further clarifies this concept.
Bivariate example with tensor product cubic B-splines (d=4). The top row shows different coefficient matrix patterns, while the bottom row shows the corresponding spline, where dots represent knots and the set D is highlighted in red. The first 2 columns show a coefficient matrix with isolated zeros and a (d-1) x (d-1) block of zeros, respectevely. None of the two is able to produce a sparse function. In the last column, conversely, an entire d x d block of coefficients is null and the resulting function is indeed sparse.
The previous figure suggests that, in general, ψ(t,s) equals zero in the region identified by two pairs of consecutive knots if the related d × d block of coefficients of Ψ is entirely set to zero. Therefore, Ψ should be suitably partitioned in several blocks of dimensions d × d on which a joint sparsity penalty is induced.
Note that in the simpler situation where a functional covariate xi(t) and a scalar response yi are observed we have the following simplifications. The functional linear model is
yi=∫xi(t)ψ(t)dt+ei,
Univariate cubic B-spline basis and resulting spline functions. Dashed curves correspond to bases with a zero-valued coefficient. Only in the case when d=4 consecutive coefficients are zero the resulting spline function is null on a set of positive Lebesgue measure (in red).
Overlap group Lasso
Having in mind the above mentioned B-splines sparsity properties, in order to estimate a locally-sparse regression surface, we minimize the following objective function
12n∑i=1∫(yi(s)–∫xi(t)ψ(t,s)dt)2ds+λΩ(Ψ),
but, instead of specifying the functional form for the penalty Ω(⋅) as the widely-employed Lasso (Tibshirani, 1996) or group-Lasso (Yuan and Lin, 2006) that would not work properly (see paper for details), we propose to use something truly tailored for the problem. First, instead of a disjoint partition, we define an overlapping sequence of blocks of size d × d. Specifically, we introduce the block index b=1,…,B with the total number of blocks denoted by B=(M−d+1)×(L−d+1). Notably, there is a block for each set Dml. This overlapping group structure allows D0 to be the union of any set Dml by moving a block of minimum size. An example of overlapping covering when d=4 (cubic splines) is shown in the following figure.
Overlapping covering of the coefficient matrix (L = M = 12) with the first 4 blocks of size d × d, d = 4. Colors according to the balancing vector.
This construction suggests specifying a penalty Ω for overlapping groups of coefficients, which has attracted significant interest in the last decade, see Jacob, Obozinski and Vert (2009), Jenatton, Audibert and Bach (2011) and Lim and Hastie (2015). Being interested in the sparsity structure of the matrix of coefficients Ψ rather than its support we particularize the previous problem as
12n∑i=1∫(yi(s)–∫xi(t)ψ(t,s)dt)2ds+λB+1∑b=1||cb⊙ψ||2,
where λ>0 is a fixed penalization term, and Ω specifies in the sum of B+1 Euclidean norms ‖cb⊙ψ‖2, where ψ=vec(Ψ), and ⊙ represents the Hadamard product. The index b denotes the block of coefficients in Ψ, with the first B blocks being consistent with the aforementioned construction and the last block containing all coefficients in Ψ. Vectors of size ML, denoted by cb, are needed to extract the correct subset of entries of Ψ and contain also known constants that equally balance the penalization of the coefficients. This balancing is needed because the parameters close to the boundaries appear in fewer groups than the central ones. Note that this penalty constitutes a special case of the norm defined by Jenatton, Audibert, and Bach (2011).
In the case of a scalar response, the objective function becomes
12n∑i=1(yi–∫xi(t)ψ(t)dt)2+λB+1∑b=1||cb⊙ψ||2
where ψ and cb are vectors of dimension M and each of the first B blocks contains d consecutive coefficients, as in the following example.
Overlapping covering of the coefficient vector (M = 12) when the response is scalar with the first 4 blocks of size d = 4. Colors according to the balancing vector.
Computational Considerations
To develop an efficient computational strategy, we introduce the
empirical counterparts of the quantities described in the previous
section assuming to observe a sample of response curves
yi with i=1,…,n on
a common grid of G points, i.e. yi=(yi(s1),…,yi(sG))T. Let also
xi be the related functional covariate observed
on a possibly different but— common across i—grid of points, that for simplicity and without loss of
generality, we assume of length G. Let
X be the n×G matrix with xi in the rows. Let
Φ and
Θ be the M×G and L×G matrices
defined as Φ=(φ1(t1)⋯φ1(tG)⋮⋮φm(t1)⋯φm(tG)⋮⋮φM(t1)⋯φM(tG)),Θ=(θ1(s1)…θ1(sG)⋮⋮θl(s1)⋯θl(sG)⋮⋮θL(s1)⋯θL(sG)),
Applying the vectorization operator on each side of the equality above, we obtain the new optimization problem
12‖y–Zψ‖22+λB+1∑b=1‖Dbψ‖2,
where y=vec(Y), ψ=vec(Ψ) is the vector of coefficients of dimension LM, Z=ΘT⊗XΦT and Db=diag(cb) is a diagonal matrix whose elements correspond to the elements of the vector cb. The minimization of the quantity above is challenging because of the non-separability of the overlap group Lasso penalty. We propose a Majorization-Minimization (MM) algorithm (Lange, 2016) to obtain the solution, although other choices are viable e.g., the Alternating Direction Method of Multipliers (ADMM), see Boyd et al. (2011). The MM procedure works in two steps: in the first step a majorizing function Q(ψ|ˆψk) based on the actual estimate is determined and in the second step this function is minimized. Alternating between the two guarantees the convergence to a solution of the original problem. The curious reader can find the expression of Q(ψ|ˆψk) in the original paper, here we just point out that the function is quadratic and admits an explicit solution in the form of generalized ridge regression. A further speed-up, when the dimension of the problem is high, is made exploiting the Sherman-Morrison-Woodbury matrix identity, also known as the matrix inversion lemma.
When the response is scalar, we have the following modifications. The design matrix is Z=XΦT, y and ψ are vectors of dimension respectevely n and M and the diagonal matrix Db is of dimension M×M. The optimization problem is still valid and the same class of algorithms can be employed to obtain the solution.
Swedish mortality revisited
We apply the proposed locally sparse estimator to the Swedish mortality dataset, where the aim is to predict the log-hazard function on a given year from the same quantity on the previous year. We implement the model with d=4,M=L=20 basis functions on each dimension and select the optimal value of λ by means of cross-validation. A perspective plot of the estimated surface is depicted in the last figure of the post. The estimate shows a marked positive diagonal confirming the positive influence on the log-hazard rate at age s of the previous year’s curve evaluated on a neighborhood of s. At the same time, the flat zero regions outside the diagonal suggest that there is no influence of the curves evaluated at distant ages. Our estimate is more regular than previous approaches and its qualitative interpretation sharper and easier. Refer, for example, to Figure 10.11 of Ramsay, Hooker, and Graves (2009).
Estimated regression surface for the Swedish mortality dataset.
This result witnesses the practical relevance of adopting the proposed approach. Indeed, the resulting estimate, while being reminiscent of a concurrent model—inheriting its ease of interpretation—gives further insights and improves the fit, representing the desired intermediate solution between the concurrent and nonconcurrent models.
R package
The method is implemented in R through the package fdaSP, available on CRAN. The function f2fSP can be used for a functional response regression model while f2sSP for a scalar response one. The two counterparts f2fSP_cv and f2sSP_cv are useful to select the tuning parameter by means of cross-validation.
This post is based on
Bernardi, M., Canale, A. and Stefanucci, M. (2022). Locally Sparse Function-on-Function Regression, Journal of Computational and Graphical Statistics, 32:3, 985-999, DOI: 10.1080/10618600.2022.2130926
References
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the
Lasso,” Journal of the Royal Statistical Society, Series B, 58,
267–288.
Yuan, M., and Lin, Y. (2006), “Model Selection and Estimation in
Regression with Grouped Variables,” Journal of the Royal Statistical
Society, Series B, 68, 49–67.
Jacob, L., Obozinski, G., and Vert, J.-P. (2009), “Group Lasso with
Overlap and Graph Lasso,” in Proceedings of the 26th Annual
International Conference on Machine Learning, pp. 433–440.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2011), “Structured Variable
Selection with Sparsity-Inducing Norms,” Journal of Machine Learning
Research, 12, 2777–2824.
Lim, M., and Hastie, T. (2015), “Learning Interactions via Hierarchical
Group-Lasso Regularization,” Journal of Computational and Graphical
Statistics, 24, 627–654.
Lange, K. (2016), “MM optimization algorithms.” Society for Industrial
and Applied Mathematics, Philadelphia, PA.
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011),
“Distributed optimization and statistical learning via the alternating
direction method of multipliers,” Foundations and Trends® in Machine
Learning, 3, 1–122.
Ramsay, J. O., Hooker, G., and Graves, S. (2009), Functional Data
Analysis with R and Matlab, NewYork: Springer.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.