Locally Sparse Functional Regression

[This article was first published on YoungStatS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

In this post we present a new estimation procedure for functional linear regression useful when the regression surface – or curve – is supposed to be exactly zero within specific regions of its domain. Our approach involves regularization techniques, merging a B-spline representation of the unknown coefficient function with a peculiar overlap group lasso penalty. The methodology is illustrated on the well-known Swedish mortality dataset and can be employed by R users through the package fdaSP.

Introduction

We consider the framework on which a functional response yi(s) with sS is observed together with a functional covariate xi(t) with tT for i=1,,n. The simplest modeling strategy for function-on-function regression is the concurrent model, which assumes that T=S and the covariate x() influences y(s) only through its values x(s) at the domain point sS. The relation when both variables are centered is given by

yi(s)=xi(s)ψ(s)+ei(s)

where ψ(s) is the regression function and ei(s) is a functional zero-mean random error. The more general approach, named nonconcurrent functional linear model, allows yi(s) to entirely depend on the functional regressor in the following way

yi(s)=Txi(t)ψ(t,s)dt+ei(s).

The model allows TS and the bivariate function ψ(t,s) represents the impact of x() evaluated at tT on yi(s) and usually is a “dense” function. Our goal is to introduce a nonconcurrent functional linear model that allows for local sparsity patterns. Specifically, we want that ψ(t,s)=0 for (t,s)D0 with D0 being a suitable subset of the domain D of ψ(), with D=S×T, thus, inducing locally sparse Hilbert-Schmidt operators. Note that the set D0 is unknown and need to be estimated from data.

Locally sparse functional model

B-splines and sparsity

In order to represent functional objects using basis expansion, we select a basis {θl(s),l=1,,L} of dimension L in the space of square integrable functions on S and a basis {φm(t),m=1,,M} of dimension M in the space of square integrable functions on T. Exploiting a tensor product expansion of these two, we represent the regression coefficient ψ as

ψ(t,s)=Mm=1Ll=1ψmlφm(t)θl(s)=(φ1(t),,φM(t))(ψ1,1ψ1,LψM,1ψM,L)(θ1(s)θL(s))=φ(t)TΨθ(s),

where ψmlR for l=1,,L and m=1,,M. A key point is to assume that elements in this representation are B-splines (De Boor 1978) of order d with Ld and Md interior knots, respectively. Suitable zero patterns in the B-spline basis coefficients of Ψ induce sparsity in ψ(t,s). Let τ1<<τm<<τMd+2 and σ1<<σl<<σLd+2 denote the knots defining the tensor product splines, with τ1,τMd+2 and σ1,σLd+2 being the boundaries of the two domains, and let DmlD be the rectangular subset of D defined as Dml=(τm,τm+1)×(σl,σl+1) for m=1,,Md+1 and l=1,,Ld+1. To obtain ψ(t,s)=0 for each (t,s)Dml, it is sufficient that all the coefficients ψml with m=m,,m+d1 and l=l,,l+d1 are jointly zero. The following figure further clarifies this concept.

Bivariate example with tensor product cubic B-splines (d=4). The top row shows different coefficient matrix patterns, while the bottom row shows the corresponding spline, where dots represent knots and the set D is highlighted in red. The first 2 columns show a coefficient matrix with isolated zeros and a (d-1) x (d-1) block of zeros, respectevely. None of the two is able to produce a sparse function. In the last column, conversely, an entire d x d block of coefficients is null and the resulting function is indeed sparse.

The previous figure suggests that, in general, ψ(t,s) equals zero in the region identified by two pairs of consecutive knots if the related d × d block of coefficients of Ψ is entirely set to zero. Therefore, Ψ should be suitably partitioned in several blocks of dimensions d × d on which a joint sparsity penalty is induced.

Note that in the simpler situation where a functional covariate xi(t) and a scalar response yi are observed we have the following simplifications. The functional linear model is

yi=xi(t)ψ(t)dt+ei,

the regression function is expanded as ψ(t)=Mm=1ψmφm(t) and the set Dm=(τm,τm+1) is an interval of the real line. To obtain ψ(t)=0 for each tDm, it is sufficient that all the coefficients ψm with m=m,,m+d1 are jointly zero and ψ(t) equals zero in the interval identified by two pairs of consecutive knots if the related ddimensional subvector of coefficients is entirely set to zero. This behaviour is illustrated in the next figure.

Univariate cubic B-spline basis and resulting spline functions. Dashed curves correspond to bases with a zero-valued coefficient. Only in the case when d=4 consecutive coefficients are zero the resulting spline function is null on a set of positive Lebesgue measure (in red).

Overlap group Lasso

Having in mind the above mentioned B-splines sparsity properties, in order to estimate a locally-sparse regression surface, we minimize the following objective function

12ni=1(yi(s)xi(t)ψ(t,s)dt)2ds+λΩ(Ψ),

but, instead of specifying the functional form for the penalty Ω() as the widely-employed Lasso (Tibshirani, 1996) or group-Lasso (Yuan and Lin, 2006) that would not work properly (see paper for details), we propose to use something truly tailored for the problem. First, instead of a disjoint partition, we define an overlapping sequence of blocks of size d × d. Specifically, we introduce the block index b=1,,B with the total number of blocks denoted by B=(Md+1)×(Ld+1). Notably, there is a block for each set Dml. This overlapping group structure allows D0 to be the union of any set Dml by moving a block of minimum size. An example of overlapping covering when d=4 (cubic splines) is shown in the following figure.

Overlapping covering of the coefficient matrix (L = M = 12) with the first 4 blocks of size d × d, d = 4. Colors according to the balancing vector.

This construction suggests specifying a penalty Ω for overlapping groups of coefficients, which has attracted significant interest in the last decade, see Jacob, Obozinski and Vert (2009), Jenatton, Audibert and Bach (2011) and Lim and Hastie (2015). Being interested in the sparsity structure of the matrix of coefficients Ψ rather than its support we particularize the previous problem as

12ni=1(yi(s)xi(t)ψ(t,s)dt)2ds+λB+1b=1||cbψ||2,

where λ>0 is a fixed penalization term, and Ω specifies in the sum of B+1 Euclidean norms cbψ2, where ψ=vec(Ψ), and represents the Hadamard product. The index b denotes the block of coefficients in Ψ, with the first B blocks being consistent with the aforementioned construction and the last block containing all coefficients in Ψ. Vectors of size ML, denoted by cb, are needed to extract the correct subset of entries of Ψ and contain also known constants that equally balance the penalization of the coefficients. This balancing is needed because the parameters close to the boundaries appear in fewer groups than the central ones. Note that this penalty constitutes a special case of the norm defined by Jenatton, Audibert, and Bach (2011).

In the case of a scalar response, the objective function becomes

12ni=1(yixi(t)ψ(t)dt)2+λB+1b=1||cbψ||2

where ψ and cb are vectors of dimension M and each of the first B blocks contains d consecutive coefficients, as in the following example.

Overlapping covering of the coefficient vector (M = 12) when the response is scalar with the first 4 blocks of size d = 4. Colors according to the balancing vector.

Computational Considerations

To develop an efficient computational strategy, we introduce the empirical counterparts of the quantities described in the previous section assuming to observe a sample of response curves yi with i=1,,n on a common grid of G points, i.e. yi=(yi(s1),,yi(sG))T. Let also xi be the related functional covariate observed on a possibly different but— common across i—grid of points, that for simplicity and without loss of generality, we assume of length G. Let X be the n×G matrix with xi in the rows. Let Φ and Θ be the M×G and L×G matrices defined as Φ=(φ1(t1)φ1(tG)φm(t1)φm(tG)φM(t1)φM(tG)),Θ=(θ1(s1)θ1(sG)θl(s1)θl(sG)θL(s1)θL(sG)),

and let Y and E be the n×G matrices obtained as Y=(y1,,yn)T and E=(e1,,en)T , with ei=(ei(s1),,ei(sG))T. The function-on-function linear regression model can be equivalently written in matrix form as Y=XΦTΨΘ+E.

Applying the vectorization operator on each side of the equality above, we obtain the new optimization problem

12yZψ22+λB+1b=1Dbψ2,

where y=vec(Y), ψ=vec(Ψ) is the vector of coefficients of dimension LM, Z=ΘTXΦT and Db=diag(cb) is a diagonal matrix whose elements correspond to the elements of the vector cb. The minimization of the quantity above is challenging because of the non-separability of the overlap group Lasso penalty. We propose a Majorization-Minimization (MM) algorithm (Lange, 2016) to obtain the solution, although other choices are viable e.g., the Alternating Direction Method of Multipliers (ADMM), see Boyd et al. (2011). The MM procedure works in two steps: in the first step a majorizing function Q(ψ|ˆψk) based on the actual estimate is determined and in the second step this function is minimized. Alternating between the two guarantees the convergence to a solution of the original problem. The curious reader can find the expression of Q(ψ|ˆψk) in the original paper, here we just point out that the function is quadratic and admits an explicit solution in the form of generalized ridge regression. A further speed-up, when the dimension of the problem is high, is made exploiting the Sherman-Morrison-Woodbury matrix identity, also known as the matrix inversion lemma.

When the response is scalar, we have the following modifications. The design matrix is Z=XΦT, y and ψ are vectors of dimension respectevely n and M and the diagonal matrix Db is of dimension M×M. The optimization problem is still valid and the same class of algorithms can be employed to obtain the solution.

Swedish mortality revisited

We apply the proposed locally sparse estimator to the Swedish mortality dataset, where the aim is to predict the log-hazard function on a given year from the same quantity on the previous year. We implement the model with d=4,M=L=20 basis functions on each dimension and select the optimal value of λ by means of cross-validation. A perspective plot of the estimated surface is depicted in the last figure of the post. The estimate shows a marked positive diagonal confirming the positive influence on the log-hazard rate at age s of the previous year’s curve evaluated on a neighborhood of s. At the same time, the flat zero regions outside the diagonal suggest that there is no influence of the curves evaluated at distant ages. Our estimate is more regular than previous approaches and its qualitative interpretation sharper and easier. Refer, for example, to Figure 10.11 of Ramsay, Hooker, and Graves (2009).

Estimated regression surface for the Swedish mortality dataset.

This result witnesses the practical relevance of adopting the proposed approach. Indeed, the resulting estimate, while being reminiscent of a concurrent model—inheriting its ease of interpretation—gives further insights and improves the fit, representing the desired intermediate solution between the concurrent and nonconcurrent models.

R package

The method is implemented in R through the package fdaSP, available on CRAN. The function f2fSP can be used for a functional response regression model while f2sSP for a scalar response one. The two counterparts f2fSP_cv and f2sSP_cv are useful to select the tuning parameter by means of cross-validation.

This post is based on

Bernardi, M., Canale, A. and Stefanucci, M. (2022). Locally Sparse Function-on-Function Regression, Journal of Computational and Graphical Statistics, 32:3, 985-999, DOI: 10.1080/10618600.2022.2130926

References

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
Yuan, M., and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,” Journal of the Royal Statistical Society, Series B, 68, 49–67.
Jacob, L., Obozinski, G., and Vert, J.-P. (2009), “Group Lasso with Overlap and Graph Lasso,” in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433–440.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2011), “Structured Variable Selection with Sparsity-Inducing Norms,” Journal of Machine Learning Research, 12, 2777–2824.
Lim, M., and Hastie, T. (2015), “Learning Interactions via Hierarchical Group-Lasso Regularization,” Journal of Computational and Graphical Statistics, 24, 627–654.
Lange, K. (2016), “MM optimization algorithms.” Society for Industrial and Applied Mathematics, Philadelphia, PA.
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011), “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, 3, 1–122.
Ramsay, J. O., Hooker, G., and Graves, S. (2009), Functional Data Analysis with R and Matlab, NewYork: Springer.

About the authors

Mauro Bernardi is Associate Professor at University of Padua.

Antonio Canale is Associate Professor at University of Padua.

Marco Stefanucci is Assistant Professor at University of Rome Tor Vergata.

To leave a comment for the author, please follow the link and comment on their blog: YoungStatS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)