1 (Exercise 3.14). It can also be shown that the sequence of PLS coeﬃcients for m =1, 2,...,prepresents the conjugate gradient sequence for computing the least squares solutions (Exercise 3.18). 3.6 Discussion: A Comparison of the Selection and Shrinkage Methods There are some simple settings where we can understand better the rela- tionship between the diﬀerent methods described above. Consider an exam- ple with two correlated inputs X1 and X2, with correlation ρ. We assume that the true regression coeﬃcients are β1 =4andβ2 = 2. Figure 3.18 shows the coeﬃcient proﬁles for the diﬀerent methods, as their tuning pa- rameters are varied. The top panel has ρ =0.5, the bottom panel ρ = −0.5. The tuning parameters for ridge and lasso vary over a continuous range, while best subset, PLS and PCR take just two discrete steps to the least squares solution. In the top panel, starting at the origin, ridge regression shrinks the coeﬃcients together until it ﬁnally converges to least squares. PLS and PCR show similar behavior to ridge, although are discrete and more extreme. Best subset overshoots the solution and then backtracks. The behavior of the lasso is intermediate to the other methods. When the correlation is negative (lower panel), again PLS and PCR roughly track the ridge path, while all of the methods are more similar to one another. It is interesting to compare the shrinkage behavior of these diﬀerent methods. Recall that ridge regression shrinks all directions, but shrinks low-variance directions more. Principal components regression leaves M high-variance directions alone, and discards the rest. Interestingly, it can be shown that partial least squares also tends to shrink the low-variance directions, but can actually inﬂate some of the higher variance directions. This can make PLS a little unstable, and cause it to have slightly higher prediction error compared to ridge regression. A full study is given in Frank and Friedman (1993). These authors conclude that for minimizing predic- tion error, ridge regression is generally preferable to variable subset selec- tion, principal components regression and partial least squares. However the improvement over the latter two methods was only slight. To summarize, PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.3.6 Discussion: A Comparison of the Selection and Shrinkage Methods 83 0123456 -1 0 1 2 3 Least Squares 0 Ridge Lasso Best Subset PLSPCR • 0123456 -10123 Least Squares Ridge Best Subset PLS PCR Lasso • 0 ρ =0.5 ρ = −0.5 β1 β1 β 2 β 2 FIGURE 3.18. Coeﬃcient proﬁles from diﬀerent methods for a simple problem: two inputs with correlation ±0.5, and the true regression coeﬃcients β =(4, 2).84 3. Linear Methods for Regression 3.7 Multiple Outcome Shrinkage and Selection As noted in Section 3.2.4, the least squares estimates in a multiple-output linear model are simply the individual least squares estimates for each of the outputs. To apply selection and shrinkage methods in the multiple output case, one could apply a univariate technique individually to each outcome or si- multaneously to all outcomes. With ridge regression, for example, we could apply formula (3.44) to each of the K columns of the outcome matrix Y , using possibly diﬀerent parameters λ, or apply it to all columns using the same value of λ. The former strategy would allow diﬀerent amounts of regularization to be applied to diﬀerent outcomes but require estimation of k separate regularization parameters λ1,...,λk, while the latter would permit all k outputs to be used in estimating the sole regularization pa- rameter λ. Other more sophisticated shrinkage and selection strategies that exploit correlations in the diﬀerent responses can be helpful in the multiple output case. Suppose for example that among the outputs we have Yk = f(X)+εk (3.65) Y = f(X)+ε; (3.66) i.e., (3.65) and (3.66) share the same structural part f(X) in their models. It is clear in this case that we should pool our observations on Yk and Yl to estimate the common f. Combining responses is at the heart of canonical correlation analysis (CCA), a data reduction technique developed for the multiple output case. Similar to PCA, CCA ﬁnds a sequence of uncorrelated linear combina- tions Xvm,m=1,...,M of the xj, and a corresponding sequence of uncorrelated linear combinations Yum of the responses yk, such that the correlations Corr2(Yum, Xvm) (3.67) are successively maximized. Note that at most M = min(K, p) directions can be found. The leading canonical response variates are those linear com- binations (derived responses) best predicted by the xj; in contrast, the trailing canonical variates can be poorly predicted by the xj, and are can- didates for being dropped. The CCA solution is computed using a general- ized SVD of the sample cross-covariance matrix YT X/N (assuming Y and X are centered; Exercise 3.20). Reduced-rank regression (Izenman, 1975; van der Merwe and Zidek, 1980) formalizes this approach in terms of a regression model that explicitly pools information. Given an error covariance Cov(ε)=Σ, we solve the following3.7 Multiple Outcome Shrinkage and Selection 85 restricted multivariate regression problem: ˆBrr(m) = argmin rank(B)=m N i=1 (yi − BT xi)T Σ−1(yi − BT xi). (3.68) With Σ replaced by the estimate YT Y/N , one can show (Exercise 3.21) that the solution is given by a CCA of Y and X: ˆBrr(m)= ˆBUmU− m, (3.69) where Um is the K × m sub-matrix of U consisting of the ﬁrst m columns, and U is the K × M matrix of left canonical vectors u1,u2,...,uM . U− m is its generalized inverse. Writing the solution as ˆBrr(M)=(XT X)−1XT (YUm)U− m, (3.70) we see that reduced-rank regression performs a linear regression on the pooled response matrix YUm, and then maps the coeﬃcients (and hence the ﬁts as well) back to the original response space. The reduced-rank ﬁts are given by ˆYrr(m)=X(XT X)−1XT YUmU− m = HYPm, (3.71) where H is the usual linear regression projection operator, and Pm is the rank-m CCA response projection operator. Although a better estimate of Σ would be (Y−X ˆB)T (Y−X ˆB)/(N −pK), one can show that the solution remains the same (Exercise 3.22). Reduced-rank regression borrows strength among responses by truncat- ing the CCA. Breiman and Friedman (1997) explored with some success shrinkage of the canonical variates between X and Y, a smooth version of reduced rank regression. Their proposal has the form (compare (3.69)) ˆBc+w = ˆBUΛU−1, (3.72) where Λ is a diagonal shrinkage matrix (the “c+w” stands for “Curds and Whey,” the name they gave to their procedure). Based on optimal prediction in the population setting, they show that Λ has diagonal entries λm = c2 m c2m + p N (1 − c2m),m=1,...,M, (3.73) where cm is the mth canonical correlation coeﬃcient. Note that as the ratio of the number of input variables to sample size p/N gets small, the shrink- age factors approach 1. Breiman and Friedman (1997) proposed modiﬁed versions of Λ based on training data and cross-validation, but the general form is the same. Here the ﬁtted response has the form ˆYc+w = HYSc+w, (3.74)86 3. Linear Methods for Regression where Sc+w = UΛU−1 is the response shrinkage operator. Breiman and Friedman (1997) also suggested shrinking in both the Y space and X space. This leads to hybrid shrinkage models of the form ˆYridge,c+w = AλYSc+w, (3.75) where Aλ = X(XT X+λI)−1XT is the ridge regression shrinkage operator, as in (3.46) on page 66. Their paper and the discussions thereof contain many more details. 3.8 More on the Lasso and Related Path Algorithms Since the publication of the LAR algorithm (Efron et al., 2004) there has been a lot of activity in developing algorithms for ﬁtting regularization paths for a variety of diﬀerent problems. In addition, L1 regularization has taken on a life of its own, leading to the development of the ﬁeld compressed sensing in the signal-processing literature. (Donoho, 2006a; Candes, 2006). In this section we discuss some related proposals and other path algorithms, starting oﬀ with a precursor to the LAR algorithm. 3.8.1 Incremental Forward Stagewise Regression Here we present another LAR-like algorithm, this time focused on forward stagewise regression. Interestingly, eﬀorts to understand a ﬂexible nonlinear regression procedure (boosting) led to a new algorithm for linear models (LAR). In reading the ﬁrst edition of this book and the forward stagewise Algorithm 3.4 Incremental Forward Stagewise Regression—FS. 1. Start with the residual r equal to y and β1,β2,...,βp = 0. All the predictors are standardized to have mean zero and unit norm. 2. Find the predictor xj most correlated with r 3. Update βj ← βj + δj, where δj = · sign[ xj, r]and>0 is a small step size, and set r ← r − δjxj. 4. Repeat steps 2 and 3 many times, until the residuals are uncorrelated with all the predictors. Algorithm 16.1 of Chapter 164, our colleague Brad Efron realized that with 4In the ﬁrst edition, this was Algorithm 10.4 in Chapter 10.3.8 More on the Lasso and Related Path Algorithms 87 −0.2 0.0 0.2 0.4 0.6 lcavol lweight age lbph svi lcp gleason pgg45 0 50 100 150 200 −0.2 0.0 0.2 0.4 0.6 lcavol lweight age lbph svi lcp gleason pgg45 0.0 0.5 1.0 1.5 2.0 FS FS0 Iteration CoeﬃcientsCoeﬃcients L1 Arc-length of Coeﬃcients FIGURE 3.19. Coeﬃcient proﬁles for the prostate data. The left panel shows incremental forward stagewise regression with step size =0.01. The right panel shows the inﬁnitesimal version FS0 obtained letting → 0. This proﬁle was ﬁt by the modiﬁcation 3.2b to the LAR Algorithm 3.2. In this example the FS0 proﬁles are monotone, and hence identical to those of lasso and LAR. linear models, one could explicitly construct the piecewise-linear lasso paths of Figure 3.10. This led him to propose the LAR procedure of Section 3.4.4, as well as the incremental version of forward-stagewise regression presented here. Consider the linear-regression version of the forward-stagewise boosting algorithm 16.1 proposed in Section 16.1 (page 608). It generates a coeﬃcient proﬁle by repeatedly updating (by a small amount ) the coeﬃcient of the variable most correlated with the current residuals. Algorithm 3.4 gives the details. Figure 3.19 (left panel) shows the progress of the algorithm on the prostate data with step size =0.01. If δj = xj, r (the least-squares coeﬃcient of the residual on jth predictor), then this is exactly the usual forward stagewise procedure (FS) outlined in Section 3.3.3. Here we are mainly interested in small values of . Letting → 0 gives the right panel of Figure 3.19, which in this case is identical to the lasso path in Figure 3.10. We call this limiting procedure inﬁnitesimal forward stagewise regression or FS0. This procedure plays an important role in non-linear, adaptive methods like boosting (Chapters 10 and 16) and is the version of incremental forward stagewise regression that is most amenable to theoretical analysis. B¨uhlmann and Hothorn (2008) refer to the same procedure as “L2boost”, because of its connections to boosting.88 3. Linear Methods for Regression Efron originally thought that the LAR Algorithm 3.2 was an implemen- tation of FS0, allowing each tied predictor a chance to update their coeﬃ- cients in a balanced way, while remaining tied in correlation. However, he then realized that the LAR least-squares ﬁt amongst the tied predictors can result in coeﬃcients moving in the opposite direction to their correla- tion, which cannot happen in Algorithm 3.4. The following modiﬁcation of the LAR algorithm implements FS0: Algorithm 3.2b Least Angle Regression: FS0 Modiﬁcation. 4. Find the new direction by solving the constrained least squares prob- lem min b ||r − XAb||2 2 subject to bjsj ≥ 0,j∈A, where sj is the sign of xj, r. The modiﬁcation amounts to a non-negative least squares ﬁt, keeping the signs of the coeﬃcients the same as those of the correlations. One can show that this achieves the optimal balancing of inﬁnitesimal “update turns” for the variables tied for maximal correlation (Hastie et al., 2007). Like lasso, the entire FS0 path can be computed very eﬃciently via the LAR algorithm. As a consequence of these results, if the LAR proﬁles are monotone non- increasing or non-decreasing, as they are in Figure 3.19, then all three methods—LAR, lasso, and FS0—give identical proﬁles. If the proﬁles are not monotone but do not cross the zero axis, then LAR and lasso are identical. Since FS0 is diﬀerent from the lasso, it is natural to ask if it optimizes a criterion. The answer is more complex than for lasso; the FS0 coeﬃcient proﬁle is the solution to a diﬀerential equation. While the lasso makes op- timal progress in terms of reducing the residual sum-of-squares per unit increase in L1-norm of the coeﬃcient vector β,FS0 is optimal per unit increase in L1 arc-length traveled along the coeﬃcient path. Hence its co- eﬃcient path is discouraged from changing directions too often. FS0 is more constrained than lasso, and in fact can be viewed as a mono- tone version of the lasso; see Figure 16.3 on page 614 for a dramatic exam- ple. FS0 may be useful in p N situations, where its coeﬃcient proﬁles are much smoother and hence have less variance than those of lasso. More details on FS0 are given in Section 16.2.3 and Hastie et al. (2007). Fig- ure 3.16 includes FS0 where its performance is very similar to that of the lasso.3.8 More on the Lasso and Related Path Algorithms 89 3.8.2 Piecewise-Linear Path Algorithms The least angle regression procedure exploits the piecewise linear nature of the lasso solution paths. It has led to similar “path algorithms” for other regularized problems. Suppose we solve ˆβ(λ) = argminβ [R(β)+λJ(β)] , (3.76) with R(β)= N i=1 L(yi,β0 + p j=1 xijβj), (3.77) where both the loss function L and the penalty function J are convex. Then the following are suﬃcient conditions for the solution path ˆβ(λ)to be piecewise linear (Rosset and Zhu, 2007): 1. R is quadratic or piecewise-quadratic as a function of β,and 2. J is piecewise linear in β. This also implies (in principle) that the solution path can be eﬃciently computed. Examples include squared- and absolute-error loss, “Huberized” losses, and the L1,L∞ penalties on β. Another example is the “hinge loss” function used in the support vector machine. There the loss is piecewise linear, and the penalty is quadratic. Interestingly, this leads to a piecewise- linear path algorithm in the dual space; more details are given in Sec- tion 12.3.5. 3.8.3 The Dantzig Selector Candes and Tao (2007) proposed the following criterion: minβ||β||1 subject to ||XT (y − Xβ)||∞ ≤ t. (3.78) They call the solution the Dantzig selector (DS). It can be written equiva- lently as minβ||XT (y − Xβ)||∞ subject to ||β||1 ≤ t. (3.79) Here ||·||∞ denotes the L∞ norm, the maximum absolute value of the components of the vector. In this form it resembles the lasso, replacing squared error loss by the maximum absolute value of its gradient. Note that as t gets large, both procedures yield the least squares solution if N