统计学习基础:数据挖掘、推理与预测


Springer Series in Statistics Trevor Hastie Robert Tibshirani Jerome Friedman Springer Series in Statistics The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements of Statistical Learning During the past decade there has been an explosion in computation and information tech- nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo- gy, finance, and marketing. The challenge of understanding these data has led to the devel- opment of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book’s coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting—the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for “wide” data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co- developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data- mining tools including CART, MARS, projection pursuit and gradient boosting. › springer.com STATISTICS  ---- Trevor Hastie • Robert Tibshirani • Jerome Friedman The Elements of Statictical Learning Hastie • Tibshirani • Friedman Second EditionThis is page v Printer: Opaque this To our parents: Valerie and Patrick Hastie Vera and Sami Tibshirani Florence and Harry Friedman andtoourfamilies: Samantha, Timothy, and Lynda Charlie, Ryan, Julie, and Cheryl Melanie, Dora, Monika, and IldikoviThis is page vii Printer: Opaque this Preface to the Second Edition In God we trust, all others bring data. –William Edwards Deming (1900-1993)1 We have been gratified by the popularity of the first edition of The Elements of Statistical Learning. This, along with the fast pace of research in the statistical learning field, motivated us to update our book with a second edition. We have added four new chapters and updated some of the existing chapters. Because many readers are familiar with the layout of the first edition, we have tried to change it as little as possible. Here is a summary of the main changes: 1On the Web, this quote has been widely attributed to both Deming and Robert W. Hayden; however Professor Hayden told us that he can claim no credit for this quote, and ironically we could find no “data” confirming that Deming actually said this.viii Preface to the Second Edition Chapter What’s new 1. Introduction 2. Overview of Supervised Learning 3. Linear Methods for Regression LAR algorithm and generalizations of the lasso 4. Linear Methods for Classification Lasso path for logistic regression 5. Basis Expansions and Regulariza- tion Additional illustrations of RKHS 6. Kernel Smoothing Methods 7. Model Assessment and Selection Strengths and pitfalls of cross- validation 8. Model Inference and Averaging 9. Additive Models, Trees, and Related Methods 10. Boosting and Additive Trees New example from ecology; some material split off to Chapter 16. 11. Neural Networks Bayesian neural nets and the NIPS 2003 challenge 12. Support Vector Machines and Flexible Discriminants Path algorithm for SVM classifier 13. Prototype Methods and Nearest-Neighbors 14. Unsupervised Learning Spectral clustering, kernel PCA, sparse PCA, non-negative matrix factorization archetypal analysis, nonlinear dimension reduction, Google page rank algorithm, a direct approach to ICA 15. Random Forests New 16. Ensemble Learning New 17. Undirected Graphical Models New 18. High-Dimensional Problems New Some further notes: • Our first edition was unfriendly to colorblind readers; in particular, we tended to favor red/green contrasts which are particularly trou- blesome. We have changed the color palette in this edition to a large extent, replacing the above with an orange/blue contrast. • We have changed the name of Chapter 6 from “Kernel Methods” to “Kernel Smoothing Methods”, to avoid confusion with the machine- learning kernel method that is discussed in the context of support vec- tor machines (Chapter 11) and more generally in Chapters 5 and 14. • In the first edition, the discussion of error-rate estimation in Chap- ter 7 was sloppy, as we did not clearly differentiate the notions of conditional error rates (conditional on the training set) and uncondi- tional rates. We have fixed this in the new edition.Preface to the Second Edition ix • Chapters 15 and 16 follow naturally from Chapter 10, and the chap- ters are probably best read in that order. • In Chapter 17, we have not attempted a comprehensive treatment of graphical models, and discuss only undirected models and some new methods for their estimation. Due to a lack of space, we have specifically omitted coverage of directed graphical models. • Chapter 18 explores the “p N” problem, which is learning in high- dimensional feature spaces. These problems arise in many areas, in- cluding genomic and proteomic studies, and document classification. We thank the many readers who have found the (too numerous) errors in the first edition. We apologize for those and have done our best to avoid er- rors in this new edition. We thank Mark Segal, Bala Rajaratnam, and Larry Wasserman for comments on some of the new chapters, and many Stanford graduate and post-doctoral students who offered comments, in particular Mohammed AlQuraishi, John Boik, Holger Hoefling, Arian Maleki, Donal McMahon, Saharon Rosset, Babak Shababa, Daniela Witten, Ji Zhu and Hui Zou. We thank John Kimmel for his patience in guiding us through this new edition. RT dedicates this edition to the memory of Anna McPhee. Trevor Hastie Robert Tibshirani Jerome Friedman Stanford, California August 2008x Preface to the Second EditionThis is page xi Printer: Opaque this Preface to the First Edition We are drowning in information and starving for knowledge. –Rutherford D. Roger The field of Statistics is constantly challenged by the problems that science and industry brings to its door. In the early days, these problems often came from agricultural and industrial experiments and were relatively small in scope. With the advent of computers and the information age, statistical problems have exploded both in size and complexity. Challenges in the areas of data storage, organization and searching have led to the new field of “data mining”; statistical and computational problems in biology and medicine have created “bioinformatics.” Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract important patterns and trends, and understand “what the data says.” We call this learning from data. The challenges in learning from data have led to a revolution in the sta- tistical sciences. Since computation plays such a key role, it is not surprising that much of this new development has been done by researchers in other fields such as computer science and engineering. The learning problems that we consider can be roughly categorized as either supervised or unsupervised. In supervised learning, the goal is to pre- dict the value of an outcome measure based on a number of input measures; in unsupervised learning, there is no outcome measure, and the goal is to describe the associations and patterns among a set of input measures.xii Preface to the First Edition This book is our attempt to bring together many of the important new ideas in learning, and explain them in a statistical framework. While some mathematical details are needed, we emphasize the methods and their con- ceptual underpinnings rather than their theoretical properties. As a result, we hope that this book will appeal not just to statisticians but also to researchers and practitioners in a wide variety of fields. Just as we have learned a great deal from researchers outside of the field of statistics, our statistical viewpoint may help others to better understand different aspects of learning: There is no true interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about an idea. –Andreas Buja We would like to acknowledge the contribution of many people to the conception and completion of this book. David Andrews, Leo Breiman, Andreas Buja, John Chambers, Bradley Efron, Geoffrey Hinton, Werner Stuetzle, and John Tukey have greatly influenced our careers. Balasub- ramanian Narasimhan gave us advice and help on many computational problems, and maintained an excellent computing environment. Shin-Ho Bang helped in the production of a number of the figures. Lee Wilkinson gave valuable tips on color production. Ilana Belitskaya, Eva Cantoni, Maya Gupta, Michael Jordan, Shanti Gopatam, Radford Neal, Jorge Picazo, Bog- dan Popescu, Olivier Renaud, Saharon Rosset, John Storey, Ji Zhu, Mu Zhu, two reviewers and many students read parts of the manuscript and offered helpful suggestions. John Kimmel was supportive, patient and help- ful at every phase; MaryAnn Brickner and Frank Ganz headed a superb production team at Springer. Trevor Hastie would like to thank the statis- tics department at the University of Cape Town for their hospitality during the final stages of this book. We gratefully acknowledge NSF and NIH for their support of this work. Finally, we would like to thank our families and our parents for their love and support. Trevor Hastie Robert Tibshirani Jerome Friedman Stanford, California May 2001 The quiet statisticians have changed our world; not by discov- ering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions .... –Ian HackingThis is page xiii Printer: Opaque this Contents Preface to the Second Edition vii Preface to the First Edition xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction ......................... 9 2.2 Variable Types and Terminology .............. 9 2.3 Two Simple Approaches to Prediction: LeastSquaresandNearestNeighbors........... 11 2.3.1 Linear Models and Least Squares . ....... 11 2.3.2 Nearest-Neighbor Methods ............ 14 2.3.3 From Least Squares to Nearest Neighbors .... 16 2.4 Statistical Decision Theory ................. 18 2.5 LocalMethodsinHighDimensions............. 22 2.6 Statistical Models, Supervised Learning and Function Approximation ................ 28 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y ) ....... 28 2.6.2 Supervised Learning ................ 29 2.6.3 Function Approximation ............. 29 2.7 StructuredRegressionModels............... 32 2.7.1 Difficulty of the Problem ............. 32xiv Contents 2.8 Classes of Restricted Estimators .............. 33 2.8.1 Roughness Penalty and Bayesian Methods . . . 34 2.8.2 Kernel Methods and Local Regression ...... 34 2.8.3 Basis Functions and Dictionary Methods .... 35 2.9 Model Selection and the Bias–Variance Tradeoff ..... 37 Bibliographic Notes ......................... 39 Exercises............................... 39 3 Linear Methods for Regression 43 3.1 Introduction ......................... 43 3.2 LinearRegressionModelsandLeastSquares....... 44 3.2.1 Example: Prostate Cancer ............ 49 3.2.2 The Gauss–Markov Theorem ........... 51 3.2.3 Multiple Regression from Simple Univariate Regression . ....... 52 3.2.4 Multiple Outputs ................. 56 3.3 Subset Selection ....................... 57 3.3.1 Best-Subset Selection ............... 57 3.3.2 Forward- and Backward-Stepwise Selection . . . 58 3.3.3 Forward-Stagewise Regression . . . ....... 60 3.3.4 Prostate Cancer Data Example (Continued) . . 61 3.4 ShrinkageMethods...................... 61 3.4.1 Ridge Regression ................. 61 3.4.2 The Lasso ..................... 68 3.4.3 Discussion: Subset Selection, Ridge Regression andtheLasso................... 69 3.4.4 Least Angle Regression .............. 73 3.5 Methods Using Derived Input Directions . . ....... 79 3.5.1 Principal Components Regression . ....... 79 3.5.2 Partial Least Squares ............... 80 3.6 Discussion: A Comparison of the Selection andShrinkageMethods................... 82 3.7 Multiple Outcome Shrinkage and Selection . ....... 84 3.8 More on the Lasso and Related Path Algorithms ..... 86 3.8.1 Incremental Forward Stagewise Regression . . . 86 3.8.2 Piecewise-Linear Path Algorithms . ....... 89 3.8.3 The Dantzig Selector ............... 89 3.8.4 The Grouped Lasso ................ 90 3.8.5 Further Properties of the Lasso . . . ....... 91 3.8.6 Pathwise Coordinate Optimization . ....... 92 3.9 Computational Considerations ............... 93 Bibliographic Notes ......................... 94 Exercises............................... 94Contents xv 4 Linear Methods for Classification 101 4.1 Introduction ......................... 101 4.2 Linear Regression of an Indicator Matrix . . ....... 103 4.3 Linear Discriminant Analysis ................ 106 4.3.1 Regularized Discriminant Analysis . ....... 112 4.3.2 Computations for LDA .............. 113 4.3.3 Reduced-Rank Linear Discriminant Analysis . . 113 4.4 Logistic Regression ...................... 119 4.4.1 Fitting Logistic Regression Models . ....... 120 4.4.2 Example: South African Heart Disease ..... 122 4.4.3 Quadratic Approximations and Inference .... 124 4.4.4 L1 Regularized Logistic Regression . ....... 125 4.4.5 Logistic Regression or LDA? ........... 127 4.5 Separating Hyperplanes ................... 129 4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . 130 4.5.2 Optimal Separating Hyperplanes . . ....... 132 Bibliographic Notes ......................... 135 Exercises............................... 135 5 Basis Expansions and Regularization 139 5.1 Introduction ......................... 139 5.2 Piecewise Polynomials and Splines ............. 141 5.2.1 Natural Cubic Splines ............... 144 5.2.2 Example: South African Heart Disease (Continued)146 5.2.3 Example: Phoneme Recognition . . ....... 148 5.3 Filtering and Feature Extraction .............. 150 5.4 SmoothingSplines...................... 151 5.4.1 Degrees of Freedom and Smoother Matrices . . . 153 5.5 Automatic Selection of the Smoothing Parameters .... 156 5.5.1 Fixing the Degrees of Freedom . . . ....... 158 5.5.2 The Bias–Variance Tradeoff ............ 158 5.6 Nonparametric Logistic Regression ............. 161 5.7 Multidimensional Splines .................. 162 5.8 Regularization and Reproducing Kernel Hilbert Spaces . 167 5.8.1 Spaces of Functions Generated by Kernels . . . 168 5.8.2 Examples of RKHS ................ 170 5.9 WaveletSmoothing..................... 174 5.9.1 Wavelet Bases and the Wavelet Transform . . . 176 5.9.2 Adaptive Wavelet Filtering ............ 179 Bibliographic Notes ......................... 181 Exercises............................... 181 Appendix: Computational Considerations for Splines ...... 186 Appendix: B-splines..................... 186 Appendix: Computations for Smoothing Splines ..... 189xvi Contents 6 Kernel Smoothing Methods 191 6.1 One-Dimensional Kernel Smoothers ............ 192 6.1.1 Local Linear Regression .............. 194 6.1.2 Local Polynomial Regression ........... 197 6.2 SelectingtheWidthoftheKernel............. 198 6.3 Local Regression in IRp ................... 200 6.4 Structured Local Regression Models in IRp ........ 201 6.4.1 Structured Kernels ................. 203 6.4.2 Structured Regression Functions . . ....... 203 6.5 LocalLikelihoodandOtherModels............ 205 6.6 Kernel Density Estimation and Classification ....... 208 6.6.1 Kernel Density Estimation ............ 208 6.6.2 Kernel Density Classification ........... 210 6.6.3 The Naive Bayes Classifier ............ 210 6.7 Radial Basis Functions and Kernels ............ 212 6.8 Mixture Models for Density Estimation and Classification 214 6.9 Computational Considerations ............... 216 Bibliographic Notes ......................... 216 Exercises............................... 216 7 Model Assessment and Selection 219 7.1 Introduction ......................... 219 7.2 Bias, Variance and Model Complexity ........... 219 7.3 The Bias–Variance Decomposition ............. 223 7.3.1 Example: Bias–Variance Tradeoff . ....... 226 7.4 Optimism of the Training Error Rate ........... 228 7.5 Estimates of In-Sample Prediction Error . . . ....... 230 7.6 TheEffectiveNumberofParameters............ 232 7.7 TheBayesianApproachandBIC.............. 233 7.8 Minimum Description Length ................ 235 7.9 Vapnik–Chervonenkis Dimension .............. 237 7.9.1 Example (Continued) ............... 239 7.10 Cross-Validation ....................... 241 7.10.1 K-Fold Cross-Validation ............. 241 7.10.2 The Wrong and Right Way to Do Cross-validation ............... 245 7.10.3 Does Cross-Validation Really Work? ....... 247 7.11 Bootstrap Methods ..................... 249 7.11.1 Example (Continued) ............... 252 7.12 Conditional or Expected Test Error? ............ 254 Bibliographic Notes ......................... 257 Exercises............................... 257 8 Model Inference and Averaging 261 8.1 Introduction ......................... 261Contents xvii 8.2 TheBootstrapandMaximumLikelihoodMethods.... 261 8.2.1 A Smoothing Example .............. 261 8.2.2 Maximum Likelihood Inference . . . ....... 265 8.2.3 Bootstrap versus Maximum Likelihood ..... 267 8.3 BayesianMethods...................... 267 8.4 Relationship Between the Bootstrap and Bayesian Inference ................... 271 8.5 The EM Algorithm ..................... 272 8.5.1 Two-Component Mixture Model . . ....... 272 8.5.2 The EM Algorithm in General . . . ....... 276 8.5.3 EM as a Maximization–Maximization Procedure 277 8.6 MCMCforSamplingfromthePosterior.......... 279 8.7 Bagging ............................ 282 8.7.1 Example: Trees with Simulated Data ...... 283 8.8 Model Averaging and Stacking ............... 288 8.9 StochasticSearch:Bumping................. 290 Bibliographic Notes ......................... 292 Exercises............................... 293 9 Additive Models, Trees, and Related Methods 295 9.1 Generalized Additive Models ................ 295 9.1.1 Fitting Additive Models .............. 297 9.1.2 Example: Additive Logistic Regression ..... 299 9.1.3 Summary ...................... 304 9.2 Tree-Based Methods ..................... 305 9.2.1 Background .................... 305 9.2.2 Regression Trees .................. 307 9.2.3 Classification Trees ................ 308 9.2.4 Other Issues .................... 310 9.2.5 Spam Example (Continued) ........... 313 9.3 PRIM:BumpHunting.................... 317 9.3.1 Spam Example (Continued) ........... 320 9.4 MARS: Multivariate Adaptive Regression Splines ..... 321 9.4.1 Spam Example (Continued) ........... 326 9.4.2 Example (Simulated Data) ............ 327 9.4.3 Other Issues .................... 328 9.5 HierarchicalMixturesofExperts.............. 329 9.6 MissingData......................... 332 9.7 Computational Considerations ............... 334 Bibliographic Notes ......................... 334 Exercises............................... 335 10 Boosting and Additive Trees 337 10.1 Boosting Methods ...................... 337 10.1.1 Outline of This Chapter .............. 340xviii Contents 10.2 Boosting Fits an Additive Model .............. 341 10.3 Forward Stagewise Additive Modeling ........... 342 10.4 Exponential Loss and AdaBoost .............. 343 10.5 Why Exponential Loss? ................... 345 10.6 Loss Functions and Robustness ............... 346 10.7 “Off-the-Shelf” Procedures for Data Mining . ....... 350 10.8 Example: Spam Data .................... 352 10.9 Boosting Trees ........................ 353 10.10 Numerical Optimization via Gradient Boosting ...... 358 10.10.1 Steepest Descent .................. 358 10.10.2 Gradient Boosting ................. 359 10.10.3 Implementations of Gradient Boosting ...... 360 10.11 Right-Sized Trees for Boosting ............... 361 10.12 Regularization ........................ 364 10.12.1 Shrinkage ...................... 364 10.12.2 Subsampling .................... 365 10.13 Interpretation ........................ 367 10.13.1 Relative Importance of Predictor Variables . . . 367 10.13.2 Partial Dependence Plots ............. 369 10.14 Illustrations .......................... 371 10.14.1 California Housing ................. 371 10.14.2 New Zealand Fish ................. 375 10.14.3 Demographics Data ................ 379 Bibliographic Notes ......................... 380 Exercises............................... 384 11 Neural Networks 389 11.1 Introduction ......................... 389 11.2 Projection Pursuit Regression ............... 389 11.3 Neural Networks ....................... 392 11.4 Fitting Neural Networks ................... 395 11.5 Some Issues in Training Neural Networks . . ....... 397 11.5.1 Starting Values ................... 397 11.5.2 Overfitting ..................... 398 11.5.3 Scaling of the Inputs ............... 398 11.5.4 Number of Hidden Units and Layers ....... 400 11.5.5 Multiple Minima .................. 400 11.6 Example: Simulated Data .................. 401 11.7 Example: ZIP Code Data .................. 404 11.8 Discussion .......................... 408 11.9 Bayesian Neural Nets and the NIPS 2003 Challenge . . . 409 11.9.1 Bayes, Boosting and Bagging ........... 410 11.9.2 Performance Comparisons ............ 412 11.10 Computational Considerations ............... 414 Bibliographic Notes ......................... 415Contents xix Exercises............................... 415 12 Support Vector Machines and Flexible Discriminants 417 12.1 Introduction ......................... 417 12.2 The Support Vector Classifier ................ 417 12.2.1 Computing the Support Vector Classifier .... 420 12.2.2 Mixture Example (Continued) . . . ....... 421 12.3 Support Vector Machines and Kernels ........... 423 12.3.1 Computing the SVM for Classification ...... 423 12.3.2 The SVM as a Penalization Method ....... 426 12.3.3 Function Estimation and Reproducing Kernels . 428 12.3.4 SVMs and the Curse of Dimensionality ..... 431 12.3.5 A Path Algorithm for the SVM Classifier .... 432 12.3.6 Support Vector Machines for Regression ..... 434 12.3.7 Regression and Kernels .............. 436 12.3.8 Discussion ..................... 438 12.4 Generalizing Linear Discriminant Analysis . ....... 438 12.5 Flexible Discriminant Analysis ............... 440 12.5.1 Computing the FDA Estimates . . . ....... 444 12.6 Penalized Discriminant Analysis .............. 446 12.7 Mixture Discriminant Analysis ............... 449 12.7.1 Example: Waveform Data ............. 451 Bibliographic Notes ......................... 455 Exercises............................... 455 13 Prototype Methods and Nearest-Neighbors 459 13.1 Introduction ......................... 459 13.2 Prototype Methods ..................... 459 13.2.1 K-meansClustering................ 460 13.2.2 Learning Vector Quantization . . . ....... 462 13.2.3 Gaussian Mixtures ................. 463 13.3 k-Nearest-NeighborClassifiers............... 463 13.3.1 Example: A Comparative Study . . ....... 468 13.3.2 Example: k-Nearest-Neighbors and Image Scene Classification . . . ....... 470 13.3.3 Invariant Metrics and Tangent Distance ..... 471 13.4 Adaptive Nearest-Neighbor Methods ............ 475 13.4.1 Example ...................... 478 13.4.2 Global Dimension Reduction forNearest-Neighbors............... 479 13.5 Computational Considerations ............... 480 Bibliographic Notes ......................... 481 Exercises............................... 481xx Contents 14 Unsupervised Learning 485 14.1 Introduction ......................... 485 14.2 Association Rules ...................... 487 14.2.1 Market Basket Analysis .............. 488 14.2.2 The Apriori Algorithm .............. 489 14.2.3 Example: Market Basket Analysis . ....... 492 14.2.4 Unsupervised as Supervised Learning ...... 495 14.2.5 Generalized Association Rules . . . ....... 497 14.2.6 Choice of Supervised Learning Method ..... 499 14.2.7 Example: Market Basket Analysis (Continued) . 499 14.3 Cluster Analysis ....................... 501 14.3.1 Proximity Matrices ................ 503 14.3.2 Dissimilarities Based on Attributes ....... 503 14.3.3 Object Dissimilarity ................ 505 14.3.4 Clustering Algorithms ............... 507 14.3.5 Combinatorial Algorithms ............ 507 14.3.6 K-means...................... 509 14.3.7 Gaussian Mixtures as Soft K-means Clustering . 510 14.3.8 Example: Human Tumor Microarray Data . . . 512 14.3.9 Vector Quantization ................ 514 14.3.10 K-medoids..................... 515 14.3.11 Practical Issues .................. 518 14.3.12 Hierarchical Clustering .............. 520 14.4 Self-Organizing Maps .................... 528 14.5 Principal Components, Curves and Surfaces . ....... 534 14.5.1 Principal Components ............... 534 14.5.2 Principal Curves and Surfaces . . . ....... 541 14.5.3 Spectral Clustering ................ 544 14.5.4 Kernel Principal Components ........... 547 14.5.5 Sparse Principal Components ........... 550 14.6 Non-negative Matrix Factorization ............. 553 14.6.1 Archetypal Analysis ................ 554 14.7 Independent Component Analysis and Exploratory Projection Pursuit ............ 557 14.7.1 Latent Variables and Factor Analysis ...... 558 14.7.2 Independent Component Analysis . ....... 560 14.7.3 Exploratory Projection Pursuit . . . ....... 565 14.7.4 A Direct Approach to ICA ............ 565 14.8 Multidimensional Scaling .................. 570 14.9 Nonlinear Dimension Reduction and Local Multidimensional Scaling ............ 572 14.10 The Google PageRank Algorithm ............. 576 Bibliographic Notes ......................... 578 Exercises............................... 579Contents xxi 15 Random Forests 587 15.1 Introduction ......................... 587 15.2 Definition of Random Forests ................ 587 15.3 Details of Random Forests ................. 592 15.3.1 Out of Bag Samples ................ 592 15.3.2 Variable Importance ................ 593 15.3.3 Proximity Plots .................. 595 15.3.4 Random Forests and Overfitting . . ....... 596 15.4 Analysis of Random Forests ................. 597 15.4.1 Variance and the De-Correlation Effect ..... 597 15.4.2 Bias ......................... 600 15.4.3 Adaptive Nearest Neighbors ........... 601 Bibliographic Notes ......................... 602 Exercises............................... 603 16 Ensemble Learning 605 16.1 Introduction ......................... 605 16.2 Boosting and Regularization Paths ............. 607 16.2.1 Penalized Regression ............... 607 16.2.2 The “Bet on Sparsity” Principle . . ....... 610 16.2.3 Regularization Paths, Over-fitting and Margins . 613 16.3 Learning Ensembles ..................... 616 16.3.1 Learning a Good Ensemble ............ 617 16.3.2 Rule Ensembles .................. 622 Bibliographic Notes ......................... 623 Exercises............................... 624 17 Undirected Graphical Models 625 17.1 Introduction ......................... 625 17.2 Markov Graphs and Their Properties ........... 627 17.3 Undirected Graphical Models for Continuous Variables . 630 17.3.1 Estimation of the Parameters whentheGraphStructureisKnown....... 631 17.3.2 Estimation of the Graph Structure . ....... 635 17.4 Undirected Graphical Models for Discrete Variables . . . 638 17.4.1 Estimation of the Parameters whentheGraphStructureisKnown....... 639 17.4.2 Hidden Nodes ................... 641 17.4.3 Estimation of the Graph Structure . ....... 642 17.4.4 Restricted Boltzmann Machines . . ....... 643 Exercises............................... 645 18 High-Dimensional Problems: p N 649 18.1 When p is Much Bigger than N .............. 649xxii Contents 18.2 Diagonal Linear Discriminant Analysis andNearestShrunkenCentroids.............. 651 18.3 Linear Classifiers with Quadratic Regularization ..... 654 18.3.1 Regularized Discriminant Analysis . ....... 656 18.3.2 Logistic Regression with Quadratic Regularization . . . ....... 657 18.3.3 The Support Vector Classifier . . . ....... 657 18.3.4 Feature Selection .................. 658 18.3.5 Computational Shortcuts When p N ..... 659 18.4 Linear Classifiers with L1 Regularization . . ....... 661 18.4.1 Application of Lasso toProteinMassSpectroscopy.......... 664 18.4.2 The Fused Lasso for Functional Data ...... 666 18.5 Classification When Features are Unavailable ....... 668 18.5.1 Example: String Kernels and Protein Classification ............. 668 18.5.2 Classification and Other Models Using Inner-Product Kernels and Pairwise Distances . 670 18.5.3 Example: Abstracts Classification . ....... 672 18.6 High-Dimensional Regression: Supervised Principal Components ............. 674 18.6.1 Connection to Latent-Variable Modeling .... 678 18.6.2 Relationship with Partial Least Squares ..... 680 18.6.3 Pre-Conditioning for Feature Selection ..... 681 18.7 Feature Assessment and the Multiple-Testing Problem . . 683 18.7.1 The False Discovery Rate ............. 687 18.7.2 Asymmetric Cutpoints and the SAM Procedure 690 18.7.3 A Bayesian Interpretation of the FDR ...... 692 18.8 Bibliographic Notes ..................... 693 Exercises............................... 694 References 699 Author Index 729 Index 737This is page 1 Printer: Opaque this 1 Introduction Statistical learning plays a key role in many areas of science, finance and industry. Here are some examples of learning problems: • Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demo- graphic, diet and clinical measurements for that patient. • Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data. • Identify the numbers in a handwritten ZIP code, from a digitized image. • Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spectrum of that person’s blood. • Identify the risk factors for prostate cancer, based on clinical and demographic variables. The science of learning plays a key role in the fields of statistics, data mining and artificial intelligence, intersecting with areas of engineering and other disciplines. This book is about learning from data. In a typical scenario, we have an outcome measurement, usually quantitative (such as a stock price) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a training set of data, in which we observe the outcome and feature2 1. Introduction TABLE 1.1. Average percentage of words or characters in an email message equal to the indicated word or character. We have chosen the words and characters showing the largest difference between spam and email. george you your hp free hpl ! our re edu remove spam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28 email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01 measurements for a set of objects (such as people). Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects. A good learner is one that accurately predicts such an outcome. The examples above describe what is called the supervised learning prob- lem. It is called “supervised” because of the presence of the outcome vari- able to guide the learning process. In the unsupervised learning problem, we observe only the features and have no measurements of the outcome. Our task is rather to describe how the data are organized or clustered. We devote most of this book to supervised learning; the unsupervised problem is less developed in the literature, and is the focus of the last chapter. Here are some examples of real learning problems that are discussed in this book. Example 1: Email Spam The data for this example consists of information from 4601 email mes- sages, in a study to try to predict whether the email was junk email, or “spam.” The objective was to design an automatic spam detector that could filter out spam before clogging the users’ mailboxes. For all 4601 email messages, the true outcome (email type) email or spam is available, along with the relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message. This is a supervised learning problem, with the outcome the class variable email/spam. It is also called a classification problem. Table 1.1 lists the words and characters showing the largest average difference between spam and email. Our learning method has to decide which features to use and how: for example, we might use a rule such as if (%george < 0.6) & (%you > 1.5) then spam else email. Another form of a rule might be: if (0.2 · %you − 0.3 · %george) > 0 then spam else email.1. Introduction 3 lpsa −1 1234 oooo oooooooo ooooo oo o o oo oo ooo o oooooo ooo ooo oo ooo oo ooo oo o ooooo oo ooooo ooo o ooo o ooo oooo oooo ooo oo ooo ooo o o ooooooooo oooo oo ooo ooooo oo ooo o ooo oo ooo ooo ooooo oooo o oo oo oo o oo ooo oo ooooooo oo oo ooo oo o oooo oo oo ooooo oo 40 50 60 70 80 o ooo ooooooooo oo o ooo oooo o oooooooooo oo oo oo o ooo oo ooooooo oo ooo oo oooooo o oooo ooo oo o ooooo ooo oo oo ooo oo o oo oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo 0.0 0.4 0.8 oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oooooo oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o 6.0 7.0 8.0 9.0 oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oooooo 012345oo oooooooooo oooo ooooo oo ooo oo oooooooo oo oo oooo oo oooo oo ooo ooo o oo oooo o oooo ooo oo o oo ooo oo oo oo o o oo o ooo oo −1 1234 ooo o o o oo o oo o oooo o o o o o o o o o o o o o o oo o oo oo o o oo o o ooo o o o oo o o o o oo oo o o o o oo o oo o o o oo o oo o ooo o ooo o o oo o o o oo o oo o lcavol o oo o o o oo o o o o oo oo o o o o o o o o o o o o o o oo o oo oo o o oo o o ooo o o o o o o o o o oo o o o o o o oo o oo o o o o o o oo o oo o o o oo o o oo o o o oo o o o o o o o o o o oo o oo o o oo o o o o o o o o o o o o o o o oo o o o o o o o o o o o o oo o o o oo o o o o oo o o o o o o oo o o o o o o oo o oo o ooo o o oo o o o o o o o o o o o o o ooo o o o o o o oo o oooo o o o o o o o o o o o o o o oo o oo oo o o oo o o o oo o o o oo o o o o oo oo o o o o oo o o o o o o oo o o o o ooo o oo o o o oo o o o oo o o o o ooo o o o oo o oo o oooo o o o o o o o o o o o o o o oo o oo oo o o oo o o ooo o o o oo o o o o oo oo o o o o oo o oo o o o o o o oo o o oo o o oo o o o o o o o oo o oo o ooo o o o oo o oo o oo oo o o o o o o o o o o o o o o oo o o o o o o o oo o o ooo o o o oo o o o o o o oo o o o o oo o oo o o o o o o o o o o oo o oo o o o o o o o o o o o oo o oo o o o o oo o oo o oooo o o o o o o o o o o o o o o oo o oo o o o o o o o o ooo o o o o o o o o o oo o o o o o o oo o oo o o o oo o oo o ooo o oo o o o o o o o o oo o oo o oo o o o o oo o oo o oooo o o o o o o o o o o o o o o oo o oo o o o o o o o o o oo o o o o o o o o o oo o o o o o o oo o oo o o o oo o o o o o oo o oo o o o o o o o o o o o o o o o o o o oo ooo o oo oo o o oo o o ooooo o oo o o o o o oo o o o o oo ooo oo oo oo o o o o o o o o o o o oo o oooo o o o o o o o o oo o o o o ooooo o o o o ooo o oo o o o o oo ooo o oo oo o o o o o o o oo oo o oo o o o o o oo o o o o oo oo o oo oo oo o o o o o o o o o o o o o o oo oo o o o o o o o o o o o o o o ooo oo o o o o oo o o o o lweight o o o o oo ooo o oo o o o o oo o o ooo o o o oo o o o o o o o o o o o o o ooo oo oo o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o ooo oo o o o o o oo o oo o o o ooo o oo o o o oo o o oo o o o oo o o o o o o o o o o oo o o o o oo o oo oo oo o o o o o o o o o o o o o oo o o o o o o o o o o o o o oo o o o o o o oo o o o o o ooo o oo o o o ooo ooo o oo oo o o oo o o ooooo o oo o o o o o oo o o o o oo ooo oo oo oo o o o o o o o o o o o oo o oooo o o o o o o o o oo o o o o ooo oo o o o o ooo o oo o o o ooo ooo o oo oo o o o o o o o oo oo o oo o o o o o o o o o o o oo o o o oo oo o o o o o o o o o o o o o oo o o o oo o o o o o o o o oo o o o o o oo oo o o o o o o o o o o o o o ooo ooo o oo oo o o oo o o o oo oo o oo o o o o o oo o o o o o o oo o oo oo oo o o o o o o o o o o o o o o o ooo o o o o o o o o oo o o o o o oooo o o o o ooo o oo 2.5 3.5 4.5 o o o ooo ooo o oo oo o o oo o o o oo oo o oo o o o o o oo o o o o o o ooo oo oo oo o o o o o o o o o o o o o o o o oo o o o o o o o o oo o o o o o oo oo o o o o o oo o oo 40 50 60 70 80 o o o o o o o o o oooo o o o o o o o ooo o oooooooo o o oo o o o o o o oo o o o o o oo oo o o o oo o o o o o ooo o ooo o o o o oo o o o oo o o o o o o oo o o o o o o oo o o o o o o o o o ooo o o o o o o o o o oo o o ooooooo o o o o o o o o o o o o o o o o o oo oo o o o oo o o o o o ooo o oo o o o o o oo o o o oo o o o o o o o o o o o o o o oo o o o o o o o o o o ooo o o o o o o o ooo o oo ooo ooo o o o o o o o o o o oo o o o o o o o oo o o o o o o o o o o oo o o o o o o o o o oo o o o oo o o o o o o o o o o o o o o oo age o o o o o o o o o oo oo o o o o o o o o oo o ooo ooo oo o o o o o o o o o o oo o o o o o oo oo o o o o o o o o o o oo o o oo o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o o o o o o o oooo o o o o o o o ooo o oooooooo o o oo o o o o o o oo o o o o o oo oo o o o oo o o o o o ooo o ooo o o o o oo o o o oo o o o o o o oo o o o o o o oo o o o o o o o o o ooo o o o o o o o o o oo o oo oo o ooo o o oo o o o o o o o o o o o o o oo o o o o o oo o o o o o oo o o oo o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o o o oo o oo oooooo o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o oo o o oo o o o o o oo o o o oo o o o o o o oo o o o o o o oo o o o o o o o o o ooo o o o o o o o o o oo o oo oooooo o o oo o o o o o o oo o o o o o o o o o o o o oo o o o o o oo o o ooo o o o o o o o o o oo o o o o o o o o o o o o o o oo oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o oo oo oo o o o o o oo o o o oo o o o oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o oo o o o ooo o o oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o o o oo o o o ooo o o oo oooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o oo o o o ooo o o lbph oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o oo oo oo o o o o o oo o o o ooo o o oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o o o oo o o o ooo o o oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o oo oo oo o o o o o oo o o o ooo o o −1012oooooo o o ooo o oooo o oo o o o o o o o o o o o oo o oo o o o o o oo o o o o oo o o o o oo o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o o o oo o o o ooo o o 0.0 0.4 0.8 oooooooooooooooooooooooooooooooooooooo o ooooooo o oooooooooooooo o o o oooooo o o oooo oo o ooo o oo o o ooo o ooo ooo oooooooooooooooooooo o ooooooooooooooooo o ooooooo o oooooooooooooo o o o oooooo o o oo oo oo o ooo o oo o o ooo o oo oooo oooooooooooooooooooooooooooooo oooooooo o oo o oooo o oooooooooooooo o o o oooo o o o o oooo oo o ooo o oo o o ooo o oooooo oo oooooooooooooooooooooo oooooooooooooo o oo ooooo o ooooooooooo ooo o o o oooooo o o oooo oo o ooo o oo o o oo o o oooo oo oooooooooooooooooooooooooooooooooooooo o ooooooo o ooooooooooooo o o o o oooooo o o oooo oo o ooo o oo o o ooo o oooooo svi ooooooooooooooooo oooooooooooo ooooooooo o oooo ooo o oooooooooooooo o o o oo oooo o o oooo oo o ooo o oo o o ooo o oooooo oooooooooooooooooooooooooooooooooooooo o ooooooo o oooooooooooooo o o o oooooo o o oooo oo o ooo o oo o o ooo o oooooo oooooooooooooooooooooooooooooooooooo oo o ooooo oo o oooooooooooooo o o o oo oooo o o oooo oo o ooo o oo o o ooo o oooooo oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo oo o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o oooooooooooo o o o o o o oo o o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo oo o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o oo oooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o o o o o oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o ooo o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo oo o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o lcp oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o −10123oooooooooooo o o o o o o ooo o o o oo o o o o o ooo o o o o o o oo o o o o o o o o oo o o oo o oooo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o oo o oo o o 6.0 7.0 8.0 9.0 oo o ooooooooo ooo o o oooo o o o oo ooo oooooo o o ooo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooooooooo o oo o ooo o ooo ooo oo o ooooooooo ooo o o ooo o o o o oo ooo oooooo o o ooo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooooooooo o oo o ooo o oo oooo oo o ooooooooo ooo o o oooo o o o oo ooo oooooo o o o oo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooooooooo o oo o o oo o oooooo oo o ooooooooo ooo o o oooo o o o oo ooo oooooo o o ooo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooo oooooo o oo o oo o o oooo oo oo o ooooooooo ooo o o oooo o o o oo o oo oooooo o o o oo o o o o oo o o oo o o ooo oo o oo o o o o o ooo o oo oo o o ooooo ooo o oo o oo o o oooo oo oo o ooooooooo ooo o o oooo o o o oo ooo oooooo o o ooo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooooooooo o oo o ooo o oooooo oo o ooooooooo oo o o o oooo o o o oo oo o oooooo o o o oo o o o ooo o o oo o o oooo o o oo o o o o o ooo o ooo o o oooo ooooo o oo o ooo o o oooo o gleason oo o ooooooooo ooo o o oooo o o o oo ooo oooooo o o ooo o o o ooo o o oo o o ooooo o oo o o o o o ooo o oooo o ooooooooo o oo o ooo o oooooo 012345 oo o ooooooooo o ooo o oooo o o o oo o o o ooooooo oo o o o o oo oo o o oo o o o o o o oo o o o o o o o o o o oo o oo o o o o o o o o o o o o o o o o o o o o o o o o oo o ooooooooo o ooo o ooo o o o o oo o o o oooooo o oo o o o o o o oo o o oo o o o o o o oo o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o 2.5 3.5 4.5 oo o ooooooooo o ooo o oooo o o o oo o o o oooooo o oo o o o o oo oo o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o ooooooooo o oo o o oooo o o o oo o o o ooooooo oo o o o o oo oo o o oo o o o o o o oo o o o o o o o o o o oo o oo o o o o o o o o o o o o o o o o o o o o o o o o −1012 oo o ooooooooo o ooo o oooo o o o oo o o o oooooo o oo o o o o oo oo o o oo o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o oo o ooooooooo o ooo o oooo o o o oo o o o ooooooo oo o o o o oo oo o o oo o o o o o o oo o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o −10123 oo o ooooooooo o o oo o oooo o o o oo o o o ooooooo oo o o o o o o oo o o oo o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o ooooooooo o ooo o oooo o o o oo o o o oooooo o oo o o o o o o oo o o oo o o o o o o oo o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o 0 20 60 100 0 20 60 100 pgg45 FIGURE 1.1. Scatterplot matrix of the prostate cancer data. The first row shows the response against each of the predictors in turn. Two of the predictors, svi and gleason, are categorical. For this problem not all errors are equal; we want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences. We discuss a number of different methods for tackling this learning problem in the book. Example 2: Prostate Cancer The data for this example, displayed in Figure 1.11,comefromastudy by Stamey et al. (1989) that examined the correlation between the level of 1There was an error in these data in the first edition of this book. Subject 32 had a value of 6.1 for lweight, which translates to a 449 gm prostate! The correct value is 44.9 gm. We are grateful to Prof. Stephen W. Link for alerting us to this error.4 1. Introduction FIGURE 1.2. Examples of handwritten digits from U.S. postal envelopes. prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy. The goal is to predict the log of PSA (lpsa) from a number of measure- ments including log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle in- vasion svi, log of capsular penetration lcp, Gleason score gleason,and percent of Gleason scores 4 or 5 pgg45. Figure 1.1 is a scatterplot matrix of the variables. Some correlations with lpsa are evident, but a good pre- dictive model is difficult to construct by eye. This is a supervised learning problem, known as a regression problem, because the outcome measurement is quantitative. Example 3: Handwritten Digit Recognition The data from this example come from the handwritten ZIP codes on envelopes from U.S. postal mail. Each image is a segment from a five digit ZIP code, isolating a single digit. The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255. Some sample images are shown in Figure 1.2. The images have been normalized to have approximately the same size and orientation. The task is to predict, from the 16 × 16 matrix of pixel intensities, the identity of each image (0, 1,...,9) quickly and accurately. If it is accurate enough, the resulting algorithm would be used as part of an automatic sorting procedure for envelopes. This is a classification problem for which the error rate needs to be kept very low to avoid misdirection of1. Introduction 5 mail. In order to achieve this low error rate, some objects can be assigned to a “don’t know” category, and sorted instead by hand. Example 4: DNA Expression Microarrays DNA stands for deoxyribonucleic acid, and is the basic material that makes up human chromosomes. DNA microarrays measure the expression of a gene in a cell by measuring the amount of mRNA (messenger ribonucleic acid) present for that gene. Microarrays are considered a breakthrough technology in biology, facilitating the quantitative study of thousands of genes simultaneously from a single sample of cells. Here is how a DNA microarray works. The nucleotide sequences for a few thousand genes are printed on a glass slide. A target sample and a reference sample are labeled with red and green dyes, and each are hybridized with the DNA on the slide. Through fluoroscopy, the log (red/green) intensities of RNA hybridizing at each site is measured. The result is a few thousand numbers, typically ranging from say −6 to 6, measuring the expression level of each gene in the target relative to the reference sample. Positive values indicate higher expression in the target versus the reference, and vice versa for negative values. A gene expression dataset collects together the expression values from a series of DNA microarray experiments, with each column representing an experiment. There are therefore several thousand rows representing individ- ual genes, and tens of columns representing samples: in the particular ex- ample of Figure 1.3 there are 6830 genes (rows) and 64 samples (columns), although for clarity only a random sample of 100 rows are shown. The fig- ure displays the data set as a heat map, ranging from green (negative) to red (positive). The samples are 64 cancer tumors from different patients. The challenge here is to understand how the genes and samples are or- ganized. Typical questions include the following: (a) which samples are most similar to each other, in terms of their expres- sion profiles across genes? (b) which genes are most similar to each other, in terms of their expression profiles across samples? (c) do certain genes show very high (or low) expression for certain cancer samples? We could view this task as a regression problem, with two categorical predictor variables—genes and samples—with the response variable being the level of expression. However, it is probably more useful to view it as unsupervised learning problem. For example, for question (a) above, we think of the samples as points in 6830 −−dimensional space, which we want to cluster together in some way.6 1. Introduction SID42354 SID31984 SID301902 SIDW128368 SID375990 SID360097 SIDW325120 ESTsChr.10 SIDW365099 SID377133 SID381508 SIDW308182 SID380265 SIDW321925 ESTsChr.15 SIDW362471 SIDW417270 SIDW298052 SID381079 SIDW428642 TUPLE1TUP1 ERLUMEN SIDW416621 SID43609 ESTs SID52979 SIDW357197 SIDW366311 ESTs SMALLNUC SIDW486740 ESTs SID297905 SID485148 SID284853 ESTsChr.15 SID200394 SIDW322806 ESTsChr.2 SIDW257915 SID46536 SIDW488221 ESTsChr.5 SID280066 SIDW376394 ESTsChr.15 SIDW321854 WASWiskott HYPOTHETIC SIDW376776 SIDW205716 SID239012 SIDW203464 HLACLASSI SIDW510534 SIDW279664 SIDW201620 SID297117 SID377419 SID114241 ESTsCh31 SIDW376928 SIDW310141 SIDW298203 PTPRC SID289414 SID127504 ESTsChr.3 SID305167 SID488017 SIDW296310 ESTsChr.6 SID47116 MITOCHONDR Chr SIDW376586 Homosapiens SIDW487261 SIDW470459 SID167117 SIDW31489 SID375812 DNAPOLYME SID377451 ESTsChr.1 MYBPROTO SID471915 ESTs SIDW469884 HumanmRNA SIDW377402 ESTs SID207172 RASGTPASE SID325394 H.sapiensmRN GNAL SID73161 SIDW380102 SIDW299104 BREAST RENAL MELANOMA MELANOMA MCF7D-repro COLON COLON K562B-repro COLON NSCLC LEUKEMIA RENAL MELANOMA BREAST CNS CNS RENAL MCF7A-repro NSCLC K562A-repro COLON CNS NSCLC NSCLC LEUKEMIA CNS OVARIAN BREAST LEUKEMIA MELANOMA MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA LEUKEMIA COLON BREAST LEUKEMIA COLON CNS MELANOMA NSCLC PROSTATE NSCLC RENAL RENAL NSCLC RENAL LEUKEMIA OVARIAN PROSTATE COLON BREAST RENAL UNKNOWN FIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows) and 64 samples (columns), for the human tumor data. Only a random sample of 100 rows are shown. The display is a heat map, ranging from bright green (negative, under expressed) to bright red (positive, over expressed). Missing values are gray. The rows and columns are displayed in a randomly chosen order.1. Introduction 7 Who Should Read this Book This book is designed for researchers and students in a broad variety of fields: statistics, artificial intelligence, engineering, finance and others. We expect that the reader will have had at least one elementary course in statistics, covering basic topics including linear regression. We have not attempted to write a comprehensive catalog of learning methods, but rather to describe some of the most important techniques. Equally notable, we describe the underlying concepts and considerations by which a researcher can judge a learning method. We have tried to write this book in an intuitive fashion, emphasizing concepts rather than math- ematical details. As statisticians, our exposition will naturally reflect our backgrounds and areas of expertise. However in the past eight years we have been attending conferences in neural networks, data mining and machine learning, and our thinking has been heavily influenced by these exciting fields. This influence is evident in our current research, and in this book. HowThisBookisOrganized Our view is that one must understand simple methods before trying to grasp more complex ones. Hence, after giving an overview of the supervis- ing learning problem in Chapter 2, we discuss linear methods for regression and classification in Chapters 3 and 4.InChapter 5 we describe splines, wavelets and regularization/penalization methods for a single predictor, while Chapter 6 covers kernel methods and local regression. Both of these sets of methods are important building blocks for high-dimensional learn- ing techniques. Model assessment and selection is the topic of Chapter 7, covering the concepts of bias and variance, overfitting and methods such as cross-validation for choosing models. Chapter 8 discusses model inference and averaging, including an overview of maximum likelihood, Bayesian in- ference and the bootstrap, the EM algorithm, Gibbs sampling and bagging, A related procedure called boosting is the focus of Chapter 10. In Chapters 9–13 we describe a series of structured methods for su- pervised learning, with Chapters 9 and 11 covering regression and Chap- ters 12 and 13 focusing on classification. Chapter 14 describes methods for unsupervised learning. Two recently proposed techniques, random forests and ensemble learning, are discussed in Chapters 15 and 16. We describe undirected graphical models in Chapter 17 and finally we study high- dimensional problems in Chapter 18. At the end of each chapter we discuss computational considerations im- portant for data mining applications, including how the computations scale with the number of observations and predictors. Each chapter ends with Bibliographic Notes giving background references for the material.8 1. Introduction We recommend that Chapters 1–4 be first read in sequence. Chapter 7 should also be considered mandatory, as it covers central concepts that pertain to all learning methods. With this in mind, the rest of the book can be read sequentially, or sampled, depending on the reader’s interest. The symbol indicates a technically difficult section, one that can be skipped without interrupting the flow of the discussion. Book Website The website for this book is located at http://www-stat.stanford.edu/ElemStatLearn It contains a number of resources, including many of the datasets used in this book. Note for Instructors We have successively used the first edition of this book as the basis for a two-quarter course, and with the additional materials in this second edition, it could even be used for a three-quarter sequence. Exercises are provided at the end of each chapter. It is important for students to have access to good software tools for these topics. We used the R and S-PLUS programming languages in our courses.This is page 9 Printer: Opaque this 2 Overview of Supervised Learning 2.1 Introduction The first three examples described in Chapter 1 have several components in common. For each there is a set of variables that might be denoted as inputs, which are measured or preset. These have some influence on one or more outputs. For each example the goal is to use the inputs to predict the values of the outputs. This exercise is called supervised learning. We have used the more modern language of machine learning. In the statistical literature the inputs are often called the predictors,atermwe will use interchangeably with inputs, and more classically the independent variables. In the pattern recognition literature the term features is preferred, which we use as well. The outputs are called the responses, or classically the dependent variables. 2.2 Variable Types and Terminology The outputs vary in nature among the examples. In the glucose prediction example, the output is a quantitative measurement, where some measure- ments are bigger than others, and measurements close in value are close in nature. In the famous Iris discrimination example due to R. A. Fisher, the output is qualitative (species of Iris) and assumes values in a finite set G = {Virginica, Setosa and Versicolor}. In the handwritten digit example the output is one of 10 different digit classes: G = {0, 1,...,9}. In both of10 2. Overview of Supervised Learning these there is no explicit ordering in the classes, and in fact often descrip- tive labels rather than numbers are used to denote the classes. Qualitative variables are also referred to as categorical or discrete variables as well as factors. For both types of outputs it makes sense to think of using the inputs to predict the output. Given some specific atmospheric measurements today and yesterday, we want to predict the ozone level tomorrow. Given the grayscale values for the pixels of the digitized image of the handwritten digit, we want to predict its class label. This distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and clas- sification when we predict qualitative outputs. We will see that these two tasks have a lot in common, and in particular both can be viewed as a task in function approximation. Inputs also vary in measurement type; we can have some of each of qual- itative and quantitative input variables. These have also led to distinctions in the types of methods that are used for prediction: some methods are defined most naturally for quantitative inputs, some most naturally for qualitative and some for both. A third variable type is ordered categorical, such as small, medium and large, where there is an ordering between the values, but no metric notion is appropriate (the difference between medium and small need not be the same as that between large and medium). These are discussed further in Chapter 4. Qualitative variables are typically represented numerically by codes. The easiest case is when there are only two classes or categories, such as “suc- cess” or “failure,” “survived” or “died.” These are often represented by a single binary digit or bit as 0 or 1, or else by −1 and 1. For reasons that will become apparent, such numeric codes are sometimes referred to as targets. When there are more than two categories, several alternatives are available. The most useful and commonly used coding is via dummy variables.Herea K-level qualitative variable is represented by a vector of K binary variables or bits, only one of which is “on” at a time. Although more compact coding schemes are possible, dummy variables are symmetric in the levels of the factor. We will typically denote an input variable by the symbol X.IfX is a vector, its components can be accessed by subscripts Xj. Quantitative outputs will be denoted by Y , and qualitative outputs by G (for group). We use uppercase letters such as X, Y or G when referring to the generic aspects of a variable. Observed values are written in lowercase; hence the ith observed value of X is written as xi (where xi is again a scalar or vector). Matrices are represented by bold uppercase letters; for example, a set of N input p-vectors xi,i=1,...,N would be represented by the N ×p matrix X. In general, vectors will not be bold, except when they have N components; this convention distinguishes a p-vector of inputs xi for the2.3 Least Squares and Nearest Neighbors 11 ith observation from the N-vector xj consisting of all the observations on variable Xj. Since all vectors are assumed to be column vectors, the ith row of X is xT i , the vector transpose of xi. For the moment we can loosely state the learning task as follows: given the value of an input vector X, make a good prediction of the output Y, denoted by ˆY (pronounced “y-hat”). If Y takes values in IR then so should ˆY ; likewise for categorical outputs, ˆG should take values in the same set G associated with G. For a two-class G, one approach is to denote the binary coded target as Y , and then treat it as a quantitative output. The predictions ˆY will typically lie in [0, 1], and we can assign to ˆG the class label according to whether ˆy>0.5. This approach generalizes to K-level qualitative outputs as well. We need data to construct prediction rules, often a lot of it. We thus suppose we have available a set of measurements (xi,yi)or(xi,gi),i= 1,...,N, known as the training data, with which to construct our prediction rule. 2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors In this section we develop two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neighbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable. 2.3.1 Linear Models and Least Squares The linear model has been a mainstay of statistics for the past 30 years and remains one of our most important tools. Given a vector of inputs XT =(X1,X2,...,Xp), we predict the output Y via the model ˆY = ˆβ0 + p j=1 Xj ˆβj. (2.1) The term ˆβ0 is the intercept, also known as the bias in machine learning. Often it is convenient to include the constant variable 1 in X, include ˆβ0 in the vector of coefficients ˆβ, and then write the linear model in vector form as an inner product ˆY = XT ˆβ, (2.2)12 2. Overview of Supervised Learning where XT denotes vector or matrix transpose (X being a column vector). Here we are modeling a single output, so ˆY is a scalar; in general ˆY can be a K–vector, in which case β would be a p×K matrix of coefficients. In the (p + 1)-dimensional input–output space, (X, ˆY ) represents a hyperplane. If the constant is included in X, then the hyperplane includes the origin and is a subspace; if not, it is an affine set cutting the Y -axis at the point (0, ˆβ0). From now on we assume that the intercept is included in ˆβ. Viewed as a function over the p-dimensional input space, f(X)=XT β is linear, and the gradient f (X)=β is a vector in input space that points in the steepest uphill direction. How do we fit the linear model to a set of training data? There are many different methods, but by far the most popular is the method of least squares. In this approach, we pick the coefficients β to minimize the residual sum of squares RSS(β)= N i=1 (yi − xT i β)2. (2.3) RSS(β) is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in matrix notation. We can write RSS(β)=(y − Xβ)T (y − Xβ), (2.4) where X is an N × p matrix with each row an input vector, and y is an N-vector of the outputs in the training set. Differentiating w.r.t. β we get the normal equations XT (y − Xβ)=0. (2.5) If XT X is nonsingular, then the unique solution is given by ˆβ =(XT X)−1XT y, (2.6) and the fitted value at the ith input xi is ˆyi =ˆy(xi)=xT i ˆβ.Atanarbi- trary input x0 the prediction is ˆy(x0)=xT 0 ˆβ. The entire fitted surface is characterized by the p parameters ˆβ. Intuitively, it seems that we do not need a very large data set to fit such a model. Let’s look at an example of the linear model in a classification context. Figure 2.1 shows a scatterplot of training data on a pair of inputs X1 and X2. The data are simulated, and for the present the simulation model is not important. The output class variable G has the values BLUE or ORANGE, and is represented as such in the scatterplot. There are 100 points in each of the two classes. The linear regression model was fit to these data, with the response Y coded as 0 for BLUE and 1 for ORANGE. The fitted values ˆY are converted to a fitted class variable ˆG according to the rule ˆG = ORANGE if ˆY>0.5, BLUE if ˆY ≤ 0.5. (2.7)2.3 Least Squares and Nearest Neighbors 13 Linear Regression of 0/1 Response ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o FIGURE 2.1. A classification example in two dimensions. The classes are coded as a binary variable (BLUE =0, ORANGE =1), and then fit by linear regression. The line is the decision boundary defined by xT ˆβ =0.5. The orange shaded region denotes that part of input space classified as ORANGE, while the blue region is classified as BLUE. The set of points in IR2 classified as ORANGE corresponds to {x : xT ˆβ>0.5}, indicated in Figure 2.1, and the two predicted classes are separated by the decision boundary {x : xT ˆβ =0.5}, which is linear in this case. We see that for these data there are several misclassifications on both sides of the decision boundary. Perhaps our linear model is too rigid— or are such errors unavoidable? Remember that these are errors on the training data itself, and we have not said where the constructed data came from. Consider the two possible scenarios: Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means. Scenario 2: The training data in each class came from a mixture of 10 low- variance Gaussian distributions, with individual means themselves distributed as Gaussian. A mixture of Gaussians is best described in terms of the generative model. One first generates a discrete variable that determines which of14 2. Overview of Supervised Learning the component Gaussians to use, and then generates an observation from the chosen density. In the case of one Gaussian per class, we will see in Chapter 4 that a linear decision boundary is the best one can do, and that our estimate is almost optimal. The region of overlap is inevitable, and future data to be predicted will be plagued by this overlap as well. In the case of mixtures of tightly clustered Gaussians the story is dif- ferent. A linear decision boundary is unlikely to be optimal, and in fact is not. The optimal decision boundary is nonlinear and disjoint, and as such will be much more difficult to obtain. We now look at another classification and regression procedure that is in some sense at the opposite end of the spectrum to the linear model, and far better suited to the second scenario. 2.3.2 Nearest-Neighbor Methods Nearest-neighbor methods use those observations in the training set T clos- est in input space to x to form ˆY . Specifically, the k-nearest neighbor fit for ˆY is defined as follows: ˆY (x)=1 k xi∈Nk(x) yi, (2.8) where Nk(x) is the neighborhood of x defined by the k closest points xi in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance. So, in words, we find the k observations with xi closest to x in input space, and average their responses. In Figure 2.2 we use the same training data as in Figure 2.1, and use 15-nearest-neighbor averaging of the binary coded response as the method of fitting. Thus ˆY is the proportion of ORANGE’s in the neighborhood, and so assigning class ORANGE to ˆG if ˆY>0.5 amounts to a majority vote in the neighborhood. The colored regions indicate all those points in input space classified as BLUE or ORANGE by such a rule, in this case found by evaluating the procedure on a fine grid in input space. We see that the decision boundaries that separate the BLUE from the ORANGE regions are far more irregular, and respond to local clusters where one class dominates. Figure 2.3 shows the results for 1-nearest-neighbor classification: ˆY is assigned the value y of the closest point x to x in the training data. In this case the regions of classification can be computed relatively easily, and correspond to a Voronoi tessellation of the training data. Each point xi has an associated tile bounding the region for which it is the closest input point. For all points x in the tile, ˆG(x)=gi. The decision boundary is even more irregular than before. The method of k-nearest-neighbor averaging is defined in exactly the same way for regression of a quantitative output Y , although k = 1 would be an unlikely choice.2.3 Least Squares and Nearest Neighbors 15 15-Nearest Neighbor Classifier ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ....................................................................................... ........................ ............................ ................................................... .................................................. ................................................ ............................................. ........................................... ........................................... ......................................... ........................................ ....................................... .................................. .................................... .................................... ................................... .................................. .................................. ................................. ................................. ................................ .............................. ............................ ........................... ......................... ....................... ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. ............................................................................................. ............ ................ . .................. .................. ..................... ....................... .......................... .......................... ........................... ........................... .............................. .................................. .................................. ................................ ................................. ................................... ................................... .................................... .................................... .................................... ...................................... ........................................ .......................................... ............................................ ............................................. .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o FIGURE 2.2. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE =0, ORANGE =1)and then fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hence chosen by majority vote amongst the 15-nearest neighbors. In Figure 2.2 we see that far fewer training observations are misclassified than in Figure 2.1. This should not give us too much comfort, though, since in Figure 2.3 none of the training data are misclassified. A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be approximately an increasing function of k, and will always be 0 for k = 1. An independent test set would give us a more satisfactory means for comparing the different methods. It appears that k-nearest-neighbor fits have a single parameter, the num- ber of neighbors k, compared to the p parameters in least-squares fits. Al- though this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood. It is also clear that we cannot use sum-of-squared errors on the training set as a criterion for picking k, since we would always pick k = 1! It would seem that k-nearest-neighbor methods would be more appropriate for the mixture Scenario 2 described above, while for Gaussian data the decision boundaries of k-nearest neighbors would be unnecessarily noisy.16 2. Overview of Supervised Learning 1-Nearest Neighbor Classifier ........ ...... ........ ......... .......... ............ . ............. .. .............. .... .............. .... ................ ..... ................ .. .. ................. .... ... .. .................. ..... . . .................. .... . .................... ... . ..................... . .. ..................... . .... ..................... ..... ...................... ...... .................... ....... ..................... ....... ..................... .......... ..................... .......... ...................... .. .......... ..................... . .. .......... ...................... ... .. ........... ..................... ..... . ........... ...................... ...... ........... ............................ ............ ............................ ........... ............................... ........... ............................... ........... ..................... .... .. ........... ........................ ... .. . ............ .......................... ... .. ....... .. ......................... ... . ...... .. ......................... .... . ....... .. ........................ ........ . . ....... .. ........................ .. ..... ....... ........ ........................ .. ....................... ........................ . ........................ ....................... . .. .............. ...... ....................... . .. ............ ...... ....................... .. ............ ...... ....................... .. ............. ..... ........................ ................. ....... ........................ ...... ............. .... ....................... ...... ............ ... ........................ ..... ............ .. ........................ ...... ............. .. ........................ ....... ............. ....................... ........ ............. ........................ ......... ............... ........................ ......... .......... ...... ........................ ....... .......... ...... ........................ ...... ........ ....... .......................... ..... ........ ....... .......................... .... ....... ....... .......................... ........ ....... ....... ......................... ........ ....... ....... .......................... ......... ........ ........ ........................... .............................. .......................................................... .......................................................... .......................................................... ............................. ... ..................... ........................... ... .................... ........................... ... ................... ............................ .. .................... ............................. .......................... ........................................................... ........................................................... ........................................................... ........................................................... ........................................................... ............................................................ ...... .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... .............................................................. ............................................................ ........................................................... .......................................................... .................................... ................... .................................... ................. ................................... ................ ................................... ............... ................................. ............... ................................. . .............. ........................... . . .............. .......................... .. .. ............. ............................ ..... ............ ........................... ............ ...... ........................... ............ ..... ............................ ........... .... ....................................... .... ..................................... ..... ..................................... ..... .................................... ..... ................................. ...... ................................ ...... ........................ ..... ....... ............... ........ ..... ....... ............. ....... .... ........ ............ ...... ..... ........ .......... ............ ......... .................... ......... .................... .......... ................. ........... ................ ........... ......... ... ........ ........... ...... .... ...... . .......... .... ........... .. ... .......... .... ........... ... .... .......... .... .......... .. .... ......... .... ...... . .. .... ......... .... . .. . ............... .... . ............... ..... . .............. ...... . .. .. ............ ...... . ... ... ............ ......... .. ... ............ ......... .. ... ............ ........ .. ........... ........ .. . ............ ........ ... .. ............ ....... .... ... ............ ...... ... .... ........... ..... ... ................. ...... .. ................. ..... . ................ .... .. .. ............ ..... ... .. ............ ..... ..... ... .............. ...... ... ............... ...... .... .............. ... .... ............... .. ..... .............. . .... ............ ............................................... ... ................ ... ................ .... ................ .... .............. ...................................................................... o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o FIGURE 2.3. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE =0, ORANGE =1),and then predicted by 1-nearest-neighbor classification. 2.3.3 From Least Squares to Nearest Neighbors The linear decision boundary from least squares is very smooth, and ap- parently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop shortly, it has low variance and potentially high bias. On the other hand, the k-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular subregion of the decision bound- ary depends on a handful of input points and their particular positions, and is thus wiggly and unstable—high variance and low bias. Each method has its own situations for which it works best; in particular linear regression is more appropriate for Scenario 1 above, while nearest neighbors are more suitable for Scenario 2. The time has come to expose the oracle! The data in fact were simulated from a model somewhere be- tween the two, but closer to Scenario 2. First we generated 10 means mk from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and2.3 Least Squares and Nearest Neighbors 17 Degrees of Freedom − N/k Test Error 0.10 0.15 0.20 0.25 0.30 2 3 5 8 12 18 29 67 200 151 101 69 45 31 21 11 7 5 3 1 Train Test Bayes k − Number of Nearest Neighbors Linear FIGURE 2.4. Misclassification curves for the simulation example used in Fig- ures 2.1, 2.2 and 2.3. A single training sample of size 200 was used, and a test sample of size 10, 000. The orange curves are test and the blue are training er- ror for k-nearest-neighbor classification. The results for linear regression are the bigger orange and blue squares at three degrees of freedom. The purple line is the optimal Bayes error rate. then generated a N(mk, I/5), thus leading to a mixture of Gaussian clus- ters for each class. Figure 2.4 shows the results of classifying 10,000 new observations generated from the model. We compare the results for least squares and those for k-nearest neighbors for a range of values of k. A large subset of the most popular techniques in use today are variants of these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, captures a large percentage of the market for low-dimensional problems. The following list describes some ways in which these simple procedures have been enhanced: • Kernel methods use weights that decrease smoothly to zero with dis- tance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors. • In high-dimensional spaces the distance kernels are modified to em- phasize some variable more than others.18 2. Overview of Supervised Learning • Local regression fits linear models by locally weighted least squares, rather than fitting constants locally. • Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models. • Projection pursuit and neural network models consist of sums of non- linearly transformed linear models. 2.4 Statistical Decision Theory In this section we develop a small amount of theory that provides a frame- work for developing models such as those discussed informally so far. We first consider the case of a quantitative output, and place ourselves in the world of random variables and probability spaces. Let X ∈ IR p denote a real valued random input vector, and Y ∈ IR a real valued random out- put variable, with joint distribution Pr(X, Y ). We seek a function f(X) for predicting Y given values of the input X. This theory requires a loss function L(Y,f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y,f(X)) = (Y − f(X))2. This leads us to a criterion for choosing f, EPE(f)=E(Y − f(X))2 (2.9) = [y − f(x)]2 Pr(dx, dy), (2.10) the expected (squared) prediction error . By conditioning1 on X,wecan write EPE as EPE(f)=EXEY |X [Y − f(X)]2|X (2.11) and we see that it suffices to minimize EPE pointwise: f(x) = argmincEY |X [Y − c]2|X = x . (2.12) The solution is f(x)=E(Y |X = x), (2.13) the conditional expectation, also known as the regression function. Thus the best prediction of Y at any point X = x is the conditional mean, when best is measured by average squared error. The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point x, we might ask for the average of all 1Conditioning here amounts to factoring the joint density Pr(X, Y )=Pr(Y |X)Pr(X) where Pr(Y |X)=Pr(Y,X)/Pr(X), and splitting up the bivariate integral accordingly.2.4 Statistical Decision Theory 19 those yis with input xi = x. Since there is typically at most one observation at any point x, we settle for ˆf(x)=Ave(yi|xi ∈ Nk(x)), (2.14) where “Ave” denotes average, and Nk(x) is the neighborhood containing the k points in T closest to x. Two approximations are happening here: • expectation is approximated by averaging over sample data; • conditioning at a point is relaxed to conditioning on some region “close” to the target point. For large training sample size N, the points in the neighborhood are likely to be close to x,andask gets large the average will get more stable. In fact, under mild regularity conditions on the joint probability distri- bution Pr(X, Y ), one can show that as N,k →∞such that k/N → 0, ˆf(x) → E(Y |X = x). In light of this, why look further, since it seems we have a universal approximator? We often do not have very large sam- ples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than k-nearest neighbors, although such knowledge has to be learned from the data as well. There are other problems though, sometimes disastrous. In Section 2.5 we see that as the dimension p gets large, so does the metric size of the k-nearest neighbor- hood. So settling for nearest neighborhood as a surrogate for conditioning will fail us miserably. The convergence above still holds, but the rate of convergence decreases as the dimension increases. How does linear regression fit into this framework? The simplest explana- tion is that one assumes that the regression function f(x) is approximately linear in its arguments: f(x) ≈ xT β. (2.15) This is a model-based approach—we specify a model for the regression func- tion. Plugging this linear model for f(x) into EPE (2.9) and differentiating we can solve for β theoretically: β =[E(XXT )]−1E(XY). (2.16) Note we have not conditioned on X; rather we have used our knowledge of the functional relationship to pool over values of X. The least squares solution (2.6) amounts to replacing the expectation in (2.16) by averages over the training data. So both k-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions: • Least squares assumes f(x) is well approximated by a globally linear function.20 2. Overview of Supervised Learning • k-nearest neighbors assumes f(x) is well approximated by a locally constant function. Although the latter seems more palatable, we have already seen that we may pay a price for this flexibility. Many of the more modern techniques described in this book are model based, although far more flexible than the rigid linear model. For example, additive models assume that f(X)= p j=1 fj(Xj). (2.17) This retains the additivity of the linear model, but each coordinate function fj is arbitrary. It turns out that the optimal estimate for the additive model uses techniques such as k-nearest neighbors to approximate univariate con- ditional expectations simultaneously for each of the coordinate functions. Thus the problems of estimating a conditional expectation in high dimen- sions are swept away in this case by imposing some (often unrealistic) model assumptions, in this case additivity. Are we happy with the criterion (2.11)? What happens if we replace the L2 loss function with the L1: E|Y − f(X)|? The solution in this case is the conditional median, ˆf(x) = median(Y |X = x), (2.18) which is a different measure of location, and its estimates are more robust than those for the conditional mean. L1 criteria have discontinuities in their derivatives, which have hindered their widespread use. Other more resistant loss functions will be mentioned in later chapters, but squared error is analytically convenient and the most popular. What do we do when the output is a categorical variable G? The same paradigm works here, except we need a different loss function for penalizing prediction errors. An estimate ˆG will assume values in G, the set of possible classes. Our loss function can be represented by a K × K matrix L, where K =card(G). L will be zero on the diagonal and nonnegative elsewhere, where L(k, ) is the price paid for classifying an observation belonging to class Gk as G. Most often we use the zero–one loss function, where all misclassifications are charged a single unit. The expected prediction error is EPE = E[L(G, ˆG(X))], (2.19) where again the expectation is taken with respect to the joint distribution Pr(G, X). Again we condition, and can write EPE as EPE = EX K k=1 L[Gk, ˆG(X)]Pr(Gk|X) (2.20)2.4 Statistical Decision Theory 21 Bayes Optimal Classifier ..... ..... ..... ....... ........ .......... ........... ............. ............... ................. ................... .. ..................... ...... ....................... ....... ........................ ......... ........................... .......... ............................ .......... ............................. ........... .............................. ........... ................................ ........... ................................ ............ ................................. ............ ................................... ............. ................................... ............ .................................... ............ .................................... ............ ...................................... ............. ....................................... ............. ....................................... ............. ........................................ .............. ......................................... ............. .......................................................... .......................................................... .......................................................... ......................................................... .......................................................... .......................................................... .......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ........................................................ ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ......................................................... ........................................................ ........................................................ ......................................................... ......................................................... ............ ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. ............................................ ........................................ ..................................... ................................. ............................... ............................. ............................ .......................... .......................... ........................ ..................... ..................... ..................... ..................... .................. ................. ................. ............... .............. ...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o FIGURE 2.5. The optimal Bayes decision boundary for the simulation example of Figures 2.1, 2.2 and 2.3. Since the generating density is known for each class, this boundary can be calculated exactly (Exercise 2.2). and again it suffices to minimize EPE pointwise: ˆG(x) = argming∈G K k=1 L(Gk,g)Pr(Gk|X = x). (2.21) With the 0–1 loss function this simplifies to ˆG(x) = argming∈G[1 − Pr(g|X = x)] (2.22) or simply ˆG(X)=Gk if Pr(Gk|X = x) = max g∈G Pr(g|X = x). (2.23) This reasonable solution is known as the Bayes classifier, and says that we classify to the most probable class, using the conditional (discrete) dis- tribution Pr(G|X). Figure 2.5 shows the Bayes-optimal decision boundary for our simulation example. The error rate of the Bayes classifier is called the Bayes rate.22 2. Overview of Supervised Learning Again we see that the k-nearest neighbor classifier directly approximates this solution—a majority vote in a nearest neighborhood amounts to ex- actly this, except that conditional probability at a point is relaxed to con- ditional probability within a neighborhood of a point, and probabilities are estimated by training-sample proportions. Suppose for a two-class problem we had taken the dummy-variable ap- proach and coded G via a binary Y , followed by squared error loss estima- tion. Then ˆf(X)=E(Y |X)=Pr(G = G1|X)ifG1 corresponded to Y =1. Likewise for a K-class problem, E(Yk|X)=Pr(G = Gk|X). This shows that our dummy-variable regression procedure, followed by classification to the largest fitted value, is another way of representing the Bayes classifier. Although this theory is exact, in practice problems can occur, depending on the regression model used. For example, when linear regression is used, ˆf(X) need not be positive, and we might be suspicious about using it as an estimate of a probability. We will discuss a variety of approaches to modeling Pr(G|X) in Chapter 4. 2.5 Local Methods in High Dimensions We have examined two learning techniques for prediction so far: the stable but biased linear model and the less stable but apparently less biased class of k-nearest-neighbor estimates. It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, since we should be able to find a fairly large neighborhood of observations close to any x and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the curse of dimensionality (Bellman, 1961). There are many manifestations of this problem, and we will examine a few here. Consider the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube, as in Figure 2.6. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the unit volume, the expected edge length will be ep(r)=r1/p. In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.” Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit. Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median2.5 Local Methods in High Dimensions 23 1 1 0 Unit Cube Fraction of Volume Distance 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 d=1 d=2 d=3 d=10 Neighborhood FIGURE 2.6. The curse of dimensionality is well illustrated by a subcubical neighborhood for uniform data in a unit cube. The figure on the right shows the side-length of the subcube needed to capture a fraction r of the volume of the data, for different dimensions p. In ten dimensions we need to cover 80% of the range of each coordinate to capture 10% of the data. distance from the origin to the closest data point is given by the expression d(p, N)= 1 − 1 2 1/N 1/p (2.24) (Exercise 2.3). A more complicated expression exists for the mean distance to the closest point. For N = 500, p =10,d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them. Another manifestation of the curse is that the sampling density is pro- portional to N 1/p, where p is the dimension of the input space and N is the sample size. Thus, if N1 = 100 represents a dense sample for a single input problem, then N10 = 10010 is the sample size required for the same sam- pling density with 10 inputs. Thus in high dimensions all feasible training samples sparsely populate the input space. Let us construct another uniform example. Suppose we have 1000 train- ing examples xi generated uniformly on [−1, 1]p. Assume that the true relationship between X and Y is Y = f(X)=e−8||X||2 , without any measurement error. We use the 1-nearest-neighbor rule to predict y0 at the test-point x0 = 0. Denote the training set by T .Wecan24 2. Overview of Supervised Learning compute the expected prediction error at x0 for our procedure, averaging over all such samples of size 1000. Since the problem is deterministic, this is the mean squared error (MSE) for estimating f(0): MSE(x0)=ET [f(x0) − ˆy0]2 =ET [ˆy0 − ET (ˆy0)]2 +[ET (ˆy0) − f(x0)]2 =VarT (ˆy0) + Bias2(ˆy0). (2.25) Figure 2.7 illustrates the setup. We have broken down the MSE into two components that will become familiar as we proceed: variance and squared bias. Such a decomposition is always possible and often useful, and is known as the bias–variance decomposition. Unless the nearest neighbor is at 0, ˆy0 will be smaller than f(0) in this example, and so the average estimate will be biased downward. The variance is due to the sampling variance of the 1-nearest neighbor. In low dimensions and with N = 1000, the nearest neighbor is very close to 0, and so both the bias and variance are small. As the dimension increases, the nearest neighbor tends to stray further from the target point, and both bias and variance are incurred. By p = 10, for more than 99% of the samples the nearest neighbor is a distance greater than 0.5 from the origin. Thus as p increases, the estimate tends to be 0 more often than not, and hence the MSE levels off at 1.0, as does the bias, and the variance starts dropping (an artifact of this example). Although this is a highly contrived example, similar phenomena occur more generally. The complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well. In this example, the function is a complex interaction of all p variables involved. The dependence of the bias term on distance depends on the truth, and it need not always dominate with 1-nearest neighbor. For example, if the function always involves only a few dimensions as in Figure 2.8, then the variance can dominate instead. Suppose, on the other hand, that we know that the relationship between Y and X is linear, Y = XT β + ε, (2.26) where ε ∼ N(0,σ2) and we fit the model by least squares to the train- ing data. For an arbitrary test point x0,wehaveˆy0 = xT 0 ˆβ, which can be written as ˆy0 = xT 0 β + N i=1 i(x0)εi, where i(x0)istheith element of X(XT X)−1x0. Since under this model the least squares estimates are2.5 Local Methods in High Dimensions 25 X f(X) -1.0 -0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 • 1-NN in One Dimension X1 X2 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 • • • •• • • • • • •• • • • • • • • • • • • • • • 1-NN in One vs. Two Dimensions Dimension Average Distance to Nearest Neighbor 246810 0.0 0.2 0.4 0.6 0.8 • • • • • • • • • • Distance to 1-NN vs. Dimension Dimension Mse 246810 0.0 0.2 0.4 0.6 0.8 1.0 ••• • • • • • • • ••• • • • • • ••••• • • • • • • • MSE vs. Dimension • MSE • Variance • Sq. Bias FIGURE 2.7. A simulation example, demonstrating the curse of dimensional- ity and its effect on MSE, bias and variance. The input features are uniformly distributed in [−1, 1]p for p =1,...,10 The top left panel shows the target func- tion (no noise) in IR : f(X)=e−8||X||2 , and demonstrates the error that 1-nearest neighbor makes in estimating f(0). The training point is indicated by the blue tick mark. The top right panel illustrates why the radius of the 1-nearest neighborhood increases with dimension p. The lower left panel shows the average radius of the 1-nearest neighborhoods. The lower-right panel shows the MSE, squared bias and variance curves as a function of dimension p.26 2. Overview of Supervised Learning X f(X) -1.0 -0.5 0.0 0.5 1.0 01234 • 1-NN in One Dimension Dimension MSE 246810 0.0 0.05 0.10 0.15 0.20 0.25 • • • • • • • • • • • • • • • • • • • • •••••• • • • • MSE vs. Dimension • MSE • Variance • Sq. Bias FIGURE 2.8. A simulation example with the same setup as in Figure 2.7. Here the function is constant in all but one dimension: F(X)= 1 2 (X1 +1)3.The variance dominates. unbiased, we find that EPE(x0)=Ey0|x0 ET (y0 − ˆy0)2 =Var(y0|x0)+ET [ˆy0 − ET ˆy0]2 +[ET ˆy0 − xT 0 β]2 =Var(y0|x0)+VarT (ˆy0) + Bias2(ˆy0) = σ2 +ET xT 0 (XT X)−1x0σ2 +02. (2.27) Here we have incurred an additional variance σ2 in the prediction error, since our target is not deterministic. There is no bias, and the variance depends on x0.IfN is large and T were selected at random, and assuming E(X) = 0, then XT X → NCov(X)and Ex0 EPE(x0) ∼ Ex0 xT 0 Cov(X)−1x0σ2/N + σ2 = trace[Cov(X)−1Cov(x0)]σ2/N + σ2 = σ2(p/N)+σ2. (2.28) Here we see that the expected EPE increases linearly as a function of p, with slope σ2/N .IfN is large and/or σ2 is small, this growth in vari- ance is negligible (0 in the deterministic case). By imposing some heavy restrictions on the class of models being fitted, we have avoided the curse of dimensionality. Some of the technical details in (2.27) and (2.28) are derived in Exercise 2.5. Figure 2.9 compares 1-nearest neighbor vs. least squares in two situa- tions, both of which have the form Y = f(X)+ε, X uniform as before, and ε ∼ N(0, 1). The sample size is N = 500. For the red curve, f(x)is2.5 Local Methods in High Dimensions 27 Dimension EPE Ratio 246810 1.6 1.7 1.8 1.9 2.0 2.1 • • • • • • • • • • • • • • • • • • • • Expected Prediction Error of 1NN vs. OLS • Linear • Cubic FIGURE 2.9. The curves show the expected prediction error (at x0 =0)for 1-nearest neighbor relative to least squares for the model Y = f(X)+ε.Forthe orange curve, f(x)=x1, while for the blue curve f(x)= 1 2 (x1 +1)3. linear in the first coordinate, for the green curve, cubic as in Figure 2.8. Shown is the relative EPE of 1-nearest neighbor to least squares, which appears to start at around 2 for the linear case. Least squares is unbiased in this case, and as discussed above the EPE is slightly above σ2 =1. The EPE for 1-nearest neighbor is always above 2, since the variance of ˆf(x0)inthiscaseisatleastσ2, and the ratio increases with dimension as the nearest neighbor strays from the target point. For the cubic case, least squares is biased, which moderates the ratio. Clearly we could manufacture examples where the bias of least squares would dominate the variance, and the 1-nearest neighbor would come out the winner. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is substantially larger. However, if the assumptions are wrong, all bets are off and the 1-nearest neighbor may dominate. We will see that there is a whole spec- trum of models between the rigid linear models and the extremely flexible 1-nearest-neighbor models, each with their own assumptions and biases, which have been proposed specifically to avoid the exponential growth in complexity of functions in high dimensions by drawing heavily on these assumptions. Before we delve more deeply, let us elaborate a bit on the concept of statistical models and see how they fit into the prediction framework.28 2. Overview of Supervised Learning 2.6 Statistical Models, Supervised Learning and Function Approximation Our goal is to find a useful approximation ˆf(x) to the function f(x) that underlies the predictive relationship between the inputs and outputs. In the theoretical setting of Section 2.4, we saw that squared error loss lead us to the regression function f(x)=E(Y |X = x) for a quantitative response. The class of nearest-neighbor methods can be viewed as direct estimates of this conditional expectation, but we have seen that they can fail in at least two ways: • if the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors; • if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates. We anticipate using other classes of models for f(x), in many cases specif- ically designed to overcome the dimensionality problems, and here we dis- cuss a framework for incorporating them into the prediction problem. 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y ) Suppose in fact that our data arose from a statistical model Y = f(X)+ε, (2.29) where the random error ε has E(ε) = 0 and is independent of X. Note that for this model, f(x)=E(Y |X = x), and in fact the conditional distribution Pr(Y |X) depends on X only through the conditional mean f(x). The additive error model is a useful approximation to the truth. For most systems the input–output pairs (X, Y ) will not have a deterministic relationship Y = f(X). Generally there will be other unmeasured variables that also contribute to Y , including measurement error. The additive model assumes that we can capture all these departures from a deterministic re- lationship via the error ε. For some problems a deterministic relationship does hold. Many of the classification problems studied in machine learning are of this form, where the response surface can be thought of as a colored map defined in IRp. The training data consist of colored examples from the map {xi,gi},and the goal is to be able to color any point. Here the function is deterministic, and the randomness enters through the x location of the training points. For the moment we will not pursue such problems, but will see that they can be handled by techniques appropriate for the error-based models. The assumption in (2.29) that the errors are independent and identically distributed is not strictly necessary, but seems to be at the back of our mind2.6 Statistical Models, Supervised Learning and Function Approximation 29 when we average squared errors uniformly in our EPE criterion. With such a model it becomes natural to use least squares as a data criterion for model estimation as in (2.1). Simple modifications can be made to avoid the independence assumption; for example, we can have Var(Y |X = x)= σ(x), and now both the mean and variance depend on X. In general the conditional distribution Pr(Y |X) can depend on X in complicated ways, but the additive error model precludes these. So far we have concentrated on the quantitative response. Additive error models are typically not used for qualitative outputs G; in this case the tar- get function p(X) is the conditional density Pr(G|X), and this is modeled directly. For example, for two-class data, it is often reasonable to assume that the data arise from independent binary trials, with the probability of one particular outcome being p(X), and the other 1 − p(X). Thus if Y is the 0–1 coded version of G, then E(Y |X = x)=p(x), but the variance depends on x as well: Var(Y |X = x)=p(x)[1 − p(x)]. 2.6.2 Supervised Learning Before we launch into more statistically oriented jargon, we present the function-fitting paradigm from a machine learning point of view. Suppose for simplicity that the errors are additive and that the model Y = f(X)+ε is a reasonable assumption. Supervised learning attempts to learn f by example through a teacher. One observes the system under study, both the inputs and outputs, and assembles a training set of observations T = (xi,yi),i=1,...,N. The observed input values to the system xi are also fed into an artificial system, known as a learning algorithm (usually a com- puter program), which also produces outputs ˆf(xi) in response to the in- puts. The learning algorithm has the property that it can modify its in- put/output relationship ˆf in response to differences yi − ˆf(xi) between the original and generated outputs. This process is known as learning by exam- ple. Upon completion of the learning process the hope is that the artificial and real outputs will be close enough to be useful for all sets of inputs likely to be encountered in practice. 2.6.3 Function Approximation The learning paradigm of the previous section has been the motivation for research into the supervised learning problem in the fields of machine learning (with analogies to human reasoning) and neural networks (with biological analogies to the brain). The approach taken in applied mathe- matics and statistics has been from the perspective of function approxima- tion and estimation. Here the data pairs {xi,yi} are viewed as points in a (p + 1)-dimensional Euclidean space. The function f(x) has domain equal to the p-dimensional input subspace, and is related to the data via a model30 2. Overview of Supervised Learning such as yi = f(xi)+εi. For convenience in this chapter we will assume the domain is IRp,ap-dimensional Euclidean space, although in general the inputs can be of mixed type. The goal is to obtain a useful approximation to f(x) for all x in some region of IRp, given the representations in T . Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book. Many of the approximations we will encounter have associated a set of parameters θ that can be modified to suit the data at hand. For example, the linear model f(x)=xT β has θ = β. Another class of useful approxi- mators can be expressed as linear basis expansions fθ(x)= K k=1 hk(x)θk, (2.30) where the hk are a suitable set of functions or transformations of the input vector x. Traditional examples are polynomial and trigonometric expan- sions, where for example hk might be x2 1, x1x2 2,cos(x1)andsoon.We also encounter nonlinear expansions, such as the sigmoid transformation common to neural network models, hk(x)= 1 1+exp(−xT βk). (2.31) We can use least squares to estimate the parameters θ in fθ as we did for the linear model, by minimizing the residual sum-of-squares RSS(θ)= N i=1 (yi − fθ(xi))2 (2.32) as a function of θ. This seems a reasonable criterion for an additive error model. In terms of function approximation, we imagine our parameterized function as a surface in p + 1 space, and what we observe are noisy re- alizations from it. This is easy to visualize when p = 2 and the vertical coordinate is the output y, as in Figure 2.10. The noise is in the output coordinate, so we find the set of parameters such that the fitted surface gets as close to the observed points as possible, where close is measured by the sum of squared vertical errors in RSS(θ). For the linear model we get a simple closed form solution to the mini- mization problem. This is also true for the basis function methods, if the basis functions themselves do not have any hidden parameters. Otherwise the solution requires either iterative methods or numerical optimization. While least squares is generally very convenient, it is not the only crite- rion used and in some cases would not make much sense. A more general2.6 Statistical Models, Supervised Learning and Function Approximation 31 • • • • •• ••• • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • • • • •• • • • • •• • • • • • • •• • •• • • • • FIGURE 2.10. Least squares fitting of a function of two inputs. The parameters of fθ(x) are chosen so as to minimize the sum-of-squared vertical errors. principle for estimation is maximum likelihood estimation. Suppose we have a random sample yi,i=1,...,N from a density Prθ(y) indexed by some parameters θ. The log-probability of the observed sample is L(θ)= N i=1 log Prθ(yi). (2.33) The principle of maximum likelihood assumes that the most reasonable values for θ are those for which the probability of the observed sample is largest. Least squares for the additive error model Y = fθ(X)+ε, with ε ∼ N(0,σ2), is equivalent to maximum likelihood using the conditional likelihood Pr(Y |X, θ)=N(fθ(X),σ2). (2.34) So although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is L(θ)=−N 2 log(2π) − N log σ − 1 2σ2 N i=1 (yi − fθ(xi))2, (2.35) and the only term involving θ is the last, which is RSS(θ) up to a scalar negative multiplier. A more interesting example is the multinomial likelihood for the regres- sion function Pr(G|X) for a qualitative output G. Suppose we have a model Pr(G = Gk|X = x)=pk,θ(x),k=1,...,K for the conditional probabil- ity of each class given X, indexed by the parameter vector θ. Then the32 2. Overview of Supervised Learning log-likelihood (also referred to as the cross-entropy) is L(θ)= N i=1 log pgi,θ(xi), (2.36) and when maximized it delivers values of θ that best conform with the data in this likelihood sense. 2.7 Structured Regression Models We have seen that although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data. This section introduces classes of such structured approaches. Before we proceed, though, we discuss further the need for such classes. 2.7.1 Difficulty of the Problem Consider the RSS criterion for an arbitrary function f, RSS(f)= N i=1 (yi − f(xi))2. (2.37) Minimizing (2.37) leads to infinitely many solutions: any function ˆf passing through the training points (xi,yi) is a solution. Any particular solution chosen might be a poor predictor at test points different from the training points. If there are multiple observation pairs xi,yi,=1,...,Ni at each value of xi, the risk is limited. In this case, the solutions pass through the average values of the yi at each xi; see Exercise 2.6. The situation is similar to the one we have already visited in Section 2.4; indeed, (2.37) is the finite sample version of (2.11) on page 18. If the sample size N were sufficiently large such that repeats were guaranteed and densely arranged, it would seem that these solutions might all tend to the limiting conditional expectation. In order to obtain useful results for finite N, we must restrict the eligible solutions to (2.37) to a smaller set of functions. How to decide on the nature of the restrictions is based on considerations outside of the data. These restrictions are sometimes encoded via the parametric representation of fθ, or may be built into the learning method itself, either implicitly or explicitly. These restricted classes of solutions are the major topic of this book. One thing should be clear, though. Any restrictions imposed on f that lead to a unique solution to (2.37) do not really remove the ambiguity2.8 Classes of Restricted Estimators 33 caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the ambiguity has simply been transferred to the choice of constraint. In general the constraints imposed by most learning methods can be described as complexity restrictions of one kind or another. This usually means some kind of regular behavior in small neighborhoods of the input space. That is, for all input points x sufficiently close to each other in some metric, ˆf exhibits some special structure such as nearly constant, linear or low-order polynomial behavior. The estimator is then obtained by averaging or polynomial fitting in that neighborhood. The strength of the constraint is dictated by the neighborhood size. The larger the size of the neighborhood, the stronger the constraint, and the more sensitive the solution is to the particular choice of constraint. For example, local constant fits in infinitesimally small neighborhoods is no constraint at all; local linear fits in very large neighborhoods is almost a globally linear model, and is very restrictive. The nature of the constraint depends on the metric used. Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The nearest-neighbor methods discussed so far are based on the assumption that locally the function is constant; close to a target input x0, the function does not change much, and so close outputs can be averaged to produce ˆf(x0). Other methods such as splines, neural networks and basis-function methods implicitly define neighborhoods of local behavior. In Section 5.4.1 we discuss the concept of an equivalent kernel (see Figure 5.8 on page 157), which describes this local dependence for any method linear in the outputs. These equivalent kernels in many cases look just like the explicitly defined weighting kernels discussed above—peaked at the target point and falling away smoothly away from it. One fact should be clear by now. Any method that attempts to pro- duce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions—again the curse of dimensionality. And conversely, all methods that overcome the dimensionality problems have an associated—and often implicit or adaptive—metric for measuring neighbor- hoods, which basically does not allow the neighborhood to be simultane- ously small in all directions. 2.8 Classes of Restricted Estimators The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. These classes are not distinct, and indeed some methods fall in several classes. Here we give a brief summary, since detailed descriptions34 2. Overview of Supervised Learning are given in later chapters. Each of the classes has associated with it one or more parameters, sometimes appropriately called smoothing parameters, that control the effective size of the local neighborhood. Here we describe three broad classes. 2.8.1 Roughness Penalty and Bayesian Methods Here the class of functions is controlled by explicitly penalizing RSS(f) with a roughness penalty PRSS(f; λ) = RSS(f)+λJ(f). (2.38) The user-selected functional J(f) will be large for functions f that vary too rapidly over small regions of input space. For example, the popular cubic smoothing spline for one-dimensional inputs is the solution to the penalized least-squares criterion PRSS(f; λ)= N i=1 (yi − f(xi))2 + λ [f (x)]2dx. (2.39) The roughness penalty here controls large values of the second derivative of f, and the amount of penalty is dictated by λ ≥ 0. For λ = 0 no penalty is imposed, and any interpolating function will do, while for λ = ∞ only functions linear in x are permitted. Penalty functionals J can be constructed for functions in any dimension, and special versions can be created to impose special structure. For ex- ample, additive penalties J(f)= p j=1 J(fj) are used in conjunction with additive functions f(X)= p j=1 fj(Xj) to create additive models with smooth coordinate functions. Similarly, projection pursuit regression mod- els have f(X)= M m=1 gm(αT mX) for adaptively chosen directions αm,and the functions gm can each have an associated roughness penalty. Penalty function, or regularization methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework. The penalty J corre- sponds to a log-prior, and PRSS(f; λ) the log-posterior distribution, and minimizing PRSS(f; λ) amounts to finding the posterior mode. We discuss roughness-penalty approaches in Chapter 5 and the Bayesian paradigm in Chapter 8. 2.8.2 Kernel Methods and Local Regression These methods can be thought of as explicitly providing estimates of the re- gression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function Kλ(x0,x) which assigns2.8 Classes of Restricted Estimators 35 weights to points x in a region around x0 (see Figure 6.1 on page 192). For example, the Gaussian kernel has a weight function based on the Gaussian density function Kλ(x0,x)= 1 λ exp −||x − x0||2 2λ (2.40) and assigns weights to points that die exponentially with their squared Euclidean distance from x0. The parameter λ corresponds to the variance of the Gaussian density, and controls the width of the neighborhood. The simplest form of kernel estimate is the Nadaraya–Watson weighted average ˆf(x0)= N i=1 Kλ(x0,xi)yiN i=1 Kλ(x0,xi) . (2.41) In general we can define a local regression estimate of f(x0)asfˆθ(x0), where ˆθ minimizes RSS(fθ,x0)= N i=1 Kλ(x0,xi)(yi − fθ(xi))2, (2.42) and fθ is some parameterized function, such as a low-order polynomial. Some examples are: • fθ(x)=θ0, the constant function; this results in the Nadaraya– Watson estimate in (2.41) above. • fθ(x)=θ0 + θ1x gives the popular local linear regression model. Nearest-neighbor methods can be thought of as kernel methods having a more data-dependent metric. Indeed, the metric for k-nearest neighbors is Kk(x, x0)=I(||x − x0|| ≤ ||x(k) − x0||), where x(k) is the training observation ranked kth in distance from x0,and I(S) is the indicator of the set S. These methods of course need to be modified in high dimensions, to avoid the curse of dimensionality. Various adaptations are discussed in Chapter 6. 2.8.3 Basis Functions and Dictionary Methods This class of methods includes the familiar linear and polynomial expan- sions, but more importantly a wide variety of more flexible models. The model for f is a linear expansion of basis functions fθ(x)= M m=1 θmhm(x), (2.43)36 2. Overview of Supervised Learning where each of the hm is a function of the input x, and the term linear here refers to the action of the parameters θ. This class covers a wide variety of methods. In some cases the sequence of basis functions is prescribed, such as a basis for polynomials in x of total degree M. For one-dimensional x, polynomial splines of degree K can be represented by an appropriate sequence of M spline basis functions, determined in turn by M − K knots. These produce functions that are piecewise polynomials of degree K between the knots, and joined up with continuity of degree K − 1 at the knots. As an example consider linear splines, or piecewise linear functions. One intuitively satisfying basis consists of the functions b1(x)=1,b2(x)=x,andbm+2(x)=(x − tm)+, m =1,...,M − 2, where tm is the mth knot, and z+ denotes positive part. Tensor products of spline bases can be used for inputs with dimensions larger than one (see Section 5.2, and the CART and MARS models in Chapter 9.) The parameter θ can be the total degree of the polynomial or the number of knots in the case of splines. Radial basis functions are symmetric p-dimensional kernels located at particular centroids, fθ(x)= M m=1 Kλm (μm,x)θm; (2.44) for example, the Gaussian kernel Kλ(μ, x)=e−||x−μ||2/2λ is popular. Radial basis functions have centroids μm and scales λm that have to be determined. The spline basis functions have knots. In general we would like the data to dictate them as well. Including these as parameters changes the regression problem from a straightforward linear problem to a combi- natorially hard nonlinear problem. In practice, shortcuts such as greedy algorithms or two stage processes are used. Section 6.7 describes some such approaches. A single-layer feed-forward neural network model with linear output weights can be thought of as an adaptive basis function method. The model has the form fθ(x)= M m=1 βmσ(αT mx + bm), (2.45) where σ(x)=1/(1 + e−x) is known as the activation function. Here, as in the projection pursuit model, the directions αm and the bias terms bm have to be determined, and their estimation is the meat of the computation. Details are give in Chapter 11. These adaptively chosen basis function methods are also known as dictio- nary methods, where one has available a possibly infinite set or dictionary D of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.2.9 Model Selection and the Bias–Variance Tradeoff 37 2.9 Model Selection and the Bias–Variance Tradeoff All the models described above and many others discussed in later chapters have a smoothing or complexity parameter that has to be determined: • the multiplier of the penalty term; • the width of the kernel; • or the number of basis functions. In the case of the smoothing spline, the parameter λ indexes models ranging from a straight line fit to the interpolating model. Similarly a local degree- m polynomial model ranges between a degree-m global polynomial when the window size is infinitely large, to an interpolating fit when the window size shrinks to zero. This means that we cannot use residual sum-of-squares on the training data to determine these parameters as well, since we would always pick those that gave interpolating fits and hence zero residuals. Such a model is unlikely to predict future data well at all. The k-nearest-neighbor regression fit ˆfk(x0) usefully illustrates the com- peting forces that effect the predictive ability of such approximations. Sup- pose the data arise from a model Y = f(X)+ε, with E(ε)=0and Var(ε)=σ2. For simplicity here we assume that the values of xi in the sample are fixed in advance (nonrandom). The expected prediction error at x0, also known as test or generalization error, can be decomposed: EPEk(x0)=E[(Y − ˆfk(x0))2|X = x0] = σ2 + [Bias2( ˆfk(x0)) + VarT ( ˆfk(x0))] (2.46) = σ2 + f(x0) − 1 k k =1 f(x()) 2 + σ2 k . (2.47) The subscripts in parentheses () indicate the sequence of nearest neighbors to x0. There are three terms in this expression. The first term σ2 is the ir- reducible error—the variance of the new test target—and is beyond our control, even if we know the true f(x0). The second and third terms are under our control, and make up the mean squared error of ˆfk(x0) in estimating f(x0), which is broken down into a bias component and a variance component. The bias term is the squared difference between the true mean f(x0) and the expected value of the estimate—[ET ( ˆfk(x0)) − f(x0)]2—where the expectation averages the randomness in the training data. This term will most likely increase with k, if the true function is reasonably smooth. For small k the few closest neighbors will have values f(x()) close to f(x0), so their average should38 2. Overview of Supervised Learning High Bias Low Variance Low Bias High Variance Prediction Error Model Complexity Training Sample Test Sample Low High FIGURE 2.11. Test and training error as a function of model complexity. be close to f(x0). As k grows, the neighbors are further away, and then anything can happen. The variance term is simply the variance of an average here, and de- creases as the inverse of k.Soask varies, there is a bias–variance tradeoff. More generally, as the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decreases. The opposite behavior occurs as the model complexity is decreased. For k-nearest neighbors, the model complexity is controlled by k. Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is the training error 1 N i(yi − ˆyi)2. Unfortunately training error is not a good estimate of test error, as it does not properly account for model complexity. Figure 2.11 shows the typical behavior of the test and training error, as model complexity is varied. The training error tends to decrease whenever we increase the model complexity, that is, whenever we fit the data harder. However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions ˆf(x0) will have large variance, as reflected in the last term of expression (2.46). In contrast, if the model is not complex enough, it will underfit and may have large bias, again resulting in poor generalization. In Chapter 7 we discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity for a given prediction method and training set.Exercises 39 Bibliographic Notes Some good general books on the learning problem are Duda et al. (2000), Bishop (1995),(Bishop, 2006), Ripley (1996), Cherkassky and Mulier (2007) and Vapnik (1996). Parts of this chapter are based on Friedman (1994b). Exercises Ex. 2.1 Suppose each of K-classes has an associated target tk, which is a vector of all zeros, except a one in the kth position. Show that classifying to the largest element of ˆy amounts to choosing the closest target, mink ||tk − ˆy||, if the elements of ˆy sum to one. Ex. 2.2 Show how to compute the Bayes decision boundary for the simula- tion example in Figure 2.5. Ex. 2.3 Derive equation (2.24). Ex. 2.4 The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. Consider inputs drawn from a spherical multinormal distribution X ∼ N(0, Ip). The squared distance from any sample point to the origin has a χ2 p distribution with mean p. Consider a prediction point x0 drawn from this distribution, and let a = x0/||x0|| be an associated unit vector. Let zi = aT xi be the projection of each of the training points on this direction. Show that the zi are distributed N(0, 1) with expected squared distance from the origin 1, while the target point has expected squared distance p from the origin. Hence for p = 10, a randomly drawn test point is about 3.1 standard deviations from the origin, while all the training points are on average one standard deviation along direction a. So most prediction points see themselves as lying on the edge of the training set. Ex. 2.5 (a) Derive equation (2.27). The last line makes use of (3.8) through a conditioning argument. (b) Derive equation (2.28), making use of the cyclic property of the trace operator [trace(AB) = trace(BA)], and its linearity (which allows us to interchange the order of trace and expectation). Ex. 2.6 Consider a regression problem with inputs xi and outputs yi, and a parameterized model fθ(x) to be fit by least squares. Show that if there are observations with tied or identical values of x, then the fit can be obtained from a reduced weighted least squares problem.40 2. Overview of Supervised Learning Ex. 2.7 Suppose we have a sample of N pairs xi,yi drawn i.i.d. from the distribution characterized as follows: xi ∼ h(x), the design density yi = f(xi)+εi,fis the regression function εi ∼ (0,σ2) (mean zero, variance σ2) We construct an estimator for f linear in the yi, ˆf(x0)= N i=1 i(x0; X)yi, where the weights i(x0; X) do not depend on the yi, but do depend on the entire training sequence of xi, denoted here by X. (a) Show that linear regression and k-nearest-neighbor regression are mem- bers of this class of estimators. Describe explicitly the weights i(x0; X) in each of these cases. (b) Decompose the conditional mean-squared error EY|X(f(x0) − ˆf(x0))2 into a conditional squared bias and a conditional variance component. Like X, Y represents the entire training sequence of yi. (c) Decompose the (unconditional) mean-squared error EY,X (f(x0) − ˆf(x0))2 into a squared bias and a variance component. (d) Establish a relationship between the squared biases and variances in the above two cases. Ex. 2.8 Compare the classification performance of linear regression and k– nearest neighbor classification on the zipcode data. In particular, consider only the 2’s and 3’s, and k =1, 3, 5, 7 and 15. Show both the training and test error for each choice. The zipcode data are available from the book website www-stat.stanford.edu/ElemStatLearn. Ex. 2.9 Consider a linear regression model with p parameters, fit by least squares to a set of training data (x1,y1),...,(xN ,yN ) drawn at random from a population. Let ˆβ be the least squares estimate. Suppose we have some test data (˜x1, ˜y1),...,(˜xM , ˜yM ) drawn at random from the same pop- ulation as the training data. If Rtr(β)= 1 N N 1 (yi − βT xi)2 and Rte(β)= 1 M M 1 (˜yi − βT ˜xi)2, prove that E[Rtr(ˆβ)] ≤ E[Rte(ˆβ)],Exercises 41 where the expectations are over all that is random in each expression. [This exercise was brought to our attention by Ryan Tibshirani, from a homework assignment given by Andrew Ng.]42 2. Overview of Supervised LearningThis is page 43 Printer: Opaque this 3 Linear Methods for Regression 3.1 Introduction A linear regression model assumes that the regression function E(Y |X)is linear in the inputs X1,...,Xp. Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output. For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalizations are sometimes called basis-function methods, and are discussed in Chapter 5. In this chapter we describe linear methods for regression, while in the next chapter we discuss linear methods for classification. On some topics we go into considerable detail, as it is our firm belief that an understanding of linear methods is essential for understanding nonlinear ones. In fact, many nonlinear techniques are direct generalizations of the linear methods discussed here.44 3. Linear Methods for Regression 3.2 Linear Regression Models and Least Squares As introduced in Chapter 2, we have an input vector XT =(X1,X2,...,Xp), and want to predict a real-valued output Y . The linear regression model has the form f(X)=β0 + p j=1 Xjβj. (3.1) The linear model either assumes that the regression function E(Y |X)is linear, or that the linear model is a reasonable approximation. Here the βj’s are unknown parameters or coefficients, and the variables Xj can come from different sources: • quantitative inputs; • transformations of quantitative inputs, such as log, square-root or square; • basis expansions, such as X2 = X2 1 , X3 = X3 1 , leading to a polynomial representation; • numeric or “dummy” coding of the levels of qualitative inputs. For example, if G is a five-level factor input, we might create Xj,j= 1,...,5, such that Xj = I(G = j). Together this group of Xj repre- sents the effect of G by a set of level-dependent constants, since in5 j=1 Xjβj, one of the Xjs is one, and the others are zero. • interactions between variables, for example, X3 = X1 · X2. No matter the source of the Xj, the model is linear in the parameters. Typically we have a set of training data (x1,y1) ...(xN ,yN )fromwhich to estimate the parameters β.Eachxi =(xi1,xi2,...,xip)T is a vector of feature measurements for the ith case. The most popular estimation method is least squares, in which we pick the coefficients β =(β0,β1,...,βp)T to minimize the residual sum of squares RSS(β)= N i=1 (yi − f(xi))2 = N i=1 yi − β0 − p j=1 xijβj 2 . (3.2) From a statistical point of view, this criterion is reasonable if the training observations (xi,yi) represent independent random draws from their popu- lation. Even if the xi’s were not drawn randomly, the criterion is still valid if the yi’s are conditionally independent given the inputs xi. Figure 3.1 illustrates the geometry of least-squares fitting in the IRp+1-dimensional3.2 Linear Regression Models and Least Squares 45 •• • • • • • •• • • •• • • • •• • • • • • • • •• • •• •• • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • •• • • • • X1 X2 Y FIGURE 3.1. Linear least squares fitting with X ∈ IR 2. We seek the linear function of X that minimizes the sum of squared residuals from Y . space occupied by the pairs (X, Y ). Note that (3.2) makes no assumptions about the validity of model (3.1); it simply finds the best linear fit to the data. Least squares fitting is intuitively satisfying no matter how the data arise; the criterion measures the average lack of fit. How do we minimize (3.2)? Denote by X the N × (p + 1) matrix with each row an input vector (with a 1 in the first position), and similarly let y be the N-vector of outputs in the training set. Then we can write the residual sum-of-squares as RSS(β)=(y − Xβ)T (y − Xβ). (3.3) This is a quadratic function in the p + 1 parameters. Differentiating with respect to β we obtain ∂RSS ∂β = −2XT (y − Xβ) ∂2RSS ∂β∂βT =2XT X. (3.4) Assuming (for the moment) that X has full column rank, and hence XT X is positive definite, we set the first derivative to zero XT (y − Xβ) = 0 (3.5) to obtain the unique solution ˆβ =(XT X)−1XT y. (3.6)46 3. Linear Methods for Regression x1 x2 y ˆy FIGURE 3.2. The N-dimensional geometry of least squares regression with two predictors. The outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection ˆy represents the vector of the least squares predictions The predicted values at an input vector x0 are given by ˆf(x0)=(1:x0)T ˆβ; the fitted values at the training inputs are ˆy = Xˆβ = X(XT X)−1XT y, (3.7) where ˆyi = ˆf(xi). The matrix H = X(XT X)−1XT appearing in equation (3.7) is sometimes called the “hat” matrix because it puts the hat on y. Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in IRN . We denote the column vectors of X by x0, x1,...,xp, with x0 ≡ 1. For much of what follows, this first column is treated like any other. These vectors span a subspace of IRN , also referred to as the column space of X. We minimize RSS(β)= y − Xβ 2 by choosing ˆβ so that the residual vector y − ˆy is orthogonal to this subspace. This orthogonality is expressed in (3.5), and the resulting estimate ˆy is hence the orthogonal pro- jection of y onto this subspace. The hat matrix H computes the orthogonal projection, and hence it is also known as a projection matrix. It might happen that the columns of X are not linearly independent, so that X is not of full rank. This would occur, for example, if two of the inputs were perfectly correlated, (e.g., x2 =3x1). Then XT X is singular and the least squares coefficients ˆβ are not uniquely defined. However, the fitted values ˆy = Xˆβ are still the projection of y onto the column space of X; there is just more than one way to express that projection in terms of the column vectors of X. The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns in X. Most regression software packages detect these redundancies and automatically implement3.2 Linear Regression Models and Least Squares 47 some strategy for removing them. Rank deficiencies can also occur in signal and image analysis, where the number of inputs p can exceed the number of training cases N. In this case, the features are typically reduced by filtering or else the fitting is controlled by regularization (Section 5.2.3 and Chapter 18). Up to now we have made minimal assumptions about the true distribu- tion of the data. In order to pin down the sampling properties of ˆβ,wenow assume that the observations yi are uncorrelated and have constant vari- ance σ2, and that the xi are fixed (non random). The variance–covariance matrix of the least squares parameter estimates is easily derived from (3.6) and is given by Var( ˆβ)=(XT X)−1σ2. (3.8) Typically one estimates the variance σ2 by ˆσ2 = 1 N − p − 1 N i=1 (yi − ˆyi)2. The N − p − 1 rather than N in the denominator makes ˆσ2 an unbiased estimate of σ2:E(ˆσ2)=σ2. To draw inferences about the parameters and the model, additional as- sumptions are needed. We now assume that (3.1) is the correct model for the mean; that is, the conditional expectation of Y is linear in X1,...,Xp. We also assume that the deviations of Y around its expectation are additive and Gaussian. Hence Y =E(Y |X1,...,Xp)+ε = β0 + p j=1 Xjβj + ε, (3.9) where the error ε is a Gaussian random variable with expectation zero and variance σ2, written ε ∼ N(0,σ2). Under (3.9), it is easy to show that ˆβ ∼ N(β,(XT X)−1σ2). (3.10) This is a multivariate normal distribution with mean vector and variance– covariance matrix as shown. Also (N − p − 1)ˆσ2 ∼ σ2χ2 N−p−1, (3.11) a chi-squared distribution with N −p−1 degrees of freedom. In addition ˆβ and ˆσ2 are statistically independent. We use these distributional properties to form tests of hypothesis and confidence intervals for the parameters βj.48 3. Linear Methods for Regression Z Tail Probabilities 2.0 2.2 2.4 2.6 2.8 3.0 0.01 0.02 0.03 0.04 0.05 0.06 t30 t100 normal FIGURE 3.3. The tail probabilities Pr(|Z| >z) for three distributions, t30, t100 and standard normal. Shown are the appropriate quantiles for testing significance at the p =0.05 and 0.01 levels. The difference between t and the standard normal becomes negligible for N bigger than about 100. To test the hypothesis that a particular coefficient βj =0,weformthe standardized coefficient or Z-score zj = ˆβj ˆσ√vj , (3.12) where vj is the jth diagonal element of (XT X)−1. Under the null hypothesis that βj =0,zj is distributed as tN−p−1 (a t distribution with N − p − 1 degrees of freedom), and hence a large (absolute) value of zj will lead to rejection of this null hypothesis. If ˆσ is replaced by a known value σ, then zj would have a standard normal distribution. The difference between the tail quantiles of a t-distribution and a standard normal become negligible as the sample size increases, and so we typically use the normal quantiles (see Figure 3.3). Often we need to test for the significance of groups of coefficients simul- taneously. For example, to test if a categorical variable with k levels can be excluded from a model, we need to test whether the coefficients of the dummy variables used to represent the levels can all be set to zero. Here we use the F statistic, F = (RSS0 − RSS1)/(p1 − p0) RSS1/(N − p1 − 1) , (3.13) where RSS1 is the residual sum-of-squares for the least squares fit of the big- ger model with p1 +1 parameters, and RSS0 the same for the nested smaller model with p0 + 1 parameters, having p1 −p0 parameters constrained to be3.2 Linear Regression Models and Least Squares 49 zero. The F statistic measures the change in residual sum-of-squares per additional parameter in the bigger model, and it is normalized by an esti- mate of σ2. Under the Gaussian assumptions, and the null hypothesis that the smaller model is correct, the F statistic will have a Fp1−p0,N−p1−1 dis- tribution. It can be shown (Exercise 3.1) that the zj in (3.12) are equivalent to the F statistic for dropping the single coefficient βj from the model. For large N, the quantiles of the Fp1−p0,N−p1−1 approach those of the χ2 p1−p0 . Similarly, we can isolate βj in (3.10) to obtain a 1−2α confidence interval for βj: (ˆβj − z(1−α)v 1 2 j ˆσ, ˆβj + z(1−α)v 1 2 j ˆσ). (3.14) Here z(1−α) is the 1 − α percentile of the normal distribution: z(1−0.025) =1.96, z(1−.05) =1.645, etc. Hence the standard practice of reporting ˆβ ± 2 · se(ˆβ) amounts to an ap- proximate 95% confidence interval. Even if the Gaussian error assumption does not hold, this interval will be approximately correct, with its coverage approaching 1 − 2α as the sample size N →∞. In a similar fashion we can obtain an approximate confidence set for the entire parameter vector β, namely Cβ = {β|(ˆβ − β)T XT X(ˆβ − β) ≤ ˆσ2χ2 p+1 (1−α)}, (3.15) where χ2 (1−α) is the 1 − α percentile of the chi-squared distribution on  degrees of freedom: for example, χ2 5 (1−0.05) =11.1, χ2 5 (1−0.1) =9.2. This confidence set for β generates a corresponding confidence set for the true function f(x)=xT β, namely {xT β|β ∈ Cβ} (Exercise 3.2; see also Fig- ure 5.4 in Section 5.2.2 for examples of confidence bands for functions). 3.2.1 Example: Prostate Cancer The data for this example come from a study by Stamey et al. (1989). They examined the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). The correlation matrix of the predictors given in Table 3.1 shows many strong correlations. Figure 1.1 (page 3) of Chapter 1 is a scatterplot matrix showing every pairwise plot between the variables. We see that svi is a binary variable, and gleason is an ordered categorical variable. We see, for50 3. Linear Methods for Regression TABLE 3.1. Correlations of predictors in the prostate cancer data. lcavol lweight age lbph svi lcp gleason lweight 0.300 age 0.286 0.317 lbph 0.063 0.437 0.287 svi 0.593 0.181 0.129 −0.139 lcp 0.692 0.157 0.173 −0.089 0.671 gleason 0.426 0.024 0.366 0.033 0.307 0.476 pgg45 0.483 0.074 0.276 −0.030 0.481 0.663 0.757 TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the coefficient divided by its standard error (3.12). Roughly a Z score larger than two in absolute value is significantly nonzero at the p =0.05 level. Term Coefficient Std. Error Z Score Intercept 2.46 0.09 27.60 lcavol 0.68 0.13 5.37 lweight 0.26 0.10 2.75 age −0.14 0.10 −1.40 lbph 0.21 0.10 2.06 svi 0.31 0.12 2.47 lcp −0.29 0.15 −1.87 gleason −0.02 0.15 −0.15 pgg45 0.27 0.15 1.74 example, that both lcavol and lcp show a strong relationship with the response lpsa, and with each other. We need to fit the effects jointly to untangle the relationships between the predictors and the response. We fit a linear model to the log of prostate-specific antigen, lpsa,after first standardizing the predictors to have unit variance. We randomly split the dataset into a training set of size 67 and a test set of size 30. We ap- plied least squares estimation to the training set, producing the estimates, standard errors and Z-scores shown in Table 3.2. The Z-scores are defined in (3.12), and measure the effect of dropping that variable from the model. A Z-score greater than 2 in absolute value is approximately significant at the 5% level. (For our example, we have nine parameters, and the 0.025 tail quantiles of the t67−9 distribution are ±2.002!) The predictor lcavol shows the strongest effect, with lweight and svi also strong. Notice that lcp is not significant, once lcavol is in the model (when used in a model without lcavol, lcp is strongly significant). We can also test for the exclusion of a number of terms at once, using the F-statistic (3.13). For example, we consider dropping all the non-significant terms in Table 3.2, namely age,3.2 Linear Regression Models and Least Squares 51 lcp, gleason,andpgg45.Weget F = (32.81 − 29.43)/(9 − 5) 29.43/(67 − 9) =1.67, (3.16) which has a p-value of 0.17 (Pr(F4,58 > 1.67) = 0.17), and hence is not significant. The mean prediction error on the test data is 0.521. In contrast, predic- tion using the mean training value of lpsa has a test error of 1.057, which is called the “base error rate.” Hence the linear model reduces the base error rate by about 50%. We will return to this example later to compare various selection and shrinkage methods. 3.2.2 The Gauss–Markov Theorem One of the most famous results in statistics asserts that the least squares estimates of the parameters β have the smallest variance among all linear unbiased estimates. We will make this precise here, and also make clear that the restriction to unbiased estimates is not necessarily a wise one. This observation will lead us to consider biased estimates such as ridge regression later in the chapter. We focus on estimation of any linear combination of the parameters θ = aT β; for example, predictions f(x0)=xT 0 β are of this form. The least squares estimate of aT β is ˆθ = aT ˆβ = aT (XT X)−1XT y. (3.17) Considering X to be fixed, this is a linear function cT 0 y of the response vector y. If we assume that the linear model is correct, aT ˆβ is unbiased since E(aT ˆβ)=E(aT (XT X)−1XT y) = aT (XT X)−1XT Xβ = aT β. (3.18) The Gauss–Markov theorem states that if we have any other linear estima- tor ˜θ = cT y that is unbiased for aT β,thatis,E(cT y)=aT β, then Var(aT ˆβ) ≤ Var(cT y). (3.19) The proof (Exercise 3.3) uses the triangle inequality. For simplicity we have stated the result in terms of estimation of a single parameter aT β, but with a few more definitions one can state it in terms of the entire parameter vector β (Exercise 3.3). Consider the mean squared error of an estimator ˜θ in estimating θ: MSE(˜θ)=E(˜θ − θ)2 =Var(˜θ)+[E(˜θ) − θ]2. (3.20)52 3. Linear Methods for Regression The first term is the variance, while the second term is the squared bias. The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias. However, there may well exist a biased estimator with smaller mean squared error. Such an estimator would trade a little bias for a larger reduction in variance. Biased estimates are commonly used. Any method that shrinks or sets to zero some of the least squares coefficients may result in a biased estimate. We discuss many examples, including variable subset selection and ridge regression, later in this chapter. From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance. We go into these issues in more detail in Chapter 7. Mean squared error is intimately related to prediction accuracy, as dis- cussed in Chapter 2. Consider the prediction of the new response at input x0, Y0 = f(x0)+ε0. (3.21) Then the expected prediction error of an estimate ˜f(x0)=xT 0 ˜β is E(Y0 − ˜f(x0))2 = σ2 +E(xT 0 ˜β − f(x0))2 = σ2 + MSE( ˜f(x0)). (3.22) Therefore, expected prediction error and mean squared error differ only by the constant σ2, representing the variance of the new observation y0. 3.2.3 Multiple Regression from Simple Univariate Regression The linear model (3.1) with p>1 inputs is called the multiple linear regression model. The least squares estimates (3.6) for this model are best understood in terms of the estimates for the univariate (p = 1) linear model, as we indicate in this section. Suppose first that we have a univariate model with no intercept, that is, Y = Xβ + ε. (3.23) The least squares estimate and residuals are ˆβ = N 1 xiyiN 1 x2 i , ri = yi − xi ˆβ. (3.24) In convenient vector notation, we let y =(y1,...,yN )T , x =(x1,...,xN )T and define x, y = N i=1 xiyi, = xT y, (3.25)3.2 Linear Regression Models and Least Squares 53 the inner product between x and y1. Then we can write ˆβ = x, y x, x, r = y − xˆβ. (3.26) As we will see, this simple univariate regression provides the building block for multiple linear regression. Suppose next that the inputs x1, x2,...,xp (the columns of the data matrix X) are orthogonal; that is xj, xk =0 for all j = k. Then it is easy to check that the multiple least squares esti- mates ˆβj are equal to xj, y/ xj, xj—the univariate estimates. In other words, when the inputs are orthogonal, they have no effect on each other’s parameter estimates in the model. Orthogonal inputs occur most often with balanced, designed experiments (where orthogonality is enforced), but almost never with observational data. Hence we will have to orthogonalize them in order to carry this idea further. Suppose next that we have an intercept and a single input x. Then the least squares coefficient of x has the form ˆβ1 = x − ¯x1, y x − ¯x1, x − ¯x1, (3.27) where ¯x = i xi/N ,and1 = x0, the vector of N ones. We can view the estimate (3.27) as the result of two applications of the simple regression (3.26). The steps are: 1. regress x on 1 to produce the residual z = x − ¯x1; 2. regress y on the residual z to give the coefficient ˆβ1. In this procedure, “regress b on a” means a simple univariate regression of b on a with no intercept, producing coefficient ˆγ = a, b/ a, a and residual vector b − ˆγa. We say that b is adjusted for a, or is “orthogonalized” with respect to a. Step 1 orthogonalizes x with respect to x0 = 1. Step 2 is just a simple univariate regression, using the orthogonal predictors 1 and z. Figure 3.4 shows this process for two general inputs x1 and x2. The orthogonalization does not change the subspace spanned by x1 and x2, it simply produces an orthogonal basis for representing it. This recipe generalizes to the case of p inputs, as shown in Algorithm 3.1. Note that the inputs z0,...,zj−1 in step 2 are orthogonal, hence the simple regression coefficients computed there are in fact also the multiple regres- sion coefficients. 1The inner-product notation is suggestive of generalizations of linear regression to different metric spaces, as well as to probability spaces.54 3. Linear Methods for Regression x1 x2 y ˆy zzzzz FIGURE 3.4. Least squares regression by orthogonalization of the inputs. The vector x2 is regressed on the vector x1, leaving the residual vector z. The regres- sion of y on z gives the multiple regression coefficient of x2. Adding together the projections of y on each of x1 and z gives the least squares fit ˆy. Algorithm 3.1 Regression by Successive Orthogonalization. 1. Initialize z0 = x0 = 1. 2. For j =1, 2,...,p Regress xj on z0, z1,...,,zj−1 to produce coefficients ˆγj = z, xj/ z, z,  =0,...,j − 1 and residual vector zj = xj − j−1 k=0 ˆγkjzk. 3. Regress y on the residual zp to give the estimate ˆβp. The result of this algorithm is ˆβp = zp, y zp, zp. (3.28) Re-arranging the residual in step 2, we can see that each of the xj is a linear combination of the zk,k≤ j. Since the zj are all orthogonal, they form a basis for the column space of X, and hence the least squares projection onto this subspace is ˆy. Since zp alone involves xp (with coefficient 1), we see that the coefficient (3.28) is indeed the multiple regression coefficient of y on xp. This key result exposes the effect of correlated inputs in multiple regression. Note also that by rearranging the xj, any one of them could be in the last position, and a similar results holds. Hence stated more generally, we have shown that the jth multiple regression coefficient is the univariate regression coefficient of y on xj·012...(j−1)(j+1)...,p, the residual after regressing xj on x0, x1,...,xj−1, xj+1,...,xp:3.2 Linear Regression Models and Least Squares 55 The multiple regression coefficient ˆβj represents the additional contribution of xj on y,afterxj has been adjusted for x0, x1,...,xj−1, xj+1,...,xp. If xp is highly correlated with some of the other xk’s, the residual vector zp will be close to zero, and from (3.28) the coefficient ˆβp will be very unstable. This will be true for all the variables in the correlated set. In such situations, we might have all the Z-scores (as in Table 3.2) be small— any one of the set can be deleted—yet we cannot delete them all. From (3.28) we also obtain an alternate formula for the variance estimates (3.8), Var( ˆβp)= σ2 zp, zp = σ2 zp 2 . (3.29) In other words, the precision with which we can estimate ˆβp depends on the length of the residual vector zp; this represents how much of xp is unexplained by the other xk’s. Algorithm 3.1 is known as the Gram–Schmidt procedure for multiple regression, and is also a useful numerical strategy for computing the esti- mates. We can obtain from it not just ˆβp, but also the entire multiple least squares fit, as shown in Exercise 3.4. We can represent step 2 of Algorithm 3.1 in matrix form: X = ZΓ, (3.30) where Z has as columns the zj (in order), and Γ is the upper triangular ma- trix with entries ˆγkj. Introducing the diagonal matrix D with jth diagonal entry Djj = zj ,weget X = ZD−1DΓ = QR, (3.31) the so-called QR decomposition of X. Here Q is an N × (p + 1) orthogonal matrix, QT Q = I,andR is a (p +1)× (p + 1) upper triangular matrix. The QR decomposition represents a convenient orthogonal basis for the column space of X. It is easy to see, for example, that the least squares solution is given by ˆβ = R−1QT y, (3.32) ˆy = QQT y. (3.33) Equation (3.32) is easy to solve because R is upper triangular (Exercise 3.4).56 3. Linear Methods for Regression 3.2.4 Multiple Outputs Suppose we have multiple outputs Y1,Y2,...,YK that we wish to predict from our inputs X0,X1,X2,...,Xp. We assume a linear model for each output Yk = β0k + p j=1 Xjβjk + εk (3.34) = fk(X)+εk. (3.35) With N training cases we can write the model in matrix notation Y = XB + E. (3.36) Here Y is the N ×K response matrix, with ik entry yik, X is the N ×(p+1) input matrix, B is the (p +1)× K matrix of parameters and E is the N × K matrix of errors. A straightforward generalization of the univariate loss function (3.2) is RSS(B)= K k=1 N i=1 (yik − fk(xi))2 (3.37) =tr[(Y − XB)T (Y − XB)]. (3.38) The least squares estimates have exactly the same form as before ˆB =(XT X)−1XT Y. (3.39) Hence the coefficients for the kth outcome are just the least squares es- timates in the regression of yk on x0, x1,...,xp. Multiple outputs do not affect one another’s least squares estimates. If the errors ε =(ε1,...,εK) in (3.34) are correlated, then it might seem appropriate to modify (3.37) in favor of a multivariate version. Specifically, suppose Cov(ε)=Σ, then the multivariate weighted criterion RSS(B; Σ)= N i=1 (yi − f(xi))T Σ−1(yi − f(xi)) (3.40) arises naturally from multivariate Gaussian theory. Here f(x) is the vector function (f1(x),...,fK(x)), and yi the vector of K responses for observa- tion i. However, it can be shown that again the solution is given by (3.39); K separate regressions that ignore the correlations (Exercise 3.11). If the Σi vary among observations, then this is no longer the case, and the solution for B no longer decouples. In Section 3.7 we pursue the multiple outcome problem, and consider situations where it does pay to combine the regressions.3.3 Subset Selection 57 3.3 Subset Selection There are two reasons why we are often not satisfied with the least squares estimates (3.6). • The first is prediction accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. • The second reason is interpretation. With a large number of predic- tors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details. In this section we describe a number of approaches to variable subset selec- tion with linear regression. In later sections we discuss shrinkage and hybrid approaches for controlling variance, as well as other dimension-reduction strategies. These all fall under the general heading model selection. Model selection is not restricted to linear models; Chapter 7 covers this topic in some detail. With subset selection we retain only a subset of the variables, and elim- inate the rest from the model. Least squares regression is used to estimate the coefficients of the inputs that are retained. There are a number of dif- ferent strategies for choosing the subset. 3.3.1 Best-Subset Selection Best subset regression finds for each k ∈{0, 1, 2,...,p} the subset of size k that gives smallest residual sum of squares (3.2). An efficient algorithm— the leaps and bounds procedure (Furnival and Wilson, 1974)—makes this feasible for p as large as 30 or 40. Figure 3.5 shows all the subset models for the prostate cancer example. The lower boundary represents the models that are eligible for selection by the best-subsets approach. Note that the best subset of size 2, for example, need not include the variable that was in the best subset of size 1 (for this example all the subsets are nested). The best-subset curve (red lower boundary in Figure 3.5) is necessarily decreasing, so cannot be used to select the subset size k. The question of how to choose k involves the tradeoff between bias and variance, along with the more subjective desire for parsimony. There are a number of criteria that one may use; typically we choose the smallest model that minimizes an estimate of the expected prediction error. Many of the other approaches that we discuss in this chapter are similar, in that they use the training data to produce a sequence of models varying in complexity and indexed by a single parameter. In the next section we use58 3. Linear Methods for Regression Subset Size k Residual Sum−of−Squares 0 20406080100 012345678 • • • ••• ••• •••••• •••• ••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••• •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••• ••••••• • • • • • • • • • • • FIGURE 3.5. All possible subset models for the prostate cancer example. At each subset size is shown the residual sum-of-squares for each model of that size. cross-validation to estimate prediction error and select k; the AIC criterion is a popular alternative. We defer more detailed discussion of these and other approaches to Chapter 7. 3.3.2 Forward- and Backward-Stepwise Selection Rather than search through all possible subsets (which becomes infeasible for p much larger than 40), we can seek a good path through them. Forward- stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. With many candidate predictors, this might seem like a lot of computation; however, clever up- dating algorithms can exploit the QR decomposition for the current fit to rapidly establish the next candidate (Exercise 3.9). Like best-subset re- gression, forward stepwise produces a sequence of models indexed by k,the subset size, which must be determined. Forward-stepwise selection is a greedy algorithm, producing a nested se- quence of models. In this sense it might seem sub-optimal compared to best-subset selection. However, there are several reasons why it might be preferred:3.3 Subset Selection 59 • Computational; for large p we cannot compute the best subset se- quence, but we can always compute the forward stepwise sequence (even when p N). • Statistical; a price is paid in variance for selecting the best subset of each size; forward stepwise is a more constrained search, and will have lower variance, but perhaps more bias. 0 5 10 15 20 25 30 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Best Subset Forward Stepwise Backward Stepwise Forward Stagewise E || ˆ β ( k ) − β || 2 Subset Size k FIGURE 3.6. Comparison of four subset-selection techniques on a simulated lin- ear regression problem Y = XT β + ε. There are N =300observations on p =31 standard Gaussian variables, with pairwise correlations all equal to 0.85.For10 of the variables, the coefficients are drawn at random from a N(0, 0.4) distribution; the rest are zero. The noise ε ∼ N(0, 6.25), resulting in a signal-to-noise ratio of 0.64. Results are averaged over 50 simulations. Shown is the mean-squared error of the estimated coefficient ˆβ(k) at each step from the true β. Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score (Exercise 3.10). Backward selection can only be used when N>p, while forward stepwise can always be used. Figure 3.6 shows the results of a small simulation study to compare best-subset regression with the simpler alternatives forward and backward selection. Their performance is very similar, as is often the case. Included in the figure is forward stagewise regression (next section), which takes longer to reach minimum error.60 3. Linear Methods for Regression On the prostate cancer example, best-subset, forward and backward se- lection all gave exactly the same sequence of terms. Some software packages implement hybrid stepwise-selection strategies that consider both forward and backward moves at each step, and select the “best” of the two. For example in the R package the step function uses the AIC criterion for weighing the choices, which takes proper account of the number of parameters fit; at each step an add or drop will be performed that minimizes the AIC score. Other more traditional packages base the selection on F-statistics, adding “significant” terms, and dropping “non-significant” terms. These are out of fashion, since they do not take proper account of the multiple testing issues. It is also tempting after a model search to print out a summary of the chosen model, such as in Table 3.2; however, the standard errors are not valid, since they do not account for the search process. The bootstrap (Section 8.2) can be useful in such settings. Finally, we note that often variables come in groups (such as the dummy variables that code a multi-level categorical predictor). Smart stepwise pro- cedures (such as step in R) will add or drop whole groups at a time, taking proper account of their degrees-of-freedom. 3.3.3 Forward-Stagewise Regression Forward-stagewise regression (FS) is even more constrained than forward- stepwise regression. It starts like forward-stepwise regression, with an in- tercept equal to ¯y, and centered predictors with coefficients initially all 0. At each step the algorithm identifies the variable most correlated with the current residual. It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current co- efficient for that variable. This is continued till none of the variables have correlation with the residuals—i.e. the least-squares fit when N>p. Unlike forward-stepwise regression, none of the other variables are ad- justed when a term is added to the model. As a consequence, forward stagewise can take many more than p steps to reach the least squares fit, and historically has been dismissed as being inefficient. It turns out that this “slow fitting” can pay dividends in high-dimensional problems. We see in Section 3.8.1 that both forward stagewise and a variant which is slowed down even further are quite competitive, especially in very high- dimensional problems. Forward-stagewise regression is included in Figure 3.6. In this example it takes over 1000 steps to get all the correlations below 10−4. For subset size k, we plotted the error for the last step for which there where k nonzero coefficients. Although it catches up with the best fit, it takes longer to do so.3.4 Shrinkage Methods 61 3.3.4 Prostate Cancer Data Example (Continued) Table 3.3 shows the coefficients from a number of different selection and shrinkage methods. They are best-subset selection using an all-subsets search, ridge regression,thelasso, principal components regression and partial least squares. Each method has a complexity parameter, and this was chosen to minimize an estimate of prediction error based on tenfold cross-validation; full details are given in Section 7.10. Briefly, cross-validation works by divid- ing the training data randomly into ten equal parts. The learning method is fit—for a range of values of the complexity parameter—to nine-tenths of the data, and the prediction error is computed on the remaining one-tenth. This is done in turn for each one-tenth of the data, and the ten prediction error estimates are averaged. From this we obtain an estimated prediction error curve as a function of the complexity parameter. Note that we have already divided these data into a training set of size 67 and a test set of size 30. Cross-validation is applied to the training set, since selecting the shrinkage parameter is part of the training process. The test set is there to judge the performance of the selected model. The estimated prediction error curves are shown in Figure 3.7. Many of the curves are very flat over large ranges near their minimum. Included are estimated standard error bands for each estimated error rate, based on the ten error estimates computed by cross-validation. We have used the “one-standard-error” rule—we pick the most parsimonious model within one standard error of the minimum (Section 7.10, page 244). Such a rule acknowledges the fact that the tradeoff curve is estimated with error, and hence takes a conservative approach. Best-subset selection chose to use the two predictors lcvol and lcweight. The last two lines of the table give the average prediction error (and its estimated standard error) over the test set. 3.4 Shrinkage Methods By retaining a subset of the predictors and discarding the rest, subset selec- tion produces a model that is interpretable and has possibly lower predic- tion error than the full model. However, because it is a discrete process— variables are either retained or discarded—it often exhibits high variance, and so doesn’t reduce the prediction error of the full model. Shrinkage methods are more continuous, and don’t suffer as much from high variability. 3.4.1 Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The ridge coefficients minimize a penalized residual sum of62 3. Linear Methods for Regression Subset Size CV Error 02468 0.6 0.8 1.0 1.2 1.4 1.6 1.8 • • • • • • • • • All Subsets Degrees of Freedom CV Error 02468 0.6 0.8 1.0 1.2 1.4 1.6 1.8 • • • • • • • • • Ridge Regression Shrinkage Factor s CV Error 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 • • • • • • • • • Lasso Number of Directions CV Error 02468 0.6 0.8 1.0 1.2 1.4 1.6 1.8 • • • • • • • • • Principal Components Regression Number of Directions CV Error 02468 0.6 0.8 1.0 1.2 1.4 1.6 1.8 • • • • • • • • • Partial Least Squares FIGURE 3.7. Estimated prediction error curves and their standard errors for the various selection and shrinkage methods. Each curve is plotted as a function of the corresponding complexity parameter for that method. The horizontal axis has been chosen so that the model complexity increases as we move from left to right. The estimates of prediction error and their standard errors were obtained by tenfold cross-validation; full details are given in Section 7.10. The least complex model within one standard error of the best is chosen, indicated by the purple vertical broken lines.3.4 Shrinkage Methods 63 TABLE 3.3. Estimated coefficients and test error results, for different subset and shrinkage methods applied to the prostate data. The blank entries correspond to variables omitted. Term LS Best Subset Ridge Lasso PCR PLS Intercept 2.465 2.477 2.452 2.468 2.497 2.452 lcavol 0.680 0.740 0.420 0.533 0.543 0.419 lweight 0.263 0.316 0.238 0.169 0.289 0.344 age −0.141 −0.046 −0.152 −0.026 lbph 0.210 0.162 0.002 0.214 0.220 svi 0.305 0.227 0.094 0.315 0.243 lcp −0.288 0.000 −0.051 0.079 gleason −0.021 0.040 0.232 0.011 pgg45 0.267 0.133 −0.056 0.084 Test Error 0.521 0.492 0.492 0.479 0.449 0.528 Std Error 0.179 0.143 0.165 0.164 0.105 0.152 squares, ˆβridge = argmin β N i=1 yi − β0 − p j=1 xijβj 2 + λ p j=1 β2 j . (3.41) Here λ ≥ 0 is a complexity parameter that controls the amount of shrink- age: the larger the value of λ, the greater the amount of shrinkage. The coefficients are shrunk toward zero (and each other). The idea of penaliz- ing by the sum-of-squares of the parameters is also used in neural networks, where it is known as weight decay (Chapter 11). An equivalent way to write the ridge problem is ˆβridge = argmin β N i=1 yi − β0 − p j=1 xijβj 2 , subject to p j=1 β2 j ≤ t, (3.42) which makes explicit the size constraint on the parameters. There is a one- to-one correspondence between the parameters λ in (3.41) and t in (3.42). When there are many correlated variables in a linear regression model, their coefficients can become poorly determined and exhibit high variance. A wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin. By imposing a size constraint on the coefficients, as in (3.42), this problem is alleviated. The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving (3.41). In addition,64 3. Linear Methods for Regression notice that the intercept β0 has been left out of the penalty term. Penal- ization of the intercept would make the procedure depend on the origin chosen for Y ; that is, adding a constant c to each of the targets yi would not simply result in a shift of the predictions by the same amount c.It can be shown (Exercise 3.5) that the solution to (3.41) can be separated into two parts, after reparametrization using centered inputs: each xij gets replaced by xij − ¯xj. We estimate β0 by ¯y = 1 N N 1 yi. The remaining co- efficients get estimated by a ridge regression without intercept, using the centered xij. Henceforth we assume that this centering has been done, so that the input matrix X has p (rather than p + 1) columns. Writing the criterion in (3.41) in matrix form, RSS(λ)=(y − Xβ)T (y − Xβ)+λβT β, (3.43) the ridge regression solutions are easily seen to be ˆβridge =(XT X + λI)−1XT y, (3.44) where I is the p×p identity matrix. Notice that with the choice of quadratic penalty βT β, the ridge regression solution is again a linear function of y. The solution adds a positive constant to the diagonal of XT X before inversion. This makes the problem nonsingular, even if XT X is not of full rank, and was the main motivation for ridge regression when it was first introduced in statistics (Hoerl and Kennard, 1970). Traditional descriptions of ridge regression start with definition (3.44). We choose to motivate it via (3.41) and (3.42), as these provide insight into how it works. Figure 3.8 shows the ridge coefficient estimates for the prostate can- cer example, plotted as functions of df(λ), the effective degrees of freedom implied by the penalty λ (defined in (3.50) on page 68). In the case of or- thonormal inputs, the ridge estimates are just a scaled version of the least squares estimates, that is, ˆβridge = ˆβ/(1 + λ). Ridge regression can also be derived as the mean or mode of a poste- rior distribution, with a suitably chosen prior distribution. In detail, sup- pose yi ∼ N(β0 + xT i β,σ2), and the parameters βj are each distributed as N(0,τ2), independently of one another. Then the (negative) log-posterior density of β, with τ 2 and σ2 assumed known, is equal to the expression in curly braces in (3.41), with λ = σ2/τ 2 (Exercise 3.6). Thus the ridge estimate is the mode of the posterior distribution; since the distribution is Gaussian, it is also the posterior mean. The singular value decomposition (SVD) of the centered input matrix X gives us some additional insight into the nature of ridge regression. This de- composition is extremely useful in the analysis of many statistical methods. The SVD of the N × p matrix X has the form X = UDVT . (3.45)3.4 Shrinkage Methods 65 Coefficients 02468 −0.2 0.0 0.2 0.4 0.6 • ••••• • • • • • • • • • • • • •••••• • lcavol •••••••••••••••••••••••• • lweight ••••••••••••••••••••••••• age •••••••••••••••••••••••• • lbph •••••••••••••••••••••••• • svi • ••••• • • • ••••••••••••••• • lcp •••••••••••••••••••••••• • gleason • ••••••••••••••••••••••• • pgg45 df(λ) FIGURE 3.8. Profiles of ridge coefficients for the prostate cancer example, as the tuning parameter λ is varied. Coefficients are plotted versus df(λ), the effective degrees of freedom. A vertical line is drawn at df = 5.0, the value chosen by cross-validation.66 3. Linear Methods for Regression Here U and V are N × p and p × p orthogonal matrices, with the columns of U spanning the column space of X, and the columns of V spanning the row space. D is a p × p diagonal matrix, with diagonal entries d1 ≥ d2 ≥ ··· ≥dp ≥ 0 called the singular values of X. If one or more values dj =0, X is singular. Using the singular value decomposition we can write the least squares fitted vector as Xˆβls = X(XT X)−1XT y = UUT y, (3.46) after some simplification. Note that UT y are the coordinates of y with respect to the orthonormal basis U. Note also the similarity with (3.33); Q and U are generally different orthogonal bases for the column space of X (Exercise 3.8). Now the ridge solutions are Xˆβridge = X(XT X + λI)−1XT y = UD(D2 + λI)−1DUT y = p j=1 uj d2 j d2 j + λuT j y, (3.47) where the uj are the columns of U. Note that since λ ≥ 0, we have d2 j /(d2 j + λ) ≤ 1. Like linear regression, ridge regression computes the coordinates of y with respect to the orthonormal basis U. It then shrinks these coordinates by the factors d2 j /(d2 j + λ). This means that a greater amount of shrinkage is applied to the coordinates of basis vectors with smaller d2 j . What does a small value of d2 j mean? The SVD of the centered matrix X is another way of expressing the principal components of the variables in X. The sample covariance matrix is given by S = XT X/N ,andfrom (3.45) we have XT X = VD2VT , (3.48) which is the eigen decomposition of XT X (and of S,uptoafactorN). The eigenvectors vj (columns of V) are also called the principal compo- nents (or Karhunen–Loeve) directions of X. The first principal component direction v1 has the property that z1 = Xv1 has the largest sample vari- ance amongst all normalized linear combinations of the columns of X. This sample variance is easily seen to be Var(z1)=Var(Xv1)=d2 1 N , (3.49) and in fact z1 = Xv1 = u1d1. The derived variable z1 is called the first principal component of X, and hence u1 is the normalized first principal3.4 Shrinkage Methods 67 -4 -2 0 2 4 -4 -2 0 2 4 o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o oo o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o o oo o o o o o o Largest Principal Component Smallest Principal Component X1 X 2 FIGURE 3.9. Principal components of some input data points. The largest prin- cipal component is the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance. Ridge regression projects y onto these components, and then shrinks the coefficients of the low– variance components more than the high-variance components. component. Subsequent principal components zj have maximum variance d2 j /N , subject to being orthogonal to the earlier ones. Conversely the last principal component has minimum variance. Hence the small singular val- ues dj correspond to directions in the column space of X having small variance, and ridge regression shrinks these directions the most. Figure 3.9 illustrates the principal components of some data points in two dimensions. If we consider fitting a linear surface over this domain (the Y -axis is sticking out of the page), the configuration of the data allow us to determine its gradient more accurately in the long direction than the short. Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.68 3. Linear Methods for Regression In Figure 3.7 we have plotted the estimated prediction error versus the quantity df(λ)=tr[X(XT X + λI)−1XT ], =tr(Hλ) = p j=1 d2 j d2 j + λ. (3.50) This monotone decreasing function of λ is the effective degrees of freedom of the ridge regression fit. Usually in a linear-regression fit with p variables, the degrees-of-freedom of the fit is p, the number of free parameters. The idea is that although all p coefficients in a ridge fit will be non-zero, they are fit in a restricted fashion controlled by λ. Note that df(λ)=p when λ = 0 (no regularization) and df(λ) → 0asλ →∞. Of course there is always an additional one degree of freedom for the intercept, which was removed apriori. This definition is motivated in more detail in Section 3.4.4 and Sections 7.4–7.6. In Figure 3.7 the minimum occurs at df(λ)=5.0. Table 3.3 shows that ridge regression reduces the test error of the full least squares estimates by a small amount. 3.4.2 The Lasso The lasso is a shrinkage method like ridge, with subtle but important dif- ferences. The lasso estimate is defined by ˆβlasso = argmin β N i=1 yi − β0 − p j=1 xijβj 2 subject to p j=1 |βj|≤t. (3.51) Just as in ridge regression, we can re-parametrize the constant β0 by stan- dardizing the predictors; the solution for ˆβ0 is ¯y, and thereafter we fit a model without an intercept (Exercise 3.5). In the signal processing litera- ture, the lasso is also known as basis pursuit (Chen et al., 1998). We can also write the lasso problem in the equivalent Lagrangian form ˆβlasso = argmin β N i=1 yi − β0 − p j=1 xijβj 2 + λ p j=1 |βj| . (3.52) Notice the similarity to the ridge regression problem (3.42) or (3.41): the L2 ridge penalty p 1 β2 j is replaced by the L1 lasso penalty p 1 |βj|. This latter constraint makes the solutions nonlinear in the yi, and there is no closed form expression as in ridge regression. Computing the lasso solution3.4 Shrinkage Methods 69 is a quadratic programming problem, although we see in Section 3.4.4 that efficient algorithms are available for computing the entire path of solutions as λ is varied, with the same computational cost as for ridge regression. Because of the nature of the constraint, making t sufficiently small will cause some of the coefficients to be exactly zero. Thus the lasso does a kind of continuous subset selection. If t is chosen larger than t0 = p 1 |ˆβj| (where ˆβj = ˆβls j , the least squares estimates), then the lasso estimates are the ˆβj’s. On the other hand, for t = t0/2 say, then the least squares coefficients are shrunk by about 50% on average. However, the nature of the shrinkage is not obvious, and we investigate it further in Section 3.4.4 below. Like the subset size in variable subset selection, or the penalty parameter in ridge regression, t should be adaptively chosen to minimize an estimate of expected prediction error. In Figure 3.7, for ease of interpretation, we have plotted the lasso pre- diction error estimates versus the standardized parameter s = t/ p 1 |ˆβj|. A value ˆs ≈ 0.36 was chosen by 10-fold cross-validation; this caused four coefficients to be set to zero (fifth column of Table 3.3). The resulting model has the second lowest test error, slightly lower than the full least squares model, but the standard errors of the test error estimates (last line of Table 3.3) are fairly large. Figure 3.10 shows the lasso coefficients as the standardized tuning pa- rameter s = t/ p 1 |ˆβj| is varied. At s =1.0 these are the least squares estimates; they decrease to 0 as s → 0. This decrease is not always strictly monotonic, although it is in this example. A vertical line is drawn at s =0.36, the value chosen by cross-validation. 3.4.3 Discussion: Subset Selection, Ridge Regression and the Lasso In this section we discuss and compare the three approaches discussed so far for restricting the linear regression model: subset selection, ridge regression and the lasso. In the case of an orthonormal input matrix X the three procedures have explicit solutions. Each method applies a simple transformation to the least squares estimate ˆβj, as detailed in Table 3.4. Ridge regression does a proportional shrinkage. Lasso translates each coefficient by a constant factor λ, truncating at zero. This is called “soft thresholding,” and is used in the context of wavelet-based smoothing in Sec- tion 5.9. Best-subset selection drops all variables with coefficients smaller than the Mth largest; this is a form of “hard-thresholding.” Back to the nonorthogonal case; some pictures help understand their re- lationship. Figure 3.11 depicts the lasso (left) and ridge regression (right) when there are only two parameters. The residual sum of squares has ellip- tical contours, centered at the full least squares estimate. The constraint70 3. Linear Methods for Regression 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 Shrinkage Factor s Coefficients lcavol lweight age lbph svi lcp gleason pgg45 FIGURE 3.10. Profiles of lasso coefficients, as the tuning parameter t is varied. Coefficients are plotted versus s = t/ Pp 1 |ˆβj|. A vertical line is drawn at s =0.36, the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lasso profiles hit zero, while those for ridge do not. The profiles are piece-wise linear, and so are computed only at the points displayed; see Section 3.4.4 for details.3.4 Shrinkage Methods 71 TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λ are constants chosen by the corresponding techniques; sign denotes the sign of its argument (±1), and x+ denotes “positive part” of x. Below the table, estimators are shown by broken red lines. The 45◦ line in gray shows the unrestricted estimate for reference. Estimator Formula Best subset (size M) ˆβj · I[rank(|ˆβj|≤M) Ridge ˆβj/(1 + λ) Lasso sign(ˆβj)(|ˆβj|−λ)+ (0,0) (0,0) (0,0) | ˆβ(M)| λ Best Subset Ridge Lasso β^ β^2 ..β 1 β 2 β1 β FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions |β1| + |β2|≤t and β2 1 + β2 2 ≤ t2, respectively, while the red ellipses are the contours of the least squares error function.72 3. Linear Methods for Regression region for ridge regression is the disk β2 1 + β2 2 ≤ t, while that for lasso is the diamond |β1| + |β2|≤t. Both methods find the first point where the elliptical contours hit the constraint region. Unlike the disk, the diamond has corners; if the solution occurs at a corner, then it has one parameter βj equal to zero. When p>2, the diamond becomes a rhomboid, and has many corners, flat edges and faces; there are many more opportunities for the estimated parameters to be zero. We can generalize ridge regression and the lasso, and view them as Bayes estimates. Consider the criterion ˜β = argmin β N i=1 yi − β0 − p j=1 xijβj 2 + λ p j=1 |βj|q (3.53) for q ≥ 0. The contours of constant value of j |βj|q are shown in Fig- ure 3.12, for the case of two inputs. Thinking of |βj|q as the log-prior density for βj, these are also the equi- contours of the prior distribution of the parameters. The value q = 0 corre- sponds to variable subset selection, as the penalty simply counts the number of nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridge regression. Notice that for q ≤ 1, the prior is not uniform in direction, but concentrates more mass in the coordinate directions. The prior correspond- ing to the q = 1 case is an independent double exponential (or Laplace) distribution for each input, with density (1/2τ)exp(−|β|)/τ)andτ =1/λ. The case q = 1 (lasso) is the smallest q such that the constraint region is convex; non-convex constraint regions make the optimization problem more difficult. In this view, the lasso, ridge regression and best subset selection are Bayes estimates with different priors. Note, however, that they are derived as posterior modes, that is, maximizers of the posterior. It is more common to use the mean of the posterior as the Bayes estimate. Ridge regression is also the posterior mean, but the lasso and best subset selection are not. Looking again at the criterion (3.53), we might try using other values of q besides 0, 1, or 2. Although one might consider estimating q from the data, our experience is that it is not worth the effort for the extra variance incurred. Values of q ∈ (1, 2) suggest a compromise between the lasso and ridge regression. Although this is the case, with q>1, |βj|q is differentiable at 0, and so does not share the ability of lasso (q =1)for q =4 q =2 q =1 q =0.5 q =0.1 FIGURE 3.12. Contours of constant value of P j |βj|q for given values of q.3.4 Shrinkage Methods 73 q =1.2 α =0.2 Lq Elastic Net FIGURE 3.13. Contours of constant value of P j |βj|q for q =1.2 (left plot), and the elastic-net penalty P j(αβ2 j +(1−α)|βj|) for α =0.2 (right plot). Although visually very similar, the elastic-net has sharp (non-differentiable) corners, while the q =1.2 penalty does not. setting coefficients exactly to zero. Partly for this reason as well as for computational tractability, Zou and Hastie (2005) introduced the elastic- net penalty λ p j=1 αβ2 j +(1− α)|βj| , (3.54) a different compromise between ridge and lasso. Figure 3.13 compares the Lq penalty with q =1.2 and the elastic-net penalty with α =0.2; it is hard to detect the difference by eye. The elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like ridge. It also has considerable computational advantages over the Lq penal- ties. We discuss the elastic-net further in Section 18.4. 3.4.4 Least Angle Regression Least angle regression (LAR) is a relative newcomer (Efron et al., 2004), and can be viewed as a kind of “democratic” version of forward stepwise regression (Section 3.3.2). As we will see, LAR is intimately connected with the lasso, and in fact provides an extremely efficient algorithm for computing the entire lasso path as in Figure 3.10. Forward stepwise regression builds a model sequentially, adding one vari- able at a time. At each step, it identifies the best variable to include in the active set, and then updates the least squares fit to include all the active variables. Least angle regression uses a similar strategy, but only enters “as much” of a predictor as it deserves. At the first step it identifies the variable most correlated with the response. Rather than fit this variable completely, LAR moves the coefficient of this variable continuously toward its least- squares value (causing its correlation with the evolving residual to decrease in absolute value). As soon as another variable “catches up” in terms of correlation with the residual, the process is paused. The second variable then joins the active set, and their coefficients are moved together in a way that keeps their correlations tied and decreasing. This process is continued74 3. Linear Methods for Regression until all the variables are in the model, and ends at the full least-squares fit. Algorithm 3.2 provides the details. The termination condition in step 5 requires some explanation. If p>N− 1, the LAR algorithm reaches a zero residual solution after N − 1 steps (the −1 is because we have centered the data). Algorithm 3.2 Least Angle Regression. 1. Standardize the predictors to have mean zero and unit norm. Start with the residual r = y − ¯y, β1,β2,...,βp =0. 2. Find the predictor xj most correlated with r. 3. Move βj from 0 towards its least-squares coefficient xj, r, until some other competitor xk has as much correlation with the current residual as does xj. 4. Move βj and βk in the direction defined by their joint least squares coefficient of the current residual on (xj, xk), until some other com- petitor xl has as much correlation with the current residual. 5. Continue in this way until all p predictors have been entered. After min(N − 1,p) steps, we arrive at the full least-squares solution. Suppose Ak is the active set of variables at the beginning of the kth step, and let βAk be the coefficient vector for these variables at this step; there will be k − 1 nonzero values, and the one just entered will be zero. If rk = y − XAk βAk is the current residual, then the direction for this step is δk =(XTAk XAk )−1XTAk rk. (3.55) The coefficient profile then evolves as βAk (α)=βAk + α · δk. Exercise 3.23 verifies that the directions chosen in this fashion do what is claimed: keep the correlations tied and decreasing. If the fit vector at the beginning of this step is ˆfk, then it evolves as ˆfk(α)=fk + α · uk, where uk = XAk δk is the new fit direction. The name “least angle” arises from a geometrical interpretation of this process; uk makes the smallest (and equal) angle with each of the predictors in Ak (Exercise 3.24). Figure 3.14 shows the absolute correlations decreasing and joining ranks with each step of the LAR algorithm, using simulated data. By construction the coefficients in LAR change in a piecewise linear fash- ion. Figure 3.15 [left panel] shows the LAR coefficient profile evolving as a function of their L1 arc length 2. Note that we do not need to take small 2The L1 arc-length of a differentiable curve β(s)fors ∈ [0,S]isgivenbyTV(β,S)=R S 0 || ˙β(s)||1ds, where ˙β(s)=∂β(s)/∂s. For the piecewise-linear LAR coefficient profile, this amounts to summing the L1 norms of the changes in coefficients from step to step.3.4 Shrinkage Methods 75 0 5 10 15 0.0 0.1 0.2 0.3 0.4 v2 v6 v4 v5 v3 v1 L1 Arc Length Absolute Correlations FIGURE 3.14. Progression of the absolute correlations during each step of the LAR procedure, using a simulated data set with six predictors. The labels at the top of the plot indicate which variables enter the active set at each step. The step length are measured in units of L1 arc length. 0 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 Least Angle Regression 0 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 Lasso L1 Arc LengthL1 Arc Length CoefficientsCoefficients FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulated data, as a function of the L1 arc length. The right panel shows the Lasso profile. They are identical until the dark-blue coefficient crosses zero at an arc length of about 18.76 3. Linear Methods for Regression steps and recheck the correlations in step 3; using knowledge of the covari- ance of the predictors and the piecewise linearity of the algorithm, we can work out the exact step length at the beginning of each step (Exercise 3.25). The right panel of Figure 3.15 shows the lasso coefficient profiles on the same data. They are almost identical to those in the left panel, and differ for the first time when the blue coefficient passes back through zero. For the prostate data, the LAR coefficient profile turns out to be identical to the lasso profile in Figure 3.10, which never crosses zero. These observations lead to a simple modification of the LAR algorithm that gives the entire lasso path, which is also piecewise-linear. Algorithm 3.2a Least Angle Regression: Lasso Modification. 4a. If a non-zero coefficient hits zero, drop its variable from the active set of variables and recompute the current joint least squares direction. The LAR(lasso) algorithm is extremely efficient, requiring the same order of computation as that of a single least squares fit using the p predictors. Least angle regression always takes p steps to get to the full least squares estimates. The lasso path can have more than p steps, although the two are often quite similar. Algorithm 3.2 with the lasso modification 3.2a is an efficient way of computing the solution to any lasso problem, especially when p N. Osborne et al. (2000a) also discovered a piecewise-linear path for computing the lasso, which they called a homotopy algorithm. We now give a heuristic argument for why these procedures are so similar. Although the LAR algorithm is stated in terms of correlations, if the input features are standardized, it is equivalent and easier to work with inner- products. Suppose A is the active set of variables at some stage in the algorithm, tied in their absolute inner-product with the current residuals y − Xβ. We can express this as xT j (y − Xβ)=γ · sj, ∀j ∈A (3.56) where sj ∈{−1, 1} indicates the sign of the inner-product, and γ is the common value. Also |xT k (y − Xβ)|≤γ ∀k ∈A. Now consider the lasso criterion (3.52), which we write in vector form R(β)= 1 2 ||y − Xβ||2 2 + λ||β||1. (3.57) Let B be the active set of variables in the solution for a given value of λ. For these variables R(β) is differentiable, and the stationarity conditions give xT j (y − Xβ)=λ · sign(βj), ∀j ∈B (3.58) Comparing (3.58) with (3.56), we see that they are identical only if the sign of βj matches the sign of the inner product. That is why the LAR3.4 Shrinkage Methods 77 algorithm and lasso start to differ when an active coefficient passes through zero; condition (3.58) is violated for that variable, and it is kicked out of the active set B. Exercise 3.23 shows that these equations imply a piecewise- linear coefficient profile as λ decreases. The stationarity conditions for the non-active variables require that |xT k (y − Xβ)|≤λ, ∀k ∈B, (3.59) which again agrees with the LAR algorithm. Figure 3.16 compares LAR and lasso to forward stepwise and stagewise regression. The setup is the same as in Figure 3.6 on page 59, except here N = 100 here rather than 300, so the problem is more difficult. We see that the more aggressive forward stepwise starts to overfit quite early (well before the 10 true variables can enter the model), and ultimately performs worse than the slower forward stagewise regression. The behavior of LAR and lasso is similar to that of forward stagewise regression. Incremental forward stagewise is similar to LAR and lasso, and is described in Sec- tion 3.8.1. Degrees-of-Freedom Formula for LAR and Lasso Suppose that we fit a linear model via the least angle regression procedure, stopping at some number of steps k1 (Exercise 3.14). It can also be shown that the sequence of PLS coefficients for m =1, 2,...,prepresents the conjugate gradient sequence for computing the least squares solutions (Exercise 3.18). 3.6 Discussion: A Comparison of the Selection and Shrinkage Methods There are some simple settings where we can understand better the rela- tionship between the different methods described above. Consider an exam- ple with two correlated inputs X1 and X2, with correlation ρ. We assume that the true regression coefficients are β1 =4andβ2 = 2. Figure 3.18 shows the coefficient profiles for the different methods, as their tuning pa- rameters are varied. The top panel has ρ =0.5, the bottom panel ρ = −0.5. The tuning parameters for ridge and lasso vary over a continuous range, while best subset, PLS and PCR take just two discrete steps to the least squares solution. In the top panel, starting at the origin, ridge regression shrinks the coefficients together until it finally converges to least squares. PLS and PCR show similar behavior to ridge, although are discrete and more extreme. Best subset overshoots the solution and then backtracks. The behavior of the lasso is intermediate to the other methods. When the correlation is negative (lower panel), again PLS and PCR roughly track the ridge path, while all of the methods are more similar to one another. It is interesting to compare the shrinkage behavior of these different methods. Recall that ridge regression shrinks all directions, but shrinks low-variance directions more. Principal components regression leaves M high-variance directions alone, and discards the rest. Interestingly, it can be shown that partial least squares also tends to shrink the low-variance directions, but can actually inflate some of the higher variance directions. This can make PLS a little unstable, and cause it to have slightly higher prediction error compared to ridge regression. A full study is given in Frank and Friedman (1993). These authors conclude that for minimizing predic- tion error, ridge regression is generally preferable to variable subset selec- tion, principal components regression and partial least squares. However the improvement over the latter two methods was only slight. To summarize, PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.3.6 Discussion: A Comparison of the Selection and Shrinkage Methods 83 0123456 -1 0 1 2 3 Least Squares 0 Ridge Lasso Best Subset PLSPCR • 0123456 -10123 Least Squares Ridge Best Subset PLS PCR Lasso • 0 ρ =0.5 ρ = −0.5 β1 β1 β 2 β 2 FIGURE 3.18. Coefficient profiles from different methods for a simple problem: two inputs with correlation ±0.5, and the true regression coefficients β =(4, 2).84 3. Linear Methods for Regression 3.7 Multiple Outcome Shrinkage and Selection As noted in Section 3.2.4, the least squares estimates in a multiple-output linear model are simply the individual least squares estimates for each of the outputs. To apply selection and shrinkage methods in the multiple output case, one could apply a univariate technique individually to each outcome or si- multaneously to all outcomes. With ridge regression, for example, we could apply formula (3.44) to each of the K columns of the outcome matrix Y , using possibly different parameters λ, or apply it to all columns using the same value of λ. The former strategy would allow different amounts of regularization to be applied to different outcomes but require estimation of k separate regularization parameters λ1,...,λk, while the latter would permit all k outputs to be used in estimating the sole regularization pa- rameter λ. Other more sophisticated shrinkage and selection strategies that exploit correlations in the different responses can be helpful in the multiple output case. Suppose for example that among the outputs we have Yk = f(X)+εk (3.65) Y = f(X)+ε; (3.66) i.e., (3.65) and (3.66) share the same structural part f(X) in their models. It is clear in this case that we should pool our observations on Yk and Yl to estimate the common f. Combining responses is at the heart of canonical correlation analysis (CCA), a data reduction technique developed for the multiple output case. Similar to PCA, CCA finds a sequence of uncorrelated linear combina- tions Xvm,m=1,...,M of the xj, and a corresponding sequence of uncorrelated linear combinations Yum of the responses yk, such that the correlations Corr2(Yum, Xvm) (3.67) are successively maximized. Note that at most M = min(K, p) directions can be found. The leading canonical response variates are those linear com- binations (derived responses) best predicted by the xj; in contrast, the trailing canonical variates can be poorly predicted by the xj, and are can- didates for being dropped. The CCA solution is computed using a general- ized SVD of the sample cross-covariance matrix YT X/N (assuming Y and X are centered; Exercise 3.20). Reduced-rank regression (Izenman, 1975; van der Merwe and Zidek, 1980) formalizes this approach in terms of a regression model that explicitly pools information. Given an error covariance Cov(ε)=Σ, we solve the following3.7 Multiple Outcome Shrinkage and Selection 85 restricted multivariate regression problem: ˆBrr(m) = argmin rank(B)=m N i=1 (yi − BT xi)T Σ−1(yi − BT xi). (3.68) With Σ replaced by the estimate YT Y/N , one can show (Exercise 3.21) that the solution is given by a CCA of Y and X: ˆBrr(m)= ˆBUmU− m, (3.69) where Um is the K × m sub-matrix of U consisting of the first m columns, and U is the K × M matrix of left canonical vectors u1,u2,...,uM . U− m is its generalized inverse. Writing the solution as ˆBrr(M)=(XT X)−1XT (YUm)U− m, (3.70) we see that reduced-rank regression performs a linear regression on the pooled response matrix YUm, and then maps the coefficients (and hence the fits as well) back to the original response space. The reduced-rank fits are given by ˆYrr(m)=X(XT X)−1XT YUmU− m = HYPm, (3.71) where H is the usual linear regression projection operator, and Pm is the rank-m CCA response projection operator. Although a better estimate of Σ would be (Y−X ˆB)T (Y−X ˆB)/(N −pK), one can show that the solution remains the same (Exercise 3.22). Reduced-rank regression borrows strength among responses by truncat- ing the CCA. Breiman and Friedman (1997) explored with some success shrinkage of the canonical variates between X and Y, a smooth version of reduced rank regression. Their proposal has the form (compare (3.69)) ˆBc+w = ˆBUΛU−1, (3.72) where Λ is a diagonal shrinkage matrix (the “c+w” stands for “Curds and Whey,” the name they gave to their procedure). Based on optimal prediction in the population setting, they show that Λ has diagonal entries λm = c2 m c2m + p N (1 − c2m),m=1,...,M, (3.73) where cm is the mth canonical correlation coefficient. Note that as the ratio of the number of input variables to sample size p/N gets small, the shrink- age factors approach 1. Breiman and Friedman (1997) proposed modified versions of Λ based on training data and cross-validation, but the general form is the same. Here the fitted response has the form ˆYc+w = HYSc+w, (3.74)86 3. Linear Methods for Regression where Sc+w = UΛU−1 is the response shrinkage operator. Breiman and Friedman (1997) also suggested shrinking in both the Y space and X space. This leads to hybrid shrinkage models of the form ˆYridge,c+w = AλYSc+w, (3.75) where Aλ = X(XT X+λI)−1XT is the ridge regression shrinkage operator, as in (3.46) on page 66. Their paper and the discussions thereof contain many more details. 3.8 More on the Lasso and Related Path Algorithms Since the publication of the LAR algorithm (Efron et al., 2004) there has been a lot of activity in developing algorithms for fitting regularization paths for a variety of different problems. In addition, L1 regularization has taken on a life of its own, leading to the development of the field compressed sensing in the signal-processing literature. (Donoho, 2006a; Candes, 2006). In this section we discuss some related proposals and other path algorithms, starting off with a precursor to the LAR algorithm. 3.8.1 Incremental Forward Stagewise Regression Here we present another LAR-like algorithm, this time focused on forward stagewise regression. Interestingly, efforts to understand a flexible nonlinear regression procedure (boosting) led to a new algorithm for linear models (LAR). In reading the first edition of this book and the forward stagewise Algorithm 3.4 Incremental Forward Stagewise Regression—FS. 1. Start with the residual r equal to y and β1,β2,...,βp = 0. All the predictors are standardized to have mean zero and unit norm. 2. Find the predictor xj most correlated with r 3. Update βj ← βj + δj, where δj =  · sign[ xj, r]and>0 is a small step size, and set r ← r − δjxj. 4. Repeat steps 2 and 3 many times, until the residuals are uncorrelated with all the predictors. Algorithm 16.1 of Chapter 164, our colleague Brad Efron realized that with 4In the first edition, this was Algorithm 10.4 in Chapter 10.3.8 More on the Lasso and Related Path Algorithms 87 −0.2 0.0 0.2 0.4 0.6 lcavol lweight age lbph svi lcp gleason pgg45 0 50 100 150 200 −0.2 0.0 0.2 0.4 0.6 lcavol lweight age lbph svi lcp gleason pgg45 0.0 0.5 1.0 1.5 2.0 FS FS0 Iteration CoefficientsCoefficients L1 Arc-length of Coefficients FIGURE 3.19. Coefficient profiles for the prostate data. The left panel shows incremental forward stagewise regression with step size  =0.01. The right panel shows the infinitesimal version FS0 obtained letting  → 0. This profile was fit by the modification 3.2b to the LAR Algorithm 3.2. In this example the FS0 profiles are monotone, and hence identical to those of lasso and LAR. linear models, one could explicitly construct the piecewise-linear lasso paths of Figure 3.10. This led him to propose the LAR procedure of Section 3.4.4, as well as the incremental version of forward-stagewise regression presented here. Consider the linear-regression version of the forward-stagewise boosting algorithm 16.1 proposed in Section 16.1 (page 608). It generates a coefficient profile by repeatedly updating (by a small amount ) the coefficient of the variable most correlated with the current residuals. Algorithm 3.4 gives the details. Figure 3.19 (left panel) shows the progress of the algorithm on the prostate data with step size  =0.01. If δj = xj, r (the least-squares coefficient of the residual on jth predictor), then this is exactly the usual forward stagewise procedure (FS) outlined in Section 3.3.3. Here we are mainly interested in small values of . Letting  → 0 gives the right panel of Figure 3.19, which in this case is identical to the lasso path in Figure 3.10. We call this limiting procedure infinitesimal forward stagewise regression or FS0. This procedure plays an important role in non-linear, adaptive methods like boosting (Chapters 10 and 16) and is the version of incremental forward stagewise regression that is most amenable to theoretical analysis. B¨uhlmann and Hothorn (2008) refer to the same procedure as “L2boost”, because of its connections to boosting.88 3. Linear Methods for Regression Efron originally thought that the LAR Algorithm 3.2 was an implemen- tation of FS0, allowing each tied predictor a chance to update their coeffi- cients in a balanced way, while remaining tied in correlation. However, he then realized that the LAR least-squares fit amongst the tied predictors can result in coefficients moving in the opposite direction to their correla- tion, which cannot happen in Algorithm 3.4. The following modification of the LAR algorithm implements FS0: Algorithm 3.2b Least Angle Regression: FS0 Modification. 4. Find the new direction by solving the constrained least squares prob- lem min b ||r − XAb||2 2 subject to bjsj ≥ 0,j∈A, where sj is the sign of xj, r. The modification amounts to a non-negative least squares fit, keeping the signs of the coefficients the same as those of the correlations. One can show that this achieves the optimal balancing of infinitesimal “update turns” for the variables tied for maximal correlation (Hastie et al., 2007). Like lasso, the entire FS0 path can be computed very efficiently via the LAR algorithm. As a consequence of these results, if the LAR profiles are monotone non- increasing or non-decreasing, as they are in Figure 3.19, then all three methods—LAR, lasso, and FS0—give identical profiles. If the profiles are not monotone but do not cross the zero axis, then LAR and lasso are identical. Since FS0 is different from the lasso, it is natural to ask if it optimizes a criterion. The answer is more complex than for lasso; the FS0 coefficient profile is the solution to a differential equation. While the lasso makes op- timal progress in terms of reducing the residual sum-of-squares per unit increase in L1-norm of the coefficient vector β,FS0 is optimal per unit increase in L1 arc-length traveled along the coefficient path. Hence its co- efficient path is discouraged from changing directions too often. FS0 is more constrained than lasso, and in fact can be viewed as a mono- tone version of the lasso; see Figure 16.3 on page 614 for a dramatic exam- ple. FS0 may be useful in p N situations, where its coefficient profiles are much smoother and hence have less variance than those of lasso. More details on FS0 are given in Section 16.2.3 and Hastie et al. (2007). Fig- ure 3.16 includes FS0 where its performance is very similar to that of the lasso.3.8 More on the Lasso and Related Path Algorithms 89 3.8.2 Piecewise-Linear Path Algorithms The least angle regression procedure exploits the piecewise linear nature of the lasso solution paths. It has led to similar “path algorithms” for other regularized problems. Suppose we solve ˆβ(λ) = argminβ [R(β)+λJ(β)] , (3.76) with R(β)= N i=1 L(yi,β0 + p j=1 xijβj), (3.77) where both the loss function L and the penalty function J are convex. Then the following are sufficient conditions for the solution path ˆβ(λ)to be piecewise linear (Rosset and Zhu, 2007): 1. R is quadratic or piecewise-quadratic as a function of β,and 2. J is piecewise linear in β. This also implies (in principle) that the solution path can be efficiently computed. Examples include squared- and absolute-error loss, “Huberized” losses, and the L1,L∞ penalties on β. Another example is the “hinge loss” function used in the support vector machine. There the loss is piecewise linear, and the penalty is quadratic. Interestingly, this leads to a piecewise- linear path algorithm in the dual space; more details are given in Sec- tion 12.3.5. 3.8.3 The Dantzig Selector Candes and Tao (2007) proposed the following criterion: minβ||β||1 subject to ||XT (y − Xβ)||∞ ≤ t. (3.78) They call the solution the Dantzig selector (DS). It can be written equiva- lently as minβ||XT (y − Xβ)||∞ subject to ||β||1 ≤ t. (3.79) Here ||·||∞ denotes the L∞ norm, the maximum absolute value of the components of the vector. In this form it resembles the lasso, replacing squared error loss by the maximum absolute value of its gradient. Note that as t gets large, both procedures yield the least squares solution if NN case and considers the lasso solution as the bound t gets large. In the limit this gives the solution with minimum L1 norm among all models with zero training error. He shows that under certain assumptions on the model matrix X,if the true model is sparse, this solution identifies the correct predictors with high probability. Many of the results in this area assume a condition on the model matrix of the form ||(XST XS)−1XST XSc ||∞ ≤ (1 − ) for some  ∈ (0, 1]. (3.81) Here S indexes the subset of features with non-zero coefficients in the true underlying model, and XS are the columns of X corresponding to those features. Similarly Sc are the features with true coefficients equal to zero, and XSc the corresponding columns. This says that the least squares coef- ficients for the columns of XSc on XS are not too large, that is, the “good” variables S are not too highly correlated with the nuisance variables Sc. Regarding the coefficients themselves, the lasso shrinkage causes the esti- mates of the non-zero coefficients to be biased towards zero, and in general they are not consistent5. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un- restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since 5Statistical consistency means as the sample size grows, the estimates converge to the true values.92 3. Linear Methods for Regression the variables in the second step have less “competition” from noise vari- ables, cross-validation will tend to pick a smaller value for λ, and hence their coefficients will be shrunken less than those in the initial estimate. Alternatively, one can modify the lasso penalty function so that larger co- efficients are shrunken less severely; the smoothly clipped absolute deviation (SCAD) penalty of Fan and Li (2005) replaces λ|β| by Ja(β,λ), where dJa(β,λ) dβ = λ · sign(β) I(|β|≤λ)+(aλ −|β|)+ (a − 1)λ I(|β| >λ) (3.82) for some a ≥ 2. The second term in square-braces reduces the amount of shrinkage in the lasso for larger values of β, with ultimately no shrinkage as a →∞. Figure 3.20 shows the SCAD penalty, along with the lasso and −4 −2 0 2 4 012345 −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 −4 −2 0 2 4 0.5 1.0 1.5 2.0 |β| SCAD |β|1−ν βββ FIGURE 3.20. The lasso and two alternative non-convex penalties designed to penalize large coefficients less. For SCAD we use λ =1and a =4,andν = 1 2 in the last panel. |β|1−ν. However this criterion is non-convex, which is a drawback since it makes the computation much more difficult. The adaptive lasso (Zou, 2006) uses a weighted penalty of the form p j=1 wj|βj| where wj =1/|ˆβj|ν, ˆβj is the ordinary least squares estimate and ν>0. This is a practical approxi- mation to the |β|q penalties (q =1−ν here) discussed in Section 3.4.3. The adaptive lasso yields consistent estimates of the parameters while retaining the attractive convexity property of the lasso. 3.8.6 Pathwise Coordinate Optimization An alternate approach to the LARS algorithm for computing the lasso solu- tion is simple coordinate descent. This idea was proposed by Fu (1998) and Daubechies et al. (2004), and later studied and generalized by Friedman et al. (2007b), Wu and Lange (2008) and others. The idea is to fix the penalty parameter λ in the Lagrangian form (3.52) and optimize successively over each parameter, holding the other parameters fixed at their current values. Suppose the predictors are all standardized to have mean zero and unit norm. Denote by ˜βk(λ) the current estimate for βk at penalty parameter3.9 Computational Considerations 93 λ. We can rearrange (3.52) to isolate βj, R(˜β(λ),βj)=1 2 N i=1 yi − k=j xik ˜βk(λ) − xijβj 2 + λ k=j |˜βk(λ)| + λ|βj|, (3.83) where we have suppressed the intercept and introduced a factor 1 2 for con- venience. This can be viewed as a univariate lasso problem with response variable the partial residual yi − ˜y(j) i = yi − k=j xik ˜βk(λ). This has an explicit solution, resulting in the update ˜βj(λ) ← S N i=1 xij(yi − ˜y(j) i ),λ . (3.84) Here S(t, λ) = sign(t)(|t|−λ)+ is the soft-thresholding operator in Table 3.4 on page 71. The first argument to S(·) is the simple least-squares coefficient of the partial residual on the standardized variable xij. Repeated iteration of (3.84)—cycling through each variable in turn until convergence—yields the lasso estimate ˆβ(λ). We can also use this simple algorithm to efficiently compute the lasso solutions at a grid of values of λ. We start with the smallest value λmax for which ˆβ(λmax) = 0, decrease it a little and cycle through the variables until convergence. Then λ is decreased again and the process is repeated, using the previous solution as a “warm start” for the new value of λ. This can be faster than the LARS algorithm, especially in large problems. A key to its speed is the fact that the quantities in (3.84) can be updated quickly as j varies, and often the update is to leave ˜βj = 0. On the other hand, it delivers solutions over a grid of λ values, rather than the entire solution path. The same kind of algorithm can be applied to the elastic net, the grouped lasso and many other models in which the penalty is a sum of functions of the individual parameters (Friedman et al., 2008a). It can also be applied, with some substantial modifications, to the fused lasso (Section 18.4.2); details are in Friedman et al. (2007b). 3.9 Computational Considerations Least squares fitting is usually done via the Cholesky decomposition of the matrix XT X or a QR decomposition of X. With N observations and p features, the Cholesky decomposition requires p3 +Np2/2 operations, while the QR decomposition requires Np2 operations. Depending on the relative size of N and p, the Cholesky can sometimes be faster; on the other hand, it can be less numerically stable (Lawson and Hansen, 1974). Computation of the lasso via the LAR algorithm has the same order of computation as a least squares fit.94 3. Linear Methods for Regression Bibliographic Notes Linear regression is discussed in many statistics books, for example, Seber (1984), Weisberg (1980) and Mardia et al. (1979). Ridge regression was introduced by Hoerl and Kennard (1970), while the lasso was proposed by Tibshirani (1996). Around the same time, lasso-type penalties were pro- posed in the basis pursuit method for signal processing (Chen et al., 1998). The least angle regression procedure was proposed in Efron et al. (2004); related to this is the earlier homotopy procedure of Osborne et al. (2000a) and Osborne et al. (2000b). Their algorithm also exploits the piecewise linearity used in the LAR/lasso algorithm, but lacks its transparency. The criterion for the forward stagewise criterion is discussed in Hastie et al. (2007). Park and Hastie (2007) develop a path algorithm similar to least angle regression for generalized regression models. Partial least squares was introduced by Wold (1975). Comparisons of shrinkage methods may be found in Copas (1983) and Frank and Friedman (1993). Exercises Ex. 3.1 Show that the F statistic (3.13) for dropping a single coefficient from a model is equal to the square of the corresponding z-score (3.12). Ex. 3.2 Given data on two variables X and Y , consider fitting a cubic polynomial regression model f(X)= 3 j=0 βjXj. In addition to plotting the fitted curve, you would like a 95% confidence band about the curve. Consider the following two approaches: 1. At each point x0, form a 95% confidence interval for the linear func- tion aT β = 3 j=0 βjxj 0. 2. Form a 95% confidence set for β as in (3.15), which in turn generates confidence intervals for f(x0). How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment to compare the two methods. Ex. 3.3 Gauss–Markov theorem: (a) Prove the Gauss–Markov theorem: the least squares estimate of a parameter aT β has variance no bigger than that of any other linear unbiased estimate of aT β (Section 3.2.2). (b) The matrix inequality B  A holds if A − B is positive semidefinite. Show that if ˆV is the variance-covariance matrix of the least squares estimate of β and ˜V is the variance-covariance matrix of any other linear unbiased estimate, then ˆV  ˜V.Exercises 95 Ex. 3.4 Show how the vector of least squares coefficients can be obtained from a single pass of the Gram–Schmidt procedure (Algorithm 3.1). Rep- resent your solution in terms of the QR decomposition of X. Ex. 3.5 Consider the ridge regression problem (3.41). Show that this prob- lem is equivalent to the problem ˆβc = argmin βc N i=1 yi − βc 0 − p j=1 (xij − ¯xj)βc j 2 + λ p j=1 βc j 2 . (3.85) Give the correspondence between βc and the original β in (3.41). Char- acterize the solution to this modified criterion. Show that a similar result holds for the lasso. Ex. 3.6 Show that the ridge regression estimate is the mean (and mode) of the posterior distribution, under a Gaussian prior β ∼ N(0,τI), and Gaussian sampling model y ∼ N(Xβ,σ2I). Find the relationship between the regularization parameter λ in the ridge formula, and the variances τ and σ2. Ex. 3.7 Assume yi ∼ N(β0 + xT i β,σ2),i=1, 2,...,N, and the parameters βj are each distributed as N(0,τ2), independently of one another. Assuming σ2 and τ 2 are known, show that the (minus) log-posterior density of β is proportional to N i=1(yi − β0 − j xijβj)2 + λ p j=1 β2 j where λ = σ2/τ 2. Ex. 3.8 Consider the QR decomposition of the uncentered N × (p +1) matrix X (whose first column is all ones), and the SVD of the N × p centered matrix ˜X. Show that Q2 and U span the same subspace, where Q2 is the sub-matrix of Q with the first column removed. Under what circumstances will they be the same, up to sign flips? Ex. 3.9 Forward stepwise regression. Suppose we have the QR decomposi- tion for the N ×q matrix X1 in a multiple regression problem with response y, and we have an additional p−q predictors in the matrix X2. Denote the current residual by r. We wish to establish which one of these additional variables will reduce the residual-sum-of squares the most when included with those in X1. Describe an efficient procedure for doing this. Ex. 3.10 Backward stepwise regression. Suppose we have the multiple re- gression fit of y on X, along with the standard errors and Z-scores as in Table 3.2. We wish to establish which variable, when dropped, will increase the residual sum-of-squares the least. How would you do this? Ex. 3.11 Show that the solution to the multivariate linear regression prob- lem (3.40) is given by (3.39). What happens if the covariance matrices Σi are different for each observation?96 3. Linear Methods for Regression Ex. 3.12 Show that the ridge regression estimates can be obtained by ordinary least squares regression on an augmented data set. We augment the centered matrix X with p additional rows √ λI, and augment y with p zeros. By introducing artificial data having response value zero, the fitting procedure is forced to shrink the coefficients toward zero. This is related to the idea of hints due to Abu-Mostafa (1995), where model constraints are implemented by adding artificial data examples that satisfy them. Ex. 3.13 Derive the expression (3.62), and show that ˆβpcr(p)=ˆβls. Ex. 3.14 Show that in the orthogonal case, PLS stops after m = 1 steps, because subsequent ˆϕmj in step 2 in Algorithm 3.3 are zero. Ex. 3.15 Verify expression (3.64), and hence show that the partial least squares directions are a compromise between the ordinary regression coef- ficient and the principal component directions. Ex. 3.16 Derive the entries in Table 3.4, the explicit forms for estimators in the orthogonal case. Ex. 3.17 Repeat the analysis of Table 3.3 on the spam data discussed in Chapter 1. Ex. 3.18 Read about conjugate gradient algorithms (Murray et al., 1981, for example), and establish a connection between these algorithms and partial least squares. Ex. 3.19 Show that ˆβridge increases as its tuning parameter λ → 0. Does the same property hold for the lasso and partial least squares estimates? For the latter, consider the “tuning parameter” to be the successive steps in the algorithm. Ex. 3.20 Consider the canonical-correlation problem (3.67). Show that the leading pair of canonical variates u1 and v1 solve the problem max uT (YT Y)u=1 vT (XT X)v=1 uT (YT X)v, (3.86) a generalized SVD problem. Show that the solution is given by u1 = (YT Y)− 1 2 u∗ 1,andv1 =(XT X)− 1 2 v∗ 1, where u∗ 1 and v∗ 1 are the leading left and right singular vectors in (YT Y)− 1 2 (YT X)(XT X)− 1 2 = U∗D∗V∗T . (3.87) Show that the entire sequence um,vm,m=1,...,min(K, p) is also given by (3.87). Ex. 3.21 Show that the solution to the reduced-rank regression problem (3.68), with Σ estimated by YT Y/N , is given by (3.69). Hint: TransformExercises 97 Y to Y∗ = YΣ− 1 2 , and solved in terms of the canonical vectors u∗ m. Show that Um = Σ− 1 2 U∗ m, and a generalized inverse is U− m = U∗ m T Σ 1 2 . Ex. 3.22 Show that the solution in Exercise 3.21 does not change if Σ is estimated by the more natural quantity (Y − X ˆB)T (Y − X ˆB)/(N − pK). Ex. 3.23 Consider a regression problem with all variables and response hav- ing mean zero and standard deviation one. Suppose also that each variable has identical absolute correlation with the response: 1 N | xj, y| = λ, j =1,...,p. Let ˆβ be the least-squares coefficient of y on X, and let u(α)=αXˆβ for α ∈ [0, 1] be the vector that moves a fraction α toward the least squares fit u.LetRSS be the residual sum-of-squares from the full least squares fit. (a) Show that 1 N | xj, y − u(α)| =(1− α)λ, j =1,...,p, and hence the correlations of each xj with the residuals remain equal in magnitude as we progress toward u. (b) Show that these correlations are all equal to λ(α)= (1 − α) (1 − α)2 + α(2−α) N · RSS · λ, and hence they decrease monotonically to zero. (c) Use these results to show that the LAR algorithm in Section 3.4.4 keeps the correlations tied and monotonically decreasing, as claimed in (3.55). Ex. 3.24 LAR directions. Using the notation around equation (3.55) on page 74, show that the LAR direction makes an equal angle with each of the predictors in Ak. Ex. 3.25 LAR look-ahead (Efron et al., 2004, Sec. 2). Starting at the be- ginning of the kth step of the LAR algorithm, derive expressions to identify the next variable to enter the active set at step k + 1, and the value of α at which this occurs (using the notation around equation (3.55) on page 74). Ex. 3.26 Forward stepwise regression enters the variable at each step that most reduces the residual sum-of-squares. LAR adjusts variables that have the most (absolute) correlation with the current residuals. Show that these two entry criteria are not necessarily the same. [Hint: let xj.A be the jth98 3. Linear Methods for Regression variable, linearly adjusted for all the variables currently in the model. Show that the first criterion amounts to identifying the j for which Cor(xj.A, r) is largest in magnitude. Ex. 3.27 Lasso and LAR: Consider the lasso problem in Lagrange multiplier form: with L(β)= i(yi − j xijβj)2, we minimize L(β)+λ j |βj| (3.88) for fixed λ>0. (a) Setting βj = β+ j − β− j with β+ j ,β− j ≥ 0, expression (3.88) becomes L(β)+λ j(β+ j + β− j ). Show that the Lagrange dual function is L(β)+λ j (β+ j + β− j ) − j λ+ j β+ j − j λ− j β− j (3.89) and the Karush–Kuhn–Tucker optimality conditions are ∇L(β)j + λ − λ+ j =0 −∇L(β)j + λ − λ− j =0 λ+ j β+ j =0 λ− j β− j =0, along with the non-negativity constraints on the parameters and all the Lagrange multipliers. (b) Show that |∇L(β)j|≤λ ∀j, and that the KKT conditions imply one of the following three scenarios: λ =0 ⇒∇L(β)j =0∀j β+ j > 0,λ>0 ⇒ λ+ j =0, ∇L(β)j = −λ<0,β− j =0 β− j > 0,λ>0 ⇒ λ− j =0, ∇L(β)j = λ>0,β+ j =0. Hence show that for any “active” predictor having βj = 0, we must have ∇L(β)j = −λ if βj > 0, and ∇L(β)j = λ if βj < 0. Assuming the predictors are standardized, relate λ to the correlation between the jth predictor and the current residuals. (c) Suppose that the set of active predictors is unchanged for λ0 ≥ λ ≥ λ1. Show that there is a vector γ0 such that ˆβ(λ)=ˆβ(λ0) − (λ − λ0)γ0 (3.90) Thus the lasso solution path is linear as λ ranges from λ0 to λ1(Efron et al., 2004; Rosset and Zhu, 2007).Exercises 99 Ex. 3.28 Suppose for a given t in (3.51), the fitted lasso coefficient for variable Xj is ˆβj = a. Suppose we augment our set of variables with an identical copy X∗ j = Xj. Characterize the effect of this exact collinearity by describing the set of solutions for ˆβj and ˆβ∗ j , using the same value of t. Ex. 3.29 Suppose we run a ridge regression with parameter λ on a single variable X, and get coefficient a. We now include an exact copy X∗ = X, and refit our ridge regression. Show that both coefficients are identical, and derive their value. Show in general that if m copies of a variable Xj are included in a ridge regression, their coefficients are all the same. Ex. 3.30 Consider the elastic-net optimization problem: min β ||y − Xβ||2 + λ α||β||2 2 +(1− α)||β||1 . (3.91) Show how one can turn this into a lasso problem, using an augmented version of X and y.100 3. Linear Methods for RegressionThis is page 101 Printer: Opaque this 4 Linear Methods for Classification 4.1 Introduction In this chapter we revisit the classification problem and focus on linear methods for classification. Since our predictor G(x) takes values in a dis- crete set G, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the bound- aries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these decision boundaries are linear; this is what we will mean by linear methods for classification. There are several different ways in which linear decision boundaries can be found. In Chapter 2 we fit linear regression models to the class indicator variables, and classify to the largest fit. Suppose there are K classes, for convenience labeled 1, 2,...,K, and the fitted linear model for the kth indicator response variable is ˆfk(x)=ˆβk0 + ˆβT k x. The decision boundary between class k and  is that set of points for which ˆfk(x)= ˆf(x), that is, the set {x :(ˆβk0 − ˆβ0)+(ˆβk − ˆβ)T x =0}, an affine set or hyperplane1 Since the same is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries. This regression approach is a member of a class of methods that model discriminant functions δk(x) for each class, and then classify x to the class with the largest value for its discriminant function. Methods 1Strictly speaking, a hyperplane passes through the origin, while an affine set need not. We sometimes ignore the distinction and refer in general to hyperplanes.102 4. Linear Methods for Classification that model the posterior probabilities Pr(G = k|X = x) are also in this class. Clearly, if either the δk(x)orPr(G = k|X = x) are linear in x, then the decision boundaries will be linear. Actually, all we require is that some monotone transformation of δk or Pr(G = k|X = x) be linear for the decision boundaries to be linear. For example, if there are two classes, a popular model for the posterior proba- bilities is Pr(G =1|X = x)= exp(β0 + βT x) 1 + exp(β0 + βT x), Pr(G =2|X = x)= 1 1 + exp(β0 + βT x). (4.1) Here the monotone transformation is the logit transformation: log[p/(1−p)], and in fact we see that log Pr(G =1|X = x) Pr(G =2|X = x) = β0 + βT x. (4.2) The decision boundary is the set of points for which the log-odds are zero, and this is a hyperplane defined by x|β0 + βT x =0 . We discuss two very popular but different methods that result in linear log-odds or logits: linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the linear function is fit to the training data. A more direct approach is to explicitly model the boundaries between the classes as linear. For a two-class problem in a p-dimensional input space, this amounts to modeling the decision boundary as a hyperplane—in other words, a normal vector and a cut-point. We will look at two methods that explicitly look for “separating hyperplanes.” The first is the well- known perceptron model of Rosenblatt (1958), with an algorithm that finds a separating hyperplane in the training data, if one exists. The second method, due to Vapnik (1996), finds an optimally separating hyperplane if one exists, else finds a hyperplane that minimizes some measure of overlap in the training data. We treat the separable case here, and defer treatment of the nonseparable case to Chapter 12. While this entire chapter is devoted to linear decision boundaries, there is considerable scope for generalization. For example, we can expand our vari- able set X1,...,Xp by including their squares and cross-products X2 1 ,X2 2 ,..., X1X2,..., thereby adding p(p +1)/2 additional variables. Linear functions in the augmented space map down to quadratic functions in the original space—hence linear decision boundaries to quadratic decision boundaries. Figure 4.1 illustrates the idea. The data are the same: the left plot uses linear decision boundaries in the two-dimensional space shown, while the right plot uses linear decision boundaries in the augmented five-dimensional space described above. This approach can be used with any basis transfor-4.2 Linear Regression of an Indicator Matrix 103 1 1 1 11 11 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 11 11 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 FIGURE 4.1. The left plot shows some data from three classes, with linear decision boundaries found by linear discriminant analysis. The right plot shows quadratic decision boundaries. These were obtained by finding linear boundaries in the five-dimensional space X1,X2,X1X2,X2 1 ,X2 2 . Linear inequalities in this space are quadratic inequalities in the original space. mation h(X) where h :IRp → IR q with q>p, and will be explored in later chapters. 4.2 Linear Regression of an Indicator Matrix Here each of the response categories are coded via an indicator variable. Thus if G has K classes, there will be K such indicators Yk,k=1,...,K, with Yk =1ifG = k else 0. These are collected together in a vector Y =(Y1,...,YK), and the N training instances of these form an N × K indicator response matrix Y. Y is a matrix of 0’s and 1’s, with each row having a single 1. We fit a linear regression model to each of the columns of Y simultaneously, and the fit is given by ˆY = X(XT X)−1XT Y. (4.3) Chapter 3 has more details on linear regression. Note that we have a coeffi- cient vector for each response column yk, and hence a (p+1)×K coefficient matrix ˆB =(XT X)−1XT Y. Here X is the model matrix with p+1 columns corresponding to the p inputs, and a leading column of 1’s for the intercept. A new observation with input x is classified as follows: • compute the fitted output ˆf(x) = [(1,x) ˆB]T ,aK vector; • identify the largest component and classify accordingly: ˆG(x) = argmaxk∈G ˆfk(x). (4.4)104 4. Linear Methods for Classification What is the rationale for this approach? One rather formal justification is to view the regression as an estimate of conditional expectation. For the random variable Yk, E(Yk|X = x)=Pr(G = k|X = x), so conditional expectation of each of the Yk seems a sensible goal. The real issue is: how good an approximation to conditional expectation is the rather rigid linear regression model? Alternatively, are the ˆfk(x) reasonable estimates of the posterior probabilities Pr(G = k|X = x), and more importantly, does this matter? It is quite straightforward to verify that k∈G ˆfk(x) = 1 for any x,as long as there is an intercept in the model (column of 1’s in X). However, the ˆfk(x) can be negative or greater than 1, and typically some are. This is a consequence of the rigid nature of linear regression, especially if we make predictions outside the hull of the training data. These violations in themselves do not guarantee that this approach will not work, and in fact on many problems it gives similar results to more standard linear meth- ods for classification. If we allow linear regression onto basis expansions h(X) of the inputs, this approach can lead to consistent estimates of the probabilities. As the size of the training set N grows bigger, we adaptively include more basis elements so that linear regression onto these basis func- tions approaches conditional expectation. We discuss such approaches in Chapter 5. A more simplistic viewpoint is to construct targets tk for each class, where tk is the kth column of the K × K identity matrix. Our prediction problem is to try and reproduce the appropriate target for an observation. With the same coding as before, the response vector yi (ith row of Y)for observation i has the value yi = tk if gi = k. We might then fit the linear model by least squares: min B N i=1 ||yi − [(1,xi)B]T ||2. (4.5) The criterion is a sum-of-squared Euclidean distances of the fitted vectors from their targets. A new observation is classified by computing its fitted vector ˆf(x) and classifying to the closest target: ˆG(x) = argmin k || ˆf(x) − tk||2. (4.6) This is exactly the same as the previous approach: • The sum-of-squared-norm criterion is exactly the criterion for multi- ple response linear regression, just viewed slightly differently. Since a squared norm is itself a sum of squares, the components decouple and can be rearranged as a separate linear model for each element. Note that this is only possible because there is nothing in the model that binds the different responses together.4.2 Linear Regression of an Indicator Matrix 105 Linear Regression 1 1 1 1 1 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 333 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 333 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 333 3 3 3 33 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 Linear Discriminant Analysis 1 1 1 1 1 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 333 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 333 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 333 3 3 3 33 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 X1X1 X 2 X 2 FIGURE 4.2. The data come from three classes in IR 2 and are easily separated by linear decision boundaries. The right plot shows the boundaries found by linear discriminant analysis. The left plot shows the boundaries found by linear regres- sion of the indicator response variables. The middle class is completely masked (never dominates). • The closest target classification rule (4.6) is easily seen to be exactly the same as the maximum fitted component criterion (4.4), but does require that the fitted values sum to 1. There is a serious problem with the regression approach when the number of classes K ≥ 3, especially prevalent when K is large. Because of the rigid nature of the regression model, classes can be masked by others. Figure 4.2 illustrates an extreme situation when K = 3. The three classes are perfectly separated by linear decision boundaries, yet linear regression misses the middle class completely. In Figure 4.3 we have projected the data onto the line joining the three centroids (there is no information in the orthogonal direction in this case), and we have included and coded the three response variables Y1, Y2 and Y3. The three regression lines (left panel) are included, and we see that the line corresponding to the middle class is horizontal and its fitted values are never dominant! Thus, observations from class 2 are classified either as class 1 or class 3. The right panel uses quadratic regression rather than linear regression. For this simple example a quadratic rather than linear fit (for the middle class at least) would solve the problem. However, it can be seen that if there were four rather than three classes lined up like this, a quadratic would not come down fast enough, and a cubic would be needed as well. A loose but general rule is that if K ≥ 3 classes are lined up, polynomial terms up to degree K − 1 might be needed to resolve them. Note also that these are polynomials along the derived direction106 4. Linear Methods for Classification 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 111 11 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 111 1 1 1 11 1 1 11 11 1 111 1 11 1 1 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 111 1 222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 222222 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 2222 222 2 222 222 222 2 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 33 333 33 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 33 3 33 3 3 3 33 3 3 3 3 3 3 333 3 3 3 33 3 3 33 33 3 333 3 33 3 3 3 33 33 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 33 33 3 3 3 333 3 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 111 11 11 1 1 1 1 1 111 1 1 1 1 1 1 1 1 11 111 1 1 1 11 1 1 1 1 1 1 1111 1 1 11 1 1 11 11 1 111 1 11 1 11 11 111 11 1 11 11 1111 11 1 1 11 1 11 1 11 1 1111 111 1 111 111 111 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 222 22 22 2 2 2 2 2 222 2 2 22 2 22 222 2222 22 22 2 2 2 2 22 2222 22 222 2 22 222 222 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 22 2 333 333 3 3 333 3 3 3 3 3 3 33 333 33 3 33 33 3 3 3 3 33 3 3 33 33333 33 3 33 3 3 333 3 3 3 3 3 3 3 3 33 3333 3 3 33 3 3 3 3 3 3 3333 3 3 33 3 3 33 33 3 333 3 33 3 3 3 33 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 333 3 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Degree = 1; Error = 0.33 Degree = 2; Error = 0.04 FIGURE 4.3. The effects of masking on linear regression in IR for a three-class problem. The rug plot at the base indicates the positions and class membership of each observation. The three curves in each panel are the fitted regressions to the three-class indicator variables; for example, for the blue class, yblue is 1 for the blue observations, and 0 for the green and orange. The fits are linear and quadratic polynomials. Above each plot is the training error rate. The Bayes error rate is 0.025 for this problem, as is the LDA error rate. passing through the centroids, which can have arbitrary orientation. So in p-dimensional input space, one would need general polynomial terms and cross-products of total degree K − 1, O(pK−1) terms in all, to resolve such worst-case scenarios. The example is extreme, but for large K and small p such maskings naturally occur. As a more realistic illustration, Figure 4.4 is a projection of the training data for a vowel recognition problem onto an informative two-dimensional subspace. There are K = 11 classes in p = 10 dimensions. This is a difficult classification problem, and the best methods achieve around 40% errors on the test data. The main point here is summarized in Table 4.1; linear regression has an error rate of 67%, while a close relative, linear discriminant analysis, has an error rate of 56%. It seems that masking has hurt in this case. While all the other methods in this chapter are based on linear functions of x as well, they use them in such a way that avoids this masking problem. 4.3 Linear Discriminant Analysis Decision theory for classification (Section 2.4) tells us that we need to know the class posteriors Pr(G|X) for optimal classification. Suppose fk(x)is the class-conditional density of X in class G = k, and let πk be the prior probability of class k, with K k=1 πk = 1. A simple application of Bayes4.3 Linear Discriminant Analysis 107 Coordinate 1 for Training Data Coordinate 2 for Training Data -4 -2 0 2 4 -6 -4 -2 0 2 4 o ooooo o o o o oo o o o o o o oo o o o o o o o o oo ooo o oo o o o o o o o o o ooo oo oooo o o oo o o oooo o o o ooooo o o oo o o o o oooo o ooo oo o oo o o o oooo o ooo ooo o o o oo o o o ooo o o ooo o o o o oo ooo oo o oo o o o o o o o oooooo oooooo oo oooo oooooo oooooo ooooo o oooooo o o oooo o oo o o o o ooooo ooooo o oooooo o o oooo oooooo oooooo o o o o oo ooooo o o ooo o o o ooooo o ooo o o oo o o o o ooo o o o oooo o o o o o o o o oooooo o ooooo o o oo o o oo o oo o o o o oo o oo o ooo ooooo o o o o oo o ooo o oo o o o oo o o oo oo o oooooo o o o o oo oo o oo oooooo o o o oo o o ooooo o oo o o o o o o o o o o oo ooo o oooo o o o ooo o o oooooo ooo o oo ooooo o o ooo o o oooooo o o ooo o o oo o o o oo o o oo o o o o o o o o oooo oooo oo ooo ooo oo o o oo o oo o o oooooo o ooo ooo oo oooo oo o o o o •• •• •• •• •• •• •• •• •• •• •• Linear Discriminant Analysis FIGURE 4.4. A two-dimensional plot of the vowel training data. There are eleven classes with X ∈ IR 10, and this is the best view in terms of a LDA model (Section 4.3.3). The heavy circles are the projected mean vectors for each class. The class overlap is considerable. TABLE 4.1. Training and test error rates using a variety of linear techniques on the vowel data. There are eleven classes in ten dimensions, of which three account for 90% of the variance (via a principal components analysis). We see that linear regression is hurt by masking, increasing the test and training error by over 10%. Technique Error Rates Training Test Linear regression 0.48 0.67 Linear discriminant analysis 0.32 0.56 Quadratic discriminant analysis 0.01 0.53 Logistic regression 0.22 0.51108 4. Linear Methods for Classification theorem gives us Pr(G = k|X = x)= fk(x)πkK =1 f(x)π . (4.7) We see that in terms of ability to classify, having the fk(x) is almost equiv- alent to having the quantity Pr(G = k|X = x). Many techniques are based on models for the class densities: • linear and quadratic discriminant analysis use Gaussian densities; • more flexible mixtures of Gaussians allow for nonlinear decision bound- aries (Section 6.8); • general nonparametric density estimates for each class density allow the most flexibility (Section 6.6.2); • Naive Bayes models are a variant of the previous case, and assume that each of the class densities are products of marginal densities; that is, they assume that the inputs are conditionally independent in each class (Section 6.6.3). Suppose that we model each class density as multivariate Gaussian fk(x)= 1 (2π)p/2|Σk|1/2 e− 1 2 (x−μk)T Σ−1 k (x−μk). (4.8) Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix Σk = Σ ∀k.In comparing two classes k and , it is sufficient to look at the log-ratio, and we see that log Pr(G = k|X = x) Pr(G = |X = x) = log fk(x) f(x) + log πk π = log πk π − 1 2(μk + μ)T Σ−1(μk − μ) + xT Σ−1(μk − μ), (4.9) an equation linear in x. The equal covariance matrices cause the normal- ization factors to cancel, as well as the quadratic part in the exponents. This linear log-odds function implies that the decision boundary between classes k and —the set where Pr(G = k|X = x)=Pr(G = |X = x)—is linear in x;inp dimensions a hyperplane. This is of course true for any pair of classes, so all the decision boundaries are linear. If we divide IRp into regions that are classified as class 1, class 2, etc., these regions will be sep- arated by hyperplanes. Figure 4.5 (left panel) shows an idealized example with three classes and p = 2. Here the data do arise from three Gaus- sian distributions with a common covariance matrix. We have included in4.3 Linear Discriminant Analysis 109 + + + 3 21 1 1 2 3 3 3 1 2 3 3 2 1 1 21 1 3 3 1 21 2 3 2 3 3 1 2 2 1 1 1 1 3 2 2 2 2 1 3 22 3 1 3 1 3 3 2 1 3 3 2 3 1 3 3 2 1 3 3 2 2 3 2 2 21 1 1 1 1 2 1 3 3 1 1 3 3 2 2 2 23 1 2 FIGURE 4.5. The left panel shows three Gaussian distributions, with the same covariance and different means. Included are the contours of constant density enclosing 95% of the probability in each case. The Bayes decision boundaries between each pair of classes are shown (broken straight lines), and the Bayes decision boundaries separating all three classes are the thicker solid lines (a subset of the former). On the right we see a sample of 30 drawn from each Gaussian distribution, and the fitted LDA decision boundaries. the figure the contours corresponding to 95% highest probability density, as well as the class centroids. Notice that the decision boundaries are not the perpendicular bisectors of the line segments joining the centroids. This would be the case if the covariance Σ were spherical σ2I, and the class priors were equal. From (4.9) we see that the linear discriminant functions δk(x)=xT Σ−1μk − 1 2μT k Σ−1μk + log πk (4.10) are an equivalent description of the decision rule, with G(x) = argmaxkδk(x). In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data: • ˆπk = Nk/N , where Nk is the number of class-k observations; • ˆμk = gi=k xi/Nk; • ˆΣ = K k=1 gi=k(xi − ˆμk)(xi − ˆμk)T /(N − K). Figure 4.5 (right panel) shows the estimated decision boundaries based on a sample of size 30 each from three Gaussian distributions. Figure 4.1 on page 103 is another example, but here the classes are not Gaussian. With two classes there is a simple correspondence between linear dis- criminant analysis and classification by linear least squares, as in (4.5). The LDA rule classifies to class 2 if xT ˆΣ −1(ˆμ2 − ˆμ1) > 1 2 ˆμT 2 ˆΣ −1 ˆμ2 − 1 2 ˆμT 1 ˆΣ −1 ˆμ1 + log(N1/N ) − log(N2/N ) (4.11)110 4. Linear Methods for Classification and class 1 otherwise. Suppose we code the targets in the two classes as +1 and −1, respectively. It is easy to show that the coefficient vector from least squares is proportional to the LDA direction given in (4.11) (Exercise 4.2). [In fact, this correspondence occurs for any (distinct) coding of the targets; see Exercise 4.2]. However unless N1 = N2 the intercepts are different and hence the resulting decision rules are different. Since this derivation of the LDA direction via least squares does not use a Gaussian assumption for the features, its applicability extends beyond the realm of Gaussian data. However the derivation of the particular intercept or cut-point given in (4.11) does require Gaussian data. Thus it makes sense to instead choose the cut-point that empirically minimizes training error for a given dataset. This is something we have found to work well in practice, but have not seen it mentioned in the literature. With more than two classes, LDA is not the same as linear regression of the class indicator matrix, and it avoids the masking problems associated with that approach (Hastie et al., 1994). A correspondence between regres- sion and LDA can be established through the notion of optimal scoring, discussed in Section 12.5. Getting back to the general discriminant problem (4.8), if the Σk are not assumed to be equal, then the convenient cancellations in (4.9) do not occur; in particular the pieces quadratic in x remain. We then get quadratic discriminant functions (QDA), δk(x)=−1 2 log |Σk|−1 2(x − μk)T Σ−1 k (x − μk) + log πk. (4.12) The decision boundary between each pair of classes k and  is described by a quadratic equation {x : δk(x)=δ(x)}. Figure 4.6 shows an example (from Figure 4.1 on page 103) where the three classes are Gaussian mixtures (Section 6.8) and the decision bound- aries are approximated by quadratic equations in x. Here we illustrate two popular ways of fitting these quadratic boundaries. The right plot uses QDA as described here, while the left plot uses LDA in the enlarged five-dimensional quadratic polynomial space. The differences are generally small; QDA is the preferred approach, with the LDA method a convenient substitute 2. The estimates for QDA are similar to those for LDA, except that separate covariance matrices must be estimated for each class. When p is large this can mean a dramatic increase in parameters. Since the decision boundaries are functions of the parameters of the densities, counting the number of parameters must be done with care. For LDA, it seems there are (K − 1) × (p + 1) parameters, since we only need the differences δk(x) − δK(x) 2For this figure and many similar figures in the book we compute the decision bound- aries by an exhaustive contouring method. We compute the decision rule on a fine lattice of points, and then use contouring algorithms to compute the boundaries.4.3 Linear Discriminant Analysis 111 1 1 1 11 11 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 11 11 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 33 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA in the five-dimensional space X1,X2,X1X2,X2 1 ,X2 2 ). The right plot shows the quadratic decision boundaries found by QDA. The differences are small, as is usually the case. between the discriminant functions where K is some pre-chosen class (here we have chosen the last), and each difference requires p + 1 parameters3. Likewise for QDA there will be (K − 1) ×{p(p +3)/2+1} parameters. Both LDA and QDA perform well on an amazingly large and diverse set of classification tasks. For example, in the STATLOG project (Michie et al., 1994) LDA was among the top three classifiers for 7 of the 22 datasets, QDA among the top three for four datasets, and one of the pair were in the top three for 10 datasets. Both techniques are widely used, and entire books are devoted to LDA. It seems that whatever exotic tools are the rage of the day, we should always have available these two simple tools. The question arises why LDA and QDA have such a good track record. The reason is not likely to be that the data are approximately Gaussian, and in addition for LDA that the covariances are approximately equal. More likely a reason is that the data can only support simple decision boundaries such as linear or quadratic, and the estimates provided via the Gaussian models are stable. This is a bias variance tradeoff—we can put up with the bias of a linear decision boundary because it can be estimated with much lower variance than more exotic alternatives. This argument is less believable for QDA, since it can have many parameters itself, although perhaps fewer than the non-parametric alternatives. 3Although we fit the covariance matrix ˆΣ to compute the LDA discriminant functions, a much reduced function of it is all that is required to estimate the O(p) parameters needed to compute the decision boundaries.112 4. Linear Methods for Classification Misclassification Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••• • • • • • ••• • • ••• • ••• • • • ••• • • • •••••• • ••• • ••• ••••• • • • • • • ••• ••• • • • • • ••• • • • • Regularized Discriminant Analysis on the Vowel Data Test Data Train Data α FIGURE 4.7. Test and training errors for the vowel data, using regularized discriminant analysis with a series of values of α ∈ [0, 1]. The optimum for the test data occurs around α =0.9, close to quadratic discriminant analysis. 4.3.1 Regularized Discriminant Analysis Friedman (1989) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. These methods are very similar in flavor to ridge regression. The regularized covariance matrices have the form ˆΣk(α)=α ˆΣk +(1− α) ˆΣ, (4.13) where ˆΣ is the pooled covariance matrix as used in LDA. Here α ∈ [0, 1] allows a continuum of models between LDA and QDA, and needs to be specified. In practice α can be chosen based on the performance of the model on validation data, or by cross-validation. Figure 4.7 shows the results of RDA applied to the vowel data. Both the training and test error improve with increasing α, although the test error increases sharply after α =0.9. The large discrepancy between the training and test error is partly due to the fact that there are many repeat measurements on a small number of individuals, different in the training and test set. Similar modifications allow ˆΣ itself to be shrunk toward the scalar covariance, ˆΣ(γ)=γ ˆΣ +(1− γ)ˆσ2I (4.14) for γ ∈ [0, 1]. Replacing ˆΣ in (4.13) by ˆΣ(γ) leads to a more general family of covariances ˆΣ(α, γ) indexed by a pair of parameters. In Chapter 12, we discuss other regularized versions of LDA, which are more suitable when the data arise from digitized analog signals and images.4.3 Linear Discriminant Analysis 113 In these situations the features are high-dimensional and correlated, and the LDA coefficients can be regularized to be smooth or sparse in the original domain of the signal. This leads to better generalization and allows for easier interpretation of the coefficients. In Chapter 18 we also deal with very high-dimensional problems, where for example the features are gene- expression measurements in microarray studies. There the methods focus on the case γ = 0 in (4.14), and other severely regularized versions of LDA. 4.3.2 Computations for LDA As a lead-in to the next topic, we briefly digress on the computations required for LDA and especially QDA. Their computations are simplified by diagonalizing ˆΣ or ˆΣk. For the latter, suppose we compute the eigen- decomposition for each ˆΣk = UkDkUT k , where Uk is p × p orthonormal, and Dk a diagonal matrix of positive eigenvalues dk. Then the ingredients for δk(x) (4.12) are • (x − ˆμk)T ˆΣ −1 k (x − ˆμk)=[UT k (x − ˆμk)]T D−1 k [UT k (x − ˆμk)]; • log | ˆΣk| = log dk. In light of the computational steps outlined above, the LDA classifier can be implemented by the following pair of steps: • Sphere the data with respect to the common covariance estimate ˆΣ: X∗ ← D− 1 2 UT X, where ˆΣ = UDUT . The common covariance esti- mate of X∗ will now be the identity. • Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities πk. 4.3.3 Reduced-Rank Linear Discriminant Analysis So far we have discussed LDA as a restricted Gaussian classifier. Part of its popularity is due to an additional restriction that allows us to view informative low-dimensional projections of the data. The K centroids in p-dimensional input space lie in an affine subspace of dimension ≤ K − 1, and if p is much larger than K, this will be a con- siderable drop in dimension. Moreover, in locating the closest centroid, we can ignore distances orthogonal to this subspace, since they will contribute equally to each class. Thus we might just as well project the X∗ onto this centroid-spanning subspace HK−1, and make distance comparisons there. Thus there is a fundamental dimension reduction in LDA, namely, that we need only consider the data in a subspace of dimension at most K − 1.114 4. Linear Methods for Classification If K = 3, for instance, this could allow us to view the data in a two- dimensional plot, color-coding the classes. In doing so we would not have relinquished any of the information needed for LDA classification. What if K>3? We might then ask for a L 0, then yi(xT i β + β0) = 1, or in other words, xi is on the boundary of the slab; • if yi(xT i β +β0) > 1, xi is not on the boundary of the slab, and αi =0. From (4.50) we see that the solution vector β is defined in terms of a linear combination of the support points xi—those points defined to be on the boundary of the slab via αi > 0. Figure 4.16 shows the optimal separating hyperplane for our toy example; there are three support points. Likewise, β0 is obtained by solving (4.53) for any of the support points. The optimal separating hyperplane produces a function ˆf(x)=xT ˆβ + ˆβ0 for classifying new observations: ˆG(x) = sign ˆf(x). (4.54) Although none of the training observations fall in the margin (by con- struction), this will not necessarily be the case for test observations. The134 4. Linear Methods for Classification FIGURE 4.16. The same data as in Figure 4.14. The shaded region delineates the maximum margin separating the two classes. There are three support points indicated, which lie on the boundary of the margin, and the optimal separating hyperplane (blue line) bisects the slab. Included in the figure is the boundary found using logistic regression (red line), which is very close to the optimal separating hyperplane (see Section 12.3.3). intuition is that a large margin on the training data will lead to good separation on the test data. The description of the solution in terms of support points seems to sug- gest that the optimal hyperplane focuses more on the points that count, and is more robust to model misspecification. The LDA solution, on the other hand, depends on all of the data, even points far away from the de- cision boundary. Note, however, that the identification of these support points required the use of all the data. Of course, if the classes are really Gaussian, then LDA is optimal, and separating hyperplanes will pay a price for focusing on the (noisier) data at the boundaries of the classes. Included in Figure 4.16 is the logistic regression solution to this prob- lem, fit by maximum likelihood. Both solutions are similar in this case. When a separating hyperplane exists, logistic regression will always find it, since the log-likelihood can be driven to 0 in this case (Exercise 4.5). The logistic regression solution shares some other qualitative features with the separating hyperplane solution. The coefficient vector is defined by a weighted least squares fit of a zero-mean linearized response on the input features, and the weights are larger for points near the decision boundary than for those further away. When the data are not separable, there will be no feasible solution to this problem, and an alternative formulation is needed. Again one can en- large the space using basis transformations, but this can lead to artificialExercises 135 separation through over-fitting. In Chapter 12 we discuss a more attractive alternative known as the support vector machine, which allows for overlap, but minimizes a measure of the extent of this overlap. Bibliographic Notes Good general texts on classification include Duda et al. (2000), Hand (1981), McLachlan (1992) and Ripley (1996). Mardia et al. (1979) have a concise discussion of linear discriminant analysis. Michie et al. (1994) compare a large number of popular classifiers on benchmark datasets. Lin- ear separating hyperplanes are discussed in Vapnik (1996). Our account of the perceptron learning algorithm follows Ripley (1996). Exercises Ex. 4.1 Show how to solve the generalized eigenvalue problem max aT Ba subject to aT Wa = 1 by transforming to a standard eigenvalue problem. Ex. 4.2 Suppose we have features x ∈ IR p, a two-class response, with class sizes N1,N2, and the target coded as −N/N1,N/N2. (a) Show that the LDA rule classifies to class 2 if xT ˆΣ −1(ˆμ2 − ˆμ1) > 1 2 ˆμT 2 ˆΣ −1 ˆμ2 − 1 2 ˆμT 1 ˆΣ −1 ˆμ1 + log N1 N − log N2 N , and class 1 otherwise. (b) Consider minimization of the least squares criterion N i=1 (yi − β0 − βT xi)2. (4.55) Show that the solution ˆβ satisfies (N − 2) ˆΣ + N1N2 N ˆΣB β = N(ˆμ2 − ˆμ1) (4.56) (after simplification),where ˆΣB =(ˆμ2 − ˆμ1)(ˆμ2 − ˆμ1)T . (c) Hence show that ˆΣBβ is in the direction (ˆμ2 − ˆμ1) and thus ˆβ ∝ ˆΣ −1(ˆμ2 − ˆμ1). (4.57) Therefore the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.136 4. Linear Methods for Classification (d) Show that this result holds for any (distinct) coding of the two classes. (e) Find the solution ˆβ0, and hence the predicted values ˆf = ˆβ0 + ˆβT x. Consider the following rule: classify to class 2 if ˆyi > 0 and class 1 otherwise. Show this is not the same as the LDA rule unless the classes have equal numbers of observations. (Fisher, 1936; Ripley, 1996) Ex. 4.3 Suppose we transform the original predictors X to ˆY via linear regression. In detail, let ˆY = X(XT X)−1XT Y = X ˆB, where Y is the indicator response matrix. Similarly for any input x ∈ IR p,wegetatrans- formed vector ˆy = ˆBT x ∈ IR K. Show that LDA using ˆY is identical to LDA in the original space. Ex. 4.4 Consider the multilogit model with K classes (4.17). Let β be the (p + 1)(K − 1)-vector consisting of all the coefficients. Define a suitably enlarged version of the input vector x to accommodate this vectorized co- efficient matrix. Derive the Newton-Raphson algorithm for maximizing the multinomial log-likelihood, and describe how you would implement this algorithm. Ex. 4.5 Consider a two-class logistic regression problem with x ∈ IR . C h a r - acterize the maximum-likelihood estimates of the slope and intercept pa- rameter if the sample xi for the two classes are separated by a point x0 ∈ IR . Generalize this result to (a) x ∈ IR p (see Figure 4.16), and (b) more than two classes. Ex. 4.6 Suppose we have N points xi in IRp in general position, with class labels yi ∈{−1, 1}. Prove that the perceptron learning algorithm converges to a separating hyperplane in a finite number of steps: (a) Denote a hyperplane by f(x)=βT 1 x + β0 = 0, or in more compact notation βT x∗ = 0, where x∗ =(x, 1) and β =(β1,β0). Let zi = x∗ i /||x∗ i ||. Show that separability implies the existence of a βsep such that yiβT sepzi ≥ 1 ∀i (b) Given a current βold, the perceptron algorithm identifies a point zi that is misclassified, and produces the update βnew ← βold + yizi. Show that ||βnew −βsep||2 ≤||βold −βsep||2 −1, and hence that the algorithm converges to a separating hyperplane in no more than ||βstart −βsep||2 steps (Ripley, 1996). Ex. 4.7 Consider the criterion D∗(β,β0)=− N i=1 yi(xT i β + β0), (4.58)Exercises 137 a generalization of (4.41) where we sum over all the observations. Consider minimizing D∗ subject to ||β|| = 1. Describe this criterion in words. Does it solve the optimal separating hyperplane problem? Ex. 4.8 Consider the multivariate Gaussian model X|G = k ∼ N(μk, Σ), with the additional restriction that rank{μk}K 1 = L2 classes simultaneously: one approach is to run PRIM separately for each class versus a baseline class. An advantage of PRIM over CART is its patience. Because of its bi- nary splits, CART fragments the data quite quickly. Assuming splits of equal size, with N observations it can only make log2(N) − 1 splits before running out of data. If PRIM peels off a proportion α of training points at each stage, it can perform approximately − log(N)/ log(1 − α) peeling steps before running out of data. For example, if N = 128 and α =0.10, then log2(N)−1 = 6 while − log(N)/ log(1−α) ≈ 46. Taking into account that there must be an integer number of observations at each stage, PRIM in fact can peel only 29 times. In any case, the ability of PRIM to be more patient should help the top-down greedy algorithm find a better solution. 9.3.1 Spam Example (Continued) We applied PRIM to the spam data, with the response coded as 1 for spam and 0 for email. The first two boxes found by PRIM are summarized below:9.4 MARS: Multivariate Adaptive Regression Splines 321 Rule 1 Global Mean Box Mean Box Support Training 0.3931 0.9607 0.1413 Test 0.3958 1.0000 0.1536 Rule 1 ⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ ch! > 0.029 CAPAVE > 2.331 your > 0.705 1999 < 0.040 CAPTOT > 79.50 edu < 0.070 re < 0.535 ch; < 0.030 Rule 2 Remain Mean Box Mean Box Support Training 0.2998 0.9560 0.1043 Test 0.2862 0.9264 0.1061 Rule 2 remove > 0.010 george < 0.110 The box support is the proportion of observations falling in the box. The first box is purely spam, and contains about 15% of the test data. The second box contains 10.6% of the test observations, 92.6% of which are spam. Together the two boxes contain 26% of the data and are about 97% spam. The next few boxes (not shown) are quite small, containing only about 3% of the data. The predictors are listed in order of importance. Interestingly the top splitting variables in the CART tree (Figure 9.5) do not appear in PRIM’s first box. 9.4 MARS: Multivariate Adaptive Regression Splines MARS is an adaptive procedure for regression, and is well suited for high- dimensional problems (i.e., a large number of inputs). It can be viewed as a generalization of stepwise linear regression or a modification of the CART method to improve the latter’s performance in the regression setting. We introduce MARS from the first point of view, and later make the connection to CART. MARS uses expansions in piecewise linear basis functions of the form (x − t)+ and (t − x)+. The “+” means positive part, so (x−t)+ = x − t, if x>t, 0, otherwise, and (t−x)+ = t − x, , if x 0) and I(x − t ≤ 0). • When a model term is involved in a multiplication by a candidate term, it gets replaced by the interaction, and hence is not available for further interactions. With these changes, the MARS forward procedure is the same as the CART tree-growing algorithm. Multiplying a step function by a pair of reflected9.5 Hierarchical Mixtures of Experts 329 step functions is equivalent to splitting a node at the step. The second restriction implies that a node may not be split more than once, and leads to the attractive binary-tree representation of the CART model. On the other hand, it is this restriction that makes it difficult for CART to model additive structures. MARS forgoes the tree structure and gains the ability to capture additive effects. Mixed Inputs Mars can handle “mixed” predictors—quantitative and qualitative—in a natural way, much like CART does. MARS considers all possible binary partitions of the categories for a qualitative predictor into two groups. Each such partition generates a pair of piecewise constant basis functions— indicator functions for the two sets of categories. This basis pair is now treated as any other, and is used in forming tensor products with other basis functions already in the model. 9.5 Hierarchical Mixtures of Experts The hierarchical mixtures of experts (HME) procedure can be viewed as a variant of tree-based methods. The main difference is that the tree splits are not hard decisions but rather soft probabilistic ones. At each node an observation goes left or right with probabilities depending on its input val- ues. This has some computational advantages since the resulting parameter optimization problem is smooth, unlike the discrete split point search in the tree-based approach. The soft splits might also help in prediction accuracy and provide a useful alternative description of the data. There are other differences between HMEs and the CART implementa- tion of trees. In an HME, a linear (or logistic regression) model is fit in each terminal node, instead of a constant as in CART. The splits can be multiway, not just binary, and the splits are probabilistic functions of a linear combination of inputs, rather than a single input as in the standard use of CART. However, the relative merits of these choices are not clear, and most were discussed at the end of Section 9.2. A simple two-level HME model in shown in Figure 9.13. It can be thought of as a tree with soft splits at each non-terminal node. However, the inven- tors of this methodology use a different terminology. The terminal nodes are called experts, and the non-terminal nodes are called gating networks. The idea is that each expert provides an opinion (prediction) about the response, and these are combined together by the gating networks. As we will see, the model is formally a mixture model, and the two-level model in the figure can be extend to multiple levels, hence the name hierarchical mixtures of experts.330 9. Additive Models, Trees, and Related Methods g1 g2 g1|1 g2|1 g1|2 g2|2 Gating Gating GatingGating Gating GatingGating Gating Gating NetworkNetwork Network Network Network Network Network NetworkNetwork Network Network Network Network NetworkNetwork Network NetworkNetwork Network Network ExpertExpertExpert ExpertExpertExpert ExpertExpert ExpertExpertExpert Pr(y|x, θ11)Pr(y|x, θ21)Pr(y|x, θ12)Pr(y|x, θ22) FIGURE 9.13. A two-level hierarchical mixture of experts (HME) model. Consider the regression or classification problem, as described earlier in the chapter. The data is (xi,yi),i=1, 2,...,N, with yi either a continuous or binary-valued response, and xi a vector-valued input. For ease of nota- tion we assume that the first element of xi is one, to account for intercepts. Here is how an HME is defined. The top gating network has the output gj(x, γj)= eγT j x K k=1 eγT k x ,j=1, 2,...,K, (9.25) where each γj is a vector of unknown parameters. This represents a soft K-way split (K = 2 in Figure 9.13.) Each gj(x, γj) is the probability of assigning an observation with feature vector x to the jth branch. Notice that with K = 2 groups, if we take the coefficient of one of the elements of x to be +∞, then we get a logistic curve with infinite slope. In this case, the gating probabilities are either 0 or 1, corresponding to a hard split on that input. At the second level, the gating networks have a similar form: g|j(x, γj)= eγT jx K k=1 eγT jkx ,=1, 2,...,K. (9.26)9.5 Hierarchical Mixtures of Experts 331 This is the probability of assignment to the th branch, given assignment to the jth branch at the level above. At each expert (terminal node), we have a model for the response variable of the form Y ∼ Pr(y|x, θj). (9.27) This differs according to the problem. Regression: The Gaussian linear regression model is used, with θj = (βj,σ2 j): Y = βT jx + ε and ε ∼ N(0,σ2 j). (9.28) Classification: The linear logistic regression model is used: Pr(Y =1|x, θj)= 1 1+e−θT jx . (9.29) Denoting the collection of all parameters by Ψ = {γj,γj,θj}, the total probability that Y = y is Pr(y|x, Ψ) = K j=1 gj(x, γj) K =1 g|j(x, γj)Pr(y|x, θj). (9.30) This is a mixture model, with the mixture probabilities determined by the gating network models. To estimate the parameters, we maximize the log-likelihood of the data, i log Pr(yi|xi, Ψ), over the parameters in Ψ. The most convenient method for doing this is the EM algorithm, which we describe for mixtures in Section 8.5. We define latent variables Δj, all of which are zero except for a single one. We interpret these as the branching decisions made by the top level gating network. Similarly we define latent variables Δ|j to describe the gating decisions at the second level. In the E-step, the EM algorithm computes the expectations of the Δj and Δ|j given the current values of the parameters. These expectations are then used as observation weights in the M-step of the procedure, to estimate the parameters in the expert networks. The parameters in the internal nodes are estimated by a version of multiple logistic regression. The expectations of the Δj or Δ|j are probability profiles, and these are used as the response vectors for these logistic regressions. The hierarchical mixtures of experts approach is a promising competitor to CART trees. By using soft splits rather than hard decision rules it can capture situations where the transition from low to high response is gradual. The log-likelihood is a smooth function of the unknown weights and hence is amenable to numerical optimization. The model is similar to CART with linear combination splits, but the latter is more difficult to optimize. On332 9. Additive Models, Trees, and Related Methods the other hand, to our knowledge there are no methods for finding a good tree topology for the HME model, as there are in CART. Typically one uses a fixed tree of some depth, possibly the output of the CART procedure. The emphasis in the research on HMEs has been on prediction rather than interpretation of the final model. A close cousin of the HME is the latent class model (Lin et al., 2000), which typically has only one layer; here the nodes or latent classes are interpreted as groups of subjects that show similar response behavior. 9.6 Missing Data It is quite common to have observations with missing values for one or more input features. The usual approach is to impute (fill-in) the missing values in some way. However, the first issue in dealing with the problem is determining wheth- er the missing data mechanism has distorted the observed data. Roughly speaking, data are missing at random if the mechanism resulting in its omission is independent of its (unobserved) value. A more precise definition is given in Little and Rubin (2002). Suppose y is the response vector and X is the N × p matrix of inputs (some of which are missing). Denote by Xobs the observed entries in X and let Z =(y, X), Zobs =(y, Xobs). Finally, if R is an indicator matrix with ijth entry 1 if xij is missing and zero otherwise, then the data is said to be missing at random (MAR) if the distribution of R depends on the data Z only through Zobs: Pr(R|Z,θ)=Pr(R|Zobs,θ). (9.31) Here θ are any parameters in the distribution of R. Data are said to be missing completely at random (MCAR) if the distribution of R doesn’t depend on the observed or missing data: Pr(R|Z,θ)=Pr(R|θ). (9.32) MCAR is a stronger assumption than MAR: most imputation methods rely on MCAR for their validity. For example, if a patient’s measurement was not taken because the doctor felt he was too sick, that observation would not be MAR or MCAR. In this case the missing data mechanism causes our observed training data to give a distorted picture of the true population, and data imputation is dangerous in this instance. Often the determination of whether features are MCAR must be made from information about the data collection process. For categorical features, one way to diagnose this problem is to code “missing” as an additional class. Then we fit our model to the training data and see if class “missing” is predictive of the response.9.6 Missing Data 333 Assuming the features are missing completely at random, there are a number of ways of proceeding: 1. Discard observations with any missing values. 2. Rely on the learning algorithm to deal with missing values in its training phase. 3. Impute all missing values before training. Approach (1) can be used if the relative amount of missing data is small, but otherwise should be avoided. Regarding (2), CART is one learning algorithm that deals effectively with missing values, through surrogate splits (Section 9.2.4). MARS and PRIM use similar approaches. In generalized additive modeling, all observations missing for a given input feature are omitted when the partial residuals are smoothed against that feature in the backfitting algorithm, and their fitted values are set to zero. Since the fitted curves have mean zero (when the model includes an intercept), this amounts to assigning the average fitted value to the missing observations. For most learning methods, the imputation approach (3) is necessary. The simplest tactic is to impute the missing value with the mean or median of the nonmissing values for that feature. (Note that the above procedure for generalized additive models is analogous to this.) If the features have at least some moderate degree of dependence, one can do better by estimating a predictive model for each feature given the other features and then imputing each missing value by its prediction from the model. In choosing the learning method for imputation of the features, one must remember that this choice is distinct from the method used for predicting y from X. Thus a flexible, adaptive method will often be pre- ferred, even for the eventual purpose of carrying out a linear regression of y on X. In addition, if there are many missing feature values in the training set, the learning method must itself be able to deal with missing feature values. CART therefore is an ideal choice for this imputation “engine.” After imputation, missing values are typically treated as if they were ac- tually observed. This ignores the uncertainty due to the imputation, which will itself introduce additional uncertainty into estimates and predictions from the response model. One can measure this additional uncertainty by doing multiple imputations and hence creating many different training sets. The predictive model for y can be fit to each training set, and the variation across training sets can be assessed. If CART was used for the imputation engine, the multiple imputations could be done by sampling from the values in the corresponding terminal nodes.334 9. Additive Models, Trees, and Related Methods 9.7 Computational Considerations With N observations and p predictors, additive model fitting requires some number mp of applications of a one-dimensional smoother or regression method. The required number of cycles m of the backfitting algorithm is usually less than 20 and often less than 10, and depends on the amount of correlation in the inputs. With cubic smoothing splines, for example, N log N operations are needed for an initial sort and N operations for the spline fit. Hence the total operations for an additive model fit is pN log N + mpN. Trees require pN log N operations for an initial sort for each predictor, and typically another pN log N operations for the split computations. If the splits occurred near the edges of the predictor ranges, this number could increase to N 2p. MARS requires Nm2 + pmN operations to add a basis function to a model with m terms already present, from a pool of p predictors. Hence to build an M-term model requires NM3 + pM 2N computations, which can be quite prohibitive if M is a reasonable fraction of N. Each of the components of an HME are typically inexpensive to fit at each M-step: Np2 for the regressions, and Np2K2 for a K-class logistic regression. The EM algorithm, however, can take a long time to converge, and so sizable HME models are considered costly to fit. Bibliographic Notes The most comprehensive source for generalized additive models is the text of that name by Hastie and Tibshirani (1990). Different applications of this work in medical problems are discussed in Hastie et al. (1989) and Hastie and Herman (1990), and the software implementation in Splus is described in Chambers and Hastie (1991). Green and Silverman (1994) discuss penalization and spline models in a variety of settings. Efron and Tibshirani (1991) give an exposition of modern developments in statistics (including generalized additive models), for a nonmathematical audience. Classification and regression trees date back at least as far as Morgan and Sonquist (1963). We have followed the modern approaches of Breiman et al. (1984) and Quinlan (1993). The PRIM method is due to Friedman and Fisher (1999), while MARS is introduced in Friedman (1991), with an additive precursor in Friedman and Silverman (1989). Hierarchical mixtures of experts were proposed in Jordan and Jacobs (1994); see also Jacobs et al. (1991).Exercises 335 Exercises Ex. 9.1 Show that a smoothing spline fit of yi to xi preserves the linear part of the fit. In other words, if yi =ˆyi + ri, where ˆyi represents the linear regression fits, and S is the smoothing matrix, then Sy = ˆy + Sr. Show that the same is true for local linear regression (Section 6.1.1). Hence argue that the adjustment step in the second line of (2) in Algorithm 9.1 is unnecessary. Ex. 9.2 Let A be a known k × k matrix, b be a known k-vector, and z be an unknown k-vector. A Gauss–Seidel algorithm for solving the linear system of equations Az = b works by successively solving for element zj in the jth equation, fixing all other zj’s at their current guesses. This process is repeated for j =1, 2,...,k,1, 2,...,k,...,until convergence (Golub and Van Loan, 1983). (a) Consider an additive model with N observations and p terms, with the jth term to be fit by a linear smoother Sj. Consider the following system of equations: ⎛ ⎜⎜⎜⎝ IS1 S1 ··· S1 S2 IS2 ··· S2 ... ... ... ... ... Sp Sp Sp ··· I ⎞ ⎟⎟⎟⎠ ⎛ ⎜⎜⎜⎝ f1 f2 ... fp ⎞ ⎟⎟⎟⎠ = ⎛ ⎜⎜⎜⎝ S1y S2y ... Spy ⎞ ⎟⎟⎟⎠ . (9.33) Here each fj is an N-vector of evaluations of the jth function at the data points, and y is an N-vector of the response values. Show that backfitting is a blockwise Gauss–Seidel algorithm for solving this system of equations. (b) Let S1 and S2 be symmetric smoothing operators (matrices) with eigenvalues in [0, 1). Consider a backfitting algorithm with response vector y and smoothers S1, S2. Show that with any starting values, the algorithm converges and give a formula for the final iterates. Ex. 9.3 Backfitting equations. Consider a backfitting procedure with orthog- onal projections, and let D be the overall regression matrix whose columns span V = Lcol(S1) ⊕Lcol(S2) ⊕···⊕Lcol(Sp), where Lcol(S) denotes the column space of a matrix S. Show that the estimating equations ⎛ ⎜⎜⎜⎝ IS1 S1 ··· S1 S2 IS2 ··· S2 ... ... ... ... ... Sp Sp Sp ··· I ⎞ ⎟⎟⎟⎠ ⎛ ⎜⎜⎜⎝ f1 f2 ... fp ⎞ ⎟⎟⎟⎠ = ⎛ ⎜⎜⎜⎝ S1y S2y ... Spy ⎞ ⎟⎟⎟⎠ are equivalent to the least squares normal equations DT Dβ = DT y where β is the vector of coefficients.336 9. Additive Models, Trees, and Related Methods Ex. 9.4 Suppose the same smoother S is used to estimate both terms in a two-term additive model (i.e., both variables are identical). Assume that S is symmetric with eigenvalues in [0, 1). Show that the backfitting residual converges to (I + S)−1(I − S)y, and that the residual sum of squares con- verges upward. Can the residual sum of squares converge upward in less structured situations? How does this fit compare to the fit with a single term fit by S?[Hint: Use the eigen-decomposition of S to help with this comparison.] Ex. 9.5 Degrees of freedom of a tree. Given data yi with mean f(xi)and variance σ2, and a fitting operation y → ˆy, let’s define the degrees of freedom of a fit by i cov(yi, ˆyi)/σ2. Consider a fit ˆy estimated by a regression tree, fit to a set of predictors X1,X2,...,Xp. (a) In terms of the number of terminal nodes m, give a rough formula for the degrees of freedom of the fit. (b) Generate 100 observations with predictors X1,X2,...,X10 as inde- pendent standard Gaussian variates and fix these values. (c) Generate response values also as standard Gaussian (σ2 = 1), indepen- dent of the predictors. Fit regression trees to the data of fixed size 1,5 and 10 terminal nodes and hence estimate the degrees of freedom of each fit. [Do ten simulations of the response and average the results, to get a good estimate of degrees of freedom.] (d) Compare your estimates of degrees of freedom in (a) and (c) and discuss. (e) If the regression tree fit were a linear operation, we could write ˆy = Sy for some matrix S. Then the degrees of freedom would be tr(S). Suggest a way to compute an approximate S matrix for a regression tree, compute it and compare the resulting degrees of freedom to those in (a) and (c). Ex. 9.6 Consider the ozone data of Figure 6.9. (a) Fit an additive model to the cube root of ozone concentration. as a function of temperature, wind speed, and radiation. Compare your results to those obtained via the trellis display in Figure 6.9. (b) Fit trees, MARS, and PRIM to the same data, and compare the results to those found in (a) and in Figure 6.9.This is page 337 Printer: Opaque this 10 Boosting and Additive Trees 10.1 Boosting Methods Boosting is one of the most powerful learning ideas introduced in the last twenty years. It was originally designed for classification problems, but as will be seen in this chapter, it can profitably be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” From this perspective boosting bears a resemblance to bagging and other committee-based approaches (Section 8.8). However we shall see that the connection is at best superficial and that boosting is fundamentally differ- ent. We begin by describing the most popular boosting algorithm due to Freund and Schapire (1997) called “AdaBoost.M1.” Consider a two-class problem, with the output variable coded as Y ∈{−1, 1}. Given a vector of predictor variables X,aclassifierG(X) produces a prediction taking one of the two values {−1, 1}. The error rate on the training sample is err = 1 N N i=1 I(yi = G(xi)), and the expected error rate on future predictions is EXY I(Y = G(X)). A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x),m =1, 2,...,M.338 10. Boosting and Additive Trees Training Sample Weighted Sample Weighted Sample Weighted Sample Training Sample Weighted Sample Weighted Sample Weighted SampleWeighted Sample Training Sample Weighted Sample Training Sample Weighted Sample Weighted SampleWeighted Sample Weighted Sample Weighted Sample Weighted Sample Training Sample Weighted Sample G(x) = sign M m=1 αmGm(x) GM (x) G3(x) G2(x) G1(x) Final Classifier FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted ver- sions of the dataset, and then combined to produce a final prediction. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction: G(x) = sign M m=1 αmGm(x) . (10.1) Here α1,α2,...,αM are computed by the boosting algorithm, and weight the contribution of each respective Gm(x). Their effect is to give higher influence to the more accurate classifiers in the sequence. Figure 10.1 shows a schematic of the AdaBoost procedure. The data modifications at each boosting step consist of applying weights w1,w2,...,wN to each of the training observations (xi,yi),i=1, 2,...,N. Initially all of the weights are set to wi =1/N , so that the first step simply trains the classifier on the data in the usual manner. For each successive iteration m =2, 3,...,M the observation weights are individually modi- fied and the classification algorithm is reapplied to the weighted observa- tions. At step m, those observations that were misclassified by the classifier Gm−1(x) induced at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus as iterations proceed, observations that are difficult to classify correctly re- ceive ever-increasing influence. Each successive classifier is thereby forced10.1 Boosting Methods 339 Algorithm 10.1 AdaBoost.M1. 1. Initialize the observation weights wi =1/N, i =1, 2,...,N. 2. For m =1toM: (a) Fit a classifier Gm(x) to the training data using weights wi. (b) Compute errm = N i=1 wiI(yi = Gm(xi))N i=1 wi . (c) Compute αm = log((1 − errm)/errm). (d) Set wi ← wi · exp[αm · I(yi = Gm(xi))],i=1, 2,...,N. 3. Output G(x) = sign M m=1 αmGm(x) . to concentrate on those training observations that are missed by previous ones in the sequence. Algorithm 10.1 shows the details of the AdaBoost.M1 algorithm. The current classifier Gm(x) is induced on the weighted observations at line 2a. The resulting weighted error rate is computed at line 2b. Line 2c calculates the weight αm given to Gm(x) in producing the final classifier G(x) (line 3). The individual weights of each of the observations are updated for the next iteration at line 2d. Observations misclassified by Gm(x) have their weights scaled by a factor exp(αm), increasing their relative influence for inducing the next classifier Gm+1(x) in the sequence. The AdaBoost.M1 algorithm is known as “Discrete AdaBoost” in Fried- man et al. (2000), because the base classifier Gm(x) returns a discrete class label. If the base classifier instead returns a real-valued prediction (e.g., a probability mapped to the interval [−1, 1]), AdaBoost can be modified appropriately (see “Real AdaBoost” in Friedman et al. (2000)). The power of AdaBoost to dramatically increase the performance of even a very weak classifier is illustrated in Figure 10.2. The features X1,...,X10 are standard independent Gaussian, and the deterministic target Y is de- fined by Y = 1if 10 j=1 X2 j >χ2 10(0.5), −1 otherwise. (10.2) Here χ2 10(0.5) = 9.34 is the median of a chi-squared random variable with 10 degrees of freedom (sum of squares of 10 standard Gaussians). There are 2000 training cases, with approximately 1000 cases in each class, and 10,000 test observations. Here the weak classifier is just a “stump”: a two-terminal node classification tree. Applying this classifier alone to the training data set yields a very poor test set error rate of 45.8%, compared to 50% for340 10. Boosting and Additive Trees 0 100 200 300 400 0.0 0.1 0.2 0.3 0.4 0.5 Boosting Iterations Test Error Single Stump 244 Node Tree FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps, as a function of the number of iterations. Also shown are the test error rate for a single stump, and a 244-node classification tree. random guessing. However, as boosting iterations proceed the error rate steadily decreases, reaching 5.8% after 400 iterations. Thus, boosting this simple very weak classifier reduces its prediction error rate by almost a factor of four. It also outperforms a single large classification tree (error rate 24.7%). Since its introduction, much has been written to explain the success of AdaBoost in producing accurate classifiers. Most of this work has centered on using classification trees as the “base learner” G(x), where improvements are often most dramatic. In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)). This is especially the case for data- mining applications, as discussed more fully in Section 10.7 later in this chapter. 10.1.1 Outline of This Chapter Here is an outline of the developments in this chapter: • We show that AdaBoost fits an additive model in a base learner, optimizing a novel exponential loss function. This loss function is10.2 Boosting Fits an Additive Model 341 very similar to the (negative) binomial log-likelihood (Sections 10.2– 10.4). • The population minimizer of the exponential loss function is shown to be the log-odds of the class probabilities (Section 10.5). • We describe loss functions for regression and classification that are more robust than squared error or exponential loss (Section 10.6). • It is argued that decision trees are an ideal base learner for data mining applications of boosting (Sections 10.7 and 10.9). • We develop a class of gradient boosted models (GBMs), for boosting trees with any loss function (Section 10.10). • The importance of “slow learning” is emphasized, and implemented by shrinkage of each new term that enters the model (Section 10.12), as well as randomization (Section 10.12.2). • Tools for interpretation of the fitted model are described (Section 10.13). 10.2 Boosting Fits an Additive Model The success of boosting is really not very mysterious. The key lies in ex- pression (10.1). Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions. Here the basis functions are the individual classifiers Gm(x) ∈{−1, 1}. More generally, basis function expansions take the form f(x)= M m=1 βmb(x; γm), (10.3) where βm,m=1, 2,...,M are the expansion coefficients, and b(x; γ) ∈ IR are usually simple functions of the multivariate argument x, characterized by a set of parameters γ. We discuss basis expansions in some detail in Chapter 5. Additive expansions like this are at the heart of many of the learning techniques covered in this book: • In single-hidden-layer neural networks (Chapter 11), b(x; γ)=σ(γ0 + γT 1 x), where σ(t)=1/(1+e−t) is the sigmoid function, and γ param- eterizes a linear combination of the input variables. • In signal processing, wavelets (Section 5.9.1) are a popular choice with γ parameterizing the location and scale shifts of a “mother” wavelet. • Multivariate adaptive regression splines (Section 9.4) uses truncated- power spline basis functions where γ parameterizes the variables and values for the knots.342 10. Boosting and Additive Trees Algorithm 10.2 Forward Stagewise Additive Modeling. 1. Initialize f0(x)=0. 2. For m =1toM: (a) Compute (βm,γm) = arg min β,γ N i=1 L(yi,fm−1(xi)+βb(xi; γ)). (b) Set fm(x)=fm−1(x)+βmb(x; γm). • For trees, γ parameterizes the split variables and split points at the internal nodes, and the predictions at the terminal nodes. Typically these models are fit by minimizing a loss function averaged over the training data, such as the squared-error or a likelihood-based loss function, min{βm,γm}M 1 N i=1 L yi, M m=1 βmb(xi; γm) . (10.4) For many loss functions L(y,f(x)) and/or basis functions b(x; γ), this re- quires computationally intensive numerical optimization techniques. How- ever, a simple alternative often can be found when it is feasible to rapidly solve the subproblem of fitting just a single basis function, min β,γ N i=1 L (yi,βb(xi; γ)) . (10.5) 10.3 Forward Stagewise Additive Modeling Forward stagewise modeling approximates the solution to (10.4) by sequen- tially adding new basis functions to the expansion without adjusting the parameters and coefficients of those that have already been added. This is outlined in Algorithm 10.2. At each iteration m, one solves for the optimal basis function b(x; γm) and corresponding coefficient βm to add to the cur- rent expansion fm−1(x). This produces fm(x), and the process is repeated. Previously added terms are not modified. For squared-error loss L(y,f(x)) = (y − f(x))2, (10.6)10.4 Exponential Loss and AdaBoost 343 one has L(yi,fm−1(xi)+βb(xi; γ)) = (yi − fm−1(xi) − βb(xi; γ))2 =(rim − βb(xi; γ))2, (10.7) where rim = yi − fm−1(xi) is simply the residual of the current model on the ith observation. Thus, for squared-error loss, the term βmb(x; γm) that best fits the current residuals is added to the expansion at each step. This idea is the basis for “least squares” regression boosting discussed in Section 10.10.2. However, as we show near the end of the next section, squared-error loss is generally not a good choice for classification; hence the need to consider other loss criteria. 10.4 Exponential Loss and AdaBoost We now show that AdaBoost.M1 (Algorithm 10.1) is equivalent to forward stagewise additive modeling (Algorithm 10.2) using the loss function L(y,f(x)) = exp(−yf(x)). (10.8) The appropriateness of this criterion is addressed in the next section. For AdaBoost the basis functions are the individual classifiers Gm(x) ∈ {−1, 1}. Using the exponential loss function, one must solve (βm,Gm) = arg min β,G N i=1 exp[−yi(fm−1(xi)+βG(xi))] for the classifier Gm and corresponding coefficient βm to be added at each step. This can be expressed as (βm,Gm) = arg min β,G N i=1 w(m) i exp(−βyi G(xi)) (10.9) with w(m) i =exp(−yi fm−1(xi)). Since each w(m) i depends neither on β nor G(x), it can be regarded as a weight that is applied to each observa- tion. This weight depends on fm−1(xi), and so the individual weight values change with each iteration m. The solution to (10.9) can be obtained in two steps. First, for any value of β>0, the solution to (10.9) for Gm(x)is Gm = arg min G N i=1 w(m) i I(yi = G(xi)), (10.10)344 10. Boosting and Additive Trees which is the classifier that minimizes the weighted error rate in predicting y. This can be easily seen by expressing the criterion in (10.9) as e−β · yi=G(xi) w(m) i + eβ · yi=G(xi) w(m) i , which in turn can be written as eβ − e−β · N i=1 w(m) i I(yi = G(xi)) + e−β · N i=1 w(m) i . (10.11) Plugging this Gm into (10.9) and solving for β one obtains βm = 1 2 log 1 − errm errm , (10.12) where errm is the minimized weighted error rate errm = N i=1 w(m) i I(yi = Gm(xi))N i=1 w(m) i . (10.13) The approximation is then updated fm(x)=fm−1(x)+βmGm(x), which causes the weights for the next iteration to be w(m+1) i = w(m) i · e−βmyiGm(xi). (10.14) Using the fact that −yiGm(xi)=2· I(yi = Gm(xi)) − 1, (10.14) becomes w(m+1) i = w(m) i · eαmI(yi=Gm(xi)) · e−βm , (10.15) where αm =2βm is the quantity defined at line 2c of AdaBoost.M1 (Al- gorithm 10.1). The factor e−βm in (10.15) multiplies all weights by the same value, so it has no effect. Thus (10.15) is equivalent to line 2(d) of Algorithm 10.1. One can view line 2(a) of the Adaboost.M1 algorithm as a method for approximately solving the minimization in (10.11) and hence (10.10). Hence we conclude that AdaBoost.M1 minimizes the exponential loss criterion (10.8) via a forward-stagewise additive modeling approach. Figure 10.3 shows the training-set misclassification error rate and aver- age exponential loss for the simulated data problem (10.2) of Figure 10.2. The training-set misclassification error decreases to zero at around 250 it- erations (and remains there), but the exponential loss keeps decreasing. Notice also in Figure 10.2 that the test-set misclassification error continues to improve after iteration 250. Clearly Adaboost is not optimizing training- set misclassification error; the exponential loss is more sensitive to changes in the estimated class probabilities.10.5 Why Exponential Loss? 345 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 Boosting Iterations Training Error Misclassification Rate Exponential Loss FIGURE 10.3. Simulated data, boosting with stumps: misclassification error rate on the training set, and average exponential loss: (1/N ) PN i=1 exp(−yif(xi)). After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease. 10.5 Why Exponential Loss? The AdaBoost.M1 algorithm was originally motivated from a very differ- ent perspective than presented in the previous section. Its equivalence to forward stagewise additive modeling based on exponential loss was only discovered five years after its inception. By studying the properties of the exponential loss criterion, one can gain insight into the procedure and dis- cover ways it might be improved. The principal attraction of exponential loss in the context of additive modeling is computational; it leads to the simple modular reweighting Ad- aBoost algorithm. However, it is of interest to inquire about its statistical properties. What does it estimate and how well is it being estimated? The first question is answered by seeking its population minimizer. It is easy to show (Friedman et al., 2000) that f ∗(x) = arg min f(x) EY |x(e−Yf(x))=1 2 log Pr(Y =1|x) Pr(Y = −1|x), (10.16)346 10. Boosting and Additive Trees or equivalently Pr(Y =1|x)= 1 1+e−2f ∗(x) . Thus, the additive expansion produced by AdaBoost is estimating one- half the log-odds of P(Y =1|x). This justifies using its sign as the classifi- cation rule in (10.1). Another loss criterion with the same population minimizer is the bi- nomial negative log-likelihood or deviance (also known as cross-entropy), interpreting f as the logit transform. Let p(x)=Pr(Y =1| x)= ef(x) e−f(x) + ef(x) = 1 1+e−2f(x) (10.17) and define Y =(Y +1)/2 ∈{0, 1}. Then the binomial log-likelihood loss function is l(Y,p(x)) = Y log p(x)+(1− Y ) log(1 − p(x)), or equivalently the deviance is −l(Y,f(x)) = log 1+e−2Yf(x) . (10.18) Since the population maximizer of log-likelihood is at the true probabilities p(x)=Pr(Y =1| x), we see from (10.17) that the population minimizers of the deviance EY |x[−l(Y,f(x))] and EY |x[e−Yf(x)] are the same. Thus, using either criterion leads to the same solution at the population level. Note that e−Yf itself is not a proper log-likelihood, since it is not the logarithm of any probability mass function for a binary random variable Y ∈{−1, 1}. 10.6 Loss Functions and Robustness In this section we examine the different loss functions for classification and regression more closely, and characterize them in terms of their robustness to extreme data. Robust Loss Functions for Classification Although both the exponential (10.8) and binomial deviance (10.18) yield the same solution when applied to the population joint distribution, the same is not true for finite data sets. Both criteria are monotone decreasing functions of the “margin” yf(x). In classification (with a −1/1 response) the margin plays a role analogous to the residuals y−f(x) in regression. The classification rule G(x) = sign[f(x)] implies that observations with positive margin yif(xi) > 0 are classified correctly whereas those with negative margin yif(xi) < 0 are misclassified. The decision boundary is defined by10.6 Loss Functions and Robustness 347 −2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector Loss y · f FIGURE 10.4. Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) = y); exponential: exp(−yf); binomial deviance: log(1 + exp(−2yf)); squared error: (y − f)2; and support vector: (1 − yf)+ (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1). f(x) = 0. The goal of the classification algorithm is to produce positive margins as frequently as possible. Any loss criterion used for classification should penalize negative margins more heavily than positive ones since positive margin observations are already correctly classified. Figure 10.4 shows both the exponential (10.8) and binomial deviance criteria as a function of the margin y · f(x). Also shown is misclassification loss L(y,f(x)) = I(y·f(x) < 0), which gives unit penalty for negative mar- gin values, and no penalty at all for positive ones. Both the exponential and deviance loss can be viewed as monotone continuous approximations to misclassification loss. They continuously penalize increasingly negative margin values more heavily than they reward increasingly positive ones. The difference between them is in degree. The penalty associated with bi- nomial deviance increases linearly for large increasingly negative margin, whereas the exponential criterion increases the influence of such observa- tions exponentially. At any point in the training process the exponential criterion concen- trates much more influence on observations with large negative margins. Binomial deviance concentrates relatively less influence on such observa-348 10. Boosting and Additive Trees tions, more evenly spreading the influence among all of the data. It is therefore far more robust in noisy settings where the Bayes error rate is not close to zero, and especially in situations where there is misspecification of the class labels in the training data. The performance of AdaBoost has been empirically observed to dramatically degrade in such situations. Also shown in the figure is squared-error loss. The minimizer of the cor- responding risk on the population is f ∗(x) = arg min f(x) EY |x(Y −f(x))2 =E(Y | x)=2·Pr(Y =1| x)−1. (10.19) As before the classification rule is G(x) = sign[f(x)]. Squared-error loss is not a good surrogate for misclassification error. As seen in Figure 10.4, it is not a monotone decreasing function of increasing margin yf(x). For mar- gin values yif(xi) > 1 it increases quadratically, thereby placing increasing influence (error) on observations that are correctly classified with increas- ing certainty, thereby reducing the relative influence of those incorrectly classified yif(xi) < 0. Thus, if class assignment is the goal, a monotone de- creasing criterion serves as a better surrogate loss function. Figure 12.4 on page 426 in Chapter 12 includes a modification of quadratic loss, the “Hu- berized” square hinge loss (Rosset et al., 2004b), which enjoys the favorable properties of the binomial deviance, quadratic loss and the SVM hinge loss. It has the same population minimizer as the quadratic (10.19), is zero for y·f(x) > 1, and becomes linear for y·f(x) < −1. Since quadratic functions are easier to compute with than exponentials, our experience suggests this to be a useful alternative to the binomial deviance. With K-class classification, the response Y takes values in the unordered set G = {G1,...,Gk} (see Sections 2.4 and 4.4). We now seek a classifier G(x) taking values in G. It is sufficient to know the class conditional proba- bilities pk(x)=Pr(Y = Gk|x),k=1, 2,...,K, for then the Bayes classifier is G(x)=Gk where k = arg max p(x). (10.20) In principal, though, we need not learn the pk(x), but simply which one is largest. However, in data mining applications the interest is often more in the class probabilities p(x),=1,...,K themselves, rather than in per- forming a class assignment. As in Section 4.4, the logistic model generalizes naturally to K classes, pk(x)= efk(x) K l=1 efl(x) , (10.21) which ensures that 0 ≤ pk(x) ≤ 1 and that they sum to one. Note that here we have K different functions, one per class. There is a redundancy in the functions fk(x), since adding an arbitrary h(x) to each leaves the model unchanged. Traditionally one of them is set to zero: for example,10.6 Loss Functions and Robustness 349 fK(x) = 0, as in (4.17). Here we prefer to retain the symmetry, and impose the constraint K k=1 fk(x) = 0. The binomial deviance extends naturally to the K-class multinomial deviance loss function: L(y,p(x)) = − K k=1 I(y = Gk) log pk(x) = − K k=1 I(y = Gk)fk(x) + log K =1 ef(x) . (10.22) As in the two-class case, the criterion (10.22) penalizes incorrect predictions only linearly in their degree of incorrectness. Zhu et al. (2005) generalize the exponential loss for K-class problems. See Exercise 10.5 for details. Robust Loss Functions for Regression In the regression setting, analogous to the relationship between exponential loss and binomial log-likelihood is the relationship between squared-error loss L(y,f(x)) = (y−f(x))2 and absolute loss L(y,f(x)) = | y−f(x) |.The population solutions are f(x)=E(Y |x) for squared-error loss, and f(x)= median(Y |x) for absolute loss; for symmetric error distributions these are the same. However, on finite samples squared-error loss places much more emphasis on observations with large absolute residuals | yi − f(xi) | during the fitting process. It is thus far less robust, and its performance severely degrades for long-tailed error distributions and especially for grossly mis- measured y-values (“outliers”). Other more robust criteria, such as abso- lute loss, perform much better in these situations. In the statistical ro- bustness literature, a variety of regression loss criteria have been proposed that provide strong resistance (if not absolute immunity) to gross outliers while being nearly as efficient as least squares for Gaussian errors. They are often better than either for error distributions with moderately heavy tails. One such criterion is the Huber loss criterion used for M-regression (Huber, 1964) L(y,f(x)) = [y − f(x)]2 for | y − f(x) |≤δ, 2δ(| y − f(x) |−δ2) otherwise. (10.23) Figure 10.5 compares these three loss functions. These considerations suggest than when robustness is a concern, as is especially the case in data mining applications (see Section 10.7), squared- error loss for regression and exponential loss for classification are not the best criteria from a statistical perspective. However, they both lead to the elegant modular boosting algorithms in the context of forward stagewise additive modeling. For squared-error loss one simply fits the base learner to the residuals from the current model yi − fm−1(xi) at each step. For350 10. Boosting and Additive Trees −3 −2 −1 0 1 2 3 02468 Squared Error Absolute Error Huber Loss y − f FIGURE 10.5. A comparison of three loss functions for regression, plotted as a function of the margin y−f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when |y − f| is large. exponential loss one performs a weighted fit of the base learner to the output values yi, with weights wi =exp(−yifm−1(xi)). Using other more robust criteria directly in their place does not give rise to such simple feasible boosting algorithms. However, in Section 10.10.2 we show how one can derive simple elegant boosting algorithms based on any differentiable loss criterion, thereby producing highly robust boosting procedures for data mining. 10.7 “Off-the-Shelf” Procedures for Data Mining Predictive learning is an important aspect of data mining. As can be seen from this book, a wide variety of methods have been developed for predic- tive learning from data. For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data. We have attempted to characterize appropriate situations in our discussions of each of the re- spective methods. However, it is seldom known in advance which procedure will perform best or even well for any given problem. Table 10.1 summarizes some of the characteristics of a number of learning methods. Industrial and commercial data mining applications tend to be especially challenging in terms of the requirements placed on learning procedures. Data sets are often very large in terms of number of observations and number of variables measured on each of them. Thus, computational con-10.7 “Off-the-Shelf” Procedures for Data Mining 351 TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good, ◆=fair, and ▼=poor. Characteristic Neural SVM Trees MARS k-NN, Nets Kernels Natural handling of data of “mixed” type ▼▼▲▲ ▼ Handling of missing values ▼▼▲▲ ▲ Robustness to outliers in input space ▼▼▲ ▼ ▲ Insensitive to monotone transformations of inputs ▼▼▲ ▼▼ Computational scalability (large N) ▼▼▲▲ ▼ Ability to deal with irrel- evant inputs ▼▼▲▲ ▼ Ability to extract linear combinations of features ▲▲▼▼ ◆ Interpretability ▼▼◆ ▲ ▼ Predictive power ▲▲▼ ◆ ▲ siderations play an important role. Also, the data are usually messy:the inputs tend to be mixtures of quantitative, binary, and categorical vari- ables, the latter often with many levels. There are generally many missing values, complete observations being rare. Distributions of numeric predic- tor and response variables are often long-tailed and highly skewed. This is the case for the spam data (Section 9.1.2); when fitting a generalized additive model, we first log-transformed each of the predictors in order to get a reasonable fit. In addition they usually contain a substantial fraction of gross mis-measurements (outliers). The predictor variables are generally measured on very different scales. In data mining applications, usually only a small fraction of the large number of predictor variables that have been included in the analysis are actually relevant to prediction. Also, unlike many applications such as pat- tern recognition, there is seldom reliable domain knowledge to help create especially relevant features and/or filter out the irrelevant ones, the inclu- sion of which dramatically degrades the performance of many methods. In addition, data mining applications generally require interpretable mod- els. It is not enough to simply produce predictions. It is also desirable to have information providing qualitative understanding of the relationship352 10. Boosting and Additive Trees between joint values of the input variables and the resulting predicted re- sponse value. Thus, black box methods such as neural networks, which can be quite useful in purely predictive settings such as pattern recognition, are far less useful for data mining. These requirements of speed, interpretability and the messy nature of the data sharply limit the usefulness of most learning procedures as off- the-shelf methods for data mining. An “off-the-shelf” method is one that can be directly applied to the data without requiring a great deal of time- consuming data preprocessing or careful tuning of the learning procedure. Of all the well-known learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. They are relatively fast to construct and they produce interpretable models (if the trees are small). As discussed in Section 9.2, they naturally incorporate mixtures of numeric and categorical predictor variables and missing values. They are invariant under (strictly monotone) transforma- tions of the individual predictors. As a result, scaling and/or more general transformations are not an issue, and they are immune to the effects of pre- dictor outliers. They perform internal feature selection as an integral part of the procedure. They are thereby resistant, if not completely immune, to the inclusion of many irrelevant predictor variables. These properties of decision trees are largely the reason that they have emerged as the most popular learning method for data mining. Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy. They seldom provide predictive ac- curacy comparable to the best that can be achieved with the data at hand. As seen in Section 10.1, boosting decision trees improves their accuracy, often dramatically. At the same time it maintains most of their desirable properties for data mining. Some advantages of trees that are sacrificed by boosting are speed, interpretability, and, for AdaBoost, robustness against overlapping class distributions and especially mislabeling of the training data. A gradient boosted model (GBM) is a generalization of tree boosting that attempts to mitigate these problems, so as to produce an accurate and effective off-the-shelf procedure for data mining. 10.8 Example: Spam Data Before we go into the details of gradient boosting, we demonstrate its abili- ties on a two-class classification problem. The spam data are introduced in Chapter 1, and used as an example for many of the procedures in Chapter 9 (Sections 9.1.2, 9.2.5, 9.3.1 and 9.4.1). Applying gradient boosting to these data resulted in a test error rate of 4.5%, using the same test set as was used in Section 9.1.2. By comparison, an additive logistic regression achieved 5.5%, a CART tree fully grown and10.9 Boosting Trees 353 pruned by cross-validation 8.7%, and MARS 5.5%. The standard error of these estimates is around 0.6%, although gradient boosting is significantly better than all of them using the McNemar test (Exercise 10.6). In Section 10.13 below we develop a relative importance measure for each predictor, as well as a partial dependence plot describing a predictor’s contribution to the fitted model. We now illustrate these for the spam data. Figure 10.6 displays the relative importance spectrum for all 57 predictor variables. Clearly some predictors are more important than others in sep- arating spam from email. The frequencies of the character strings !, $, hp, and remove are estimated to be the four most relevant predictor variables. At the other end of the spectrum, the character strings 857, 415, table,and 3d have virtually no relevance. The quantity being modeled here is the log-odds of spam versus email f(x) = log Pr(spam|x) Pr(email|x) (10.24) (see Section 10.13 below). Figure 10.7 shows the partial dependence of the log-odds on selected important predictors, two positively associated with spam (! and remove), and two negatively associated (edu and hp). These particular dependencies are seen to be essentially monotonic.There is a general agreement with the corresponding functions found by the additive logistic regression model; see Figure 9.1 on page 303. Running a gradient boosted model on these data with J = 2 terminal- node trees produces a purely additive (main effects) model for the log- odds, with a corresponding error rate of 4.7%, as compared to 4.5% for the full gradient boosted model (with J = 5 terminal-node trees). Although not significant, this slightly higher error rate suggests that there may be interactions among some of the important predictor variables. This can be diagnosed through two-variable partial dependence plots. Figure 10.8 shows one of the several such plots displaying strong interaction effects. One sees that for very low frequencies of hp, the log-odds of spam are greatly increased. For high frequencies of hp, the log-odds of spam tend to be much lower and roughly constant as a function of !. As the frequency of hp decreases, the functional relationship with ! strengthens. 10.9 Boosting Trees Regression and classification trees are discussed in detail in Section 9.2. They partition the space of all joint predictor variable values into disjoint regions Rj,j =1, 2,...,J, as represented by the terminal nodes of the tree. A constant γj is assigned to each such region and the predictive rule is x ∈ Rj ⇒ f(x)=γj.354 10. Boosting and Additive Trees ! $ hp remove free CAPAVE your CAPMAX george CAPTOT edu you our money will 1999 business re ( receive internet 000 email meeting ; 650 over mail pm people technology hpl all order address make font project data original report conference lab [ credit parts # 85 table cs direct 415 857 telnet labs addresses 3d 0 20406080100 Relative Importance FIGURE 10.6. Predictor variable importance spectrum for the spam data. The variable names are written on the vertical axis.10.9 Boosting Trees 355 ! Partial Dependence 0.0 0.2 0.4 0.6 0.8 1.0 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 remove Partial Dependence 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 edu Partial Dependence 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.6 -0.2 0.0 0.2 hp Partial Dependence 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -1.0 -0.6 -0.2 0.0 0.2 FIGURE 10.7. Partial dependence of log-odds of spam on four important pre- dictors. The red ticks at the base of the plots are deciles of the input variable. 0.51.01.52.02.53.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.5 0.0 0.5 1.0 hp ! FIGURE 10.8. Partial dependence of the log-odds of spam vs. email as a func- tion of joint frequencies of hp and the character !.356 10. Boosting and Additive Trees Thus a tree can be formally expressed as T(x;Θ)= J j=1 γjI(x ∈ Rj), (10.25) with parameters Θ = {Rj,γj}J 1 . J is usually treated as a meta-parameter. The parameters are found by minimizing the empirical risk ˆΘ=argmin Θ J j=1 xi∈Rj L(yi,γj). (10.26) This is a formidable combinatorial optimization problem, and we usually settle for approximate suboptimal solutions. It is useful to divide the opti- mization problem into two parts: Finding γj given Rj: Given the Rj, estimating the γj is typically trivial, and often ˆγj =¯yj, the mean of the yi falling in region Rj. For mis- classification loss, ˆγj is the modal class of the observations falling in region Rj. Finding Rj: This is the difficult part, for which approximate solutions are found. Note also that finding the Rj entails estimating the γj as well. A typical strategy is to use a greedy, top-down recursive partitioning algorithm to find the Rj. In addition, it is sometimes necessary to approximate (10.26) by a smoother and more convenient criterion for optimizing the Rj: ˜Θ = arg min Θ N i=1 ˜L(yi,T(xi, Θ)). (10.27) Then given the ˆRj = ˜Rj,theγj can be estimated more precisely using the original criterion. In Section 9.2 we described such a strategy for classification trees. The Gini index replaced misclassification loss in the growing of the tree (identifying the Rj). The boosted tree model is a sum of such trees, fM (x)= M m=1 T(x;Θm), (10.28) induced in a forward stagewise manner (Algorithm 10.2). At each step in the forward stagewise procedure one must solve ˆΘm = arg min Θm N i=1 L (yi,fm−1(xi)+T(xi;Θm)) (10.29)10.9 Boosting Trees 357 for the region set and constants Θm = {Rjm,γjm}Jm 1 of the next tree, given the current model fm−1(x). Given the regions Rjm, finding the optimal constants γjm in each region is typically straightforward: ˆγjm = arg minγjm xi∈Rjm L (yi,fm−1(xi)+γjm) . (10.30) Finding the regions is difficult, and even more difficult than for a single tree. For a few special cases, the problem simplifies. For squared-error loss, the solution to (10.29) is no harder than for a single tree. It is simply the regression tree that best predicts the current residuals yi − fm−1(xi), and ˆγjm is the mean of these residuals in each corresponding region. For two-class classification and exponential loss, this stagewise approach gives rise to the AdaBoost method for boosting classification trees (Algo- rithm 10.1). In particular, if the trees T(x;Θm) are restricted to be scaled classification trees, then we showed in Section 10.4 that the solution to (10.29) is the tree that minimizes the weighted error rate N i=1 w(m) i I(yi = T(xi;Θm)) with weights w(m) i = e−yifm−1(xi). By a scaled classification tree, we mean βmT(x;Θm), with the restriction that γjm ∈{−1, 1}). Without this restriction, (10.29) still simplifies for exponential loss to a weighted exponential criterion for the new tree: ˆΘm = arg min Θm N i=1 w(m) i exp[−yiT(xi;Θm)]. (10.31) It is straightforward to implement a greedy recursive-partitioning algorithm using this weighted exponential loss as a splitting criterion. Given the Rjm, one can show (Exercise 10.7) that the solution to (10.30) is the weighted log-odds in each corresponding region ˆγjm = log xi∈Rjm w(m) i I(yi =1) xi∈Rjm w(m) i I(yi = −1) . (10.32) This requires a specialized tree-growing algorithm; in practice, we prefer the approximation presented below that uses a weighted least squares re- gression tree. Using loss criteria such as the absolute error or the Huber loss (10.23) in place of squared-error loss for regression, and the deviance (10.22) in place of exponential loss for classification, will serve to robustify boosting trees. Unfortunately, unlike their nonrobust counterparts, these robust criteria do not give rise to simple fast boosting algorithms. For more general loss criteria the solution to (10.30), given the Rjm, is typically straightforward since it is a simple “location” estimate. For358 10. Boosting and Additive Trees absolute loss it is just the median of the residuals in each respective region. For the other criteria fast iterative algorithms exist for solving (10.30), and usually their faster “single-step” approximations are adequate. The problem is tree induction. Simple fast algorithms do not exist for solving (10.29) for these more general loss criteria, and approximations like (10.27) become essential. 10.10 Numerical Optimization via Gradient Boosting Fast approximate algorithms for solving (10.29) with any differentiable loss criterion can be derived by analogy to numerical optimization. The loss in using f(x) to predict y on the training data is L(f)= N i=1 L(yi,f(xi)). (10.33) The goal is to minimize L(f) with respect to f, where here f(x)iscon- strained to be a sum of trees (10.28). Ignoring this constraint, minimizing (10.33) can be viewed as a numerical optimization ˆf = arg min f L(f), (10.34) where the “parameters” f ∈ IR N are the values of the approximating func- tion f(xi)ateachoftheN data points xi: f = {f(x1),f(x2)),...,f(xN )}. Numerical optimization procedures solve (10.34) as a sum of component vectors fM = M m=0 hm , hm ∈ IR N , where f0 = h0 is an initial guess, and each successive fm is induced based on the current parameter vector fm−1, which is the sum of the previously induced updates. Numerical optimization methods differ in their prescrip- tions for computing each increment vector hm (“step”). 10.10.1 Steepest Descent Steepest descent chooses hm = −ρmgm where ρm is a scalar and gm ∈ IR N is the gradient of L(f) evaluated at f = fm−1. The components of the gradient gm are gim = ∂L(yi,f(xi)) ∂f(xi) f(xi)=fm−1(xi) (10.35)10.10 Numerical Optimization via Gradient Boosting 359 The step length ρm is the solution to ρm = arg minρ L(fm−1 − ρgm). (10.36) The current solution is then updated fm = fm−1 − ρmgm and the process repeated at the next iteration. Steepest descent can be viewed as a very greedy strategy, since −gm is the local direction in IRN for which L(f) is most rapidly decreasing at f = fm−1. 10.10.2 Gradient Boosting Forward stagewise boosting (Algorithm 10.2) is also a very greedy strategy. At each step the solution tree is the one that maximally reduces (10.29), given the current model fm−1 and its fits fm−1(xi). Thus, the tree predic- tions T(xi;Θm) are analogous to the components of the negative gradient (10.35). The principal difference between them is that the tree compo- nents tm =(T(x1;Θm),...,T(xN ;Θm) are not independent. They are con- strained to be the predictions of a Jm-terminal node decision tree, whereas the negative gradient is the unconstrained maximal descent direction. The solution to (10.30) in the stagewise approach is analogous to the line search (10.36) in steepest descent. The difference is that (10.30) performs a separate line search for those components of tm that correspond to each separate terminal region {T(xi;Θm)}xi∈Rjm. If minimizing loss on the training data (10.33) were the only goal, steep- est descent would be the preferred strategy. The gradient (10.35) is trivial to calculate for any differentiable loss function L(y,f(x)), whereas solving (10.29) is difficult for the robust criteria discussed in Section 10.6. Unfor- tunately the gradient (10.35) is defined only at the training data points xi, whereas the ultimate goal is to generalize fM (x) to new data not repre- sented in the training set. A possible resolution to this dilemma is to induce a tree T(x;Θm)atthe mth iteration whose predictions tm are as close as possible to the negative gradient. Using squared error to measure closeness, this leads us to ˜Θm = arg min Θ N i=1 (−gim − T(xi;Θ))2 . (10.37) That is, one fits the tree T to the negative gradient values (10.35) by least squares. As noted in Section 10.9 fast algorithms exist for least squares decision tree induction. Although the solution regions ˜Rjm to (10.37) will not be identical to the regions Rjm that solve (10.29), it is generally sim- ilar enough to serve the same purpose. In any case, the forward stagewise360 10. Boosting and Additive Trees TABLE 10.2. Gradients for commonly used loss functions. Setting Loss Function −∂L(yi,f(xi))/∂f(xi) Regression 1 2 [yi − f(xi)]2 yi − f(xi) Regression |yi − f(xi)| sign[yi − f(xi)] Regression Huber yi − f(xi)for|yi − f(xi)|≤δm δmsign[yi − f(xi)] for |yi − f(xi)| >δm where δm = αth-quantile{|yi − f(xi)|} Classification Deviance kth component: I(yi = Gk) − pk(xi) boosting procedure, and top-down decision tree induction, are themselves approximation procedures. After constructing the tree (10.37), the corre- sponding constants in each region are given by (10.30). Table 10.2 summarizes the gradients for commonly used loss functions. For squared error loss, the negative gradient is just the ordinary residual −gim = yi − fm−1(xi), so that (10.37) on its own is equivalent standard least squares boosting. With absolute error loss, the negative gradient is the sign of the residual, so at each iteration (10.37) fits the tree to the sign of the current residuals by least squares. For Huber M-regression, the negative gradient is a compromise between these two (see the table). For classification the loss function is the multinomial deviance (10.22), and K least squares trees are constructed at each iteration. Each tree Tkm is fit to its respective negative gradient vector gkm, −gikm = ∂L(yi,f1m(xi),...,f1m(xi)) ∂fkm(xi) = I(yi = Gk) − pk(xi), (10.38) with pk(x) given by (10.21). Although K separate trees are built at each iteration, they are related through (10.21). For binary classification (K = 2), only one tree is needed (exercise 10.10). 10.10.3 Implementations of Gradient Boosting Algorithm 10.3 presents the generic gradient tree-boosting algorithm for regression. Specific algorithms are obtained by inserting different loss cri- teria L(y,f(x)). The first line of the algorithm initializes to the optimal constant model, which is just a single terminal node tree. The components of the negative gradient computed at line 2(a) are referred to as general- ized or pseudo residuals, r. Gradients for commonly used loss functions are summarized in Table 10.2.10.11 Right-Sized Trees for Boosting 361 Algorithm 10.3 Gradient Tree Boosting Algorithm. 1. Initialize f0(x) = arg minγ N i=1 L(yi,γ). 2. For m =1toM: (a) For i =1, 2,...,N compute rim = − ∂L(yi,f(xi)) ∂f(xi) f=fm−1 . (b) Fit a regression tree to the targets rim giving terminal regions Rjm,j=1, 2,...,Jm. (c) For j =1, 2,...,Jm compute γjm = arg minγ xi∈Rjm L (yi,fm−1(xi)+γ) . (d) Update fm(x)=fm−1(x)+ Jm j=1 γjmI(x ∈ Rjm). 3. Output ˆf(x)=fM (x). The algorithm for classification is similar. Lines 2(a)–(d) are repeated K times at each iteration m, once for each class using (10.38). The result at line 3 is K different (coupled) tree expansions fkM (x),k =1, 2,...,K. These produce probabilities via (10.21) or do classification as in (10.20). Details are given in Exercise 10.9. Two basic tuning parameters are the number of iterations M and the sizes of each of the constituent trees Jm,m=1, 2,...,M. The original implementation of this algorithm was called MART for “multiple additive regression trees,” and was referred to in the first edi- tion of this book. Many of the figures in this chapter were produced by MART. Gradient boosting as described here is implemented in the R gbm package (Ridgeway, 1999, “Gradient Boosted Models”), and is freely avail- able. The gbm package is used in Section 10.14.2, and extensively in Chap- ters 16 and 15. Another R implementation of boosting is mboost (Hothorn and B¨uhlmann, 2006). A commercial implementation of gradient boost- ing/MART called TreeNet® is available from Salford Systems, Inc. 10.11 Right-Sized Trees for Boosting Historically, boosting was considered to be a technique for combining mod- els, here trees. As such, the tree building algorithm was regarded as a362 10. Boosting and Additive Trees primitive that produced models to be combined by the boosting proce- dure. In this scenario, the optimal size of each tree is estimated separately in the usual manner when it is built (Section 9.2). A very large (oversized) tree is first induced, and then a bottom-up procedure is employed to prune it to the estimated optimal number of terminal nodes. This approach as- sumes implicitly that each tree is the last one in the expansion (10.28). Except perhaps for the very last tree, this is clearly a very poor assump- tion. The result is that trees tend to be much too large, especially during the early iterations. This substantially degrades performance and increases computation. The simplest strategy for avoiding this problem is to restrict all trees to be the same size, Jm = J ∀m. At each iteration a J-terminal node regression tree is induced. Thus J becomes a meta-parameter of the entire boosting procedure, to be adjusted to maximize estimated performance for the data at hand. One can get an idea of useful values for J by considering the properties of the “target” function η = arg min f EXY L(Y,f(X)). (10.39) Here the expected value is over the population joint distribution of (X, Y ). The target function η(x) is the one with minimum prediction risk on future data. This is the function we are trying to approximate. One relevant property of η(X) is the degree to which the coordinate vari- ables XT =(X1,X2,...,Xp) interact with one another. This is captured by its ANOVA (analysis of variance) expansion η(X)= j ηj(Xj)+ jk ηjk(Xj,Xk)+ jkl ηjkl(Xj,Xk,Xl)+··· . (10.40) The first sum in (10.40) is over functions of only a single predictor variable Xj. The particular functions ηj(Xj) are those that jointly best approximate η(X) under the loss criterion being used. Each such ηj(Xj) is called the “main effect” of Xj. The second sum is over those two-variable functions that when added to the main effects best fit η(X). These are called the second-order interactions of each respective variable pair (Xj,Xk). The third sum represents third-order interactions, and so on. For many problems encountered in practice, low-order interaction effects tend to dominate. When this is the case, models that produce strong higher-order interaction effects, such as large decision trees, suffer in accuracy. The interaction level of tree-based approximations is limited by the tree size J. Namely, no interaction effects of level greater that J − 1arepos- sible. Since boosted models are additive in the trees (10.28), this limit extends to them as well. Setting J = 2 (single split “decision stump”) produces boosted models with only main effects; no interactions are per- mitted. With J = 3, two-variable interaction effects are also allowed, and10.11 Right-Sized Trees for Boosting 363 Number of Terms Test Error 0 100 200 300 400 0.0 0.1 0.2 0.3 0.4 Stumps 10 Node 100 Node Adaboost FIGURE 10.9. Boosting with different sized trees, applied to the example (10.2) used in Figure 10.2. Since the generative model is additive, stumps perform the best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; shown for comparison is the AdaBoost Algorithm 10.1. so on. This suggests that the value chosen for J should reflect the level of dominant interactions of η(x). This is of course generally unknown, but in most situations it will tend to be low. Figure 10.9 illustrates the effect of interaction order (choice of J) on the simulation example (10.2). The generative function is additive (sum of quadratic monomials), so boosting models with J>2 incurs unnecessary variance and hence the higher test error. Figure 10.10 compares the coordinate functions found by boosted stumps with the true functions. Although in many applications J = 2 will be insufficient, it is unlikely that J>10 will be required. Experience so far indicates that 4 ≤ J ≤ 8 works well in the context of boosting, with results being fairly insensitive to particular choices in this range. One can fine-tune the value for J by trying several different values and choosing the one that produces the low- est risk on a validation sample. However, this seldom provides significant improvement over using J  6.364 10. Boosting and Additive Trees Coordinate Functions for Additive Logistic Trees f1(x1) f2(x2) f3(x3) f4(x4) f5(x5) f6(x6) f7(x7) f8(x8) f9(x9) f10(x10) FIGURE 10.10. Coordinate functions estimated by boosting stumps for the sim- ulated example used in Figure 10.9. The true quadratic functions are shown for comparison. 10.12 Regularization Besides the size of the constituent trees, J, the other meta-parameter of gradient boosting is the number of boosting iterations M. Each iteration usually reduces the training risk L(fM ), so that for M large enough this risk can be made arbitrarily small. However, fitting the training data too well can lead to overfitting, which degrades the risk on future predictions. Thus, there is an optimal number M ∗ minimizing future risk that is application dependent. A convenient way to estimate M ∗ is to monitor prediction risk as a function of M on a validation sample. The value of M that minimizes this risk is taken to be an estimate of M ∗. This is analogous to the early stopping strategy often used with neural networks (Section 11.4). 10.12.1 Shrinkage Controlling the value of M is not the only possible regularization strategy. As with ridge regression and neural networks, shrinkage techniques can be employed as well (see Sections 3.4.1 and 11.5). The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor 0 <ν<1 when it is added to the current approximation. That is, line 2(d) of Algorithm 10.3 is replaced by fm(x)=fm−1(x)+ν · J j=1 γjmI(x ∈ Rjm). (10.41) The parameter ν can be regarded as controlling the learning rate of the boosting procedure. Smaller values of ν (more shrinkage) result in larger training risk for the same number of iterations M. Thus, both ν and M control prediction risk on the training data. However, these parameters do10.12 Regularization 365 not operate independently. Smaller values of ν lead to larger values of M for the same training risk, so that there is a tradeoff between them. Empirically it has been found (Friedman, 2001) that smaller values of ν favor better test error, and require correspondingly larger values of M.In fact, the best strategy appears to be to set ν to be very small (ν<0.1) and then choose M by early stopping. This yields dramatic improvements (over no shrinkage ν = 1) for regression and for probability estimation. The corresponding improvements in misclassification risk via (10.20) are less, but still substantial. The price paid for these improvements is computa- tional: smaller values of ν give rise to larger values of M, and computation is proportional to the latter. However, as seen below, many iterations are generally computationally feasible even on very large data sets. This is partly due to the fact that small trees are induced at each step with no pruning. Figure 10.11 shows test error curves for the simulated example (10.2) of Figure 10.2. A gradient boosted model (MART) was trained using binomial deviance, using either stumps or six terminal-node trees, and with or with- out shrinkage. The benefits of shrinkage are evident, especially when the binomial deviance is tracked. With shrinkage, each test error curve reaches a lower value, and stays there for many iterations. Section 16.2.1 draws a connection between forward stagewise shrinkage in boosting and the use of an L1 penalty for regularizing model parame- ters (the “lasso”). We argue that L1 penalties may be superior to the L2 penalties used by methods such as the support vector machine. 10.12.2 Subsampling We saw in Section 8.7 that bootstrap averaging (bagging) improves the performance of a noisy classifier through averaging. Chapter 15 discusses in some detail the variance-reduction mechanism of this sampling followed by averaging. We can exploit the same device in gradient boosting, both to improve performance and computational efficiency. With stochastic gradient boosting (Friedman, 1999), at each iteration we sample a fraction η of the training observations (without replacement), and grow the next tree using that subsample. The rest of the algorithm is identical. A typical value for η can be 1 2 , although for large N, η can be substantially smaller than 1 2 . Not only does the sampling reduce the computing time by the same fraction η, but in many cases it actually produces a more accurate model. Figure 10.12 illustrates the effect of subsampling using the simulated example (10.2), both as a classification and as a regression example. We see in both cases that sampling along with shrinkage slightly outperformed the rest. It appears here that subsampling without shrinkage does poorly.366 10. Boosting and Additive Trees Boosting Iterations Test Set Deviance 0 500 1000 1500 2000 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.2 Stumps Deviance Boosting Iterations Test Set Misclassification Error 0 500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.2 Stumps Misclassification Error Boosting Iterations Test Set Deviance 0 500 1000 1500 2000 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.6 6-Node Trees Deviance Boosting Iterations Test Set Misclassification Error 0 500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.6 6-Node Trees Misclassification Error FIGURE 10.11. Test error curves for simulated example (10.2) of Figure 10.9, using gradient boosting (MART). The models were trained using binomial de- viance, either stumps or six terminal-node trees, and with or without shrinkage. The left panels report test deviance, while the right panels show misclassification error. The beneficial effect of shrinkage can be seen in all cases, especially for deviance in the left panels.10.13 Interpretation 367 0 200 400 600 800 1000 0.4 0.6 0.8 1.0 1.2 1.4 Boosting Iterations Test Set Deviance Deviance 4−Node Trees 0 200 400 600 800 1000 0.30 0.35 0.40 0.45 0.50 Boosting Iterations Test Set Absolute Error No shrinkage Shrink=0.1 Sample=0.5 Shrink=0.1 Sample=0.5 Absolute Error FIGURE 10.12. Test-error curves for the simulated example (10.2), showing the effect of stochasticity. For the curves labeled “Sample=0.5”, a different 50% subsample of the training data was used each time a tree was grown. In the left panel the models were fit by gbm using a binomial deviance loss function; in the right-hand panel using square-error loss. The downside is that we now have four parameters to set: J, M, ν and η. Typically some early explorations determine suitable values for J, ν and η, leaving M as the primary parameter. 10.13 Interpretation Single decision trees are highly interpretable. The entire model can be com- pletely represented by a simple two-dimensional graphic (binary tree) that is easily visualized. Linear combinations of trees (10.28) lose this important feature, and must therefore be interpreted in a different way. 10.13.1 Relative Importance of Predictor Variables In data mining applications the input predictor variables are seldom equally relevant. Often only a few of them have substantial influence on the re- sponse; the vast majority are irrelevant and could just as well have not been included. It is often useful to learn the relative importance or contri- bution of each input variable in predicting the response.368 10. Boosting and Additive Trees For a single decision tree T, Breiman et al. (1984) proposed I2 (T)= J−1 t=1 ˆı2 t I(v(t)=) (10.42) as a measure of relevance for each predictor variable X. The sum is over the J − 1 internal nodes of the tree. At each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions; within each a separate constant is fit to the response values. The particular variable chosen is the one that gives maximal estimated improvement ˆı2 t in squared error risk over that for a constant fit over the entire region. The squared relative importance of variable X is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable. This importance measure is easily generalized to additive tree expansions (10.28); it is simply averaged over the trees I2 = 1 M M m=1 I2 (Tm). (10.43) Due to the stabilizing effect of averaging, this measure turns out to be more reliable than is its counterpart (10.42) for a single tree. Also, because of shrinkage (Section 10.12.1) the masking of important variables by others with which they are highly correlated is much less of a problem. Note that (10.42) and (10.43) refer to squared relevance; the actual relevances are their respective square roots. Since these measures are relative, it is customary to assign the largest a value of 100 and then scale the others accordingly. Figure 10.6 shows the relevant importance of the 57 inputs in predicting spam versus email. For K-class classification, K separate models fk(x),k =1, 2,...,K are induced, each consisting of a sum of trees fk(x)= M m=1 Tkm(x). (10.44) In this case (10.43) generalizes to I2 k = 1 M M m=1 I2 (Tkm). (10.45) Here Ik is the relevance of X in separating the class k observations from the other classes. The overall relevance of X is obtained by averaging over all of the classes I2 = 1 K K k=1 I2 k. (10.46)10.13 Interpretation 369 Figures 10.23 and 10.24 illustrate the use of these averaged and separate relative importances. 10.13.2 Partial Dependence Plots After the most relevant variables have been identified, the next step is to attempt to understand the nature of the dependence of the approximation f(X) on their joint values. Graphical renderings of the f(X) as a function of its arguments provides a comprehensive summary of its dependence on the joint values of the input variables. Unfortunately, such visualization is limited to low-dimensional views. We can easily display functions of one or two arguments, either continuous or discrete (or mixed), in a variety of different ways; this book is filled with such displays. Functions of slightly higher dimensions can be plotted by conditioning on particular sets of values of all but one or two of the arguments, producing a trellis of plots (Becker et al., 1996).1 For more than two or three variables, viewing functions of the corre- sponding higher-dimensional arguments is more difficult. A useful alterna- tive can sometimes be to view a collection of plots, each one of which shows the partial dependence of the approximation f(X) on a selected small sub- set of the input variables. Although such a collection can seldom provide a comprehensive depiction of the approximation, it can often produce helpful clues, especially when f(x) is dominated by low-order interactions (10.40). Consider the subvector XS of 0,X) · Pr(Y>0|X). (10.54) The second term is estimated by the logistic regression, and the first term can be estimated using only the 2353 trawls with a positive catch. For the logistic regression the authors used a gradient boosted model (GBM)4 with binomial deviance loss function, depth-10 trees, and a shrink- age factor ν =0.025. For the positive-catch regression, they modeled log(Y ) using a GBM with squared-error loss (also depth-10 trees, but ν =0.01), and un-logged the predictions. In both cases they used 10-fold cross-validation for selecting the number of terms, as well as the shrinkage factor. 3The models, data, and maps shown here were kindly provided by Dr John Leathwick of the National Institute of Water and Atmospheric Research in New Zealand, and Dr Jane Elith, School of Botany, University of Melbourne. The collection of the research trawl data took place from 1979–2005, and was funded by the New Zealand Ministry of Fisheries. 4Version 1.5-7 of package gbm in R, ver. 2.2.0.10.14 Illustrations 377 FIGURE 10.18. Map of New Zealand and its surrounding exclusive economic zone, showing the locations of 17,000 trawls (small blue dots) taken between 1979 and 2005. The red points indicate trawls for which the species Black Oreo Dory were present.378 10. Boosting and Additive Trees 0 500 1000 1500 0.24 0.26 0.28 0.30 0.32 0.34 Number of Trees Mean Deviance GBM Test GBM CV GAM Test 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity Sensitivity AUC GAM 0.97 GBM 0.98 FIGURE 10.19. The left panel shows the mean deviance as a function of the number of trees for the GBM logistic regression model fit to the presence/absence data. Shown are 10-fold cross-validation on the training data (and 1 × s.e. bars), and test deviance on the test data. Also shown for comparison is the test deviance using a GAM model with 8 df for each term. The right panel shows ROC curves on the test data for the chosen GBM model (vertical line in left plot) and the GAM model. Figure 10.19 (left panel) shows the mean binomial deviance for the se- quence of GBM models, both for 10-fold CV and test data. There is a mod- est improvement over the performance of a GAM model, fit using smoothing splines with 8 degrees-of-freedom (df) per term. The right panel shows the ROC curves (see Section 9.2.5) for both models, which measures predictive performance. From this point of view, the performance looks very simi- lar, with GBM perhaps having a slight edge as summarized by the AUC (area under the curve). At the point of equal sensitivity/specificity, GBM achieves 91%, and GAM 90%. Figure 10.20 summarizes the contributions of the variables in the logistic GBM fit. We see that there is a well-defined depth range over which Black Oreo are caught, with much more frequent capture in colder waters. We do not give details of the quantitative catch model; the important variables were much the same. All the predictors used in these models are available on a fine geographi- cal grid; in fact they were derived from environmental atlases, satellite im- ages and the like—see Leathwick et al. (2006) for details. This also means that predictions can be made on this grid, and imported into GIS mapping systems. Figure 10.21 shows prediction maps for both presence and catch size, with both standardized to a common set of trawl conditions; since the predictors vary in a continuous fashion with geographical location, so do the predictions.10.14 Illustrations 379 OrbVel Speed Distance DisOrgMatter CodendSize Pentade TidalCurr Slope ChlaCase2 SSTGrad SalResid SusPartMatter AvgDepth TempResid Relative influence 01025 −4 0246 −7 −5 −3 −1 TempResid f(TempResid) 0 500 1000 2000 −6 −4 −2 AvgDepth f(AvgDepth) 0 5 10 15 −7 −5 −3 SusPartMatter f(SusPartMatter) −0.8 −0.4 0.0 0.4 −7 −5 −3 −1 SalResid f(SalResid) 0.00 0.05 0.10 0.15 −7 −5 −3 −1 SSTGrad f(SSTGrad) FIGURE 10.20. The top-left panel shows the relative influence computed from the GBM logistic regression model. The remaining panels show the partial de- pendence plots for the leading five variables, all plotted on the same scale for comparison. Because of their ability to model interactions and automatically select variables, as well as robustness to outliers and missing data, GBM models are rapidly gaining popularity in this data-rich and enthusiastic community. 10.14.3 Demographics Data In this section we illustrate gradient boosting on a multiclass classifica- tion problem, using MART. The data come from 9243 questionnaires filled out by shopping mall customers in the San Francisco Bay Area (Impact Resources, Inc., Columbus, OH). Among the questions are 14 concerning demographics. For this illustration the goal is to predict occupation us- ing the other 13 variables as predictors, and hence identify demographic variables that discriminate between different occupational categories. We randomly divided the data into a training set (80%) and test set (20%), and used J = 6 node trees with a learning rate ν =0.1. Figure 10.22 shows the K = 9 occupation class values along with their corresponding error rates. The overall error rate is 42.5%, which can be compared to the null rate of 69% obtained by predicting the most numerous380 10. Boosting and Additive Trees FIGURE 10.21. Geological prediction maps of the presence probability (left map) and catch size (right map) obtained from the gradient boosted models. class Prof/Man (Professional/Managerial). The four best predicted classes are seen to be Retired, Student, Prof/Man,andHomemaker. Figure 10.23 shows the relative predictor variable importances as aver- aged over all classes (10.46). Figure 10.24 displays the individual relative importance distributions (10.45) for each of the four best predicted classes. One sees that the most relevant predictors are generally different for each respective class. An exception is age which is among the three most relevant for predicting Retired, Student,andProf/Man. Figure 10.25 shows the partial dependence of the log-odds (10.52) on age for these three classes. The abscissa values are ordered codes for respective equally spaced age intervals. One sees that after accounting for the contri- butions of the other variables, the odds of being retired are higher for older people, whereas the opposite is the case for being a student. The odds of being professional/managerial are highest for middle-aged people. These results are of course not surprising. They illustrate that inspecting partial dependences separately for each class can lead to sensible results. Bibliographic Notes Schapire (1990) developed the first simple boosting procedure in the PAC learning framework (Valiant, 1984; Kearns and Vazirani, 1994). Schapire10.14 Illustrations 381 Sales Unemployed Military Clerical Labor Homemaker Prof/Man Retired Student 0.0 0.2 0.4 0.6 0.8 1.0 Error Rate Overall Error Rate = 0.425 FIGURE 10.22. Error rate for each occupation in the demographics data. age income edu hsld-stat mar-dlinc sex ethnic mar-stat typ-home lang num-hsld children yrs-BA 0 20406080100 Relative Importance FIGURE 10.23. Relative importance of the predictors as averaged over all classes for the demographics data.382 10. Boosting and Additive Trees age mar-dlinc sex ethnic income hsld-stat mar-stat lang typ-home children edu num-hsld yrs-BA 0 20406080100 Relative Importance Class = Retired hsld-stat age income mar-stat edu ethnic num-hsld typ-home sex mar-dlinc lang yrs-BA children 0 20406080100 Relative Importance Class = Student edu income age mar-dlinc ethnic hsld-stat typ-home sex num-hsld lang mar-stat yrs-BA children 0 20406080100 Relative Importance Class = Prof/Man sex mar-dlinc children ethnic num-hsld edu mar-stat lang typ-home income age hsld-stat yrs-BA 0 20406080100 Relative Importance Class = Homemaker FIGURE 10.24. Predictor variable importances separately for each of the four classes with lowest error rate for the demographics data.10.14 Illustrations 383 age Partial Dependence 1234567 01234 Retired age Partial Dependence 1234567 -2 -1 0 1 2 Student age Partial Dependence 1234567 -2 -1 0 1 2 Prof/Man FIGURE 10.25. Partial dependence of the odds of three different occupations on age, for the demographics data. showed that a weak learner could always improve its performance by train- ing two additional classifiers on filtered versions of the input data stream. A weak learner is an algorithm for producing a two-class classifier with performance guaranteed (with high probability) to be significantly better than a coin-flip. After learning an initial classifier G1 on the first N training points, • G2 is learned on a new sample of N points, half of which are misclas- sified by G1; • G3 is learned on N points for which G1 and G2 disagree; • the boosted classifier is GB = majority vote(G1,G2,G3). Schapire’s “Strength of Weak Learnability” theorem proves that GB has improved performance over G1. Freund (1995) proposed a “boost by majority” variation which combined many weak learners simultaneously and improved the performance of the simple boosting algorithm of Schapire. The theory supporting both of these384 10. Boosting and Additive Trees algorithms requires the weak learner to produce a classifier with a fixed error rate. This led to the more adaptive and realistic AdaBoost (Freund and Schapire, 1996a) and its offspring, where this assumption was dropped. Freund and Schapire (1996a) and Schapire and Singer (1999) provide some theory to support their algorithms, in the form of upper bounds on generalization error. This theory has evolved in the computational learning community, initially based on the concepts of PAC learning. Other theo- ries attempting to explain boosting come from game theory (Freund and Schapire, 1996b; Breiman, 1999; Breiman, 1998), and VC theory (Schapire et al., 1998). The bounds and the theory associated with the AdaBoost algorithms are interesting, but tend to be too loose to be of practical im- portance. In practice, boosting achieves results far more impressive than the bounds would imply. Schapire (2002) and Meir and R¨atsch (2003) give useful overviews more recent than the first edition of this book. Friedman et al. (2000) and Friedman (2001) form the basis for our expo- sition in this chapter. Friedman et al. (2000) analyze AdaBoost statistically, derive the exponential criterion, and show that it estimates the log-odds of the class probability. They propose additive tree models, the right-sized trees and ANOVA representation of Section 10.11, and the multiclass logit formulation. Friedman (2001) developed gradient boosting and shrinkage for classification and regression, while Friedman (1999) explored stochastic variants of boosting. Mason et al. (2000) also embraced a gradient approach to boosting. As the published discussions of Friedman et al. (2000) shows, there is some controversy about how and why boosting works. Since the publication of the first edition of this book, these debates have continued, and spread into the statistical community with a series of papers on consistency of boosting (Jiang, 2004; Logosi and Vayatis, 2004; Zhang and Yu, 2005; Bartlett and Traskin, 2007). Mease and Wyner (2008), through a series of simulation examples, challenge some of our interpre- tations of boosting; our response (Friedman et al., 2008b) puts most of these objections to rest. A recent survey by B¨uhlmann and Hothorn (2008) supports our approach to boosting. Exercises Ex. 10.1 Derive expression (10.12) for the update parameter in AdaBoost. Ex. 10.2 Prove result (10.16), that is, the minimizer of the population version of the AdaBoost criterion, is one-half of the log odds. Ex. 10.3 Show that the marginal average (10.47) recovers additive and multiplicative functions (10.50) and (10.51), while the conditional expec- tation (10.49) does not.Exercises 385 Ex. 10.4 (a) Write a program implementing AdaBoost with trees. (b) Redo the computations for the example of Figure 10.2. Plot the train- ing error as well as test error, and discuss its behavior. (c) Investigate the number of iterations needed to make the test error finally start to rise. (d) Change the setup of this example as follows: define two classes, with the features in Class 1 being X1,X2,...,X10, standard indepen- dent Gaussian variates. In Class 2, the features X1,X2,...,X10 are also standard independent Gaussian, but conditioned on the event j X2 j > 12. Now the classes have significant overlap in feature space. Repeat the AdaBoost experiments as in Figure 10.2 and discuss the results. Ex. 10.5 Multiclass exponential loss (Zhu et al., 2005). For a K-class clas- sification problem, consider the coding Y =(Y1,...,YK)T with Yk = 1, if G = Gk − 1 K−1 , otherwise. (10.55) Let f =(f1,...,fK)T with K k=1 fk = 0, and define L(Y,f) = exp − 1 K Y T f . (10.56) (a) Using Lagrange multipliers, derive the population minimizer f ∗ of E(Y,f), subject to the zero-sum constraint, and relate these to the class probabilities. (b) Show that a multiclass boosting using this loss function leads to a reweighting algorithm similar to Adaboost, as in Section 10.4. Ex. 10.6 McNemar test (Agresti, 1996). We report the test error rates on the spam data to be 5.5% for a generalized additive model (GAM), and 4.5% for gradient boosting (GBM), with a test sample of size 1536. (a) Show that the standard error of these estimates is about 0.6%. Since the same test data are used for both methods, the error rates are correlated, and we cannot perform a two-sample t-test. We can compare the methods directly on each test observation, leading to the summary GBM GAM Correct Error Correct 1434 18 Error 33 51386 10. Boosting and Additive Trees The McNemar test focuses on the discordant errors, 33 vs. 18. (b) Conduct a test to show that GAM makes significantly more errors than gradient boosting, with a two-sided p-value of 0.036. Ex. 10.7 Derive expression (10.32). Ex. 10.8 Consider a K-class problem where the targets yik are coded as 1 if observation i is in class k and zero otherwise. Suppose we have a current model fk(x),k=1,...,K, with K k=1 fk(x) = 0 (see (10.21) in Section 10.6). We wish to update the model for observations in a region R in predictor space, by adding constants fk(x)+γk, with γK =0. (a) Write down the multinomial log-likelihood for this problem, and its first and second derivatives. (b) Using only the diagonal of the Hessian matrix in (1), and starting from γk =0∀k, show that a one-step approximate Newton update for γk is γ1 k = xi∈R(yik − pik) xi∈R pik(1 − pik),k=1,...,K− 1, (10.57) where pik =exp(fk(xi))/( K =1 f(xi)). (c) We prefer our update to sum to zero, as the current model does. Using symmetry arguments, show that ˆγk = K − 1 K (γ1 k − 1 K K =1 γ1 ),k=1,...,K (10.58) is an appropriate update, where γ1 k is defined as in (10.57) for all k =1,...,K. Ex. 10.9 Consider a K-class problem where the targets yik are coded as 1 if observation i is in class k and zero otherwise. Using the multinomial deviance loss function (10.22) and the symmetric logistic transform, use the arguments leading to the gradient boosting Algorithm 10.3 to derive Algorithm 10.4. Hint: See exercise 10.8 for step 2(b)iii. Ex. 10.10 Show that for K = 2 class classification, only one tree needs to be grown at each gradient-boosting iteration. Ex. 10.11 Show how to compute the partial dependence function fS(XS) in (10.47) efficiently. Ex. 10.12 Referring to (10.49), let S = {1} and C = {2}, with f(X1,X2)= X1. Assume X1 and X2 are bivariate Gaussian, each with mean zero, vari- ance one, and E(X1,X2)=ρ. Show that E(f(X1,X2|X2)=ρX2,even though f is not a function of X2.Exercises 387 Algorithm 10.4 Gradient Boosting for K-class Classification. 1. Initialize fk0(x)=0,k=1, 2,...,K. 2. For m=1 to M: (a) Set pk(x)= efk(x) K =1 ef(x) ,k=1, 2,...,K. (b) For k =1toK: i. Compute rikm = yik − pk(xi),i=1, 2,...,N. ii. Fit a regression tree to the targets rikm,i=1, 2,...,N, giving terminal regions Rjkm,j=1, 2,...,Jm. iii. Compute γjkm = K − 1 K xi∈Rjkm rikm xi∈Rjkm |rikm|(1 −|rikm|),j=1, 2,...,Jm. iv. Update fkm(x)=fk,m−1(x)+ Jm j=1 γjkmI(x ∈ Rjkm). 3. Output ˆfk(x)=fkM (x),k=1, 2,...,K.388 10. Boosting and Additive TreesThis is page 389 Printer: Opaque this 11 Neural Networks 11.1 Introduction In this chapter we describe a class of learning methods that was developed separately in different fields—statistics and artificial intelligence—based on essentially identical models. The central idea is to extract linear com- binations of the inputs as derived features, and then model the target as a nonlinear function of these features. The result is a powerful learning method, with widespread applications in many fields. We first discuss the projection pursuit model, which evolved in the domain of semiparamet- ric statistics and smoothing. The rest of the chapter is devoted to neural network models. 11.2 Projection Pursuit Regression As in our generic supervised learning problem, assume we have an input vector X with p components, and a target Y .Letωm,m=1, 2,...,M,be unit p-vectors of unknown parameters. The projection pursuit regression (PPR) model has the form f(X)= M m=1 gm(ωT mX). (11.1) This is an additive model, but in the derived features Vm = ωT mX rather than the inputs themselves. The functions gm are unspecified and are esti-390 Neural Networks g(V ) X1 X2 g(V ) X1 X2 FIGURE 11.1. Perspective plots of two ridge functions. (Left:) g(V )=1/[1 + exp(−5(V − 0.5))],whereV =(X1 + X2)/ √ 2. (Right:) g(V )=(V +0.1) sin(1/(V/3+0.1)),whereV = X1. mated along with the directions ωm using some flexible smoothing method (see below). The function gm(ωT mX) is called a ridge function in IRp.Itvariesonly in the direction defined by the vector ωm. The scalar variable Vm = ωT mX is the projection of X onto the unit vector ωm, and we seek ωm so that the model fits well, hence the name “projection pursuit.” Figure 11.1 shows some examples of ridge functions. In the example on the left ω =(1/ √ 2)(1, 1)T , so that the function only varies in the direction X1 + X2. In the example on the right, ω =(1, 0). The PPR model (11.1) is very general, since the operation of forming nonlinear functions of linear combinations generates a surprisingly large class of models. For example, the product X1 ·X2 can be written as [(X1 + X2)2 − (X1 − X2)2]/4, and higher-order products can be represented simi- larly. In fact, if M is taken arbitrarily large, for appropriate choice of gm the PPR model can approximate any continuous function in IRp arbitrarily well. Such a class of models is called a universal approximator. However this generality comes at a price. Interpretation of the fitted model is usually difficult, because each input enters into the model in a complex and multi- faceted way. As a result, the PPR model is most useful for prediction, and not very useful for producing an understandable model for the data. The M = 1 model, known as the single index model in econometrics, is an exception. It is slightly more general than the linear regression model, and offers a similar interpretation. How do we fit a PPR model, given training data (xi,yi), i =1, 2,...,N? We seek the approximate minimizers of the error function N i=1 $ yi − M m=1 gm(ωT mxi) %2 (11.2)11.2 Projection Pursuit Regression 391 over functions gm and direction vectors ωm, m =1, 2,...,M. As in other smoothing problems, we need either explicitly or implicitly to impose com- plexity constraints on the gm, to avoid overfit solutions. Consider just one term (M = 1, and drop the subscript). Given the direction vector ω, we form the derived variables vi = ωT xi. Then we have a one-dimensional smoothing problem, and we can apply any scatterplot smoother, such as a smoothing spline, to obtain an estimate of g. On the other hand, given g, we want to minimize (11.2) over ω. A Gauss– Newton search is convenient for this task. This is a quasi-Newton method, in which the part of the Hessian involving the second derivative of g is discarded. It can be simply derived as follows. Let ωold be the current estimate for ω. We write g(ωT xi) ≈ g(ωT oldxi)+g(ωT oldxi)(ω − ωold)T xi (11.3) to give N i=1 yi − g(ωT xi) 2 ≈ N i=1 g(ωT oldxi)2 ωT oldxi + yi − g(ωT oldxi) g(ωT oldxi) − ωT xi 2 . (11.4) To minimize the right-hand side, we carry out a least squares regression with target ωT oldxi +(yi −g(ωT oldxi))/g(ωT oldxi) on the input xi, with weights g(ωT oldxi)2 and no intercept (bias) term. This produces the updated coef- ficient vector ωnew. These two steps, estimation of g and ω, are iterated until convergence. With more than one term in the PPR model, the model is built in a forward stage-wise manner, adding a pair (ωm,gm) at each stage. There are a number of implementation details. • Although any smoothing method can in principle be used, it is conve- nient if the method provides derivatives. Local regression and smooth- ing splines are convenient. • After each step the gm’s from previous steps can be readjusted using the backfitting procedure described in Chapter 9. While this may lead ultimately to fewer terms, it is not clear whether it improves prediction performance. • Usually the ωm are not readjusted (partly to avoid excessive compu- tation), although in principle they could be as well. • The number of terms M is usually estimated as part of the forward stage-wise strategy. The model building stops when the next term does not appreciably improve the fit of the model. Cross-validation can also be used to determine M.392 Neural Networks There are many other applications, such as density estimation (Friedman et al., 1984; Friedman, 1987), where the projection pursuit idea can be used. In particular, see the discussion of ICA in Section 14.7 and its relationship with exploratory projection pursuit. However the projection pursuit re- gression model has not been widely used in the field of statistics, perhaps because at the time of its introduction (1981), its computational demands exceeded the capabilities of most readily available computers. But it does represent an important intellectual advance, one that has blossomed in its reincarnation in the field of neural networks, the topic of the rest of this chapter. 11.3 Neural Networks The term neural network has evolved to encompass a large class of models and learning methods. Here we describe the most widely used “vanilla” neu- ral net, sometimes called the single hidden layer back-propagation network, or single layer perceptron. There has been a great deal of hype surrounding neural networks, making them seem magical and mysterious. As we make clear in this section, they are just nonlinear statistical models, much like the projection pursuit regression model discussed above. A neural network is a two-stage regression or classification model, typ- ically represented by a network diagram as in Figure 11.2. This network applies both to regression or classification. For regression, typically K =1 and there is only one output unit Y1 at the top. However, these networks can handle multiple quantitative responses in a seamless fashion, so we will deal with the general case. For K-class classification, there are K units at the top, with the kth unit modeling the probability of class k. There are K target measurements Yk,k=1,...,K, each being coded as a 0 − 1 variable for the kth class. Derived features Zm are created from linear combinations of the inputs, and then the target Yk is modeled as a function of linear combinations of the Zm, Zm = σ(α0m + αT mX),m=1,...,M, Tk = β0k + βT k Z, k =1,...,K, fk(X)=gk(T),k=1,...,K, (11.5) where Z =(Z1,Z2,...,ZM ), and T =(T1,T2,...,TK). The activation function σ(v) is usually chosen to be the sigmoid σ(v)= 1/(1 + e−v); see Figure 11.3 for a plot of 1/(1 + e−v). Sometimes Gaussian radial basis functions (Chapter 6) are used for the σ(v), producing what is known as a radial basis function network. Neural network diagrams like Figure 11.2 are sometimes drawn with an additional bias unit feeding into every unit in the hidden and output layers.11.3 Neural Networks 393 Y Y Y 21 K Z Z Z1 Z2 3 m X X Z Z1 Z2 3 1 Xp X p-1 X2 X3 M X p-13 X2 X1 p Z Y Y Y X K1 2             FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network. Thinking of the constant “1” as an additional input feature, this bias unit captures the intercepts α0m and β0k in model (11.5). The output function gk(T) allows a final transformation of the vector of outputs T. For regression we typically choose the identity function gk(T)= Tk. Early work in K-class classification also used the identity function, but this was later abandoned in favor of the softmax function gk(T)= eTk K =1 eT . (11.6) This is of course exactly the transformation used in the multilogit model (Section 4.4), and produces positive estimates that sum to one. In Sec- tion 4.2 we discuss other problems with linear activation functions, in par- ticular potentially severe masking effects. The units in the middle of the network, computing the derived features Zm, are called hidden units because the values Zm are not directly ob- served. In general there can be more than one hidden layer, as illustrated in the example at the end of this chapter. We can think of the Zm as a basis expansion of the original inputs X; the neural network is then a stan- dard linear model, or linear multilogit model, using these transformations as inputs. There is, however, an important enhancement over the basis- expansion techniques discussed in Chapter 5; here the parameters of the basis functions are learned from the data.394 Neural Networks -10 -5 0 5 10 0.0 0.5 1.0 1 / (1 + e − v ) v FIGURE 11.3. Plot of the sigmoid function σ(v)=1/(1+exp(−v)) (red curve), commonly used in the hidden layer of a neural network. Included are σ(sv) for s = 1 2 (blue curve) and s =10(purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v =0. Note that σ(s(v − v0)) shifts the activation threshold from 0 to v0. Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification. By introducing the nonlinear transformation σ, it greatly enlarges the class of linear models. In Figure 11.3 we see that the rate of activation of the sigmoid depends on the norm of αm,andif αm is very small, the unit will indeed be operating in the linear part of its activation function. Notice also that the neural network model with one hidden layer has exactly the same form as the projection pursuit model described above. The difference is that the PPR model uses nonparametric functions gm(v), while the neural network uses a far simpler function based on σ(v), with three free parameters in its argument. In detail, viewing the neural network model as a PPR model, we identify gm(ωT mX)=βmσ(α0m + αT mX) = βmσ(α0m + αm (ωT mX)), (11.7) where ωm = αm/ αm is the mth unit-vector. Since σβ,α0,s(v)=βσ(α0 + sv) has lower complexity than a more general nonparametric g(v), it is not surprising that a neural network might use 20 or 100 such functions, while the PPR model typically uses fewer terms (M = 5 or 10, for example). Finally, we note that the name “neural networks” derives from the fact that they were first developed as models for the human brain. Each unit represents a neuron, and the connections (links in Figure 11.2) represent synapses. In early models, the neurons fired when the total signal passed to that unit exceeded a certain threshold. In the model above, this corresponds11.4 Fitting Neural Networks 395 to use of a step function for σ(Z)andgm(T). Later the neural network was recognized as a useful tool for nonlinear statistical modeling, and for this purpose the step function is not smooth enough for optimization. Hence the step function was replaced by a smoother threshold function, the sigmoid in Figure 11.3. 11.4 Fitting Neural Networks The neural network model has unknown parameters, often called weights, and we seek values for them that make the model fit the training data well. We denote the complete set of weights by θ, which consists of {α0m,αm; m =1, 2,...,M} M(p + 1) weights, {β0k,βk; k =1, 2,...,K} K(M + 1) weights. (11.8) For regression, we use sum-of-squared errors as our measure of fit (error function) R(θ)= K k=1 N i=1 (yik − fk(xi))2. (11.9) For classification we use either squared error or cross-entropy (deviance): R(θ)=− N i=1 K k=1 yik log fk(xi), (11.10) and the corresponding classifier is G(x) = argmaxkfk(x). With the softmax activation function and the cross-entropy error function, the neural network model is exactly a linear logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood. Typically we don’t want the global minimizer of R(θ), as this is likely to be an overfit solution. Instead some regularization is needed: this is achieved directly through a penalty term, or indirectly by early stopping. Details are given in the next section. The generic approach to minimizing R(θ) is by gradient descent, called back-propagation in this setting. Because of the compositional form of the model, the gradient can be easily derived using the chain rule for differen- tiation. This can be computed by a forward and backward sweep over the network, keeping track only of quantities local to each unit.396 Neural Networks Here is back-propagation in detail for squared error loss. Let zmi = σ(α0m + αT mxi), from (11.5) and let zi =(z1i,z2i,...,zMi). Then we have R(θ) ≡ N i=1 Ri = N i=1 K k=1 (yik − fk(xi))2, (11.11) with derivatives ∂Ri ∂βkm = −2(yik − fk(xi))g k(βT k zi)zmi, ∂Ri ∂αm = − K k=1 2(yik − fk(xi))g k(βT k zi)βkmσ(αT mxi)xi. (11.12) Given these derivatives, a gradient descent update at the (r + 1)st iter- ation has the form β(r+1) km = β(r) km − γr N i=1 ∂Ri ∂β(r) km , α(r+1) m = α(r) m − γr N i=1 ∂Ri ∂α(r) m , (11.13) where γr is the learning rate, discussed below. Now write (11.12) as ∂Ri ∂βkm = δkizmi, ∂Ri ∂αm = smixi. (11.14) The quantities δki and smi are “errors” from the current model at the output and hidden layer units, respectively. From their definitions, these errors satisfy smi = σ(αT mxi) K k=1 βkmδki, (11.15) known as the back-propagation equations. Using this, the updates in (11.13) can be implemented with a two-pass algorithm. In the forward pass,the current weights are fixed and the predicted values ˆfk(xi) are computed from formula (11.5). In the backward pass, the errors δki are computed, and then back-propagated via (11.15) to give the errors smi. Both sets of errors are then used to compute the gradients for the updates in (11.13), via (11.14).11.5 Some Issues in Training Neural Networks 397 This two-pass procedure is what is known as back-propagation. It has also been called the delta rule (Widrow and Hoff, 1960). The computational components for cross-entropy have the same form as those for the sum of squares error function, and are derived in Exercise 11.3. The advantages of back-propagation are its simple, local nature. In the back propagation algorithm, each hidden unit passes and receives infor- mation only to and from units that share a connection. Hence it can be implemented efficiently on a parallel architecture computer. The updates in (11.13) are a kind of batch learning, with the parame- ter updates being a sum over all of the training cases. Learning can also be carried out online—processing each observation one at a time, updat- ing the gradient after each training case, and cycling through the training cases many times. In this case, the sums in equations (11.13) are replaced by a single summand. A training epoch refers to one sweep through the entire training set. Online training allows the network to handle very large training sets, and also to update the weights as new observations come in. The learning rate γr for batch learning is usually taken to be a con- stant, and can also be optimized by a line search that minimizes the error function at each update. With online learning γr should decrease to zero as the iteration r →∞. This learning is a form of stochastic approxima- tion (Robbins and Munro, 1951); results in this field ensure convergence if γr → 0, r γr = ∞,and r γ2 r < ∞ (satisfied, for example, by γr =1/r). Back-propagation can be very slow, and for that reason is usually not the method of choice. Second-order techniques such as Newton’s method are not attractive here, because the second derivative matrix of R (the Hessian) can be very large. Better approaches to fitting include conjugate gradients and variable metric methods. These avoid explicit computation of the second derivative matrix while still providing faster convergence. 11.5 Some Issues in Training Neural Networks There is quite an art in training neural networks. The model is generally overparametrized, and the optimization problem is nonconvex and unstable unless certain guidelines are followed. In this section we summarize some of the important issues. 11.5.1 Starting Values Note that if the weights are near zero, then the operative part of the sigmoid (Figure 11.3) is roughly linear, and hence the neural network collapses into an approximately linear model (Exercise 11.2). Usually starting values for weights are chosen to be random values near zero. Hence the model starts out nearly linear, and becomes nonlinear as the weights increase. Individual398 Neural Networks units localize to directions and introduce nonlinearities where needed. Use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves. Starting instead with large weights often leads to poor solutions. 11.5.2 Overfitting Often neural networks have too many weights and will overfit the data at the global minimum of R. In early developments of neural networks, either by design or by accident, an early stopping rule was used to avoid over- fitting. Here we train the model only for a while, and stop well before we approach the global minimum. Since the weights start at a highly regular- ized (linear) solution, this has the effect of shrinking the final model toward a linear model. A validation dataset is useful for determining when to stop, since we expect the validation error to start increasing. A more explicit method for regularization is weight decay, which is anal- ogous to ridge regression used for linear models (Section 3.4.1). We add a penalty to the error function R(θ)+λJ(θ), where J(θ)= km β2 km + m α2 m (11.16) and λ ≥ 0 is a tuning parameter. Larger values of λ will tend to shrink the weights toward zero: typically cross-validation is used to estimate λ. The effect of the penalty is to simply add terms 2βkm and 2αm to the respective gradient expressions (11.13). Other forms for the penalty have been proposed, for example, J(θ)= km β2 km 1+β2 km + m α2 m 1+α2 m , (11.17) known as the weight elimination penalty. This has the effect of shrinking smaller weights more than (11.16) does. Figure 11.4 shows the result of training a neural network with ten hidden units, without weight decay (upper panel) and with weight decay (lower panel), to the mixture example of Chapter 2. Weight decay has clearly improved the prediction. Figure 11.5 shows heat maps of the estimated weights from the training (grayscale versions of these are called Hinton diagrams.) We see that weight decay has dampened the weights in both layers: the resulting weights are spread fairly evenly over the ten hidden units. 11.5.3 Scaling of the Inputs Since the scaling of the inputs determines the effective scaling of the weights in the bottom layer, it can have a large effect on the quality of the final11.5 Some Issues in Training Neural Networks 399 Neural Network - 10 Units, No Weight Decay ... ..... ....... .......... ......... .......................... ......... ............................ ........ ............................ ...... ............................ ... ............................ . ............................ ... ............................ ..... ............................ ....... ............................ ......... ............................ ........... ............................ ............ ............................. .............. ............................. .............. ............................ ............... ............................ .............. ............................ .............. ........................... ............. .......................... .. ............. ........................ ..... ............. ....................... ...... ............ ................................ ............ ............................... ............ ................................ ........... ............................... ........... ................................ ........... ................................. ........... .................................. .......... ....................... ........... .......... ...................... ............ .......... ...................... .............. ........... ...................... ........................... ..... .................... .......................... ........................... .......................... .......................... .......................... .......................... .......................... .......................... ......................... .......................... ......................... .......................... ........................ .......................... ........................ ........................... ........................ ........................... ........................ ............................ ........................ ............................. ........................ ............................... ....................... .................................. ........................ ..................................... .................................................................. .................................................................. .................................................................. .................................................................. .................................................................. ................................................................. ................................................................. ................................................................. ................................................................. ................................................................. ................................................................. ................................................................. ................................................................ ................................................................ ................................................................ ................................................................ ................................................................ ................................................................ .................................................................. ............................................... .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. .................. ................... .................... ................ ................. ................ ............... .................. .............. .................. ............... .................. ................. ................... ................... ................................... ..... ................................. ..... ............................... ..... ............................. ...... .......................... ...... ........................ ...... ....................... ...... .................... ...... ................... ........ .................. ........ ................... ........ ................... ........ ..................... ........ ........... ......... ......... ......... ......... ......... ........ .......... .......... ............... .......... ................ .......... ................ ........... ................ ........... ............... ........... .............. ........... ............. ............ ...... ....... ............. ...... ...... ............. ...... ... ............. ....... ............. .... ................ ................. ................. ................. .................. ................... ................... .................. .................. ................. ................ ............... ............ ..................................................................................... o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o Training Error: 0.100 Test Error: 0.259 Bayes Error: 0.210 Neural Network - 10 Units, Weight Decay=0.02 ........................................................................................................................................................................... ............ ............. .............. ................ ................ .................. ................... .................... ..................... .. ...................... .... ....................... ..... ........................ ....... ......................... ....... ......................... ........ ........................... ......... ........................... ......... ............................ .......... ............................. .......... .............................. .......... ............................... .......... ................................ ........... ................................. ........... .................................. ........... .................................. ........... ................................... ........... .................................... ........... .................................... ............ ..................................... ............ ...................................... ............ ...................................... ........... ....................................... ........... ....................................... ............ ........................................ ............. ....................................................... ...................................................... ..................................................... ..................................................... ..................................................... ................................................... ................................................... ................................................... .................................................. ................................................. ................................................ ................................................ ............................................... ............................................... .............................................. .............................................. ............................................. ............................................. ............................................ ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ ............................................ ......................................... ....................................... ..................................... .................................... .................................. ................................. ............................... .............................. ............................. ............................ .......................... ......................... ........................ ........................ ....................... ...................... ..................... .................... ................... ................... ................... .................. ................ .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o oo o o o o o o o o oo o o oo ooo o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o oo o o o oo o o o o o o o o o o Training Error: 0.160 Test Error: 0.223 Bayes Error: 0.210 FIGURE 11.4. A neural network on the mixture example of Chapter 2. The upper panel uses no weight decay, and overfits the training data. The lower panel uses weight decay, and achieves close to the Bayes error rate (broken purple boundary). Both use the softmax activation function and cross-entropy error.400 Neural Networks 11 11 x1x1 x2x2 y1y1 y2y2 z 1 z 1 z 1 z 1 z 2 z 2 z 2 z 2 z 3 z 3 z 3 z 3 z 1 z 1 z 1 z 1 z 5 z 5 z 5 z 5 z 6 z 6 z 6 z 6 z 7 z 7 z 7 z 7 z 8 z 8 z 8 z 8 z 9 z 9 z 9 z 9 z 10 z 10 z 10 z 10 No weight decay Weight decay FIGURE 11.5. Heat maps of the estimated weights from the training of neural networks from Figure 11.4. The display ranges from bright green (negative) to bright red (positive). solution. At the outset it is best to standardize all inputs to have mean zero and standard deviation one. This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights. With standardized inputs, it is typical to take random uniform weights over the range [−0.7, +0.7]. 11.5.4 Number of Hidden Units and Layers Generally speaking it is better to have too many hidden units than too few. With too few hidden units, the model might not have enough flexibility to capture the nonlinearities in the data; with too many hidden units, the extra weights can be shrunk toward zero if appropriate regularization is used. Typically the number of hidden units is somewhere in the range of 5 to 100, with the number increasing with the number of inputs and num- ber of training cases. It is most common to put down a reasonably large number of units and train them with regularization. Some researchers use cross-validation to estimate the optimal number, but this seems unneces- sary if cross-validation is used to estimate the regularization parameter. Choice of the number of hidden layers is guided by background knowledge and experimentation. Each layer extracts features of the input for regres- sion or classification. Use of multiple hidden layers allows construction of hierarchical features at different levels of resolution. An example of the effective use of multiple layers is given in Section 11.6. 11.5.5 Multiple Minima The error function R(θ) is nonconvex, possessing many local minima. As a result, the final solution obtained is quite dependent on the choice of start-11.6 Example: Simulated Data 401 ing weights. One must at least try a number of random starting configura- tions, and choose the solution giving lowest (penalized) error. Probably a better approach is to use the average predictions over the collection of net- works as the final prediction (Ripley, 1996). This is preferable to averaging the weights, since the nonlinearity of the model implies that this averaged solution could be quite poor. Another approach is via bagging, which aver- ages the predictions of networks training from randomly perturbed versions of the training data. This is described in Section 8.7. 11.6 Example: Simulated Data We generated data from two additive error models Y = f(X)+ε: Sum of sigmoids: Y = σ(aT 1 X)+σ(aT 2 X)+ε1; Radial: Y = 10. m=1 φ(Xm)+ε2. Here XT =(X1,X2,...,Xp), each Xj being a standard Gaussian variate, with p = 2 in the first model, and p = 10 in the second. For the sigmoid model, a1 =(3, 3),a2 =(3, −3); for the radial model, φ(t)=(1/2π)1/2 exp(−t2/2). Both ε1 and ε2 are Gaussian errors, with variance chosen so that the signal-to-noise ratio Var(E(Y |X)) Var(Y − E(Y |X)) = Var(f(X)) Var(ε) (11.18) is 4 in both models. We took a training sample of size 100 and a test sample of size 10, 000. We fit neural networks with weight decay and various num- bers of hidden units, and recorded the average test error ETest (Y − ˆf(X))2 for each of 10 random starting weights. Only one training set was gen- erated, but the results are typical for an “average” training set. The test errors are shown in Figure 11.6. Note that the zero hidden unit model refers to linear least squares regression. The neural network is perfectly suited to the sum of sigmoids model, and the two-unit model does perform the best, achieving an error close to the Bayes rate. (Recall that the Bayes rate for regression with squared error is the error variance; in the figures, we report test error relative to the Bayes error). Notice, however, that with more hid- den units, overfitting quickly creeps in, and with some starting weights the model does worse than the linear model (zero hidden unit) model. Even with two hidden units, two of the ten starting weight configurations pro- duced results no better than the linear model, confirming the importance of multiple starting values. A radial function is in a sense the most difficult for the neural net, as it is spherically symmetric and with no preferred directions. We see in the right402 Neural Networks 1.0 1.5 2.0 2.5 3.0 012345678910 Number of Hidden Units Test Error Sum of Sigmoids 0 5 10 15 20 25 30 012345678910 Number of Hidden Units Test Error Radial FIGURE 11.6. Boxplots of test error, for simulated data example, relative to the Bayes error (broken horizontal line). True function is a sum of two sigmoids on the left, and a radial function is on the right. The test error is displayed for 10 different starting weights, for a single hidden layer neural network with the number of units as indicated. panel of Figure 11.6 that it does poorly in this case, with the test error staying well above the Bayes error (note the different vertical scale from the left panel). In fact, since a constant fit (such as the sample average) achieves a relative error of 5 (when the SNR is 4), we see that the neural networks perform increasingly worse than the mean. In this example we used a fixed weight decay parameter of 0.0005, rep- resenting a mild amount of regularization. The results in the left panel of Figure 11.6 suggest that more regularization is needed with greater num- bers of hidden units. In Figure 11.7 we repeated the experiment for the sum of sigmoids model, with no weight decay in the left panel, and stronger weight decay (λ =0.1) in the right panel. With no weight decay, overfitting becomes even more severe for larger numbers of hidden units. The weight decay value λ =0.1 produces good results for all numbers of hidden units, and there does not appear to be overfitting as the number of units increase. Finally, Figure 11.8 shows the test error for a ten hidden unit network, varying the weight decay parameter over a wide range. The value 0.1 is approximately optimal. In summary, there are two free parameters to select: the weight decay λ and number of hidden units M. As a learning strategy, one could fix either parameter at the value corresponding to the least constrained model, to ensure that the model is rich enough, and use cross-validation to choose the other parameter. Here the least constrained values are zero weight decay and ten hidden units. Comparing the left panel of Figure 11.7 to Figure 11.8, we see that the test error is less sensitive to the value of the weight11.6 Example: Simulated Data 403 1.0 1.5 2.0 2.5 3.0 012345678910 Number of Hidden Units Test Error No Weight Decay 1.0 1.5 2.0 2.5 3.0 012345678910 Number of Hidden Units Test Error Weight Decay=0.1 FIGURE 11.7. Boxplots of test error, for simulated data example, relative to the Bayes error. True function is a sum of two sigmoids. The test error is displayed for ten different starting weights, for a single hidden layer neural network with the number units as indicated. The two panels represent no weight decay (left) and strong weight decay λ =0.1 (right). 1.0 1.2 1.4 1.6 1.8 2.0 2.2 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Weight Decay Parameter Test Error Sum of Sigmoids, 10 Hidden Unit Model FIGURE 11.8. Boxplots of test error, for simulated data example. True function is a sum of two sigmoids. The test error is displayed for ten different starting weights, for a single hidden layer neural network with ten hidden units and weight decay parameter value as indicated.404 Neural Networks FIGURE 11.9. Examples of training cases from ZIP code data. Each image is a 16 × 16 8-bit grayscale representation of a handwritten digit. decay parameter, and hence cross-validation of this parameter would be preferred. 11.7 Example: ZIP Code Data This example is a character recognition task: classification of handwritten numerals. This problem captured the attention of the machine learning and neural network community for many years, and has remained a benchmark problem in the field. Figure 11.9 shows some examples of normalized hand- written digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images shown here have been deslanted and size normal- ized, resulting in 16 × 16 grayscale images (Le Cun et al., 1990). These 256 pixel values are used as inputs to the neural network classifier. A black box neural network is not ideally suited to this pattern recogni- tion task, partly because the pixel representation of the images lack certain invariances (such as small rotations of the image). Consequently early at- tempts with neural networks yielded misclassification rates around 4.5% on various examples of the problem. In this section we show some of the pioneering efforts to handcraft the neural network to overcome some these deficiencies (Le Cun, 1989), which ultimately led to the state of the art in neural network performance(Le Cun et al., 1998)1. Although current digit datasets have tens of thousands of training and test examples, the sample size here is deliberately modest in order to em- 1The figures and tables in this example were recreated from Le Cun (1989).11.7 Example: ZIP Code Data 405 16x16 8x8x2 16x16 10 4x4 4x4 8x8x2 10 Shared Weights Net-5Net-4 Net-1 4x4x4 Local Connectivity 10 10 10 Net-3Net-2 8x812 16x1616x1616x16 FIGURE 11.10. Architecture of the five networks used in the ZIP code example. phasize the effects. The examples were obtained by scanning some actual hand-drawn digits, and then generating additional images by random hor- izontal shifts. Details may be found in Le Cun (1989). There are 320 digits in the training set, and 160 in the test set. Five different networks were fit to the data: Net-1: No hidden layer, equivalent to multinomial logistic regression. Net-2: One hidden layer, 12 hidden units fully connected. Net-3: Two hidden layers locally connected. Net-4: Two hidden layers, locally connected with weight sharing. Net-5: Two hidden layers, locally connected, two levels of weight sharing. These are depicted in Figure 11.10. Net-1 for example has 256 inputs, one each for the 16×16 input pixels, and ten output units for each of the digits 0–9. The predicted value ˆfk(x) represents the estimated probability that an image x has digit class k,fork =0, 1, 2,...,9.406 Neural Networks Training Epochs % Correct on Test Data 0 5 10 15 20 25 30 60 70 80 90 100 Net-1 Net-2 Net-3 Net-4 Net-5 FIGURE 11.11. Test performance curves, as a function of the number of train- ing epochs, for the five networks of Table 11.1 applied to the ZIP code data. (Le Cun, 1989) The networks all have sigmoidal output units, and were all fit with the sum-of-squares error function. The first network has no hidden layer, and hence is nearly equivalent to a linear multinomial regression model (Exer- cise 11.4). Net-2 is a single hidden layer network with 12 hidden units, of the kind described above. The training set error for all of the networks was 0%, since in all cases there are more parameters than training observations. The evolution of the test error during the training epochs is shown in Figure 11.11. The linear network (Net-1) starts to overfit fairly quickly, while test performance of the others level off at successively superior values. The other three networks have additional features which demonstrate the power and flexibility of the neural network paradigm. They introduce constraints on the network, natural for the problem at hand, which allow for more complex connectivity but fewer parameters. Net-3 uses local connectivity: this means that each hidden unit is con- nected to only a small patch of units in the layer below. In the first hidden layer (an 8×8 array), each unit takes inputs from a 3×3 patch of the input layer; for units in the first hidden layer that are one unit apart, their recep- tive fields overlap by one row or column, and hence are two pixels apart. In the second hidden layer, inputs are from a 5 × 5 patch, and again units that are one unit apart have receptive fields that are two units apart. The weights for all other connections are set to zero. Local connectivity makes each unit responsible for extracting local features from the layer below, and11.7 Example: ZIP Code Data 407 TABLE 11.1. Test set performance of five different neural networks on a hand- written digit classification example (Le Cun, 1989). Network Architecture Links Weights % Correct Net-1: Single layer network 2570 2570 80.0% Net-2: Two layer network 3214 3214 87.0% Net-3: Locally connected 1226 1226 88.5% Net-4: Constrained network 1 2266 1132 94.0% Net-5: Constrained network 2 5194 1060 98.4% reduces considerably the total number of weights. With many more hidden units than Net-2, Net-3 has fewer links and hence weights (1226 vs. 3214), and achieves similar performance. Net-4 and Net-5 have local connectivity with shared weights. All units in a local feature map perform the same operation on different parts of the image, achieved by sharing the same weights. The first hidden layer of Net- 4 has two 8×8 arrays, and each unit takes input from a 3×3 patch just like in Net-3. However, each of the units in a single 8×8 feature map share the same set of nine weights (but have their own bias parameter). This forces the extracted features in different parts of the image to be computed by the same linear functional, and consequently these networks are sometimes known as convolutional networks. The second hidden layer of Net-4 has no weight sharing, and is the same as in Net-3. The gradient of the error function R with respect to a shared weight is the sum of the gradients of R with respect to each connection controlled by the weights in question. Table 11.1 gives the number of links, the number of weights and the optimal test performance for each of the networks. We see that Net-4 has more links but fewer weights than Net-3, and superior test performance. Net-5 has four 4 × 4 feature maps in the second hidden layer, each unit connected to a 5 × 5 local patch in the layer below. Weights are shared in each of these feature maps. We see that Net-5 does the best, having errors of only 1.6%, compared to 13% for the “vanilla” network Net-2. The clever design of network Net-5, motivated by the fact that features of handwriting style should appear in more than one part of a digit, was the result of many person years of experimentation. This and similar networks gave better performance on ZIP code problems than any other learning method at that time (early 1990s). This example also shows that neural networks are not a fully automatic tool, as they are sometimes advertised. As with all statistical models, subject matter knowledge can and should be used to improve their performance. This network was later outperformed by the tangent distance approach (Simard et al., 1993) described in Section 13.3.3, which explicitly incorpo- rates natural affine invariances. At this point the digit recognition datasets become test beds for every new learning procedure, and researchers worked408 Neural Networks hard to drive down the error rates. As of this writing, the best error rates on a large database (60, 000 training, 10, 000 test observations), derived from standard NIST2 databases, were reported to be the following: (Le Cun et al., 1998): • 1.1% for tangent distance with a 1-nearest neighbor classifier (Sec- tion 13.3.3); • 0.8% for a degree-9 polynomial SVM (Section 12.3); • 0.8% for LeNet-5, a more complex version of the convolutional net- work described here; • 0.7% for boosted LeNet-4. Boosting is described in Chapter 8. LeNet- 4 is a predecessor of LeNet-5. Le Cun et al. (1998) report a much larger table of performance results, and it is evident that many groups have been working very hard to bring these test error rates down. They report a standard error of 0.1% on the error estimates, which is based on a binomial average with N =10, 000 and p ≈ 0.01. This implies that error rates within 0.1—0.2% of one another are statistically equivalent. Realistically the standard error is even higher, since the test data has been implicitly used in the tuning of the various procedures. 11.8 Discussion Both projection pursuit regression and neural networks take nonlinear func- tions of linear combinations (“derived features”) of the inputs. This is a powerful and very general approach for regression and classification, and has been shown to compete well with the best learning methods on many problems. These tools are especially effective in problems with a high signal-to-noise ratio and settings where prediction without interpretation is the goal. They are less effective for problems where the goal is to describe the physical pro- cess that generated the data and the roles of individual inputs. Each input enters into the model in many places, in a nonlinear fashion. Some authors (Hinton, 1989) plot a diagram of the estimated weights into each hidden unit, to try to understand the feature that each unit is extracting. This is limited however by the lack of identifiability of the parameter vectors αm,m=1,...,M. Often there are solutions with αm spanning the same linear space as the ones found during training, giving predicted values that 2The National Institute of Standards and Technology maintain large databases, in- cluding handwritten character databases; http://www.nist.gov/srd/.11.9 Bayesian Neural Nets and the NIPS 2003 Challenge 409 are roughly the same. Some authors suggest carrying out a principal com- ponent analysis of these weights, to try to find an interpretable solution. In general, the difficulty of interpreting these models has limited their use in fields like medicine, where interpretation of the model is very important. There has been a great deal of research on the training of neural net- works. Unlike methods like CART and MARS, neural networks are smooth functions of real-valued parameters. This facilitates the development of Bayesian inference for these models. The next sections discusses a success- ful Bayesian implementation of neural networks. 11.9 Bayesian Neural Nets and the NIPS 2003 Challenge A classification competition was held in 2003, in which five labeled train- ing datasets were provided to participants. It was organized for a Neural Information Processing Systems (NIPS) workshop. Each of the data sets constituted a two-class classification problems, with different sizes and from a variety of domains (see Table 11.2). Feature measurements for a valida- tion dataset were also available. Participants developed and applied statistical learning procedures to make predictions on the datasets, and could submit predictions to a web- site on the validation set for a period of 12 weeks. With this feedback, participants were then asked to submit predictions for a separate test set and they received their results. Finally, the class labels for the validation set were released and participants had one week to train their algorithms on the combined training and validation sets, and submit their final pre- dictions to the competition website. A total of 75 groups participated, with 20 and 16 eventually making submissions on the validation and test sets, respectively. There was an emphasis on feature extraction in the competition. Arti- ficial “probes” were added to the data: these are noise features with dis- tributions resembling the real features but independent of the class labels. The percentage of probes that were added to each dataset, relative to the total set of features, is shown on Table 11.2. Thus each learning algorithm had to figure out a way of identifying the probes and downweighting or eliminating them. A number of metrics were used to evaluate the entries, including the percentage correct on the test set, the area under the ROC curve, and a combined score that compared each pair of classifiers head-to-head. The results of the competition are very interesting and are detailed in Guyon et al. (2006). The most notable result: the entries of Neal and Zhang (2006) were the clear overall winners. In the final competition they finished first410 Neural Networks TABLE 11.2. NIPS 2003 challenge data sets. The column labeled p is the number of features. For the Dorothea dataset the features are binary. Ntr, Nval and Nte are the number of training, validation and test cases, respectively Dataset Domain Feature p Percent Ntr Nval Nte Type Probes Arcene Mass spectrometry Dense 10,000 30 100 100 700 Dexter Text classification Sparse 20,000 50 300 300 2000 Dorothea Drug discovery Sparse 100,000 50 800 350 800 Gisette Digit recognition Dense 5000 30 6000 1000 6500 Madelon Artificial Dense 500 96 2000 600 1800 in three of the five datasets, and were 5th and 7th on the remaining two datasets. In their winning entries, Neal and Zhang (2006) used a series of pre- processing feature-selection steps, followed by Bayesian neural networks, Dirichlet diffusion trees, and combinations of these methods. Here we focus only on the Bayesian neural network approach, and try to discern which aspects of their approach were important for its success. We rerun their programs and compare the results to boosted neural networks and boosted trees, and other related methods. 11.9.1 Bayes, Boosting and Bagging Let us first review briefly the Bayesian approach to inference and its appli- cation to neural networks. Given training data Xtr, ytr, we assume a sam- pling model with parameters θ; Neal and Zhang (2006) use a two-hidden- layer neural network, with output nodes the class probabilities Pr(Y |X, θ) for the binary outcomes. Given a prior distribution Pr(θ), the posterior distribution for the parameters is Pr(θ|Xtr, ytr)= Pr(θ)Pr(ytr|Xtr,θ)# Pr(θ)Pr(ytr|Xtr,θ)dθ (11.19) For a test case with features Xnew, the predictive distribution for the label Ynew is Pr(Ynew|Xnew, Xtr, ytr)= Pr(Ynew|Xnew,θ)Pr(θ|Xtr, ytr)dθ (11.20) (c.f. equation 8.24). Since the integral in (11.20) is intractable, sophisticated Markov Chain Monte Carlo (MCMC) methods are used to sample from the posterior distribution Pr(Ynew|Xnew, Xtr, ytr). A few hundred values θ are generated and then a simple average of these values estimates the integral. Neal and Zhang (2006) use diffuse Gaussian priors for all of the parame- ters. The particular MCMC approach that was used is called hybrid Monte Carlo, and may be important for the success of the method. It includes an auxiliary momentum vector and implements Hamiltonian dynamics in which the potential function is the target density. This is done to avoid11.9 Bayesian Neural Nets and the NIPS 2003 Challenge 411 random walk behavior; the successive candidates move across the sample space in larger steps. They tend to be less correlated and hence converge to the target distribution more rapidly. Neal and Zhang (2006) also tried different forms of pre-processing of the features: 1. univariate screening using t-tests, and 2. automatic relevance determination. In the latter method (ARD), the weights (coefficients) for the jth feature to each of the first hidden layer units all share a common prior variance σ2 j , and prior mean zero. The posterior distributions for each variance σ2 j are computed, and the features whose posterior variance concentrates on small values are discarded. There are thus three main features of this approach that could be im- portant for its success: (a) the feature selection and pre-processing, (b) the neural network model, and (c) the Bayesian inference for the model using MCMC. According to Neal and Zhang (2006), feature screening in (a) is carried out purely for computational efficiency; the MCMC procedure is slow with a large number of features. There is no need to use feature selection to avoid overfitting. The posterior average (11.20) takes care of this automatically. We would like to understand the reasons for the success of the Bayesian method. In our view, power of modern Bayesian methods does not lie in their use as a formal inference procedure; most people would not believe that the priors in a high-dimensional, complex neural network model are actually correct. Rather the Bayesian/MCMC approach gives an efficient way of sampling the relevant parts of model space, and then averaging the predictions for the high-probability models. Bagging and boosting are non-Bayesian procedures that have some simi- larity to MCMC in a Bayesian model. The Bayesian approach fixes the data and perturbs the parameters, according to current estimate of the poste- rior distribution. Bagging perturbs the data in an i.i.d fashion and then re-estimates the model to give a new set of model parameters. At the end, a simple average of the model predictions from different bagged samples is computed. Boosting is similar to bagging, but fits a model that is additive in the models of each individual base learner, which are learned using non i.i.d. samples. We can write all of these models in the form ˆf(xnew)= L =1 wE(Ynew|xnew, ˆθ) (11.21)412 Neural Networks In all cases the ˆθ are a large collection of model parameters. For the Bayesian model the w =1/L, and the average estimates the posterior mean (11.21) by sampling θ from the posterior distribution. For bagging, w =1/L as well, and the ˆθ are the parameters refit to bootstrap re- samples of the training data. For boosting, the weights are all equal to 1, but the ˆθ are typically chosen in a nonrandom sequential fashion to constantly improve the fit. 11.9.2 Performance Comparisons Based on the similarities above, we decided to compare Bayesian neural networks to boosted trees, boosted neural networks, random forests and bagged neural networks on the five datasets in Table 11.2. Bagging and boosting of neural networks are not methods that we have previously used in our work. We decided to try them here, because of the success of Bayesian neural networks in this competition, and the good performance of bagging and boosting with trees. We also felt that by bagging and boosting neural nets, we could assess both the choice of model as well as the model search strategy. Here are the details of the learning methods that were compared: Bayesian neural nets. The results here are taken from Neal and Zhang (2006), using their Bayesian approach to fitting neural networks. The models had two hidden layers of 20 and 8 units. We re-ran some networks for timing purposes only. Boosted trees. We used the gbm package (version 1.5-7) in the R language. Tree depth and shrinkage factors varied from dataset to dataset. We consistently bagged 80% of the data at each boosting iteration (the default is 50%). Shrinkage was between 0.001 and 0.1. Tree depth was between 2 and 9. Boosted neural networks. Since boosting is typically most effective with “weak” learners, we boosted a single hidden layer neural network with two or four units, fit with the nnet package (version 7.2-36) in R. Random forests. We used the R package randomForest (version 4.5-16) with default settings for the parameters. Bagged neural networks. We used the same architecture as in the Bayesian neural network above (two hidden layers of 20 and 8 units), fit using both Neal’s C language package “Flexible Bayesian Modeling” (2004- 11-10 release), and Matlab neural-net toolbox (version 5.1).11.9 Bayesian Neural Nets and the NIPS 2003 Challenge 413 Test Error (%) Arcene Dexter Dorothea Gisette Madelon 51525 Univariate Screened Features Bayesian neural nets boosted trees boosted neural nets random forests bagged neural networks Test Error (%) Arcene Dexter Dorothea Gisette Madelon 51525 ARD Reduced Features FIGURE 11.12. Performance of different learning methods on five problems, using both univariate screening of features (top panel) and a reduced feature set from automatic relevance determination. The error bars at the top of each plot have width equal to one standard error of the difference between two error rates. On most of the problems several competitors are within this error bound. This analysis was carried out by Nicholas Johnson, and full details may be found in Johnson (2008)3. The results are shown in Figure 11.12 and Table 11.3. The figure and table show Bayesian, boosted and bagged neural networks, boosted trees, and random forests, using both the screened and reduced features sets. The error bars at the top of each plot indicate one standard error of the difference between two error rates. Bayesian neural networks again emerge as the winner, although for some datasets the differences between the test error rates is not statistically significant. Random forests performs the best among the competitors using the selected feature set, while the boosted neural networks perform best with the reduced feature set, and nearly match the Bayesian neural net. The superiority of boosted neural networks over boosted trees suggest that the neural network model is better suited to these particular prob- lems. Specifically, individual features might not be good predictors here 3We also thank Isabelle Guyon for help in preparing the results of this section.414 Neural Networks TABLE 11.3. Performance of different methods. Values are average rank of test error across the five problems (low is good), and mean computation time and standard error of the mean, in minutes. Screened Features ARD Reduced Features Method Average Average Average Average Rank Time Rank Time Bayesian neural networks 1.5 384(138) 1.6 600(186) Boosted trees 3.4 3.03(2.5) 4.0 34.1(32.4) Boosted neural networks 3.8 9.4(8.6) 2.2 35.6(33.5) Random forests 2.7 1.9(1.7) 3.2 11.2(9.3) Bagged neural networks 3.6 3.5(1.1) 4.0 6.4(4.4) and linear combinations of features work better. However the impressive performance of random forests is at odds with this explanation, and came as a surprise to us. Since the reduced feature sets come from the Bayesian neural network approach, only the methods that use the screened features are legitimate, self-contained procedures. However, this does suggest that better methods for internal feature selection might help the overall performance of boosted neural networks. The table also shows the approximate training time required for each method. Here the non-Bayesian methods show a clear advantage. Overall, the superior performance of Bayesian neural networks here may be due to the fact that (a) the neural network model is well suited to these five problems, and (b) the MCMC approach provides an efficient way of exploring the im- portant part of the parameter space, and then averaging the resulting models according to their quality. The Bayesian approach works well for smoothly parametrized models like neural nets; it is not yet clear that it works as well for non-smooth models like trees. 11.10 Computational Considerations With N observations, p predictors, M hidden units and L training epochs, a neural network fit typically requires O(NpML) operations. There are many packages available for fitting neural networks, probably many more than exist for mainstream statistical methods. Because the available software varies widely in quality, and the learning problem for neural networks is sensitive to issues such as input scaling, such software should be carefully chosen and tested.Exercises 415 Bibliographic Notes Projection pursuit was proposed by Friedman and Tukey (1974), and spe- cialized to regression by Friedman and Stuetzle (1981). Huber (1985) gives a scholarly overview, and Roosen and Hastie (1994) present a formulation using smoothing splines. The motivation for neural networks dates back to McCulloch and Pitts (1943), Widrow and Hoff (1960) (reprinted in An- derson and Rosenfeld (1988)) and Rosenblatt (1962). Hebb (1949) heavily influenced the development of learning algorithms. The resurgence of neural networks in the mid 1980s was due to Werbos (1974), Parker (1985) and Rumelhart et al. (1986), who proposed the back-propagation algorithm. Today there are many books written on the topic, for a broad range of audiences. For readers of this book, Hertz et al. (1991), Bishop (1995) and Ripley (1996) may be the most informative. Bayesian learning for neural networks is described in Neal (1996). The ZIP code example was taken from Le Cun (1989); see also Le Cun et al. (1990) and Le Cun et al. (1998). We do not discuss theoretical topics such as approximation properties of neural networks, such as the work of Barron (1993), Girosi et al. (1995) and Jones (1992). Some of these results are summarized by Ripley (1996). Exercises Ex. 11.1 Establish the exact correspondence between the projection pur- suit regression model (11.1) and the neural network (11.5). In particular, show that the single-layer regression network is equivalent to a PPR model with gm(ωT mx)=βmσ(α0m + sm(ωT mx)), where ωm is the mth unit vector. Establish a similar equivalence for a classification network. Ex. 11.2 Consider a neural network for a quantitative outcome as in (11.5), using squared-error loss and identity output function gk(t)=t. Suppose that the weights αm from the input to hidden layer are nearly zero. Show that the resulting model is nearly linear in the inputs. Ex. 11.3 Derive the forward and backward propagation equations for the cross-entropy loss function. Ex. 11.4 Consider a neural network for a K class outcome that uses cross- entropy loss. If the network has no hidden layer, show that the model is equivalent to the multinomial logistic model described in Chapter 4. Ex. 11.5 (a) Write a program to fit a single hidden layer neural network (ten hidden units) via back-propagation and weight decay.416 Neural Networks (b) Apply it to 100 observations from the model Y = σ(aT 1 X)+(aT 2 X)2 +0.30 · Z, where σ is the sigmoid function, Z is standard normal, XT =(X1,X2), each Xj being independent standard normal, and a1 =(3, 3),a2 = (3, −3). Generate a test sample of size 1000, and plot the training and test error curves as a function of the number of training epochs, for different values of the weight decay parameter. Discuss the overfitting behavior in each case. (c) Vary the number of hidden units in the network, from 1 up to 10, and determine the minimum number needed to perform well for this task. Ex. 11.6 Write a program to carry out projection pursuit regression, using cubic smoothing splines with fixed degrees of freedom. Fit it to the data from the previous exercise, for various values of the smoothing parameter and number of model terms. Find the minimum number of model terms necessary for the model to perform well and compare this to the number of hidden units from the previous exercise. Ex. 11.7 Fit a neural network to the spam data of Section 9.1.2, and compare the results to those for the additive model given in that chapter. Compare both the classification performance and interpretability of the final model.This is page 417 Printer: Opaque this 12 Support Vector Machines and Flexible Discriminants 12.1 Introduction In this chapter we describe generalizations of linear decision boundaries for classification. Optimal separating hyperplanes are introduced in Chap- ter 4 for the case when two classes are linearly separable. Here we cover extensions to the nonseparable case, where the classes overlap. These tech- niques are then generalized to what is known as the support vector machine, which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space. The second set of methods generalize Fisher’s linear discriminant analysis (LDA). The generalizations include flexible discriminant analysis which facilitates construction of non- linear boundaries in a manner very similar to the support vector machines, penalized discriminant analysis for problems such as signal and image clas- sification where the large number of features are highly correlated, and mixture discriminant analysis for irregularly shaped classes. 12.2 The Support Vector Classifier In Chapter 4 we discussed a technique for constructing an optimal separat- ing hyperplane between two perfectly separated classes. We review this and generalize to the nonseparable case, where the classes may not be separable by a linear boundary.418 12. Flexible Discriminants • • • • • • • • • • • • • • • • • • • • margin M = 1 β M = 1 β xT β + β0 =0 • • • • • • • • • • • • • • • • • • • • • •• margin ξ∗ 1ξ∗ 1ξ∗ 1 ξ∗ 2ξ∗ 2ξ∗ 2 ξ∗ 3ξ∗ 3 ξ∗ 4ξ∗ 4ξ∗ 4 ξ∗ 5 M = 1 β M = 1 β xT β + β0 =0 FIGURE 12.1. Support vector classifiers. The left panel shows the separable case. The decision boundary is the solid line, while broken lines bound the shaded maximal margin of width 2M =2/ β . The right panel shows the nonseparable (overlap) case. The points labeled ξ∗ j are on the wrong side of their margin by an amount ξ∗ j = Mξj; points on the correct side have ξ∗ j =0. The margin is maximized subject to a total budget P ξi ≤ constant. Hence P ξ∗ j is the total distance of points on the wrong side of their margin. Our training data consists of N pairs (x1,y1), (x2,y2),...,(xN ,yN ), with xi ∈ IR p and yi ∈{−1, 1}. Define a hyperplane by {x : f(x)=xT β + β0 =0}, (12.1) where β is a unit vector: β = 1. A classification rule induced by f(x)is G(x) = sign[xT β + β0]. (12.2) The geometry of hyperplanes is reviewed in Section 4.5, where we show that f(x) in (12.1) gives the signed distance from a point x to the hyperplane f(x)=xT β+β0 = 0. Since the classes are separable, we can find a function f(x)=xT β + β0 with yif(xi) > 0 ∀i. Hence we are able to find the hyperplane that creates the biggest margin between the training points for class 1 and −1 (see Figure 12.1). The optimization problem max β,β0,β=1 M subject to yi(xT i β + β0) ≥ M, i =1,...,N, (12.3) captures this concept. The band in the figure is M units away from the hyperplane on either side, and hence 2M units wide. It is called the margin. We showed that this problem can be more conveniently rephrased as min β,β0 β subject to yi(xT i β + β0) ≥ 1,i=1,...,N, (12.4)12.2 The Support Vector Classifier 419 where we have dropped the norm constraint on β. Note that M =1/ β . Expression (12.4) is the usual way of writing the support vector criterion for separated data. This is a convex optimization problem (quadratic cri- terion, linear inequality constraints), and the solution is characterized in Section 4.5.2. Suppose now that the classes overlap in feature space. One way to deal with the overlap is to still maximize M, but allow for some points to be on the wrong side of the margin. Define the slack variables ξ =(ξ1,ξ2,...,ξN ). There are two natural ways to modify the constraint in (12.3): yi(xT i β + β0) ≥ M − ξi, (12.5) or yi(xT i β + β0) ≥ M(1 − ξi), (12.6) ∀i, ξi ≥ 0, N i=1 ξi ≤ constant. The two choices lead to different solutions. The first choice seems more natural, since it measures overlap in actual distance from the margin; the second choice measures the overlap in relative distance, which changes with the width of the margin M. However, the first choice results in a nonconvex optimization problem, while the second is convex; thus (12.6) leads to the “standard” support vector classifier, which we use from here on. Here is the idea of the formulation. The value ξi in the constraint yi(xT i β+ β0) ≥ M(1 − ξi) is the proportional amount by which the prediction f(xi)=xT i β+β0 is on the wrong side of its margin. Hence by bounding the sum ξi, we bound the total proportional amount by which predictions fall on the wrong side of their margin. Misclassifications occur when ξi > 1, so bounding ξi at a value K say, bounds the total number of training misclassifications at K. As in (4.48) in Section 4.5.2, we can drop the norm constraint on β, define M =1/ β , and write (12.4) in the equivalent form min β subject to yi(xT i β + β0) ≥ 1 − ξi ∀i, ξi ≥ 0, ξi ≤ constant. (12.7) This is the usual way the support vector classifier is defined for the non- separable case. However we find confusing the presence of the fixed scale “1” in the constraint yi(xT i β +β0) ≥ 1−ξi, and prefer to start with (12.6). The right panel of Figure 12.1 illustrates this overlapping case. By the nature of the criterion (12.7), we see that points well inside their class boundary do not play a big role in shaping the boundary. This seems like an attractive property, and one that differentiates it from linear dis- criminant analysis (Section 4.3). In LDA, the decision boundary is deter- mined by the covariance of the class distributions and the positions of the class centroids. We will see in Section 12.3.3 that logistic regression is more similar to the support vector classifier in this regard.420 12. Flexible Discriminants 12.2.1 Computing the Support Vector Classifier The problem (12.7) is quadratic with linear inequality constraints, hence it is a convex optimization problem. We describe a quadratic programming solution using Lagrange multipliers. Computationally it is convenient to re-express (12.7) in the equivalent form min β,β0 1 2 β 2 + C N i=1 ξi subject to ξi ≥ 0,yi(xT i β + β0) ≥ 1 − ξi ∀i, (12.8) where the “cost” parameter C replaces the constant in (12.7); the separable case corresponds to C = ∞. The Lagrange (primal) function is LP = 1 2 β 2 + C N i=1 ξi − N i=1 αi[yi(xT i β + β0) − (1 − ξi)] − N i=1 μiξi, (12.9) which we minimize w.r.t β, β0 and ξi. Setting the respective derivatives to zero, we get β = N i=1 αiyixi, (12.10) 0= N i=1 αiyi, (12.11) αi = C − μi, ∀i, (12.12) as well as the positivity constraints αi,μi,ξi ≥ 0 ∀i. By substituting (12.10)–(12.12) into (12.9), we obtain the Lagrangian (Wolfe) dual objec- tive function LD = N i=1 αi − 1 2 N i=1 N i=1 αiαi yiyi xT i xi , (12.13) which gives a lower bound on the objective function (12.8) for any feasible point. We maximize LD subject to 0 ≤ αi ≤ C and N i=1 αiyi =0.In addition to (12.10)–(12.12), the Karush–Kuhn–Tucker conditions include the constraints αi[yi(xT i β + β0) − (1 − ξi)] = 0, (12.14) μiξi =0, (12.15) yi(xT i β + β0) − (1 − ξi) ≥ 0, (12.16) for i =1,...,N. Together these equations (12.10)–(12.16) uniquely char- acterize the solution to the primal and dual problem.12.2 The Support Vector Classifier 421 From (12.10) we see that the solution for β has the form ˆβ = N i=1 ˆαiyixi, (12.17) with nonzero coefficients ˆαi only for those observations i for which the constraints in (12.16) are exactly met (due to (12.14)). These observations are called the support vectors, since ˆβ is represented in terms of them alone. Among these support points, some will lie on the edge of the margin (ˆξi = 0), and hence from (12.15) and (12.12) will be characterized by 0 < ˆαi 0) have ˆαi = C. From (12.14) we can see that any of these margin points (0 < ˆαi, ˆξi = 0) can be used to solve for β0, and we typically use an average of all the solutions for numerical stability. Maximizing the dual (12.13) is a simpler convex quadratic programming problem than the primal (12.9), and can be solved with standard techniques (Murray et al., 1981, for example). Given the solutions ˆβ0 and ˆβ, the decision function can be written as ˆG(x) = sign[ ˆf(x)] = sign[xT ˆβ + ˆβ0]. (12.18) The tuning parameter of this procedure is the cost parameter C. 12.2.2 Mixture Example (Continued) Figure 12.2 shows the support vector boundary for the mixture example of Figure 2.5 on page 21, with two overlapping classes, for two different values of the cost parameter C. The classifiers are rather similar in their performance. Points on the wrong side of the boundary are support vectors. In addition, points on the correct side of the boundary but close to it (in the margin), are also support vectors. The margin is larger for C =0.01 than it is for C =10, 000. Hence larger values of C focus attention more on (correctly classified) points near the decision boundary, while smaller values involve data further away. Either way, misclassified points are given weight, no matter how far away. In this example the procedure is not very sensitive to choices of C, because of the rigidity of a linear boundary. The optimal value for C can be estimated by cross-validation, as dis- cussed in Chapter 7. Interestingly, the leave-one-out cross-validation error can be bounded above by the proportion of support points in the data. The reason is that leaving out an observation that is not a support vector will not change the solution. Hence these observations, being classified correctly by the original boundary, will be classified correctly in the cross-validation process. However this bound tends to be too high, and not generally useful for choosing C (62% and 85%, respectively, in our examples).422 12. Flexible Discriminants .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................