O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools Philipp K. Janert Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools by Philipp K. Janert Copyright c 2011 Philipp K. Janert. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Mike Loukides Production Editor: Sumita Mukherji Copyeditor: Matt Darnell Production Services: MPS Limited, a Macmillan Company, and Newgen North America, Inc. Indexer: Fred Brown Cover Designer: Karen Montgomery Interior Designer: Edie Freedman and Ron Bilodeau Illustrator: Philipp K. Janert Printing History: November 2010: First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-0-596-80235-6 [M] O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Furious activity is no substitute for understanding. —H. H. Williams O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 CONTENTS PREFACE xiii 1 INTRODUCTION 1 Data Analysis 1 What’s in This Book 2 What’s with the Workshops? 3 What’s with the Math? 4 What You’ll Need 5 What’s Missing 6 PART I Graphics: Looking at Data 2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 11 Dot and Jitter Plots 12 Histograms and Kernel Density Estimates 14 The Cumulative Distribution Function 23 Rank-Order Plots and Lift Charts 30 Only When Appropriate: Summary Statistics and Box Plots 33 Workshop: NumPy 38 Further Reading 45 3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS 47 Scatter Plots 47 Conquering Noise: Smoothing 48 Logarithmic Plots 57 Banking 61 Linear Regression and All That 62 Showing What’s Important 66 Graphical Analysis and Presentation Graphics 68 Workshop: matplotlib 69 Further Reading 78 4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS 79 Examples 79 The Task 83 Smoothing 84 Don’t Overlook the Obvious! 90 The Correlation Function 91 vii O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Optional: Filters and Convolutions 95 Workshop: scipy.signal 96 Further Reading 98 5 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99 False-Color Plots 100 A Lot at a Glance: Multiplots 105 Composition Problems 110 Novel Plot Types 116 Interactive Explorations 120 Workshop: Tools for Multivariate Graphics 123 Further Reading 125 6 INTERMEZZO: A DATA ANALYSIS SESSION 127 A Data Analysis Session 127 Workshop: gnuplot 136 Further Reading 138 PART II Analytics: Modeling Data 7 GUESSTIMATION AND THE BACK OF THE ENVELOPE 141 Principles of Guesstimation 142 How Good Are Those Numbers? 151 Optional: A Closer Look at Perturbation Theory and Error Propagation 155 Workshop: The Gnu Scientific Library (GSL) 158 Further Reading 161 8 MODELS FROM SCALING ARGUMENTS 163 Models 163 Arguments from Scale 165 Mean-Field Approximations 175 Common Time-Evolution Scenarios 178 Case Study: How Many Servers Are Best? 182 Why Modeling? 184 Workshop: Sage 184 Further Reading 188 9 ARGUMENTS FROM PROBABILITY MODELS 191 The Binomial Distribution and Bernoulli Trials 191 The Gaussian Distribution and the Central Limit Theorem 195 Power-Law Distributions and Non-Normal Statistics 201 Other Distributions 206 Optional: Case Study—Unique Visitors over Time 211 Workshop: Power-Law Distributions 215 Further Reading 218 viii CONTENTS O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 10 WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 221 Genesis 221 Statistics Defined 223 Statistics Explained 226 Controlled Experiments Versus Observational Studies 230 Optional: Bayesian Statistics—The Other Point of View 235 Workshop: R 243 Further Reading 249 11 INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 253 How to Average Averages 253 The Standard Deviation 256 Least Squares 260 Further Reading 264 PART III Computation: Mining Data 12 SIMULATIONS 267 A Warm-Up Question 267 Monte Carlo Simulations 270 Resampling Methods 276 Workshop: Discrete Event Simulations with SimPy 280 Further Reading 291 13 FINDING CLUSTERS 293 What Constitutes a Cluster? 293 Distance and Similarity Measures 298 Clustering Methods 304 Pre- and Postprocessing 311 Other Thoughts 314 A Special Case: Market Basket Analysis 316 AWordofWarning 319 Workshop: Pycluster and the C Clustering Library 320 Further Reading 324 14 SEEING THE FOREST FOR THE TREES: FINDING IMPORTANT ATTRIBUTES 327 Principal Component Analysis 328 Visual Techniques 337 Kohonen Maps 339 Workshop: PCA with R 342 Further Reading 348 15 INTERMEZZO: WHEN MORE IS DIFFERENT 351 A Horror Story 353 CONTENTS ix O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Some Suggestions 354 What About Map/Reduce? 356 Workshop: Generating Permutations 357 Further Reading 358 PART IV Applications: Using Data 16 REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS 361 Business Intelligence 362 Corporate Metrics and Dashboards 369 Data Quality Issues 373 Workshop: Berkeley DB and SQLite 376 Further Reading 381 17 FINANCIAL CALCULATIONS AND MODELING 383 The Time Value of Money 384 Uncertainty in Planning and Opportunity Costs 391 Cost Concepts and Depreciation 394 Should You Care? 398 Is This All That Matters? 399 Workshop: The Newsvendor Problem 400 Further Reading 403 18 PREDICTIVE ANALYTICS 405 Introduction 405 Some Classification Terminology 407 Algorithms for Classification 408 The Process 419 The Secret Sauce 423 The Nature of Statistical Learning 424 Workshop: Two Do-It-Yourself Classifiers 426 Further Reading 431 19 EPILOGUE: FACTS ARE NOT REALITY 433 A PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION AND DATA ANALYSIS 435 Software Tools 435 A Catalog of Scientific Software 437 Writing Your Own 443 Further Reading 444 B RESULTS FROM CALCULUS 447 Common Functions 448 Calculus 460 Useful Tricks 468 x CONTENTS O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Notation and Basic Math 472 Where to Go from Here 479 Further Reading 481 C WORKING WITH DATA 485 Sources for Data 485 Cleaning and Conditioning 487 Sampling 489 Data File Formats 490 The Care and Feeding of Your Data Zoo 492 Skills 493 Terminology 495 Further Reading 497 INDEX 499 CONTENTS xi O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 O’Reilly-5980006 master October 28, 2010 22:0 Preface THIS BOOK GREW OUT OF MY EXPERIENCE OF WORKING WITH DATA FOR VARIOUS COMPANIES IN THE TECH industry. It is a collection of those concepts and techniques that I have found to be the most useful, including many topics that I wish I had known earlier—but didn’t. My degree is in physics, but I also worked as a software engineer for several years. The book reﬂects this dual heritage. On the one hand, it is written for programmers and others in the software ﬁeld: I assume that you, like me, have the ability to write your own programs to manipulate data in any way you want. On the other hand, the way I think about data has been shaped by my background and education. As a physicist, I am not content merely to describe data or to make black-box predictions: the purpose of an analysis is always to develop an understanding for the processes or mechanisms that give rise to the data that we observe. The instrument to express such understanding is the model: a description of the system under study (in other words, not just a description of the data!), simpliﬁed as necessary but nevertheless capturing the relevant information. A model may be crude (“Assume a spherical cow ...”), but if it helps us develop better insight on how the system works, it is a successful model nevertheless. (Additional precision can often be obtained at a later time, if it is really necessary.) This emphasis on models and simpliﬁed descriptions is not universal: other authors and practitioners will make different choices. But it is essential to my approach and point of view. This is a rather personal book. Although I have tried to be reasonably comprehensive, I have selected the topics that I consider relevant and useful in practice—whether they are part of the “canon” or not. Also included are several topics that you won’t ﬁnd in any other book on data analysis. Although neither new nor original, they are usually not used or discussed in this particular context—but I ﬁnd them indispensable. Throughout the book, I freely offer speciﬁc, explicit advice, opinions, and assessments. These remarks are reﬂections of my personal interest, experience, and understanding. I do not claim that my point of view is necessarily correct: evaluate what I say for yourself and feel free to adapt it to your needs. In my view, a speciﬁc, well-argued position is of greater use than a sterile laundry list of possible algorithms—even if you later decide to disagree with me. The value is not in the opinion but rather in the arguments leading up to it. If your arguments are better than mine, or even just more agreeable to you, then I will have achieved my purpose! xiii O’Reilly-5980006 master October 28, 2010 22:0 Data analysis, as I understand it, is not a ﬁxed set of techniques. It is a way of life, and it has a name: curiosity. There is always something else to ﬁnd out and something more to learn. This book is not the last word on the matter; it is merely a snapshot in time: things I knew about and found useful today. “Works are of value only if they give rise to better ones.” (Alexander von Humboldt, writing to Charles Darwin, 18 September 1839) Before We Begin More data analysis efforts seem to go bad because of an excess of sophistication rather than a lack of it. This may come as a surprise, but it has been my experience again and again. As a consultant, I am often called in when the initial project team has already gotten stuck. Rarely (if ever) does the problem turn out to be that the team did not have the required skills. On the contrary, I usually ﬁnd that they tried to do something unnecessarily complicated and are now struggling with the consequences of their own invention! Based on what I have seen, two particular risk areas stand out: • The use of “statistical” concepts that are only partially understood (and given the relative obscurity of most of statistics, this includes virtually all statistical concepts) • Complicated (and expensive) black-box solutions when a simple and transparent approach would have worked at least as well or better I strongly recommend that you make it a habit to avoid all statistical language. Keep it simple and stick to what you know for sure. There is absolutely nothing wrong with speaking of the “range over which points spread,” because this phrase means exactly what it says: the range over which points spread, and only that! Once we start talking about “standard deviations,” this clarity is gone. Are we still talking about the observed width of the distribution? Or are we talking about one speciﬁc measure for this width? (The standard deviation is only one of several that are available.) Are we already making an implicit assumption about the nature of the distribution? (The standard deviation is only suitable under certain conditions, which are often not fulﬁlled in practice.) Or are we even confusing the predictions we could make if these assumptions were true with the actual data? (The moment someone talks about “95 percent anything” we know it’s the latter!) I’d also like to remind you not to discard simple methods until they have been proven insufﬁcient. Simple solutions are frequently rather effective: the marginal beneﬁt that more complicated methods can deliver is often quite small (and may be in no reasonable relation to the increased cost). More importantly, simple methods have fewer opportunities to go wrong or to obscure the obvious. xiv PREFACE O’Reilly-5980006 master October 28, 2010 22:0 True story: a company was tracking the occurrence of defects over time. Of course, the actual number of defects varied quite a bit from one day to the next, and they were looking for a way to obtain an estimate for the typical number of expected defects. The solution proposed by their IT department involved a compute cluster running a neural network! (I am not making this up.) In fact, a one-line calculation (involving a moving average or single exponential smoothing) is all that was needed. I think the primary reason for this tendency to make data analysis projects more complicated than they are is discomfort: discomfort with an unfamiliar problem space and uncertainty about how to proceed. This discomfort and uncertainty creates a desire to bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of course, the opposite is true: the complexities of the “solution” overwhelm the original problem, and nothing gets accomplished. Data analysis does not have to be all that hard. Although there are situations when elementary methods will no longer be sufﬁcient, they are much less prevalent than you might expect. In the vast majority of cases, curiosity and a healthy dose of common sense will serve you well. The attitude that I am trying to convey can be summarized in a few points: Simple is better than complex. Cheap is better than expensive. Explicit is better than opaque. Purpose is more important than process. Insight is more important than precision. Understanding is more important than technique. Think more, work less. Although I do acknowledge that the items on the right are necessary at times, I will give preference to those on the left whenever possible. It is in this spirit that I am offering the concepts and techniques that make up the rest of this book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, and email addresses Constant width Used to refer to language and script elements PREFACE xv O’Reilly-5980006 master October 28, 2010 22:0 Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless youre reproducing a signiﬁcant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signiﬁcant amount of example code from this book into your products documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp K. Janert. Copyright 2011 Philipp K. Janert, 978-0-596-80235-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Safari® Books Online .>SafariBooks online Safari Books Online is an on-demand digital library that lets you easily search over 7,500 technology and creative reference books and videos to ﬁnd the answers you need quickly. With a subscription, you can read any page and watch any video from our library online. Read books on your cell phone and mobile devices. Access new titles before they are available for print, and get exclusive access to manuscripts in development and post feedback for the authors. Copy and paste code samples, organize your favorites, download chapters, bookmark key sections, create notes, print out pages, and beneﬁt from tons of other time-saving features. O’Reilly Media has uploaded this book to the Safari Books Online service. To have full digital access to this book and others on similar topics from OReilly and other publishers, sign up for free at http://my.safaribooksonline.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) xvi PREFACE O’Reilly-5980006 master October 28, 2010 22:0 We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://oreilly.com/catalog/9780596802356 To comment or ask technical questions about this book, send email to: bookquestions@oreilly.com For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at: http://oreilly.com Acknowledgments It was a pleasure to work with O’Reilly on this project. In particular, O’Reilly has been most accommodating with regard to the technical challenges raised by my need to include (for an O’Reilly book) an uncommonly large amount of mathematical material in the manuscript. Mike Loukides has accompanied this project as the editor since its beginning. I have enjoyed our conversations about life, the universe, and everything, and I appreciate his comments about the manuscript—either way. I’d like to thank several of my friends for their help in bringing this book about: • Elizabeth Robson, for making the connection • Austin King, for pointing out the obvious • Scott White, for suffering my questions gladly • Richard Kreckel, for much-needed advice As always, special thanks go to PAUL Schrader (Bremen). The manuscript beneﬁted from the feedback I received from various reviewers. Michael E. Driscoll, Zachary Kessin, and Austin King read all or parts of the manuscript and provided valuable comments. I enjoyed personal correspondence with Joseph Adler, Joe Darcy, Hilary Mason, Stephen Weston, Scott White, and Brian Zimmer. All very generously provided expert advice on speciﬁc topics. Particular thanks go to Richard Kreckel, who provided uncommonly detailed and insightful feedback on most of the manuscript. During the preparation of this book, the excellent collection at the University of Washington libraries was an especially valuable resource to me. PREFACE xvii O’Reilly-5980006 master October 28, 2010 22:0 Authors usually thank their spouses for their “patience and support” or words to that effect. Unless one has lived through the actual experience, one cannot fully comprehend how true this is. Over the last three years, Angela has endured what must have seemed like a nearly continuous stream of whining, frustration, and desperation—punctuated by occasional outbursts of exhilaration and grandiosity—all of which before the background of the self-centered and self-absorbed attitude of a typical author. Her patience and support were unfailing. It’s her turn now. xviii PREFACE O’Reilly-5980006 master October 28, 2010 22:0 CHAPTER ONE Introduction IMAGINE YOUR BOSS COMES TO YOU AND SAYS: “HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE OUR business!” What would you do? Where would you start? And what would you do next? It’s this kind of situation that the present book wants to help you with! Data Analysis Businesses sit on data, and every second that passes, they generate some more. Surely, there must be a way to make use of all this stuff. But how, exactly—that’s far from clear. The task is difﬁcult because it is so vague: there is no speciﬁc problem that needs to be solved. There is no speciﬁc question that needs to be answered. All you know is the overall purpose: improve the business. And all you have is “the data.” Where do you start? You start with the only thing you have: “the data.” What is it? We don’t know! Although 50 GB sure sounds like a lot, we have no idea what it actually contains. The ﬁrst thing, therefore, is to take a look. And I mean this literally: the ﬁrst thing to do is to look at the data by plotting it in different ways and looking at graphs. Looking at data, you will notice things—the way data points are distributed, or the manner in which one quantity varies with another, or the large number of outliers, or the total absence of them....I don’t know what you will ﬁnd, but there is no doubt: if you look at data, you will observe things! These observations should lead to some reﬂection. “Ten percent of our customers drive ninety percent of our revenue.” “Whenever our sales volume doubles, the number of 1 O’Reilly-5980006 master October 28, 2010 22:0 returns goes up by a factor of four.” “Every seven days we have a production run that has twice the usual defect rate, and it’s always on a Thursday.” How very interesting! Now you’ve got something to work with: the amorphous mass of “data” has turned into ideas! To make these ideas concrete and suitable for further work, it is often useful to capture them in a mathematical form: a model. A model (the way I use the term) is a mathematical description of the system under study. A model is more than just a description of the data—it also incorporates your understanding of the process or the system that produced the data. A model therefore has predictive power: you can predict (with some certainty) that next Thursday the defect rate will be high again. It’s at this point that you may want to go back and alert the boss of your ﬁndings: “Next Thursday, watch out for defects!” Sometimes, you may already be ﬁnished at this point: you found out enough to help improve the business. At other times, however, you may need to work a little harder. Some data sets do not yield easily to visual inspection—especially if you are dealing with data sets consisting of many different quantities, all of which seem equally important. In such cases, you may need to employ more-sophisticated methods to develop enough intuition before being able to formulate a relevant model. Or you may have been able to set up a model, but it is too complicated to understand its implications, so that you want to implement the model as a computer program and simulate its results. Such computationally intensive methods are occasionally useful, but they always come later in the game. You should only move on to them after having tried all the simple things ﬁrst. And you will need the insights gained from those earlier investigations as input to the more elaborate approaches. And ﬁnally, we need to come back to the initial agenda. To “improve the business” it is necessary to feed our understanding back into the organization—for instance, in the form of a business plan, or through a “metrics dashboard” or similar program. What's in This Book The program just described reﬂects the outline of this book. We begin in Part I with a series of chapters on graphical techniques, starting in Chapter 2 with simple data sets consisting of only a single variable (or considering only a single variable at a time), then moving on in Chapter 3 to data sets of two variables. In Chapter 4 we treat the particularly important special case of a quantity changing over time, a so-called time series. Finally, in Chapter 5, we discuss data sets comprising more than two variables and some special techniques suitable for such data sets. In Part II, we discuss models as a way to not only describe data but also to capture the understanding that we gained from graphical explorations. We begin in Chapter 7 with a discussion of order-of-magnitude estimation and uncertainty considerations. This may 2 CHAPTER ONE O’Reilly-5980006 master October 28, 2010 22:0 seem odd but is, in fact, crucial: all models are approximate, so we need to develop a sense for the accuracy of the approximations that we use. In Chapters 8 and9 we introduce basic building blocks that are useful when developing models. Chapter 10 is a detour. For too many people, “data analysis” is synonymous with “statistics,” and “statistics” is usually equated with a class in college that made no sense at all. In this chapter, I want to explain what statistics really is, what all the mysterious concepts mean and how they hang together, and what statistics can (and cannot) do for us. It is intended as a travel guide should you ever want to read a statistics book in the future. Part III discusses several computationally intensive methods, such as simulation and clustering in Chapters 12 and 13. Chapter 14 is, mathematically, the most challenging chapter in the book: it deals with methods that can help select the most relevant variables from a multivariate data set. In Part IV we consider some ways that data may be used in a business environment. In Chapter 16 we talk about metrics, reporting, and dashboards—what is sometimes referred to as “business intelligence.” In Chapter 17 we introduce some of the concepts required to make ﬁnancial calculations and to prepare business plans. Finally, in chapter 18, we conclude with a survey of some methods from classiﬁcation and predictive analytics. At the end of each part of the book you will ﬁnd an “Intermezzo.” These intermezzos are not really part of the course; I use them to go off on some tangents, or to explain topics that often remain a bit hazy. You should see them as an opportunity to relax! The appendices contain some helpful material that you may want to consult at various times as you go through the text. Appendix A surveys some of the available tools and programming environments for data manipulation and analysis. In Appendix B I have collected some basic mathematical results that I expect you to have at least passing familiarity with. I assume that you have seen this material at least once before, but in this appendix, I put it together in an application-oriented context, which is more suitable for our present purposes. Appendix C discusses some of the mundane tasks that—like it or not—make up a large part of actual data analysis and also introduces some data-related terminology. What's with the Workshops? Every full chapter (after this one) includes a section titled “Workshop” that contains some programming examples related to the chapter’s material. I use these Workshops for two purposes. On the one hand, I’d like to introduce a number of open source tools and libraries that may be useful for the kind of work discussed in this book. On the other hand, some concepts (such as computational complexity and power-law distributions) must be seen to be believed: the Workshops are a way to demonstrate these issues and allow you to experiment with them yourself. INTRODUCTION 3 O’Reilly-5980006 master October 28, 2010 22:0 Among the tools and libraries is quite a bit of Python and R. Python has become somewhat the scripting language of choice for scientiﬁc applications, and R is the most popular open source package for statistical applications. This choice is neither an endorsement nor a recommendation but primarily a reﬂection of the current state of available software. (See Appendix A for a more detailed discussion of software for data analysis and related purposes.) My goal with the tool-oriented Workshops is rather speciﬁc: I want to enable you to decide whether a given tool or library is worth spending time on. (I have found that evaluating open source offerings is a necessary but time-consuming task.) I try to demonstrate clearly what purpose each particular tool serves. Toward this end, I usually give one or two short, but not entirely trivial, examples and try to outline enough of the architecture of the tool or library to allow you to take it from there. (The documentation for many open source projects has a hard time making the bridge from the trivial, cut-and-paste “Hello, World” example to the reference documentation.) What's with the Math? This book contains a certain amount of mathematics. Depending on your personal predilection you may ﬁnd this trivial, intimidating, or exciting. The reality is that if you want to work analytically, you will need to develop some familiarity with a few mathematical concepts. There is simply no way around it. (You can work with data without any math skills—look at what any data modeler or database administrator does. But if you want to do any sort of analysis, then a little math becomes a necessity.) I have tried to make the text accessible to readers with a minimum of previous knowledge. Some college math classes on calculus and similar topics are helpful, of course, but are by no means required. Some sections of the book treat material that is either more abstract or will likely be unreasonably hard to understand without some previous exposure. These sections are optional (they are not needed in the sequel) and are clearly marked as such. A somewhat different issue concerns the notation. I use mathematical notation wherever it is appropriate and it helps the presentation. I have made sure to use only a very small set of symbols; check Appendix B if something looks unfamiliar. Couldn’t I have written all the mathematical expressions as computer code, using Python or some sort of pseudo-code? The answer is no, because quite a few essential mathematical concepts cannot be expressed in a ﬁnite, ﬂoating-point oriented machine (anything having to do with a limit process—or real numbers, in fact). But even if I could write all math as code, I don’t think I should. Although I wholeheartedly agree that mathematical notation can get out of hand, simple formulas actually provide the easiest, most succinct way to express mathematical concepts. 4 CHAPTER ONE O’Reilly-5980006 master October 28, 2010 22:0 Just compare. I’d argue that: n k=0 c(k) (1 + p)k is clearer and easier to read than: s=0 for k in range( len(c) ): s += c[k]/(1+p)**k and certainly easier than: s=(c/(1+p)**numpy.arange(1, len(c)+1) ).sum(axis=0) But that’s only part of the story. More importantly, the ﬁrst version expresses a concept, whereas the second and third are merely speciﬁc prescriptions for how to perform a certain calculation. They are recipes, not ideas. Consider this: the formula in the ﬁrst line is a description of a sum—not a speciﬁc sum, but any sum of this form: it’s the idea of this kind of sum. We can now ask how this abstract sum will behave under certain conditions—for instance, if we let the upper limit n go to inﬁnity. What value does the sum have in this case? Is it ﬁnite? Can we determine it? You would not even be able to ask this question given the code versions. (Remember that I am not talking about an approximation, such as letting n get “very large.” I really do mean: what happens if n goes all the way to inﬁnity? What can we say about the sum?) Some programming environments (like Haskell, for instance) are more at ease dealing with inﬁnite data structures—but if you look closely, you will ﬁnd that they do so by being (coarse) approximations to mathematical concepts and notations. And, of course, they still won’t be able to evaluate such expressions! (All evaluations will only involve a ﬁnite number of steps.) But once you train your mind to think in those terms, you can evaluate them in your mind at will. It may come as a surprise, but mathematics is not a method for calculating things. Mathematics is a theory of ideas, and ideas—not calculational prescriptions—are what I would like to convey in this text. (See the discussion at the end of Appendix B for more on this topic and for some suggested reading.) If you feel uncomfortable or even repelled by the math in this book, I’d like to ask for just one thing: try! Give it a shot. Don’t immediately give up. Any frustration you may experience at ﬁrst is more likely due to lack of familiarity rather than to the difﬁculty of the material. I promise that none of the content is out of your reach. But you have to let go of the conditioned knee-jerk reﬂex that “math is, like, yuck!” What You'll Need This book is written with programmers in mind. Although previous programming experience is by no means required, I assume that you are able to take an idea and INTRODUCTION 5 O’Reilly-5980006 master October 28, 2010 22:0 implement it in the programming language of your choice—in fact, I assume that this is your prime motivation for reading this book. I don’t expect you to have any particular mathematical background, although some previous familiarity with calculus is certainly helpful. You will need to be able to count, though! But the most important prerequisite is not programming experience, not math skills, and certainly not knowledge of anything having to do with “statistics.” The most important prerequisite is curiosity. If you aren’t curious, then this book is not for you. If you get a new data set and you are not itching to see what’s in it, I won’t be able to help you. What's Missing This is a book about data analysis and modeling with an emphasis on applications in a business settings. It was written at a beginning-to-intermediate level and for a general technical audience. Although I have tried to be reasonably comprehensive, I had to choose which subjects to include and which to leave out. I have tried to select topics that are useful and relevant in practice and that can safely be applied by a nonspecialist. A few topics were omitted because they did not ﬁt within the book’s overall structure, or because I did not feel sufﬁciently competent to present them. Scientiﬁc data. This is not a book about scientiﬁc data analysis. When you are doing scientiﬁc research (however you wish to deﬁne “scientific”), you really need to have a solid background (and that probably means formal training) in the ﬁeld that you are working in. A book such as this one on general data analysis cannot replace this. Formal statistical analysis. A different form of data analysis exists in some particularly well-established ﬁelds. In these situations, the environment from which the data arises is fully understood (or at least believed to be understood), and the methods and models to be used are likewise accepted and well known. Typical examples include clinical trials as well as credit scoring. The purpose of an “analysis” in these cases is not to ﬁnd out anything new, but rather to determine the model parameters with the highest degree of accuracy and precision for each newly generated set of data points. Since this is the kind of work where details matter, it should be left to specialists. Network analysis. This is a topic of current interest about which I know nothing. (Sorry!) However, it does seem to me that its nature is quite different from most problems that are usually considered “data analysis”: less statistical, more algorithmic in nature. But I don’t know for sure. Natural language processing and text mining. Natural language processing is a big topic all by itself, which has little overlap (neither in terms of techniques nor applications) with 6 CHAPTER ONE O’Reilly-5980006 master October 28, 2010 22:0 the rest of the material presented here. It deserves its own treatment—and several books on this subject are available. Big data. Arguably the most painful omission concerns everything having to do with Big Data. Big Data is a pretty new concept—I tend to think of it as relating to data sets that not merely don’t ﬁt into main memory, but that no longer ﬁt comfortably on a single disk, requiring compute clusters and the respective software and algorithms (in practice, map/reduce running on Hadoop). The rise of Big Data is a remarkable phenomenon. When this book was conceived (early 2009), Big Data was certainly on the horizon but was not necessarily considered mainstream yet. As this book goes to print (late 2010), it seems that for many people in the tech ﬁeld, “data” has become nearly synonymous with “Big Data.” That kind of development usually indicates a fad. The reality is that, in practice, many data sets are “small,” and in particular many relevant data sets are small. (Some of the most important data sets in a commercial setting are those maintained by the ﬁnance department—and since they are kept in Excel, they must be small.) Big Data is not necessarily “better.” Applied carelessly, it can be a huge step backward. The amazing insight of classical statistics is that you don’t need to examine every single member of a population to make a deﬁnitive statement about the whole: instead you can sample! It is also true that a carefully selected sample may lead to better results than a large, messy data set. Big Data makes it easy to forget the basics. It is a little early to say anything deﬁnitive about Big Data, but the current trend strikes me as being something quite different: it is not just classical data analysis on a larger scale. The approach of classical data analysis and statistics is inductive. Given a part, make statements about the whole: from a sample, estimate parameters of the population; given an observation, develop a theory for the underlying system. In contrast, Big Data (at least as it is currently being used) seems primarily concerned with individual data points. Given that this speciﬁc user liked this speciﬁc movie, what other speciﬁc movie might he like? This is a very different question than asking which movies are most liked by what people in general! Big Data will not replace general, inductive data analysis. It is not yet clear just where Big Data will deliver the greatest bang for the buck—but once the dust settles, somebody should deﬁnitely write a book about it! INTRODUCTION 7 O’Reilly-5980006 master October 28, 2010 22:0 O’Reilly-5980006 master October 28, 2010 20:25 PART I Graphics: Looking at Data O’Reilly-5980006 master October 28, 2010 20:25 O’Reilly-5980006 master October 28, 2010 20:25 CHAPTER TWO A Single Variable: Shape and Distribution WHEN DEALING WITH UNIVARIATE DATA, WE ARE USUALLY MOSTLY CONCERNED WITH THE OVERALL SHAPE OF the distribution. Some of the initial questions we may ask include: • Where are the data points located, and how far do they spread? What are typical, as well as minimal and maximal, values? • How are the points distributed? Are they spread out evenly or do they cluster in certain areas? • How many points are there? Is this a large data set or a relatively small one? • Is the distribution symmetric or asymmetric? In other words, is the tail of the distribution much larger on one side than on the other? • Are the tails of the distribution relatively heavy (i.e., do many data points lie far away from the central group of points), or are most of the points—with the possible exception of individual outliers—conﬁned to a restricted region? • If there are clusters, how many are there? Is there only one, or are there several? Approximately where are the clusters located, and how large are they—both in terms of spread and in terms of the number of data points belonging to each cluster? • Are the clusters possibly superimposed on some form of unstructured background, or does the entire data set consist only of the clustered data points? • Does the data set contain any signiﬁcant outliers—that is, data points that seem to be different from all the others? • And lastly, are there any other unusual or signiﬁcant features in the data set—gaps, sharp cutoffs, unusual values, anything at all that we can observe? 11 O’Reilly-5980006 master October 28, 2010 20:25 As you can see, even a simple, single-column data set can contain a lot of different features! To make this concrete, let’s look at two examples. The ﬁrst concerns a relatively small data set: the number of months that the various American presidents have spent in ofﬁce. The second data set is much larger and stems from an application domain that may be more familiar; we will be looking at the response times from a web server. Dot and Jitter Plots Suppose you are given the following data set, which shows all past American presidents and the number of months each spent in ofﬁce.* Although this data set has three columns, we can treat it as univariate because we are interested only in the times spent in ofﬁce—the names don’t matter to us (at this point). What can we say about the typical tenure? 1 Washington 94 2 Adams 48 3 Jefferson 96 4 Madison 96 5 Monroe 96 6 Adams 48 7 Jackson 96 8 Van Buren 48 9 Harrison 1 10 Tyler 47 11 Polk 48 12 Taylor 16 13 Filmore 32 14 Pierce 48 15 Buchanan 48 16 Lincoln 49 17 Johnson 47 18 Grant 96 19 Hayes 48 20 Garfield 7 21 Arthur 41 22 Cleveland 48 23 Harrison 48 24 Cleveland 48 25 McKinley 54 26 Roosevelt 90 27 Taft 48 28 Wilson 96 29 Harding 29 *The inspiration for this example comes from a paper by Robert W. Hayden in the Journal of Statistics Education. The full text is available at http://www.amstat.org/publications/jse/v13n1/datasets.hayden.html. 12 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 30 Coolidge 67 31 Hoover 48 32 Roosevelt 146 33 Truman 92 34 Eisenhower 96 35 Kennedy 34 36 Johnson 62 37 Nixon 67 38 Ford 29 39 Carter 48 40 Reagan 96 41 Bush 48 42 Clinton 96 43 Bush 96 This is not a large data set (just over 40 records), but it is a little too big to take in as a whole. A very simple way to gain an initial sense of the data set is to create a dot plot.Ina dot plot, we plot all points on a single (typically horizontal) line, letting the value of each data point determine the position along the horizontal axis. (See the top part of Figure 2-1.) A dot plot can be perfectly sufﬁcient for a small data set such as this one. However, in our case it is slightly misleading because, whenever a certain tenure occurs more than once in the data set, the corresponding data points fall right on top of each other, which makes it impossible to distinguish them. This is a frequent problem, especially if the data assumes only integer values or is otherwise “coarse-grained.” A common remedy is to shift each point by a small random amount from its original position; this technique is called jittering and the resulting plot is a jitter plot. A jitter plot of this data set is shown in the bottom part of Figure 2-1. What does the jitter plot tell us about the data set? We see two values where data points seem to cluster, indicating that these values occur more frequently than others. Not surprisingly, they are located at 48 and 96 months, which correspond to one and two full four-year terms in ofﬁce. What may be a little surprising, however, is the relatively large number of points that occur outside these clusters. Apparently, quite a few presidents left ofﬁce at irregular intervals! Even in this simple example, a plot reveals both something expected (the clusters at 48 and 96 months) and the unexpected (the larger number of points outside those clusters). Before moving on to our second example, let me point out a few additional technical details regarding jitter plots. • It is important that the amount of “jitter” be small compared to the distance between points. The only purpose of the random displacements is to ensure that no two points fall exactly on top of one another. We must make sure that points are not shifted signiﬁcantly from their true location. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 13 O’Reilly-5980006 master October 28, 2010 20:25 0 20 40 60 80 100 120 140 160 Months in Office FIGURE 2-1.Dotandjitter plots showing the number of months U.S. presidents spent in office. • We can jitter points in either the horizontal or the vertical direction (or both), depending on the data set and the purpose of the graph. In Figure 2-1, points were jittered only in the vertical direction, so that their horizontal position (which in this case corresponds to the actual data—namely, the number of months in ofﬁce) is not altered and therefore remains exact. • I used open, transparent rings as symbols for the data points. This is no accident: among different symbols of equal size, open rings are most easily recognized as separate even when partially occluded by each other. In contrast, ﬁlled symbols tend to hide any substructure when they overlap, and symbols made from straight lines (e.g., boxes and crosses) can be confusing because of the large number of parallel lines; see the top part of Figure 2-1. Jittering is a good trick that can be used in many different contexts. We will see further examples later in the book. Histograms and Kernel Density Estimates Dot and jitter plots are nice because they are so simple. However, they are neither pretty nor very intuitive, and most importantly, they make it hard to read off quantitative information from the graph. In particular, if we are dealing with larger data sets, then we need a better type of graph, such as a histogram. 14 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 0 10 20 30 40 50 60 70 0 500 1000 1500 2000 2500 3000 Number of Observations Response Time FIGURE 2-2.Ahistogram of a server’s response times. Histograms To form a histogram, we divide the range of values into a set of “bins” and then count the number of points (sometimes called “events”) that fall into each bin. We then plot the count of events for each bin as a function of the position of the bin. Once again, let’s look at an example. Here is the beginning of a ﬁle containing response times (in milliseconds) for queries against a web server or database. In contrast to the previous example, this data set is fairly large, containing 1,000 data points. 452.42 318.58 144.82 129.13 1216.45 991.56 1476.69 662.73 1302.85 1278.55 627.65 1030.78 215.23 44.50 ... Figure 2-2 shows a histogram of this data set. I divided the horizontal axis into 60 bins of 50 milliseconds width and then counted the number of events in each bin. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 15 O’Reilly-5980006 master October 28, 2010 20:25 What does the histogram tell us? We observe a rather sharp cutoff at a nonzero value on the left, which means that there is a minimum completion time below which no request can be completed. Then there is a sharp rise to a maximum at the “typical” response time, and ﬁnally there is a relatively large tail on the right, corresponding to the smaller number of requests that take a long time to process. This kind of shape is rather typical for a histogram of task completion times. If the data set had contained completion times for students to ﬁnish their homework or for manufacturing workers to ﬁnish a work product, then it would look qualitatively similar except, of course, that the time scale would be different. Basically, there is some minimum time that nobody can beat, a small group of very fast champions, a large majority, and ﬁnally a longer or shorter tail of “stragglers.” It is important to realize that a data set does not determine a histogram uniquely. Instead, we have to ﬁx two parameters to form a histogram: the bin width and the alignment of the bins. The quality of any histogram hinges on the proper choice of bin width. If you make the width too large, then you lose too much detailed information about the data set. Make it too small and you will have few or no events in most of the bins, and the shape of the distribution does not become apparent. Unfortunately, there is no simple rule of thumb that can predict a good bin width for a given data set; typically you have to try out several different values for the bin width until you obtain a satisfactory result. (As a ﬁrst guess, you can start with Scott’s rule for the bin width w = 3.5σ/ 3 √ n, where σ is the standard deviation for the entire data set and n is the number of points. This rule assumes that the data follows a Gaussian distribution; otherwise, it is likely to give a bin width that is too wide. See the end of this chapter for more information on the standard deviation.) The other parameter that we need to ﬁx (whether we realize it or not) is the alignment of the bins on the x axis. Let’s say we ﬁxed the width of the bins at 1. Where do we now place the ﬁrst bin? We could put it ﬂush left, so that its left edge is at 0, or we could center it at 0. In fact, we can move all bins by half a bin width in either direction. Unfortunately, this seemingly insigniﬁcant (and often overlooked) parameter can have a large inﬂuence on the appearance of the histogram. Consider this small data set: 1.4 1.7 1.8 1.9 2.1 2.2 2.3 2.6 Figure 2-3 shows two histograms of this data set. Both use the same bin width (namely, 1) but have different alignment of the bins. In the top panel, where the bin edges have been aligned to coincide with the whole numbers (1, 2, 3,...), the data set appears to be ﬂat. Yet in the bottom panel, where the bins have been centered on the whole numbers, the 16 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 0 1 2 3 4 5 6 7 8 0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 5 6 7 8 0 0.5 1 1.5 2 2.5 3 3.5 4 FIGURE 2-3.Histograms can look quite different, depending on the choice of anchoring point for the first bin. The figure shows two histograms of the same data set, using the same bin width. In the top panel, the bin edges are aligned on whole numbers; in the bottom panel, bins are centered on whole numbers. data set appears to have a rather strong central peak and symmetric wings on both sides. It should be clear that we can construct even more pathological examples than this. In the next section we shall introduce an alternative to histograms that avoids this particular problem. Before moving on, I’d like to point out some additional technical details and variants of histograms. • Histograms can be either normalized or unnormalized. In an unnormalized histogram, the value plotted for each bin is the absolute count of events in that bin. In a normalized histogram, we divide each count by the total number of points in the data set, so that the value for each bin becomes the fraction of points in that bin. If we want the percentage of points per bin instead, we simply multiply the fraction by 100. • So far I have assumed that all bins have the same width. We can relax this constraint and allow bins of differing widths—narrower where points are tightly clustered but wider in areas where there are only few points. This method can seem very appealing when the data set has outliers or areas with widely differing point density. Be warned, though, that now there is an additional source of ambiguity for your histogram: should you display the absolute number of points per bin regardless of the width of each bin; or should you display the density of points per bin by normalizing the point count per bin by the bin width? Either method is valid, and you cannot assume that your audience will know which convention you are following. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 17 O’Reilly-5980006 master October 28, 2010 20:25 • It is customary to show histograms with rectangular boxes that extend from the horizontal axis, the way I have drawn Figures 2-2 and 2-3. That is perfectly all right and has the advantage of explicitly displaying the bin width as well. (Of course, the boxes should be drawn in such a way that they align in the same way that the actual bins align; see Figure 2-3.) This works well if you are only displaying a histogram for a single data set. But if you want to compare two or more data sets, then the boxes start to get in the way, and you are better off drawing “frequency polygons”: eliminate the boxes, and instead draw a symbol where the top of the box would have been. (The horizontal position of the symbol should be at the center of the bin.) Then connect consecutive symbols with straight lines. Now you can draw multiple data sets in the same plot without cluttering the graph or unnecessarily occluding points. • Don’t assume that the defaults of your graphics program will generate the best representation of a histogram! I have already discussed why I consider frequency polygons to be almost always a better choice than to construct a histogram from boxes. If you nevertheless choose to use boxes, it is best to avoid ﬁlling them (with a color or hatch pattern)—your histogram will probably look cleaner and be easier to read if you stick with just the box outlines. Finally, if you want to compare several data sets in the same graph, always use a frequency polygon, and stay away from stacked or clustered bar graphs, since these are particularly hard to read. (We will return to the problem of displaying composition problems in Chapter 5.) Histograms are very common and have a nice, intuitive interpretation. They are also easy to generate: for a moderately sized data set, it can even be done by hand, if necessary. That being said, histograms have some serious problems. The most important ones are as follows. • The binning process required by all histograms loses information (by replacing the location of individual data points with a bin of ﬁnite width). If we only have a few data points, we can ill afford to lose any information. • Histograms are not unique. As we saw in Figure 2-3, the appearance of a histogram can be quite different. (This nonuniqueness is a direct consequence of the information loss described in the previous item.) • On a more superﬁcial level, histograms are ragged and not smooth. This matters little if we just want to draw a picture of them, but if we want to feed them back into a computer as input for further calculations, then a smooth curve would be easier to handle. • Histograms do not handle outliers gracefully. A single outlier, far removed from the majority of the points, requires many empty cells in between or forces us to use bins that are too wide for the majority of points. It is the possibility of outliers that makes it difﬁcult to ﬁnd an acceptable bin width in an automated fashion. 18 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 0 2 4 6 8 10 12 14 0 20 40 60 80 100 120 140 Months in Office Histogram KDE, Bandwidth=2.5 KDE, Bandwidth=0.8 FIGURE 2-4.Histogram and kernel density estimate of the distribution of the time U.S. presidents have spent in office. Fortunately, there is an alternative to classical histograms that has none of these problems. It is called a kernel density estimate. Kernel Density Estimates Kernel density estimates (KDEs) are a relatively new technique. In contrast to histograms, and to many other classical methods of data analysis, they pretty much require the calculational power of a reasonably modern computer to be effective. They cannot be done “by hand” with paper and pencil, even for rather moderately sized data sets. (It is interesting to see how the accessibility of computational and graphing power enables new ways to think about data!) To form a KDE, we place a kernel—that is, a smooth, strongly peaked function—at the position of each data point. We then add up the contributions from all kernels to obtain a smooth curve, which we can evaluate at any point along the x axis. Figure 2-4 shows an example. This is yet another representation of the data set we have seen before in Figure 2-1. The dotted boxes are a histogram of the data set (with bin width equal to 1), and the solid curves are two KDEs of the same data set with different bandwidths (I’ll explain this concept in a moment). The shape of the individual kernel functions can be seen clearly—for example, by considering the three data points below 20. You can also see how the ﬁnal curve is composed out of the individual kernels, in particular when you look at the points between 30 and 40. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 19 O’Reilly-5980006 master October 28, 2010 20:25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -10 -5 0 5 10 Box Epanechnikov Gaussian FIGURE 2-5.Graphsofsome frequently used kernel functions. We can use any smooth, strongly peaked function as a kernel provided that it integrates to 1; in other words, the area under the curve formed by a single kernel must be 1. (This is necessary to make sure that the resulting KDE is properly normalized.) Some examples of frequently used kernel functions include (see Figure 2-5): K(x) = ⎧ ⎨ ⎩ 1 2 if |x|≤1 0 otherwise box or boxcar kernel K(x) = ⎧ ⎨ ⎩ 3 4 1 − x2 if |x|≤1 0 otherwise Epanechnikov kernel K(x) = 1√ 2π exp −1 2 x2 Gaussian kernel The box kernel and the Epanechnikov kernel are zero outside a ﬁnite range, whereas the Gaussian kernel is nonzero everywhere but negligibly small outside a limited domain. It turns out that the curve resulting from the KDE does not depend strongly on the particular choice of kernel function, so we are free to use the kernel that is most convenient. Because it is so easy to work with, the Gaussian kernel is the most widely used. (See Appendix B for more information on the Gaussian function.) Constructing a KDE requires to things: ﬁrst, we must move the kernel to the position of each point by shifting it appropriately. For example, the function K(x − xi ) will have its peak at xi , not at 0. Second, we have to choose the kernel bandwidth, which controls the spread of the kernel function. To make sure that the area under the curve stays the same as we shrink the width, we have to make the curve higher (and lower if we increase the 20 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -10 -5 0 5 10 Pos: -3 Wid: 2 Pos: 0 Wid: 1 Pos: 3 Wid: 0.5 FIGURE 2-6.TheGaussian kernel for three different bandwidths. The height of the kernel increases as the width decreases, so the total area under the curve remains constant. width). The ﬁnal expression for the shifted, rescaled kernel function of bandwidth h is: 1 h K x − xi h This function has a peak at xi , its width is approximately h, and its height is such that the area under the curve is still 1. Figure 2-6 shows some examples, using the Gaussian kernel. Keep in mind that the area under all three curves is the same. Using this expression, we can now write down a formula for the KDE with bandwidth h for any data set {x1, x2,...,xn}. This formula can be evaluated for any point x along the x axis: Dh (x; {xi }) = n i=1 1 h K x − xi h All of this is straightforward and easy to implement in any computer language. Be aware that for large data sets (those with many thousands of points), the required number of kernel evaluations can lead to performance issues, especially if the function D(x) needs to be evaluated for many different positions (i.e., many different values of x). If this becomes a problem for you, you may want to choose a simpler kernel function or not evaluate a kernel if the distance x − xi is signiﬁcantly greater than the bandwidth h.* *Yet another strategy starts with the realization that forming a KDE amounts to a convolution of the kernel function with the data set. You can now take the Fourier transform of both kernel and data set and make use of the Fourier convolution theorem. This approach is suitable for very large data sets but is outside the scope of our discussion. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 21 O’Reilly-5980006 master October 28, 2010 20:25 Now we can explain the wide gray line in Figure 2-4: it is a KDE with a larger bandwidth. Using such a large bandwidth makes it impossible to resolve the individual data points, but it does highlight entire periods of greater or smaller frequency. Which choice of bandwidth is right for you depends on your purpose. A KDE constructed as just described is similar to a classical histogram, but it avoids two of the aforementioned problems. Given data set and bandwidth, a KDE is unique; a KDE is also smooth, provided we have chosen a smooth kernel function, such as the Gaussian. Optional: Optimal Bandwidth Selection We still have to ﬁx the bandwidth. This is a different kind of problem than the other two: it’s not just a technical problem, which could be resolved through a better method; instead, it’s a fundamental problem that relates to the data set itself. If the data follows a smooth distribution, then a wider bandwidth is appropriate, but if the data follows a very wiggly distribution, then we need a smaller bandwidth to retain all relevant detail. In other words, the optimal bandwidth is a property of the data set and tells us something about the nature of the data. So how do we choose an optimal value for the bandwidth? Intuitively, the problem is clear: we want the bandwidth to be narrow enough to retain all relevant detail but wide enough so that the resulting curve is not too “wiggly.” This is a problem that arises in every approximation problem: balancing the faithfulness of representation against the simplicity of behavior. Statisticians speak of the “bias–variance trade-off.” To make matters concrete, we have to deﬁne a speciﬁc expression for the error of our approximation, one that takes into account both bias and variance. We can then choose a value for the bandwidth that minimizes this error. For KDEs, the generally accepted measure is the “expected mean-square error” between the approximation and the true density. The problem is that we don’t know the true density function that we are trying to approximate, so it seems impossible to calculate (and minimize) the error in this way. But clever methods have been developed to make progress. These methods fall broadly into two categories. First, we could try to ﬁnd explicit expressions for both bias and variance. Balancing them leads to an equation that has to be solved numerically or—if we make additional assumptions (e.g., that the distribution is Gaussian)—can even yield explicit expressions similar to Scott’s rule (introduced earlier when talking about histograms). Alternatively, we could realize that the KDE is an approximation for the probability density from which the original set of points was chosen. We can therefore choose points from this approximation (i.e., from the probability density represented by the KDE) and see how well they replicate the KDE that we started with. Now we change the bandwidth until we ﬁnd that value for which the KDE is best replicated: the result is the estimate of the “true” bandwidth of the data. (This latter method is known as cross-validation.) Although not particularly hard, the details of both methods would lead us too far aﬁeld, and so I will skip them here. If you are interested, you will have no problem picking up 22 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 the details from one of the references at the end of this chapter. Keep in mind, however, that these methods ﬁnd the optimal bandwidth with respect to the mean-square error, which tends to overemphasize bias over variance and therefore these methods lead to rather narrow bandwidths and KDEs that appear too wiggly. If you are using KDEs to generate graphs for the purpose of obtaining intuitive visualizations of point distributions, then you might be better off with a bit of manual trial and error combined with visual inspection. In the end, there is no “right” answer, only the most suitable one for a given purpose. Also, the most suitable to develop intuitive understanding might not be the one that minimizes a particular mathematical quantity. The Cumulative Distribution Function The main advantage of histograms and kernel density estimates is that they have an immediate intuitive appeal: they tell us how probable it is to ﬁnd a data point with a certain value. For example, from Figure 2-2 it is immediately clear that values around 250 milliseconds are very likely to occur, whereas values greater than 2,000 milliseconds are quite rare. But how rare, exactly? That is a question that is much harder to answer by looking at the histogram in Figure 2-2. Besides wanting to know how much weight is in the tail, we might also be interested to know what fraction of requests completes in the typical band between 150 and 350 milliseconds. It’s certainly the majority of events, but if we want to know exactly how many, then we need to sum up the contributions from all bins in that region. The cumulative distribution function (CDF) does just that. The CDF at point x tells us what fraction of events has occurred “to the left” of x. In other words, the CDF is the fraction of all points xi with xi ≤ x. Figure 2-7 shows the same data set that we have already encountered in Figure 2-2, but here the data is represented by a KDE (with bandwidth h = 30) instead of a histogram. In addition, the ﬁgure also includes the corresponding CDF. (Both KDE and CDF are normalized to 1.) We can read off several interesting observations directly from the plot of the CDF. For instance, we can see that at t = 1,500 (which certainly puts us into the tail of the distribution) the CDF is still smaller than 0.85; this means that fully 15 percent of all requests take longer than 1,500 milliseconds. In contrast, less than a third of all requests are completed in the “typical” range of 150–500 milliseconds. (How do we know this? The CDF for t = 150 is about 0.05 and is close to 0.40 for t = 500. In other words, about 40 percent of all requests are completed in less than 500 milliseconds; of these, 5 percent are completed in less than 150 milliseconds. Hence about 35 percent of all requests have response times of between 150 and 500 milliseconds.) A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 23 O’Reilly-5980006 master October 28, 2010 20:25 0 0.2 0.4 0.6 0.8 1 1.2 0 500 1000 1500 2000 2500 3000 Response Time KDE CDF FIGURE 2-7.Kernel density estimate and cumulative distribution function of the server response times shown in Figure 2-2. It is worth pausing to contemplate these ﬁndings, because they demonstrate how misleading a histogram (or KDE) can be despite (or because of) their intuitive appeal! Judging from the histogram or KDE alone, it seems quite reasonable to assume that “most” of the events occur within the major peak near t = 300 and that the tail for t > 1,500 contributes relatively little. Yet the CDF tells us clearly that this is not so. (The problem is that the eye is much better at judging distances than areas, and we are therefore misled by the large values of the histogram near its peak and fail to see that nevertheless the area beneath the peak is not that large compared to the total area under the curve.) CDFs are probably the least well-known and most underappreciated tool in basic graphical analysis. They have less immediate intuitive appeal than histograms or KDEs, but they allow us to make the kind of quantitative statement that is very often required but is difﬁcult (if not impossible) to obtain from a histogram. Cumulative distribution functions have a number of important properties that follow directly from how they are calculated. • Because the value of the CDF at position x is the fraction of points to the left of x,a CDF is always monotonically increasing with x. • CDFs are less wiggly than a histogram (or KDE) but contain the same information in a representation that is inherently less noisy. • Because CDFs do not involve any binning, they do not lose information and are therefore a more faithful representation of the data than a histogram. 24 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 • All CDFs approach 0 as x goes to negative inﬁnity. CDFs are usually normalized so that they approach 1 (or 100 percent) as x goes to positive inﬁnity. • A CDF is unique for a given data set. If you are mathematically inclined, you have probably already realized that the CDF is (an approximation to) the antiderivative of the histogram and that the histogram is the derivative of the CDF: cdf(x) ≈ x −∞ dt histo(t) histo(x) ≈ d dx cdf(x) Cumulative distribution functions have several uses. First, and most importantly, they enable us to answer questions such as those posed earlier in this section: what fraction of points falls between any two values? The answer can simply be read off from the graph. Second, CDFs also help us understand how imbalanced a distribution is—in other words, what fraction of the overall weight is carried by the tails. Cumulative distribution functions also prove useful when we want to compare two distributions. It is notoriously difﬁcult to compare two bell-shaped curves in a histogram against each other. Comparing the corresponding CDFs is usually much more conclusive. One last remark, before leaving this section: in the literature, you may ﬁnd the term quantile plot. A quantile plot is just the plot of a CDF in which the x and y axes have been switched. Figure 2-8 shows an example using once again the server response time data set. Plotted this way, we can easily answer questions such as, “What response time corresponds to the 10th percentile of response times?” But the information contained in this graph is of course exactly the same as in a graph of the CDF. Optional: Comparing Distributions with Probability Plots and QQ Plots Occasionally you might want to conﬁrm that a given set of points is distributed according to some speciﬁc, known distribution. For example, you have a data set and would like to determine whether it can be described well by a Gaussian (or some other) distribution. You could compare a histogram or KDE of the data set directly against the theoretical density function, but it is notoriously difﬁcult to compare distributions that way—especially out in the tails. A better idea would be to compare the cumulative distribution functions, which are easier to handle because they are less wiggly and are always monotonically increasing. But this is still not easy. Also keep in mind that most probability distributions depend on location and scale parameters (such as mean and variance), which you would have to estimate before being able to make a meaningful comparison. Isn’t there a way to compare a set of points directly against a theoretical distribution and, in the process, read off the estimates for all the parameters required? A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 25 O’Reilly-5980006 master October 28, 2010 20:25 0 500 1000 1500 2000 2500 3000 0 10 20 30 40 50 60 70 80 90 100 Response Time Percentage FIGURE 2-8.Quantile plot of the server data. A quantile plot is a graph of the CDF with the x and y axes interchanged. Compare to Figure 2-7. 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 1.5 2 2.5 3 FIGURE 2-9.Jitter plot, histogram, and cumulative distribution function for a Gaussian data set. As it turns out, there is. The method is technically easy to do, but the underlying logic is a bit convoluted and tends to trip up even experienced practitioners. Here is how it works. Consider a set of points {xi } that we suspect are distributed according to the Gaussian distribution. In other words, we expect the cumulative distribution function of the set of points, yi = cdf(xi ), to be the Gaussian cumulative 26 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3 5% 32% 50% 68% 95% Multiples of Standard Deviation Percentage of Points Data 0.5*x+1 FIGURE 2-10.Probability plot for the data set shown in Figure 2-9. distribution function ((x − μ)/σ) with mean μ and standard deviation σ: yi = xi − μ σ only if data is Gaussian Here, yi is the value of the cumulative distribution function corresponding to the data point xi ; in other words, yi is the quantile of the point xi . Now comes the trick. We apply the inverse of the Gaussian distribution function to both sides of the equation: −1(yi ) = xi − μ σ With a little bit of algebra, this becomes xi = μ + σ−1(yi ) In other words, if we plot the values in the data set as a function of −1(yi ), then they should fall onto a straight line with slope σ and zero intercept μ. If, on the other hand, the points do not fall onto a straight line after applying the inverse transform, then we can conclude that the data is not distributed according to a Gaussian distribution. The resulting plot is known as a probability plot. Because it is easy to spot deviation from a straight line, a probability plot provides a relatively sensitive test to determine whether a set of points behaves according to the Gaussian distribution. As an added beneﬁt, we can read off estimates for the mean and the standard deviation directly from the graph: μ is the intercept of the curve with the y axis, and σ is given by the slope of the curve.(Figure 2-10 shows the probability plot for the Gaussian data set displayed in Figure 2-9.) A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 27 O’Reilly-5980006 master October 28, 2010 20:25 One important question concerns the units that we plot along the axes. For the vertical axis the case is clear: we use whatever units the original data was measured in. But what about the horizontal axis? We plot the data as a function of −1(yi ), which is the inverse Gaussian distribution function, applied to the percentile yi for each point xi .Wecan therefore choose between two different ways to dissect the horizontal axis: either using the percentiles yi directly (in which case the tick marks will not be distributed uniformly), or dividing the horizontal axis uniformly. In the latter case we are using the width of the standard Gaussian distribution as a unit. You can convince yourself that this is really true by realizing that −1(y) is the inverse of the Gaussian distribution function (x). Now ask yourself: what units is x measured in? We use the same units for the horizontal axis of a Gaussian probability plot. These units are sometimes called probits.(Figure 2-10 shows both sets of units.) Beware of confused and confusing explanations of this point elsewhere in the literature. There is one more technical detail that we need to discuss: to produce a probability plot, we need not only the data itself, but for each point xi we also need its quantile yi (we will discuss quantiles and percentiles in more detail later in this chapter). The simplest way to obtain the quantiles, given the data, is as follows: 1. Sort the data points in ascending order. 2. Assign to each data point its rank (basically, its line number in the sorted ﬁle), starting at 1 (not at 0). 3. The quantile yi now is the rank divided by n + 1, where n is the number of data points. This prescription guarantees that each data point is assigned a quantile that is strictly greater than 0 and strictly less than 1. This is important because −1(x) is deﬁned only for 0 < x < 1. This prescription is easy to understand and easy to remember, but you may ﬁnd other, slightly more complicated prescriptions elsewhere. For all practical purposes, the differences are going to be small. Finally, let’s look at an example where the data is clearly not Gaussian. Figure 2-11 shows the server data from Figure 2-2 plotted in a probability plot. The points don’t fall on a straight line at all—which is no surprise since we already knew from Figure 2-2 that the data is not Gaussian. But for cases that are less clear-cut, the probability plot can be a helpful tool for detecting deviations from Gaussian behavior. A few additional comments are in order here. • Nothing in the previous discussion requires that the distribution be Gaussian! You can use almost any other commonly used distribution function (and its inverse) to generate the respective probability plots. In particular, many of the commonly used probability 28 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 0 500 1000 1500 2000 2500 3000 -3 -2 -1 0 1 2 3 Milliseconds Normal Probits FIGURE 2-11.Aprobability plot of the server response times from Figure 2-2. The data does not follow a Gaussian distribution and thus the points do not fall on a straight line. distributions depend on location and scale parameters in exactly the same way as the Gaussian distribution, so all the arguments discussed earlier go through as before. • So far, I have always assumed that we want to compare an empirical data set against a theoretical distribution. But there may also be situations where we want to compare two empirical data sets against each other—for example, to ﬁnd out whether they were drawn from the same family of distributions (without having to specify the family explicitly). The process is easiest to understand when both data sets we want to compare contain the same number of points. You sort both sets and then align the points from both data sets that have the same rank (once sorted). Now plot the resulting pairs of points in a regular scatter plot (see Chapter 3); the resulting graph is known as a QQ plot. (If the two data sets do not contain the same number of points, you will have to interpolate or truncate them so that they do.) Probability plots are a relatively advanced, specialized technique, and you should evaluate whether you really need them. Their purpose is to determine whether a given data set stems from a speciﬁc, known distribution. Occasionally, this is of interest in itself; in other situations subsequent analysis depends on proper identiﬁcation of the underlying model. For example, many statistical techniques assume that the errors or residuals are Gaussian and are not applicable if this condition is violated. Probability plots are a convenient technique for testing this assumption. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 29 O’Reilly-5980006 master October 28, 2010 20:25 Rank-Order Plots and Lift Charts There is a technique related to histograms and CDFs that is worth knowing about. Consider the following scenario. A company that is selling textbooks and other curriculum materials is planning an email marketing campaign to reach out to its existing customers. For this campaign, the company wants to use personalized email messages that are tailored to the job title of each recipient (so that teachers will receive a different email than their principals). The problem is the customer database contains about 250,000 individual customer records with over 16,000 different job titles among them! Now what? The trick is to sort the job titles by the number of individual customer records corresponding to each job title. The ﬁrst few records are shown in Table 2-1. The four columns give the job title, the number of customers for that job title, the fraction of all customers having that job title, and ﬁnally the cumulative fraction of customers. For the last column, we sum up the number of customers for the current and all previously seen job titles, then divide by the total number of customer records. This is the equivalent of the CDF we discussed earlier. We can see immediately that fully two thirds of all customers account for only 10 different job titles. Using just the top 30 job titles gives us 75 percent coverage of customer records. That’s much more manageable than the 16,000 job titles we started with! Let’s step back for a moment to understand how this example is different from those we have seen previously. What is important to notice here is that the independent variable has no intrinsic ordering. What does this mean? For the web-server example, we counted the number of events for each response time; hence the count of events per bin was the dependent variable, and it was determined by the independent variable—namely, the response time. In that case, the independent variable had an inherent ordering: 100 milliseconds are always less than 400 milliseconds (and so on). But in the case of counting customer records that match a certain job title, the independent variable (the job title) has no corresponding ordering relation. It may appear otherwise since we can sort the job titles alphabetically, but realize that this ordering is entirely arbitrary! There is nothing “fundamental” about it. If we choose a different font encoding or locale, the order will change. Contrast this with the ordering relationship on numbers—there are no two ways about it: 1 is always less than 2. In cases like this, where the independent variable does not have an intrinsic ordering, it is often a good idea to sort entries by the dependent variable. That’s what we did in the example: rather than deﬁning some (arbitrary) sort order on the job titles, we sorted by the number of records (i.e., by the dependent variable). Once the records have been sorted in this way, we can form a histogram and a CDF as before. 30 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 TABLE 2-1.Thefirst30jobtitles and their relative frequencies. Number of Fraction of Cumulative Title customers customers fraction Teacher 66,470 0.34047 0.340 Principal 22,958 0.11759 0.458 Superintendent 12,521 0.06413 0.522 Director 12,202 0.06250 0.584 Secretary 4,427 0.02267 0.607 Coordinator 3,201 0.01639 0.623 Vice Principal 2,771 0.01419 0.637 Program Director 1,926 0.00986 0.647 Program Coordinator 1,718 0.00880 0.656 Student 1,596 0.00817 0.664 Consultant 1,440 0.00737 0.672 Administrator 1,169 0.00598 0.678 President 1,114 0.00570 0.683 Program Manager 1,063 0.00544 0.689 Supervisor 1,009 0.00516 0.694 Professor 961 0.00492 0.699 Librarian 940 0.00481 0.704 Project Coordinator 880 0.00450 0.708 Project Director 866 0.00443 0.713 Office Manager 839 0.00429 0.717 Assistant Director 773 0.00395 0.721 Administrative Assistant 724 0.00370 0.725 Bookkeeper 697 0.00357 0.728 Intern 693 0.00354 0.732 Program Supervisor 602 0.00308 0.735 Lead Teacher 587 0.00300 0.738 Instructor 580 0.00297 0.741 Head Teacher 572 0.00292 0.744 Program Assistant 572 0.00292 0.747 Assistant Teacher 546 0.00279 0.749 This trick of sorting by the dependent variable is useful whenever the independent variable does not have a meaningful ordering relation; it is not limited to situations where we count events per bin. Figures 2-12 and 2-13 show two typical examples. Figure 2-12 shows the sales by a certain company to different countries. Not only the sales to each country but also the cumulative sales are shown, which allows us to assess the importance of the remaining “tail” of the distribution of sales. In this example, I chose to plot the independent variable along the vertical axis. This is often a good idea when the values are strings, since they are easier to read this way. (If you plot them along the horizontal axis, it is often necessary to rotate the strings by 90 degrees to make them ﬁt, which makes hard to read.) Figure 2-13 displays what in quality engineering is known as a Pareto chart. In quality engineering and process improvement, the goal is to reduce the number of defects in a A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 31 O’Reilly-5980006 master October 28, 2010 20:25 United States Brazil Japan India Germany United Kingdom Russia France Portugal Italy Mexico Spain Canada South Korea Indonesia Turkey Sweden Australia Taiwan Netherlands Poland Switzerland Argentina Thailand Philippines 0 20 40 60 80 100 0 5 10 15 20 25 30 35 40 Percentage of Revenue Sales (Millions of Dollars) Sales Percentage FIGURE 2-12.Arank-order plot of sales per country. The independent variable has been plotted along the vertical axis to make the text labels easier to read. 0 20 40 60 80 100 Engine Electrical System Brakes Air Conditioning Transmission Body Integrity Percentage of Defects Observed Cumulative Individual FIGURE 2-13.ThePareto chart is another example of a rank-order plot. certain product or process. You collect all known causes of defects and observe how often each one occurs. The results can be summarized conveniently in a chart like the one in Figure 2-13. Note that the causes of defects are sorted by their frequency of occurrence. From this chart we can see immediately that problems with the engine and the electrical system are much more common than problems with the air conditioning, the brakes, or 32 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 the transmission. In fact, by looking at the cumulative error curve, we can tell that ﬁxing just the ﬁrst two problem areas would reduce the overall defect rate by 80 percent. Two more bits of terminology: the term “Pareto chart” is not used widely outside the speciﬁc engineering disciplines mentioned in the previous paragraph. I personally prefer the expression rank-order chart for any plot generated by ﬁrst sorting all entries by the dependent variable (i.e.,bytherank of the entry). The cumulative distribution curve is occasionally referred to as a lift curve, because it tells us how much “lift” we get from each entry or range of entries. Only When Appropriate: Summary Statistics and Box Plots You may have noticed that so far I have not spoken at all about such simple topics as mean and median, standard deviation, and percentiles. That is quite intentional. These summary statistics apply only under certain assumptions and are misleading, if not downright wrong, if those assumptions are not fulﬁlled. I know that these quantities are easy to understand and easy to calculate, but if there is one message I would like you to take away from this book it is this: the fact that something is convenient and popular is no reason to follow suit. For any method that you want to use, make sure you understand the underlying assumptions and always check that they are fulﬁlled for the speciﬁc application you have in mind! Mean, median, and related summary statistics apply only to distributions that have a single, central peak—that is, to unimodal distributions. If this basic assumption is not fulﬁlled, then conclusions based on simple summary statistics will be wrong. Even worse, nothing will tip you off that they are wrong: the numbers will look quite reasonable. (We will see an example of this problem shortly.) Summary Statistics If a distribution has only a single peak, then it makes sense to ask about the properties of that peak: where is it located, and what is its width? We may also want to know whether the distribution is symmetric and whether any outliers are present. Mean and standard deviation are two popular measures for location and spread. The mean or average is both familiar and intuitive: m = 1 n i xi The standard deviation measures how far points spread “on average” from the mean: we take all the differences between each individual point and the mean, and then calculate the average of all these differences. Because data points can either overshoot or undershoot the mean and we don’t want the positive and negative deviations to cancel A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 33 O’Reilly-5980006 master October 28, 2010 20:25 each other, we sum the square of the individual deviations and then take the mean of the square deviations. (The second equation is very useful in practice and can be found from the ﬁrst after plugging in the deﬁnition of the mean.) s2 = 1 n i (xi − m)2 = 1 n i x2 i − m2 The quantity s2 calculated in this way is known as the variance and is the more important quantity from a theoretical point of view. But as a measure of the spread of a distribution, we are better off using its square root, which is known as the standard deviation. Why take the square root? Because then both measure for the location, and the measure for the spread will have the same units, which are also the units of the actual data. (If our data set consists of the prices for a basket of goods, then the variance would be given in “square dollars,” whereas the standard deviation would be given in dollars.) For many (but certainly not all!) data sets arising in practice, one can expect about two thirds of all data points to fall within the interval [m − s, m + s] and 99 percent of all points to fall within the wider interval [m − 3s, m + 3s]. Mean and standard deviation are easy to calculate, and have certain nice mathematical properties—provided the data is symmetric and does not contain crazy outliers. Unfortunately, many data sets violate at least one of these assumptions. Here is an example for the kind of trouble that one may encounter. Assume we have 10 items costing $1 each, and one item costing $20. The mean item price comes out to be $2.73, even though no item has a price anywhere near this value. The standard deviation is even worse: it comes out to $5.46, implying that most items have a price between $2.73 − $5.46 and $2.73 + $5.46. The “expected range” now includes negative prices—an obviously absurd result. Note that the data set itself is not particularly pathological: going to the grocery store and picking up a handful of candy bars and a bottle of wine will do it (pretty good wine, to be sure, but nothing outrageous). A different set of summary statistics that is both more ﬂexible and more robust is based on the concepts of median and quantiles or percentiles. The median is conventionally deﬁned as the value from a data set such that half of all points in the data set are smaller and the other half greater that that value. Percentiles are the generalization of this concept to other fractions (the 10th percentile is the value such that 10 percent of all points in the data set are smaller than it, and so on). Quantiles are similar to percentiles, only that they are taken with respect to the fraction of points, not the percentage of points (in other words, the 10th percentile equals the 0.1 quantile). Simple as it is, the percentile concept is nevertheless ambiguous, and so we need to work a little harder to make it really concrete. As an example of the problems that occur, consider the data set {1, 2, 3}. What is the median? It is not possible to break this data set 34 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 into two equal parts each containing exactly half the points. The problem becomes even more uncomfortable when we are dealing with arbitrary percentile values (rather than the median only). The Internet standard laid down in RFC 2330 (“Framework for IP Performance Metrics”) gives a deﬁnition of percentiles in terms of the CDF, which is unambiguous and practical, as follows. The pth percentile is the smallest value x, such that the cumulative distribution function of x is greater or equal p/100. pth percentile: smallest x for which cdf(x) ≥ p/100 This deﬁnition assumes that the CDF is normalized to 1, not to 100. If it were normalized to 100, the condition would be cdf(x) ≥ p. With this deﬁnition, the median (i.e., the 50th percentile) of the data set {1, 2, 3} is 2 because the cdf(1) = 0.33 ..., cdf(2) = 0.66 ..., and cdf(3) = 1.0. The median of the data set {1, 2} would be 1 because now cdf(1) = 0.5, and cdf(2) = 1.0. The median is a measure for the location of the distribution, and we can use percentiles to construct a measure for the width of the distribution. Probably the most frequently used quantity for this purpose is the inter-quartile range (IQR), which is the distance between the 75th percentile and 25th percentile. When should you favor median and percentile over mean and standard deviation? Whenever you suspect that your distribution is not symmetric or has important outliers. If a distribution is symmetric and well behaved, then mean and median will be quite close together, and there is little difference in using either. Once the distribution becomes skewed, however, the basic assumption that underlies the mean as a measure for the location of the distribution is no longer fulﬁlled, and so you are better off using the median. (This is why the average wage is usually given in ofﬁcial publications as the median family income, not the mean; the latter would be signiﬁcantly distorted by the few households with extremely high incomes.) Furthermore, the moment you have outliers, the assumptions behind the standard deviation as a measure of the width of the distribution are violated; in this case you should favor the IQR (recall our shopping basket example earlier). If median and percentiles are so great, then why don’t we always use them? A large part of the preference for mean and variance is historical. In the days before readily available computing power, percentiles were simply not practical to calculate. Keep in mind that ﬁnding percentiles requires to sort the data set whereas to ﬁnd the mean requires only to add up all elements in any order. The latter is an O(n) process, but the former is an O(n2) process, since humans—being nonrecursive—cannot be taught Quicksort and therefore need to resort to much less efﬁcient sorting algorithms. A second reason is that it is much harder to prove rigorous theorems for percentiles, whereas mean and variance are mathematically very well behaved and easy to work with. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 35 O’Reilly-5980006 master October 28, 2010 20:25 Box-and-Whisker Plots There is an interesting graphical way to represent these quantities, together with information about potential outliers, known as a box-and-whisker plot,orbox plot for short. Figure 2-15 illustrates all components of a box plot. A box plot consists of: • A marker or symbol for the median as an indicator of the location of the distribution • A box, spanning the inter-quartile range, as a measure of the width of the distribution • A set of whiskers, extending from the central box to the upper and lower adjacent values, as an indicator of the tails of the distribution (where “adjacent value” is deﬁned in the next paragraph) • Individual symbols for all values outside the range of adjacent values, as a representation for outliers You can see that a box plot combines a lot of information in a single graph. We have encountered almost all of these concepts before, with the exception of upper and lower adjacent values. While the inter-quartile range is a measure for the width of the central “bulk” of the distribution, the adjacent values are one possible way to express how far its tails reach. The upper adjacent value is the largest value in the data set that is less than twice the inter-quartile range greater than the median. In other words: extend the whisker upward from the median to twice the length of the central box. Now trim the whisker down to the largest value that actually occurs in the data set; this value is the upper adjacent value. (A similar construction holds for the lower adjacent value.) You may wonder about the reason for this peculiar construction. Why not simply extend the whiskers to, say, the 5th and 95th percentile and be done with it? The problem with this approach is that it does not allow us to recognize true outliers! Outliers are data points that are, when compared to the width of the distribution, unusually far from the center. Such values may or may not be present. The top and bottom 5 percent, on the other hand, are always present even for very compact distributions. To recognize outliers, we therefore cannot simply look at the most extreme values, instead we must compare their distance from the center to the overall width of the distribution. That is what box-and-whisker plots, as described in the previous paragraph, do. The logic behind the preceding argument is extremely important (not only in this application but more generally), so I shall reiterate the steps: ﬁrst we calculated a measure for the width of the distribution, then we used this width to identify outliers as those points that are far from the center, where (and this is the crucial step) “far” is measured in units of the width of the distribution. We neither impose an arbitrary distance from the outside, nor do we simply label the most extreme x percent of the distribution as outliers—instead, we determine the width of the distribution (as the range into which points “typically” fall) and then use it to identify outliers as those points that deviate from this range. The important insight here is that the distribution itself determines a typical scale, which provides a natural unit in which to measure other properties of the 36 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 distribution. This idea of using some typical property of the system to describe other parts of the system will come up again later (see Chapter 8). Box plots combine many different measures of a distribution into a single, compact graph. A box plot allows us to see whether the distribution is symmetric or not and how the weight is distributed between the central peak and the tails. Finally, outliers (if present) are not dropped but shown explicitly. Box plots are best when used to compare several distributions against one another—for a single distribution, the overhead of preparing and managing a graph (compared to just quoting the numbers) may often not appear justiﬁed. Here is an example that compares different data sets against each other. Let’s say we have a data set containing the index of refraction of 121 samples of glass.* The data set is broken down by the type of glass: 70 samples of window glass, 29 from headlamps, 13 from containers of various kinds, and 9 from tableware. Figures 2-14 and 2-15 are two representations of the same data, the former as a kernel density estimate and the latter as a box plot. The box plot emphasizes the overall structure of the data sets and makes it easy to compare the data sets based on their location and width. At the same time, it also loses much information. The KDE gives a more detailed view of the data—in particular showing the occurrence of multiple peaks in the distribution functions—but makes it more difﬁcult to quickly sort and classify the data sets. Depending on your needs, one or the other technique may be preferable at any given time. Here are some additional notes on box plots. • The speciﬁc way of drawing a box plot that I described here is especially useful but is far from universal. In particular, the speciﬁc deﬁnition of the adjacent values is often not properly understood. Whenever you ﬁnd yourself looking at a box plot, always ask what exactly is shown, and whenever you prepare one, make sure to include an explanation. • The box plot described here can be modiﬁed and enhanced. For example, the width of the central box (i.e., the direction orthogonal to the whiskers) can be used to indicate the size of the underlying data set: the more points are included, the wider the box. Another possibility is to abandon the rectangular shape of the box altogether and to use the local width of the box to display the density of points at each location— which brings us almost full circle to KDEs. *The raw data can be found in the “Glass Identiﬁcation Data Set” on the UCI Machine Learning Repos- itory at http://archive.ics.uci.edu/ml/. A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 37 O’Reilly-5980006 master October 28, 2010 20:25 1.51 1.515 1.52 1.525 Headlamps Window Tableware Containers FIGURE 2-14.Comparing data sets using KDEs: refractive index of different types of glass. (Compare Figure 2-15.) Workshop: NumPy The NumPy module provides efﬁcient and convenient handling of large numerical arrays in Python. It is the successor to both the earlier Numeric and the alternative numarray modules. (See the Appendix A for more on the history of scientiﬁc computing with Python.) The NumPy module is used by many other libraries and projects and in this sense is a “base” technology. Let’s look at some quick examples before delving a bit deeper into technical details. NumPy in Action NumPy objects are of type ndarray. There are different ways of creating them. We can create an ndarray by: • Converting a Python list • Using a factory function that returns a populated vector • Reading data from a ﬁle directly into a NumPy object The listing that follows shows ﬁve different ways to create NumPy objects. First we create one by converting a Python list. Then we show two different factory routines that generate equally spaced grid points. These routines differ in how they interpret the provided boundary values: one routine includes both boundary values, and the other includes one and excludes the other. Next we create a vector ﬁlled with zeros and set each element in a loop. Finally, we read data from a text ﬁle. (I am showing only the simplest 38 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 FIGURE 2-15.Comparing data sets using box plots: refractive index of different types of glass. (Compare Figure 2-14.) or default cases here—all these routines have many more options that can be used to inﬂuence their behavior.) # Five different ways to create a vector... import numpy as np # From a Python list vec1 = np.array( [ 0., 1., 2., 3., 4. ] ) # arange( start inclusive, stop exclusive, step size ) vec2 = np.arange( 0, 5, 1, dtype=float ) # linspace( start inclusive, stop inclusive, number of elements ) vec3 = np.linspace( 0, 4, 5 ) # zeros( n ) returns a vector filled with n zeros vec4 = np.zeros( 5 ) for i in range( 5 ): vec4[i] = i # read from a text file, one number per row vec5 = np.loadtxt( "data" ) In the end, all ﬁve vectors contain identical data. You should observe that the values in the Python list used to initialize vec1 are ﬂoating-point values and that we speciﬁed the type desired for the vector elements explicitly when using the arange() function to create vec2. (We will come back to types in a moment.) A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 39 O’Reilly-5980006 master October 28, 2010 20:25 Now that we have created these objects, we can operate with them (see the next listing). One of the major conveniences provided by NumPy is that we can operate with NumPy objects as if they were atomic data types: we can add, subtract, and multiply them (and so forth) without the need for explicit loops. Avoiding explicit loops makes our code clearer. It also makes it faster (because the entire operation is performed in C without overhead— see the discussion that follows). # ... continuation from previous listing # Add a vector to another v1 = vec1 + vec2 # Unnecessary: adding two vectors using an explicit loop v2 = np.zeros( 5 ) for i in range( 5 ): v2[i] = vec1[i] + vec2[i] # Adding a vector to another in place vec1 += vec2 # Broadcasting: combining scalars and vectors v3 = 2*vec3 v4 = vec4 + 3 # Ufuncs: applying a function to a vector, element by element v5 = np.sin(vec5) # Converting to Python list object again lst = v5.tolist() All operations are performed element by element: if we add two vectors, then the corresponding elements from each vector are combined to give the element in the resulting vector. In other words, the compact expression vec1 + vec2 for v1 in the listing is equivalent to the explicit loop construction used to calculate v2. This is true even for multiplication: vec1 * vec2 will result in a vector in which the corresponding elements of both operands have been multiplied element by element. (If you want a true vector or “dot” product, you must use the dot() function instead.) Obviously, this requires that all operands have the same number of elements! Now we shall demonstrate two further convenience features that in the NumPy documentation are referred to as broadcasting and ufuncs (short for “universal functions”). The term “broadcasting” in this context has nothing to do with messaging. Instead, it means that if you try to combine two arguments of different shapes, then the smaller one will be extended (“cast broader”) to match the larger one. This is especially useful when combining scalars with vectors: the scalar is expanded to a vector of appropriate size and whose elements all have the value given by the scalar; then the operation proceeds, element by element, as before. The term “ufunc” refers to a scalar function that can be applied to a NumPy object. The function is applied, element by element, to all entries in 40 CHAPTER TWO O’Reilly-5980006 master October 29, 2010 17:40 the NumPy object, and the result is a new NumPy object with the same shape as the original one. Using these features skillfully, a function to calculate a kernel density estimate can be written as a single line of code: # Calculating kernel density estimates from numpy import * # z: position, w: bandwidth, xv: vector of points def kde( z, w, xv ): return sum( exp(-0.5*((z-xv)/w)**2)/sqrt(2*pi*w**2) ) d = loadtxt( "presidents", usecols=(2,) ) w = 2.5 for x in linspace( min(d)-w, max(d)+w, 1000 ): print x, kde( x, w, d ) This program will calculate and print the data needed to generate Figure 2-4 (but it does not actually draw the graph—that will have to wait until we introduce matplotlib in the Workshop of Chapter 3). Most of the listing is boilerplate code, such as reading and writing ﬁles. All the actual work is done in the one-line function kde(z, w, xv). This function makes use of both “broadcasting” and “ufuncs” and is a good example for the style of programming typical of NumPy. Let’s dissect it—inside out. First recall what we need to do when evaluating a KDE: for each location z at which we want to evaluate the KDE, we must ﬁnd its distance to all the points in the data set. For each point, we evaluate the kernel for this distance and sum up the contributions from all the individual kernels to obtain the value of the KDE at z. The expression z-xv generates a vector that contains the distance between z and all the points in xv (that’s broadcasting). We then divide by the required bandwidth, multiply by 1/2, and square each element. Finally, we apply the exponential function exp() to this vector (that’s a ufunc). The result is a vector that contains the exponential function evaluated at the distances between the points in the data set and the location z. Now we only need to sum all the elements in the vector (that’s what sum() does) and we are done, having calculated the KDE at position z. If we want to plot the KDE as a curve, we have to repeat this process for each location we wish to plot—that’s what the ﬁnal loop in the listing is for. NumPy in Detail You may have noticed that none of the warm-up examples in the listings in the previous section contained any matrices or other data structures of higher dimensionality—just A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 41 O’Reilly-5980006 master October 28, 2010 20:25 one-dimensional vectors. To understand how NumPy treats objects with dimensions greater than one, we need to develop at least a superﬁcial understanding for the way NumPy is implemented. It is misleading to think of NumPy as a “matrix package for Python” (although it’s commonly used as such). I ﬁnd it more helpful to think of NumPy as a wrapper and access layer for underlying C buffers. These buffers are contiguous blocks of C memory, which—by their nature—are one-dimensional data structures. All elements in those data structures must be of the same size, and we can specify almost any native C type (including C structs) as the type of the individual elements. The default type corresponds toaCdouble and that is what we use in the examples that follow, but keep in mind that other choices are possible. All operations that apply to the data overall are performed in C and are therefore very fast. To interpret the data as a matrix or other multi-dimensional data structure, the shape or layout is imposed during element access. The same 12-element data structure can therefore be interpreted as a 12-element vector or a 3 × 4 matrix or a 2 × 2 × 3 tensor—the shape comes into play only through the way we access the individual elements. (Keep in mind that although reshaping a data structure is very easy, resizing is not.) The encapsulation of the underlying C data structures is not perfect: when choosing the types of the atomic elements, we specify C data types not Python types. Similarly, some features provided by NumPy allow us to manage memory manually, rather than have the memory be managed transparently by the Python runtime. This is an intentional design decision, because NumPy has been designed to accommodate large data structures—large enough that you might want (or need) to exercise a greater degree of control over the way memory is managed. For this reason, you have the ability to choose types that take up less space as elements in a collection (e.g.,Cfloat elements rather than the default double). For the same reason, all ufuncs accept an optional argument pointing to an (already allocated) location where the results will be placed, thereby avoiding the need to claim additional memory themselves. Finally, several access and structuring routines return a view (not a copy!) of the same underlying data. This does pose an aliasing problem that you need to watch out for. The next listing quickly demonstrates the concepts of shape and views. Here, I assume that the commands are entered at an interactive Python prompt (shown as >>> in the listing). Output generated by Python is shown without a prompt: >>> import numpy as np >>> # Generate two vectors with 12 elements each >>> d1 = np.linspace( 0, 11, 12 ) >>> d2 = np.linspace( 0, 11, 12 ) >>> # Reshape the first vector to a 3x4 (row x col) matrix >>> d1.shape = ( 3, 4 ) >>> print d1 42 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 [[ 0. 1. 2. 3.] [4.5.6.7.] [ 8. 9. 10. 11.]] >>> # Generate a matrix VIEW to the second vector >>> view = d2.reshape( (3,4) ) >>> # Now: possible to combine the matrix and the view >>> total = d1 + view >>> # Element access: [row,col] for matrix >>> print d1[0,1] 1.0 >>> print view[0,1] 1.0 >>> # ... and [pos] for vector >>> print d2[1] 1.0 >>> # Shape or layout information >>> print d1.shape (3,4) >>> print d2.shape (12,) >>> print view.shape (3,4) >>> # Number of elements (both commands equivalent) >>> print d1.size 12 >>> print len(d2) 12 >>> # Number of dimensions (both commands equivalent) >>> print d1.ndim 2 >>> print np.rank(d2) 1 Let’s step through this. We create two vectors of 12 elements each. Then we reshape the ﬁrst one into a 3 × 4 matrix. Note that the shape property is a data member—not an accessor function! For the second vector, we create a view in the form of a 3 × 4 matrix. Now d1 and the newly created view of d2 have the same shape, so we can combine them (by forming their sum, in this case). Note that even though reshape() is a member function, it does not change the shape of the instance itself but instead returns a new view object: d2 is still a one-dimensional vector. (There is also a standalone version of this function, so we could also have written view = np.reshape( d2, (3,4) ). The presence of such redundant functionality is due to the desire to maintain backward compatibility with both of NumPy’s ancestors.) A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 43 O’Reilly-5980006 master October 28, 2010 20:25 We can now access individual elements of the data structures, depending on their shape. Since both d1 and view are matrices, they are indexed by a pair of indices (in the order [row,col]). However, d2 is still a one-dimensional vector and thus takes only a single index. (We will have more to say about indexing in a moment.) Finally, we examine some diagnostics regarding the shape of the data structures, emphasizing their precise semantics. The shape is a tuple, giving the number of elements in each dimension. The size is the total number of elements and corresponds to the value returned by len() for the entire data structure. Finally, ndim gives the number of dimensions (i.e., d.ndim == len(d.shape)) and is equivalent to the “rank” of the entire data structure. (Again, the redundant functionality exists to maintain backward compatibility.) Finally, let’s take a closer look at the ways in which we can access elements or larger subsets of an ndarray. In the previous listing we saw how to access an individual element by fully specifying an index for each dimension. We can also specify larger subarrays of a data structure using two additional techniques, known as slicing and advanced indexing. The following listing shows some representative examples. (Again, consider this an interactive Python session.) >>> import numpy as np >>> # Create a 12-element vector and reshape into 3x4 matrix >>> d = np.linspace( 0, 11, 12 ) >>> d.shape = ( 3,4 ) >>> print d [[ 0. 1. 2. 3.] [4.5.6.7.] [ 8. 9. 10. 11.]] >>> # Slicing... >>> # First row >>> print d[0,:] [0.1.2.3.] >>> # Second col >>> print d[:,1] [1.5.9.] >>> # Individual element: scalar >>> print d[0,1] 1.0 >>> # Subvector of shape 1 >>> print d[0:1,1] [ 1.] >>> # Subarray of shape 1x1 >>> print d[0:1,1:2] [[ 1.]] 44 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:25 >>> # Indexing... >>> # Integer indexing: third and first column >>> print d[ :, [2,0] ] [[ 2. 0.] [ 6. 4.] [ 10. 8.]] >>> # Boolean indexing: second and third column >>> k = np.array( [False, True, True] ) >>> print d[ k, : ] [[ 4. 5. 6. 7.] [ 8. 9. 10. 11.]] We ﬁrst create a 12-element vector and reshape it into a 3 × 4 matrix as before. Slicing uses the standard Python slicing syntax start:stop:step, where the start position is inclusive but the stopping position is exclusive. (In the listing, I use only the simplest form of slicing, selecting all available elements.) There are two potential “gotchas” with slicing. First of all, specifying an explicit subscripting index (not a slice!) reduces the corresponding dimension to a scalar. Slicing, though, does not reduce the dimensionality of the data structure. Consider the two extreme cases: in the expression d[0,1], indices for both dimensions are fully speciﬁed, and so we are left with a scalar. In contrast, d[0:1,1:2] is sliced in both dimensions. Neither dimension is removed, and the resulting object is still a (two-dimensional) matrix but of smaller size: it has shape 1 × 1. The second issue to watch out for is that slices return views, not copies. Besides slicing, we can also index an ndarray with a vector of indices, by an operation called “advanced indexing.” The previous listing showed two simple examples. In the ﬁrst we use a Python list object, which contains the integer indices (i.e., the positions) of the desired columns and in the desired order, to select a subset of columns. In the second example, we form an ndarray of Boolean entries to select only those rows for which the Boolean evaluates to True. In contrast to slicing, advanced indexing returns copies, not views. This completes our overview of the basic capabilities of the NumPy module. NumPy is easy and convenient to use for simple use cases but can get very confusing otherwise. (For example, check out the rules for general broadcasting when both operators are multi-dimensional, or for advanced indexing). We will present some more straightforward applications in Chapters 3 and 4. Further Reading • The Elements of Graphing Data. William S. Cleveland. 2nd ed., Hobart Press. 1994. A book-length discussion of graphical methods for data analysis such as those described in this chapter. In particular, you will ﬁnd more information here on topics such as A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 45 O’Reilly-5980006 master October 28, 2010 20:25 box plots and QQ plots. Cleveland’s methods are particularly careful and well thought-out. • All of Statistics: A Concise Course in Statistical Inference. Larry Wasserman. Springer. 2004. A thoroughly modern treatment of mathematical statistics, very advanced and condensed. You will ﬁnd some additional material here on the theory of “density estimation”—that is, on histograms and KDEs. • Multivariate Density Estimation. David W. Scott. 2nd ed., Wiley. 2006. A research monograph on density estimation written by the creator of Scott’s rule. • Kernel Smoothing. M. P. Wand and M. C. Jones. Chapman & Hall. 1995. An accessible treatment of kernel density estimation. 46 CHAPTER TWO O’Reilly-5980006 master October 28, 2010 20:27 CHAPTER THREE Two Variables: Establishing Relationships WHEN WE ARE DEALING WITH A DATA SET THAT CONSISTS OF TWO VARIABLES (THAT IS, A BIVARIATE DATA SET), we are mostly interested in seeing whether some kind of relationship exists between the two variables and, if so, what kind of relationship this is. Plotting one variable against another is pretty straightforward, therefore most of our effort will be spent on various tools and transformations that can be applied to characterize the nature of the relationship between the two inputs. Scatter Plots Plotting one variable against another is simple—you just do it! In fact, this is precisely what most people mean when they speak about “plotting” something. Yet there are differences, as we shall see. Figures 3-1 and 3-2 show two examples. The data in Figure 3-1 might come from an experiment that measures the force between two surfaces separated by a short distance. The force is clearly a complicated function of the distance—on the other hand, the data points fall on a relatively smooth curve, and we can have conﬁdence that it represents the data accurately. (To be sure, we should ask for the accuracy of the measurements shown in this graph: are there signiﬁcant error bars attached to the data points? But it doesn’t matter; the data itself shows clearly that the amount of random noise in the data is small. This does not mean that there aren’t problems with the data but only that any problems will be systematic ones—for instance, with the apparatus—and statistical methods will not be helpful.) 47 O’Reilly-5980006 master October 28, 2010 20:27 FIGURE 3-1.Data that clearly shows that there is a relationship, albeit a complicated one, between x and y. In contrast, Figure 3-2 shows the kind of data typical of much of statistical analysis. Here we might be showing the prevalence of skin cancer as a function of the mean income for a group of individuals or the unemployment rate as a function of the frequency of high-school drop-outs for a number of counties, and the primary question is whether there is any relationship at all between the two quantities involved. The situation here is quite different from that shown in Figure 3-1, where it was obvious that a strong relationship existed between x and y, and therefore our main concern was to determine the precise nature of that relationship. A ﬁgure such as Figure 3-2 is referred to as a scatter plot or xy plot. I prefer the latter term because scatter plot sounds to me too much like “splatter plot,” suggesting that the data necessarily will be noisy—but we don’t know that! Once we plot the data, it may turn out to be very clean and regular, as in Figure 3-1; hence I am more comfortable with the neutral term. When we create a graph such as Figure 3-1 or Figure 3-2, we usually want to understand whether there is a relationship between x and y as well as what the nature of that relationship is. Figure 3-3 shows four different possibilities that we may ﬁnd: no relationship; a strong, simple relationship; a strong, not-simple relationship; and ﬁnally a multivariate relationship (one that is not unique). Conquering Noise: Smoothing When data is noisy, we are more concerned with establishing whether the data exhibits a meaningful relationship, rather than establishing its precise character. To see this, it is 48 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 FIGURE 3-2.Anoisydata set. Is there any relationship between x and y? FIGURE 3-3.Fourtypesoffunctional relationships (left to right, top to bottom): no relationship; strong, simple relationship; strong, not-simple relationship; multivariate relationship. often helpful to ﬁnd a smooth curve that represents the noisy data set. Trends and structure of the data may be more easily visible from such a curve than from the cloud of points. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 49 O’Reilly-5980006 master October 28, 2010 20:27 Two different methods are frequently used to provide smooth representation of noisy data sets: weighted splines and a method known as LOESS (or LOWESS), which is short for locally weighted regression. Both methods work by approximating the data in a small neighborhood (i.e., locally) by a polynomial of low order (at most cubic). The trick is to string the various local approximations together to form a single smooth curve. Both methods contain an adjustable parameter that controls the “stiffness” of the resulting curve: the stiffer the curve, the smoother it appears but the less accurately it can follow the individual data points. Striking the right balance between smoothness and accuracy is the main challenge when it comes to smoothing methods. Splines Splines are constructed from piecewise polynomial functions (typically cubic) that are joined together in a smooth fashion. In addition to the local smoothness requirements at each joint, splines must also satisfy a global smoothness condition by optimizing the functional: J[s] = α d2s dt2 2 dt + (1 − α) i wi (yi − s(xi ))2 Here s(t) is the spline curve, (xi , yi ) are the coordinates of the data points, the wi are weight factors (one for each data point), and α is a mixing factor. The ﬁrst term controls how “wiggly” the spline is overall, because the second derivative measures the curvature of s(t) and becomes large if the curve has many wiggles. The second term captures how accurately the spline represents the data points by measuring the squared deviation of the spline from each data point—it becomes large if the spline does not pass close to the data points. Each term in the sum is multiplied by a weight factor wi , which can be used to give greater weight to data points that are known with greater accuracy than others. (Put differently: we can write wi as wi = 1/d2 i , where di measures how close the spline should pass by yi at xi .) The mixing parameter α controls how much weight we give to the ﬁrst term (emphasizing overall smoothness) relative to the second term (emphasizing accuracy of representation). In a plotting program, α is usually the dial we use to tune the spline for a given data set. To construct the spline explicitly, we form cubic interpolation polynomials for each consecutive pair of points and require that these individual polynomials have the same values, as well as the same ﬁrst and second derivatives, at the points where they meet. These smoothness conditions lead to a set of linear equations for the coefﬁcients in the polynomials, which can be solved. Once these coefﬁcients have been found, the spline curve can be evaluated at any desired location. 50 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 LOESS Splines have an overall smoothness goal, which means that they are less responsive to local details in the data set. The LOESS smoothing method addresses this concern. It consists of approximating the data locally through a low-order (typically linear) polynomial (regression), while weighting all the data points in such a way that points close to the location of interest contribute more strongly than do data points farther away (local weighting). Let’s consider the case of ﬁrst-order (linear) LOESS, so that the local approximation takes the particularly simple form a + bx. To ﬁnd the “best ﬁt” in a least-squares sense, we must minimize: χ 2 = i w(x − xi ; h) (a + bxi − yi )2 with respect to the two parameters a and b. Here, w(x) is the weight function. It should be smooth and strongly peaked—in fact, it is basically a kernel, similar to those we encountered in Figure 2-5 when we discussed kernel density estimates. The kernel most often used with LOESS is the “tri-cube” kernel K(x) = 1 −|x|3 3 for |x| < 1, K(x) = 0 otherwise; but any of the other kernels will also work. The weight depends on the distance between the point x where we want to evaluate the LOESS approximation and the location of the data points. In addition, the weight function also depends on the parameter h, which controls the bandwidth of the kernel: this is the primary control parameter for LOESS approximations. Finally, the value of the LOESS approximation at position x is given by y(x) = a + bx, where a and b minimize the expression for χ2 stated earlier. This is the basic idea behind LOESS. You can see that it is easy to generalize—for example, to two or more dimensions or two higher-order approximation polynomials. (One problem, though: explicit, closed expressions for the parameters a and b can be found only if you use ﬁrst-order polynomials; whereas for quadratic or higher polynomials you will have to resort to numerical minimization techniques. Unless you have truly compelling reasons, you want to stick to the linear case!) LOESS is a computationally intensive method. Keep in mind that the entire calculation must be performed for every point at which we want to obtain a smoothed value. (In other words, the parameters a and b that we calculated are themselves functions of x.) This is in contrast to splines: once the spline coefﬁcients have been calculated, the spline can be evaluated easily at any point that we wish. In this way, splines provide a summary or approximation to the data. LOESS, however, does not lend itself easily to semi-analytical work: what you see is pretty much all you get. One ﬁnal observation: if we replace the linear function a + bx in the ﬁtting process with the constant function a, then LOESS becomes simply a weighted moving average. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 51 O’Reilly-5980006 master October 28, 2010 20:27 FIGURE 3-4. The 1970 draft lottery: draft number versus birth date (the latter as given in days since the beginning of the year). Two LOESS curves with different values for the smoothing parameter h indicate that men born later in the year tended to have lower draft numbers. This would not be easily recognizable from a plot of the data points alone. Examples Let’s look at two examples where smoothing reveals behavior that would otherwise not be visible. The ﬁrst is a famous data set that has been analyzed in many places: the 1970 draft lottery. During the Vietnam War, men in the U.S. were drafted based on their date of birth. Each possible birth date was assigned a draft number between 1 and 366 using a lottery process, and men were drafted in the order of their draft numbers. However, complaints were soon raised that the lottery was biased—that men born later in the year had a greater chance of receiving a low draft number and, consequentially, a greater chance of being drafted early.* Figure 3-4 shows all possible birth dates (as days since the beginning of the year) and their assigned draft numbers. If the lottery had been fair, these points should form a completely *More details and a description of the lottery process can be found in The Statistical Exorcist. M. Hollander and F. Proschan. CRC Press. 1984. . 52 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 random pattern. Looking at the data alone, it is virtually impossible to tell whether there is any structure in the data. However, the smoothed LOESS lines reveal a strong falling tendency of the draft number over the course of the year: later birth dates are indeed more likely to have a lower draft number! The LOESS lines have been calculated using a Gaussian kernel. For the solid line, I used a kernel bandwidth equal to 5, but for the dashed line, I used a much larger bandwidth of 100. For such a large bandwidth, practically all points in the data set contribute equally to the smoothed curve, so that the LOESS operation reverts to a linear regression of the entire data set. (In other words: if we make the bandwidth very large, then LOESS amounts to a least-squares ﬁt of a straight line to the data.) In this draft number example, we mostly cared about a global property of the data: the presence or absence of an overall trend. Because we were looking for a global property, a stiff curve (such as a straight line) was sufﬁcient to reveal what we were looking for. However, if we want to extract more detail—in particular if we want to extract local features—then we need a “softer” curve, which can follow the data on smaller scales. Figure 3-5 shows an amusing example.* Displayed are the ﬁnishing times (separately for men and women) for the winners in a marathon. Also shown are the “best ﬁt” straight-line approximations for all events up to 1990. According to this (straight-line) model, women should start ﬁnishing faster than men before the year 2000 and then continue to become faster at a dramatic rate! This expectation is not borne out by actual observations: ﬁnishing times for women (and men) have largely leveled off. This example demonstrates the danger of attempting to describe data by using a model of ﬁxed form (a “formula”)—and a straight line is one of the most rigid models out there! A model that is not appropriate for the data will lead to incorrect conclusions. Moreover, it may not be obvious that the model is inappropriate. Look again at Figure 3-5: don’t the straight lines seem reasonable as a description of the data prior to 1990? Also shown in Figure 3-5 are smoothed curves calculated using a LOESS process. Because these curves are “softer” they have a greater ability to capture features contained in the data. Indeed, the LOESS curve for the women’s results does give an indication that the trend of dramatic improvements, seen since they ﬁrst started competing in the mid-1960s, had already begun to level off before the year 1990. (All curves are based strictly on data prior to 1990.) This is a good example of how an adaptive smoothing curve can highlight local behavior that is present in the data but may not be obvious from merely looking at the individual data points. *This example was inspired by Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Howard Wainer. 2nd ed., Princeton University Press. 2007. . TWO VARIABLES: ESTABLISHING RELATIONSHIPS 53 O’Reilly-5980006 master October 28, 2010 20:27 120 130 140 150 160 170 180 190 200 210 1900 1920 1940 1960 1980 2000 Men Women FIGURE 3-5. Winning times (in minutes) for an annual marathon event, separately for men and women. Also shown are the straight-line and smooth-curve approximations. All approximations are based entirely on data points prior to 1990. Residuals Once you have obtained a smoothed approximation to the data, you will usually also want to check out the residuals—that is, the remainder when you subtract the smooth “trend” from the actual data. There are several details to look for when studying residuals. • Residuals should be balanced: symmetrically distributed around zero. • Residuals should be free of a trend. The presence of a trend or of any other large-scale systematic behavior in the residuals suggests that the model is inappropriate! (By construction, this is never a problem if the smooth curve was obtained from an adaptive smoothing model; however, it is an important indicator if the smooth curve comes from an analytic model.) • Residuals will necessarily straddle the zero value; they will take on both positive and negative values. Hence you may also want to plot their absolute values to evaluate whether the overall magnitude of the residuals is the same for the entire data set or not. The assumption that the magnitude of the variance around a model is constant throughout (“homoscedasticity”) is often an important condition in statistical methods. If it is not satisﬁed, then such methods may not apply. • Finally, you may want to use a QQ plot (see Chapter 2) to check whether the residuals are distributed according to a Gaussian distribution. This, too, is an assumption that is often important for more advanced statistical methods. 54 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 -15 -10 -5 0 5 10 15 Residual: LOESS -15 -10 -5 0 5 10 15 1965 1970 1975 1980 1985 1990 Residual: Straight Line FIGURE 3-6.Residuals for the women’s marathon results, both for the LOESS smoothing curve and the straight-line linear regression model. The residuals for the latter show an overall systematic trend, which suggests that the model does not appropriately describe the data. It may also be useful to apply a smoothing routine to the residuals in order to recognize their features more clearly. Figure 3-6 shows the residuals for the women’s marathon results (before 1990) both for the straight-line model and the LOESS smoothing curve. For the LOESS curve, the residuals are small overall and hardly exhibit any trend. For the straight-line model, however, there is a strong systematic trend in the residuals that is increasing in magnitude for years past 1985. This kind of systematic trend in the residuals is a clear indicator that the model is not appropriate for the data! Additional Ideas and Warnings Here are some additional ideas that you might want to play with. As we have discussed before, you can calculate the residuals between the real data and the smoothed approximation. Here an isolated large residual is certainly odd: it suggests that the corresponding data point is somehow “different” than the other points in the neighborhood—in other words, an outlier. Now we argue as follows. If the data point is an outlier, then it should contribute less to the smoothed curve than other points. Taking this consideration into account, we now introduce an additional weight factor for each data point into the expression for J[s]orχ2 given previously. The magnitude of this weight factor is chosen in such a way that data points with large residuals contribute less to the smooth curve. With this new weight factor reducing the inﬂuence of points with large residuals, we calculate a new version of the smoothed approximation. This process is iterated until the smooth curve no longer changes. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 55 O’Reilly-5980006 master October 28, 2010 20:27 120 130 140 150 160 170 180 1900 1920 1940 1960 1980 2000 FIGURE 3-7.A“smooth tube” for the men’s marathon results. The solid line is a smooth representation of the entire data set; the dashed lines are smooth representations of only those points that lie above (or below) the solid line. Another idea is to split the original data points into two classes: those that give rise to a positive residual and those with a negative residual. Now calculate a smooth curve for each class separately. The resulting curves can be interpreted as “conﬁdence bands” for the data set (meaning that the majority of points will lie between the upper and the lower smooth curve). We are particularly interested to see whether the width of this band varies along the curve. Figure 3-7 shows an example that uses the men’s results from Figure 3-5. Personally, I am a bit uncomfortable with either of these suggestions. They certainly have an unpleasant air of circular reasoning about them. There is also a deeper reason. In my opinion, smoothing methods are a quick and useful but entirely nonrigorous way to explore the structure of a data set. With some of the more sophisticated extensions (e.g., the two suggestions just discussed), we abandon the simplicity of the approach without gaining anything in rigor! If we need or want better (or deeper) results than simple graphical methods can give us, isn’t it time to consider a more rigorous toolset? This is a concern that I have with many of the more sophisticated graphical methods you will ﬁnd discussed in the literature. Yes, we certainly can squeeze ever more information into a graph using lines, colors, symbols, textures, and what have you. But this does not necessarily mean that we should. The primary beneﬁt of a graph is that it speaks to us directly—without the need for formal training or long explanations. Graphs that require training or complicated explanations to be properly understood are missing their mark no matter how “clever” they may be otherwise. 56 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 Similar considerations apply to some of the more involved ways of graph preparation. After all, a smooth curve such as a spline or LOESS approximation is only a rough approximation to the data set—and, by the way, contains a huge degree of arbitrariness in the form of the smoothing parameter (α or h, respectively). Given this situation, it is not clear to me that we need to worry about such details as the effect of individual outliers on the curve. Focusing too much on graphical methods may also lead us to miss the essential point. For example, once we start worrying about conﬁdence bands, we should really start thinking more deeply about the nature of the local distribution of residuals (Are the residuals normally distributed? Are they independent? Do we have a reason to prefer one statistical model over another?)—and possibly consider a more reliable estimation method (e.g., bootstrapping; see Chapter 12)—rather than continue with hand-waving (semi-)graphical methods. Remember: The purpose of computing is insight, not pictures! (L. N. Trefethen) Logarithmic Plots Logarithmic plots are a standard tool of scientists, engineers, and stock analysts everywhere. They are so popular because they have three valuable beneﬁts: • They rein in large variations in the data. • They turn multiplicative variations into additive ones. • They reveal exponential and power law behavior. In a logarithmic plot, we graph the logarithm of the data instead of the raw data. Most plotting programs can do this for us (so that we don’t have to transform the data explicitly) and also take care of labeling the axes appropriately. There are two forms of logarithmic plots: single or semi-logarithmic plots and double logarithmic or log-log plots, depending whether only one (usually the vertical or y axis) or both axes have been scaled logarithmically. All logarithmic plots are based on the fundamental property of the logarithm to turn products into sums and powers into products: log(xy) = log(x) + log(y) log(xk) = k log(x) Let’s ﬁrst consider semi-log plots. Imagine you have data generated by evaluating the function: y = C exp(αx) where C and α are constants TWO VARIABLES: ESTABLISHING RELATIONSHIPS 57 O’Reilly-5980006 master October 28, 2010 20:27 0.1 1 10 100 1000 0 20 40 60 80 100 -1 0 1 2 3 Value Power of 10 3x 3x FIGURE 3-8. A semi-logarithmic plot. on a set of x values. If you plot y as a function of x, you will see an upward- or downward-sloping curve, depending on the sign of α (see Appendix B). But if you instead plot the logarithm of y as a function of x, the points will fall on a straight line. This can be easily understood by applying the logarithm to the preceding equation: log y = αx + log C In other words, the logarithm of y is a linear function of x with slope α and with offset log C. In particular, by measuring the slope of the line, we can determine the scale factor α, which is often of great interest in applications. Figure 3-8 shows an example of a semi-logarithmic plot that contains some experimental data points as well as an exponential function for comparison. I’d like to point out a few details. First, in a logarithmic plot, we plot the logarithm of the values, but the axes are usually labeled with the actual values (not their logarithms). Figure 3-8 shows both: the actual values on the left and the logarithms on the right (the logarithm of 100 to base 10 is 2, the logarithm of 1,000 is 3, and so on). We can see how, in a logarithmic plot, the logarithms are equidistant, but the actual values are not. (Observe that the distance between consecutive tick marks is constant on the right, but not on the left.) Another aspect I want to point out is that on a semi-log plot, all relative changes have the same size no matter how large the corresponding absolute change. It is this property that makes semi-log plots popular for long-running stock charts and the like: if you lost $100, your reaction may be quite different if originally you had invested $1,000 versus $200: in the ﬁrst case you lost 10 percent but 50 percent in the second. In other words, relative change is what matters. 58 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 0 50 100 150 200 250 300 350 400 450 0 20,000 40,000 60,000 80,000 100,000 120,000 Heart Rate [beats per minute] Body Mass [kg] FIGURE 3-9. Heart rate versus body mass for a range of mammals. Compare to Figure 3-10. The two scale arrows in Figure 3-8 have the same length and correspond to the same relative change, but the underlying absolute change is quite different (from 1 to 3 in one case, from 100 to 300 in the other). This is another application of the fundamental property of the logarithm: if the value before the change is y1 and if y2 = γ y1 after the change (where γ = 3), then the change in absolute terms is: y2 − y1 = γ y1 − y1 = (γ − 1)y1 which clearly depends on y1. But if we consider the change in the logarithms, we ﬁnd: log y2 − log y1 = log(γ y1) − log y1 = log γ + log y1 − log y1 = log γ which is independent of the underlying value and depends only on γ , the size of the relative change. Double logarithmic plots are now easy to understand—the only difference is that we plot logarithms of both x and y. This will render all power-law relations as straight lines—that is, as functions of the form y = Cxk or y = C/xk, where C and k are constants. (Taking logarithms on both sides of the ﬁrst equation yields log y = k log x + log C, so that now log y is a linear function of log x with a slope that depends on the exponent k.) Figures 3-9 and 3-10 provide stunning example for both uses of double logarithmic plots: their ability to render data spanning many order of magnitude accessible and their ability to reveal power-law relationships by turning them into straight lines. Figure 3-9 shows the typical resting heart rate (in beats per minute) as a function of the body mass (in kilograms) for a selection of mammals from the hamster to large whales. Whales weigh in at 120 tons—nothing else even comes close! The consequence is that almost all of the data points are squished against the lefthand side of the graph, literally crushed by the whale. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 59 O’Reilly-5980006 master October 28, 2010 20:27 10 100 1,000 0.01 0.1 1 10 100 1,000 10,000 100,000 1e+06 Heart Rate [beats per minute] Body Mass [kg] Human Cat Dog Hamster Chicken Monkey Horse CowPig Rabbit Elephant Large Whale FIGURE 3-10.Thesame data as in Figure 3-9 but now plotted on a double logarithmic plot. The data points seem to fall on a straight line, which indicates a power-law relationship between resting heart rate and body mass. On the double logarithmic plot, the distribution of data points becomes much clearer. Moreover, we ﬁnd that the data points are not randomly distributed but instead seem to fall roughly on a straight line with slope −1/4: the signature of power-law behavior. In other words, a mammal’s typical heart rate is related to its mass: larger animals have slower heart beats. If we let f denote the heart rate and m the mass, we can summarize this observation as: f ∼ m−1/4 This surprising result is known as allometric scaling. It seems to hold more generally and not just for the speciﬁc animals and quantities shown in these ﬁgures. (For example, it turns out that the lifetime of an individual organism also obeys a 1/4 power-law relationship with the body mass: larger animals live longer. The surprising consequence is that the total number of heartbeats per life of an individual is approximately constant for all species!) Allometric scaling has been explained in terms of the geometric constraints of the vascular network (veins and arteries), which brings nutrients to the cells making up a biological system. It is sufﬁcient to assume that the network must be a space-ﬁlling fractal, that the capillaries where the actual exchange of nutrients takes place are the same size in all animals, and that the overall energy required for transport through the network is minimized, to derive the power-law relationships observed experimentally!* We’ll have more to say about scaling laws and their uses in Part II. *The original reference is “A General Model for the Origin of Allometric Scaling Laws in Biology.” G. B. West, J. H. Brown, and B. J. Enquist. Science 276 (1997), p. 122. Additional references can be found on the Web. 60 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 Banking Smoothing methods and logarithmic plots are both tools that help us recognize structure in a data set. Smoothing methods reduce noise, and logarithmic plots help with data sets spanning many orders of magnitude. Banking (or “banking to 45 degrees”) is another graphical method. It is different than the preceding ones because it does not work on the data but on the plot as a whole by changing its aspect ratio. We can recognize change (i.e., the slopes of curves) most easily if they make approximately a 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal or (even worse) nearly vertical. The idea behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle. Chances are, you have been doing this already by changing the plot ranges. Often when we “zoom” in on a graph it’s not so much to see more detail as to adjust the slopes of curves to make them more easily recognizable. The purpose is even more obvious when we zoom out. Banking is a more suitable technique to achieve the same effect and opens up a way to control the appearance of a plot by actively adjusting the aspect ratio. Figures 3-11 and 3-12 show the classical example for this technique: the annual number of sunspots measured over the last 300 years.* In Figure 3-11, the oscillation is very compressed, and so it is difﬁcult to make out much detail about the shape of the curve. In Figure 3-12, the aspect ratio of the plot has been adjusted so that most line segments are now at roughly a 45 degree angle, and we can make an interesting observation: the rising edge of each sunspot cycle is steeper than the falling edge. We would probably not have recognized this by looking at Figure 3-11. Personally, I would probably not use a graph such as Figure 3-12: shrinking the vertical axis down to almost nothing loses too much detail. It also becomes difﬁcult to compare the behavior on the far left and far right of the graph. Instead, I would break up the time series and plot it as a cut-and-stack plot, such as the one in Figure 3-13. Note that in this plot the aspect ratio of each subplot is such that the lines are, in fact, banked to 45 degrees. As this example demonstrates, banking is a good technique but can be taken too literally. When the aspect ratio required to achieve proper banking is too skewed, it is usually better to rethink the entire graph. No amount of banking will make the data set in Figure 3-9 look right—you need a double logarithmic transform. There is also another issue to consider. The purpose of banking is to improve human perception of the graph (it is, after all, exactly the same data that is displayed). But graphs *The discussion here is adapted from my book Gnuplot in Action. Manning Publications. 2010. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 61 O’Reilly-5980006 master October 28, 2010 20:27 0 20 40 60 80 100 120 140 160 180 200 1700 1750 1800 1850 1900 1950 2000 Annual Sunspot Number FIGURE 3-11.Theannual sunspot numbers for the last 300 years. The aspect ratio of the plot makes it hard to recognize the details of each cycle. 0 100 200 1700 1750 1800 1850 1900 1950 2000 FIGURE 3-12.Thesame data as in Figure 3-11. The aspect ratio has been changed so that rising and falling flanks of the curve make approximately a 45 degree angle with the horizontal (banking to 45 degrees), but the figure has become so small that it is hard to recognize much detail. with highly skewed aspect ratios violate the great afﬁnity humans seem to have for proportions of roughly 4 by 3 (or 11 by 8.5 or √ 2 by 1). Witness the abundance of display formats (paper, books, screens) that adhere approximately to these proportions the world over. Whether we favor this display format because we are so used to it or (more likely, I think) it is so predominant because it works well for humans is rather irrelevant in this context. (And keep in mind that squares seem to work particularly badly—notice how squares, when used for furniture or appliances, are considered a “bold” design. Unless there is a good reason for them, such as graphing a square matrix, I recommend you avoid square displays.) Linear Regression and All That Linear regression is a method for ﬁnding a straight line through a two-dimensional scatter plot. It is simple to calculate and has considerable intuitive appeal—both of which together make it easily the single most-often misapplied technique in all of statistics! There is a fundamental misconception regarding linear regression—namely that it is a good and particularly rigorous way to summarize the data in a two-dimensional scatter 62 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 0 100 0 100 0 100 0 20 40 60 80 100 Annual Sunspot Number Year in Century 1700 1800 1900 FIGURE 3-13.Acut-and-stack plot of the data from Figure 3-11. By breaking the time axis into three chunks, we can bank each century to 45 degrees and still fit all the data into a standard-size plot. Note how we can now easily recognize an important feature of the data: the rising flank tends to be steeper than the falling one. plot. This misconception is often associated with the notion that linear regression provides the “best ﬁt” to the data. This is not so. Linear regression is not a particularly good way to summarize data, and it provides a “best ﬁt” in a much more limited sense than is generally realized. Linear regression applies to situations where we have a set of input values (the controlled variable) and, for each of them, we measure an output value (the response variable). Now we are looking for a linear function f (x) = a + bx as a function of the controlled variable x that reproduces the response with the least amount of error. The result of a linear regression is therefore a function that minimizes the error in the responses for a given set of inputs. This is an important understanding: the purpose of a regression procedure is not to summarize the data—the purpose is to obtain a function that allows us to predict the value of the response variable (which is affected by noise) that we expect for a certain value of the input variable (which is assumed to be known exactly). As you can see, there is a fundamental asymmetry between the two variables: the two are not interchangeable. In fact, you will obtain a different solution when you regress x on y than when you regress y on x. Figure 3-14 demonstrates this effect: the same data set is ﬁtted both ways: y = a + bx and x = c + dy. The resulting straight lines are quite different. This simple observation should dispel the notion that linear regression provides the best ﬁt—after all, how could there be two different “best ﬁts” for a single data set? Instead, TWO VARIABLES: ESTABLISHING RELATIONSHIPS 63 O’Reilly-5980006 master October 28, 2010 20:27 0 2 4 6 8 10 12 14 16 0 5 10 15 20 y on x x on y FIGURE 3-14.Thefirstdata set from Anscombe’s quartet (Table 3-1), fit both ways: y = a + bx and x = c + dy. The thin lines indicate the errors, the squares of which are summed to give χ2. Depending on what you consider the input and the response variable, the “best fit” turns out to be different! linear regression provides the most faithful representation of an output in response to an input. In other words, linear regression is not so much a best ﬁt as a best predictor. How do we ﬁnd this “best predictor”? We require it to minimize the error in the responses, so that we will be able to make the most accurate predictions. But the error in the responses is simply the sum over the errors for all the individual data points. Because errors can be positive or negative (as the function over- or undershoots the real value), they may cancel each other out. To avoid this, we do not sum the errors themselves but their squares: χ2 = i ( f (xi ) − yi )2 = i (a + bxi − yi )2 where (xi , yi ) with i = 1 ...n are the data points. Using the values for the parameters a and b that minimize this quantity will yield a function that best explains y in terms of x. Because the dependence of χ2 on a and b is particularly simple, we can work out expressions for the optimal choice of both parameters explicitly. The results are: b = n xi yi − xi yi n x2 i − xi 2 a = 1 n yi − b xi 64 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 0 2 4 6 8 10 12 14 16 0 5 10 15 20 A 0 2 4 6 8 10 12 14 16 0 5 10 15 20 B 0 2 4 6 8 10 12 14 16 0 5 10 15 20 C 0 2 4 6 8 10 12 14 16 0 5 10 15 20 D FIGURE 3-15.Anscombe’s quartet: all summary statistics (in particular the regression coefficients) for all four data sets are numerically equal, yet only data set A is well represented by the linear regression function. TABLE 3-1.Anscombe’s quartet. AB C D xy xy xy xy 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 These results are simple and beautiful—and, in their simplicity, very suggestive. But they can also be highly misleading. Table 3-1 and Figure 3-15 show a famous example, Anscombe’s quartet. If you calculate the regression coefﬁcients a and b for each of the four data sets shown in Table 3-1, you will ﬁnd that they are exactly the same for all four data sets! Yet when you look at the corresponding scatter plots, it is clear that only the ﬁrst data set is properly described by the linear model. The second data set is not linear, the third is corrupted by an outlier, and the fourth does not contain enough independent x values to form a regression at all! Looking only at the results of the linear regression, you would never know this. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 65 O’Reilly-5980006 master October 28, 2010 20:27 I think this example should demonstrate once and for all how dangerous it can be to rely on linear regression (or on any form of aggregate statistics) to summarize a data set. (In fact, the situation is even worse than what I have presented: with a little bit more work, you can calculate conﬁdence intervals on the linear regression results, and even they turn out to be equal for all four members of Anscombe’s quartet!) Having seen this, here are some questions to ask before computing linear regressions. Do you need regression? Remember that regression coefﬁcients are not a particularly good way to summarize data. Regression only makes sense when you want to use it for prediction. If this is not the case, then calculating regression coefﬁcients is not useful. Is the linear assumption appropriate? Linear regression is appropriate only if the data can be described by a straight line. If this is obviously not the case (as with the second data set in Anscombe’s quartet), then linear regression does not apply. Is something else entirely going on? Linear regression, like all summary statistics, can be led astray by outliers or other “weird” data sets, as is demonstrated by the last two examples in Anscombe’s quartet. Historically, one of the attractions of linear regression has been that it is easy to calculate: all you need to do is to calculate the four sums xi , x2 i , yi , and xi yi , which can be done in a single pass through the data set. Even with moderately sized data sets (dozens of points), this is arguably easier than plotting them using paper and pencil! However, that argument simply does not hold anymore: graphs are easy to produce on a computer and contain so much more information than a set of regression coefﬁcients that they should be the preferred way to analyze, understand, and summarize data. Remember: The purpose of computing is insight, not numbers! (R. W. Hamming) Showing What's Important Perhaps this is a good time to express what I believe to be the most important principle in graphical analysis: Plot the pertinent quantities! As obvious as it may appear, this principle is often overlooked in practice. For example, if you look through one of those books that show and discuss examples of poor graphics, you will ﬁnd that most examples fall into one of two classes. First, there are those graphs that failed visually, with garish fonts, unhelpful symbols, and useless embellishments. (These are mostly presentation graphics gone wrong, not examples of bad graphical analysis.) The second large class of graphical failures consists of those plots that failed conceptually or, one might better say, analytically. The problem with these is not in the technical aspects of 66 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 drawing the graph but in the conceptual understanding of what the graph is trying to show. These plots displayed something, but they failed to present what was most important or relevant to the question at hand. The problem, of course, is that usually it is not at all obvious what we want to see, and it is certainly not obvious at the beginning. It usually takes several iterations, while a mental model of the data is forming in your head, to articulate the proper question that a data set is suggesting and to come up with the best way of answering it. This typically involves some form of transformation or manipulation of the data: instead of the raw data, maybe we should show the difference between two data sets. Or the residual after subtracting a trend or after subtracting the results from a model. Or perhaps we need to normalize data sets from different sources by subtracting their means and dividing by their spreads. Or maybe we should not use the original variables to display the data but instead apply some form of transformation on them (logarithmic scales are only the simplest example of such transformations). Whatever we choose to do, it will typically involve some form of transformation of the data—it’s rarely the raw data that is most interesting; but any deviation from the expected is almost always an interesting discovery. Very roughly, I think we can identify a three-step (maybe four-step) process. It should be taken not in the sense of a prescriptive checklist but rather in the sense of a gradual process of learning and discovery. First: The basics. Initially, we are mostly concerned with displaying what is there. • Select proper ranges. • Subtract a constant offset. • Decide whether to use symbols (for scattered data), lines (for continuous data), or perhaps both (connecting individual symbols can help emphasize trends in sparse data sets). Second: The appearance. Next, we work with aspects of the plot that inﬂuence its overall appearance. • Log plots. • Add a smoothed curve. • Consider banking. Third: Build a model. At this point, we start building a mathematical model and compare it against the raw data. The comparison often involves ﬁnding the differences between the model and the data (typically subtracting the model or forming a ratio). • Subtract a trend. • Form the ratio to a base value or baseline. • Rescale a set of curves to collapse them onto each other. TWO VARIABLES: ESTABLISHING RELATIONSHIPS 67 O’Reilly-5980006 master October 28, 2010 20:27 Fourth (for presentation graphics only): Add embellishments. Embellishments and decorations (labels, arrows, special symbols, explanations, and so on) can make a graph much more informative and self-explanatory. However, they are intended for an audience beyond the actual creator of the graph. You will rarely need them during the analysis phase, when you are trying to ﬁnd out something new about the data set, but they are an essential part when presenting your results. This step should only occur if you want to communicate your results to a wider and more general audience. Graphical Analysis and Presentation Graphics I have used the terms graphical analysis and presentation graphics without explaining them properly. In short: Graphical analysis Graphical analysis is an investigation of data using graphical methods. The purpose is the discovery of new information about the underlying data set. In graphical analysis, the proper question to ask is often not known at the outset but is discovered as part of the analysis. Presentation graphics Presentation graphics are concerned with the communication of information and results that are already understood. The discovery has been made, and now it needs to be communicated clearly. The distinction between these two activities is important, because they do require different techniques and yield different work products. During the analysis process, convenience and ease of use are the predominant concerns—any amount of polishing is too much! Nothing should keep you from redrawing a graph, changing some aspect of it, zooming in or out, applying transformations, and changing styles. (When working with a data set I haven’t seen before, I probably create dozens of graphs within a few minutes—basically, “looking at the data from all angles.”) At this stage, any form of embellishment (labels, arrows, special symbols) is inappropriate—you know what you are showing, and creating any form of decoration on the graph will only make you more reluctant to throw the graph away and start over. For presentation graphics, the opposite applies. Now you already know the results, but you would like to communicate them to others. Textual information therefore becomes very important: how else will people know what they are looking at? You can ﬁnd plenty of advice elsewhere on how to prepare “good” presentation graphics—often strongly worded and with an unfortunate tendency to use emotional responses (ridicule or derision) in place of factual arguments. In the absence of good empirical evidence one way or the other, I will not add to the discussion. But I present a checklist below, mentioning some points that are often overlooked when preparing graphs for presentation: 68 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 • Try to make the text self-explanatory. Don’t rely on a (separate) caption for basic information—it might be removed during reproduction. Place basic information on the graph itself. • Explain what is plotted on the axes. This can be done with explicit labels on the axes or through explanatory text elsewhere. Don’t forget the units! • Make labels self-explanatory. Be careful with nonstandard abbreviations. Ask yourself: If this is all the context provided, are you certain that the reader will be able to ﬁgure out what you mean? (In a recent book on data graphing, I found a histogram labeled Married, Nvd, Dvd, Spd, and Wdd. I could ﬁgure out most of them, because at least Married was given in long form, but I struggled with Nvd for quite a while!) • Given how important text is on a graph, make sure to pick a suitable font. Don’t automatically rely on the default provided by your plotting software. Generally, sans-serif fonts (such as Helvetica) are preferred for short labels, such as those on a graph, whereas serif fonts (such as Times) are more suitable for body text. Also pick an appropriate size—text fonts on graphics are often too large, making them look garish. (Most text fonts are used at 10-point to 12-point size; there is no need for type on graphics to be much larger.) • If there are error bars, be sure to explain their meaning. What are they: standard deviations, inter-quartile ranges, or the limits of experimental apparatus? Also, choose an appropriate measure of uncertainty. Don’t use standard deviations for highly skewed data. • Don’t forget the basics. Choose appropriate plot ranges. Make sure that data is not unnecessarily obscured by labels. • Proofread graphs! Common errors include: typos in textual labels, interchanged data sets or switched labels, missing units, and incorrect order-of-magnitude qualiﬁers (e.g., milli- versus micro-). • Finally, choose an appropriate output format for your graph! Don’t use bitmap formats (GIF, JPG, PNG) for print publication—use a scalable format such as PostScript or PDF. One last piece of advice: creating good presentation graphics is also a matter of taste, and taste can be acquired. If you want to work with data, then you should develop an interest in graphs—not just the ones you create yourself, but all that you see. If you notice one that seems to work (or not), take a moment to ﬁgure out what makes it so. Are the lines too thick? The labels too small? The choice of colors just right? The combination of curves helpful? Details matter. Workshop: matplotlib The matplotlib module is a Python module for creating two-dimensional xy plots, scatter plots, and other plots typical of scientiﬁc applications. It can be used in an interactive TWO VARIABLES: ESTABLISHING RELATIONSHIPS 69 O’Reilly-5980006 master October 28, 2010 20:27 session (with the plots being shown immediately in a GUI window) or from within a script to create graphics ﬁles using common graphics ﬁle formats. Let’s ﬁrst look at some examples to demonstrate how matplotlib can be used from within an interactive session. Afterward, we will take a closer look at the structure of the library and give some pointers for more detailed investigations. Using matplotlib Interactively To begin an interactive matplotlib session, start IPython (the enhanced interactive Python shell) with the -pylab option, entering the following command line like at the shell prompt: ipython -pylab This will start IPython, load matplotlib and NumPy, and import both into the global namespace. The idea is to give a Matlab-like experience of interactive graphics together with numerical and matrix operations. (It is important to use IPython here—the ﬂow of control between the Python command interpreter and the GUI eventloop for the graphics windows requires it. Other interactive shells can be used, but they may require some tinkering.) We can now create plots right away: In [1]: x = linspace( 0, 10, 100 ) In [2]: plot( x, sin(x) ) Out[2]: [] This will pop up a new window, showing a graph like the one in Figure 3-16 but decorated with some GUI buttons. (Note that the sin() function is a ufunc from the NumPy package: it takes a vector and returns a vector of the same size, having applied the sine function to each element in the input vector. See the Workshop in Chapter 2.) We can now add additional curves and decorations to the plot. Continuing in the same session as before, we add another curve and some labels: In [3]: plot( x, 0.5*cos(2*x) ) Out[3]: [] In [4]: title( "A matplotlib plot" ) Out[4]: In [5]: text( 1, -0.8, "A text label" ) Out[5]: In [6]: ylim( -1.1, 1.1 ) Out[6]: (-1.1000000000000001, 1.1000000000000001) 70 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 FIGURE 3-16.Asimple matplotlib figure (see text). In the last step, we increased the range of values plotted on the vertical axis. (There is also an axis() command, which allows you to specify limits for both axes at the same time. Don’t confuse it with the axes() command, which creates a new coordinate system.) The plot should now look like the one in Figure 3-17, except that in an interactive terminal the different lines are distinguished by their color, not their dash pattern. Let’s pause for a moment and point out a few details. First of all, you should have noticed that the graph in the plot window was updated after every operation. That is typical for the interactive mode, but it is not how matplotlib works in a script: in general, matplotlib tries to delay the (possibly expensive) creation of an actual plot until the last possible moment. (In a script, you would use the show() command to force generation of an actual plot window.) Furthermore, matplotlib is “stateful”: a new plot command does not erase the previous ﬁgure and, instead, adds to it. This behavior can be toggled with the hold() command, and the current state can be queried using ishold(). (Decorations like the text labels are not affected by this.) You can clear a ﬁgure explicitly using clf(). This implicit state may come as a surprise: haven’t we learned to make things explicit, when possible? In fact, this stateful behavior is a holdover from the way Matlab works. Here is another example. Start a new session and execute the following commands: In [1]: x1 = linspace( 0, 10, 40 ) In [2]: plot( x1, sqrt(x1), 'k-' ) Out[2]: [] TWO VARIABLES: ESTABLISHING RELATIONSHIPS 71 O’Reilly-5980006 master October 28, 2010 20:27 FIGURE 3-17.Theplot from Figure 3-16 with an additional curve and some decorations added. In [3]: figure(2) Out[3]: In [4]: x2 = linspace( 0, 10, 100 ) In [5]: plot( x1, sin(x1), 'k--', x2, 0.2*cos(3*x2), 'k:' ) Out[5]: [, ] In [6]: figure(1) Out[6]: In [7]: plot( x1, 3*exp(-x1/2), linestyle='None', color='white', marker='o', ...: markersize=7 ) Out[7]: [] In [8]: savefig( 'graph1.png' ) This snippet of code demonstrates several things. We begin as before, by creating a plot. This time, however, we pass a third argument to the plot() command that controls the appearance of the graph elements. That matplotlib library supports Matlab-style mnemonics for plot styles; the letter k stands for the color “black” and the single dash - for a solid line. (The letter b stands for “blue.”) Next we create a second ﬁgure in a new window and switch to it by using the figure(2) command. All graphics commands will now be directed to this second ﬁgure—until we switch back to the ﬁrst ﬁgure using figure(1). This is another example of “silent state.” Observe also that ﬁgures are counted starting from 1, not from 0. 72 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 In line 5, we see another way to use the plot command—namely, by specifying two sets of curves to be plotted together. (The formatting commands request a dashed and a dotted line, respectively.) Line 7 shows yet a different way to specify plot styles: by using named (keyword) arguments. Finally, we save the currently active plot (i.e., ﬁgure 1) to a PNG ﬁle. The savefig() function determines the desired output format from the extension of the ﬁlename given. Other formats that are supported out of the box are PostScript, PDF, and SVG. Additional formats may be available, depending on the libraries installed on your system. Case Study: LOESS with matplotlib As a quick example of how to put the different aspects of matplotlib together, let’s discuss the script used to generate Figure 3-4. This also gives us an opportunity to look at the LOESS method in a bit more detail. To recap: LOESS stands for locally weighted linear regression. The difference between LOESS and regular linear regression is the introduction of a weight factor, which emphasizes those data points that are close to the location x at which we want to evaluate the smoothed curve. As explained earlier, the expression for squared error (which we want to minimize) now becomes: χ2(x) = i w(x − xi ; h) (a + bxi − yi )2 Keep in mind that this expression now depends on x, the location at which we want to evaluate the smoothed curve! If we minimize this expression with respect to the parameters a and b, we obtain the following expressions for a and b (remember that we will have to evaluate them from scratch for every point x): b = wi wi xi yi − wi xi wi yi wi wi x2 i − wi xi 2 a = wi yi − b wi xi wi This can be quite easily translated into NumPy and plotted with matplotlib. The actual LOESS calculation is contained entirely in the function loess(). (See the Workshop in Chapter 2 for a discussion of this type of programming.) from pylab import * # x: location; h: bandwidth; xp, yp: data points (vectors) def loess( x, h, xp, yp ): w = exp( -0.5*( ((x-xp)/h)**2 )/sqrt(2*pi*h**2) ) b = sum(w*xp)*sum(w*yp) - sum(w)*sum(w*xp*yp) TWO VARIABLES: ESTABLISHING RELATIONSHIPS 73 O’Reilly-5980006 master October 28, 2010 20:27 b /= sum(w*xp)**2 - sum(w)*sum(w*xp**2) a = ( sum(w*yp) - b*sum(w*xp) )/sum(w) return a + b*x d = loadtxt( "draftlottery" ) s1, s2 = [], [] for k in d[:,0]: s1.append( loess( k, 5, d[:,0], d[:,1] ) ) s2.append( loess( k, 100, d[:,0], d[:,1] ) ) xlabel( "Day in Year" ) ylabel( "Draft Number" ) gca().set_aspect( 'equal' ) plot( d[:,0], d[:,1], 'o', color="white", markersize=7, linewidth=3 ) plot( d[:,0], array(s1), 'k-', d[:,0], array(s2), 'k--' ) q=4 axis( [1-q, 366+q, 1-q, 366+q] ) savefig( "draftlottery.eps" ) We evaluate the smoothed curve at the locations of all data points, using two different values for the bandwidth, and then proceed to plot the data together with the smoothed curves. Two details require an additional word of explanation. The function gca() returns the current “set of axes” (i.e., the current coordinate system on the plot—see below for more information on this function), and we require the aspect ratio of both x and y axes to be equal (so that the plot is a square). In the last command before we save the ﬁgure to ﬁle, we adjust the plot range by using the axis() command. This function must follow the plot() commands, because the plot() command automatically adjusts the plot range depending on the data. Managing Properties Until now, we have ignored the values returned by the various plotting commands. If you look at the output generated by IPython, you can see that all the commands that add graph elements to the plot return a reference to the object just created. The one exception is the plot() command itself, which always returns a list of objects (because, as we have seen, it can add more than one “line” to the plot). These references are important because it is through them that we can control the appearance of graph elements once they have been created. In a ﬁnal example, let’s study how we can use them: In [1]: x = linspace( 0, 10, 100 ) In [2]: ps = plot( x, sin(x), x, cos(x) ) 74 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 In [3]: t1 = text( 1, -0.5, "Hello" ) In [4]: t2 = text( 3, 0.5, "Hello again" ) In [5]: t1.set_position( [7, -0.5] ) In [6]: t2.set( position=[5, 0], text="Goodbye" ) Out[6]: [None, None] In [7]: draw() In [8]: setp( [t1, t2], fontsize=10 ) Out[8]: [None, None] In [9]: t2.remove() In [10]: Artist.remove( ps[1] ) In [11]: draw() In the ﬁrst four lines, we create a graph with two curves and two text labels, as before, but now we are holding on to the object references. This allows us to make changes to these graph elements. Lines 5, 6, and 8 demonstrate different ways to do this: for each property of a graph element, there is an explicit, named accessor function (line 5). Alternatively, we can use a generic setter with keyword arguments—this allows us to set several properties (on a single object) in a single call (line 6). Finally, we can use the standalone setp() function, which takes a list of graph elements and applies the requested property update to all of them. (It can also take a single graph element instead of a one-member list.) Notice that setp() generates a redraw event whereas individual property accessors do not; this is why we must generate an explicit redraw event in line 7. (If you are confused by the apparent duplication of functionality, read on: we will come back to this point in the next section.) Finally, we remove one of the text labels and one of the curves by using the remove() function. The remove() function is deﬁned for objects that are derived from the Artist class, so we can invoke it using either member syntax (as a “bound” function, line 9) or the class syntax (as an “unbound” function, line 10). Keep in mind that plot() returns a list of objects, so we need to index into the list to access the graph objects themselves. There are some useful functions that can help us handle object properties. If you issue setp(r) with only a single argument in an interactive session, then it will print all properties that are available for object r together with information about the values that each property is allowed to take on. The getp(r) function on the other hand prints all properties of r together with their current values. Suppose we did not save the references to the objects we created, or suppose we want to change the properties of an object that we did not create explicitly. In such cases we can use the functions gcf() and gca(), which return a reference to the current ﬁgure or axes TWO VARIABLES: ESTABLISHING RELATIONSHIPS 75 O’Reilly-5980006 master October 28, 2010 20:27 object, respectively. To make use of them, we need to develop at least a passing familiarity with matplotlib’s object model. The matplotlib Object Model and Architecture The object model for matplotlib is constructed similarly to the object model for a GUI widget set: a plot is represented by a tree of widgets, and each widget is able to render itself. Perhaps surprisingly, the object model is not ﬂat. In other words, the plot elements (such as axes, labels, arrows, and so on) are not properties of a high-level “plot” or “ﬁgure” object. Instead, you must descend down the object tree to ﬁnd the element that you want to modify and then, once you have an explicit reference to it, change the appropriate property on the element. The top-level element (the root node of the tree) is an object of class Figure. A ﬁgure contains one or more Axes objects: this class represents a “coordinate system” on which actual graph elements can be placed. (By contrast, the actual axes that are drawn on the graph are objects of the Axis class!) The gcf() and gca() functions therefore return a reference to the root node of the entire ﬁgure or to the root node of a single plot in a multiplot ﬁgure. Both Figure and Axes are subclasses of Artist. This is the base class of all “widgets” that can be drawn onto a graph. Other important subclasses of Artist are Line2D (a polygonal line connecting multiple points, optionally with a symbol at each point), Text, and Patch (a geometric shape that can be placed onto the ﬁgure). The top-level Figure instance is owned by an object of type FigureCanvas (in the matplotlib.backend bases module). Most likely you won’t have to interact with this class yourself directly, but it provides the bridge between the (logical) object tree that makes up the graph and a backend, which does the actual rendering. Depending on the backend, matplotlib creates either a ﬁle or a graph window that can be used in an interactive GUI session. Although it is easy to get started with matplotlib from within an interactive session, it can be quite challenging to really get one’s arms around the whole library. This can become painfully clear when you want to change some tiny aspect of a plot—and can’t ﬁgure out how to do that. As is so often the case, it helps to investigate how things came to be. Originally, matplotlib was conceived as a plotting library to emulate the behavior found in Matlab. Matlab traditionally uses a programming model based on functions and, being 30 years old, employs some conventions that are no longer popular (i.e., implicit state). In contrast, matplotlib was implemented using object-oriented design principles in Python, with the result that these two different paradigms clash. One consequence of having these two different paradigms side by side is redundancy. Many operations can be performed in several different ways (using standalone functions, Python-style keyword arguments, object attributes, or a Matlab-compatible alternative 76 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:27 syntax). We saw examples of this redundancy in the third listing when we changed object properties. This duplication of functionality matters because it drastically increases the size of the library’s interface (its application programming interface or API), which makes it that much harder to develop a comprehensive understanding. What is worse, it tends to spread information around. (Where should I be looking for plot attributes—among functions, among members, among keyword attributes? Answer: everywhere!) Another consequence is inconsistency. At least in its favored function-based interface, matplotlib uses some conventions that are rather unusual for Python programming—for instance, the way a ﬁgure is created implicitly at the beginning of every example, and how the pointer to the current ﬁgure is maintained through an invisible “state variable” that is opaquely manipulated using the figure() function. (The figure() function actually returns the ﬁgure object just created, so the invisible state variable is not even necessary.) Similar surprises can be found throughout the library. A last problem is namespace pollution (this is another Matlab heritage—they didn’t have namespaces back then). Several operations included in matplotlib’s function-based interface are not actually graphics related but do generate plots as side effects. For example, hist() calculates (and plots) a histogram, acorr() calculates (and plots) an autocorrelation function, and so on. From a user’s perspective, it makes more sense to adhere to a separation of tasks: perform all calculations in NumPy/SciPy, and then pass the results explicitly to matplotlib for plotting. Odds and Ends There are three different ways to import and use matplotlib. The original method was to enter: from pylab import * This would load all of NumPy as well as matplotlib and import both APIs into the global namespace! This is no longer the preferred way to use matplotlib. Only for interactive use with IPython is it still required (using the -pylab command-line option to IPython). The recommended way to import matplotlib’s function-based interface together with NumPy is by using: import matplotlib.pyplot as plt import numpy as np The pyplot interface is a function-based interface that uses the same Matlab-like stateful conventions that we have seen in the examples of this section; however, it does not include the NumPy functions. Instead, NumPy must be imported separately (and into its own namespace). Finally, if all you want is the object-oriented API to matplotlib, then you can import just the explicit modules from within matplotlib that contain the class deﬁnitions you need TWO VARIABLES: ESTABLISHING RELATIONSHIPS 77 O’Reilly-5980006 master October 28, 2010 20:27 (although it is customary to import pyplot instead and thereby obtain access to the whole collection). Of course, there are many details that we have not discussed. Let me mention just a few: • Many more options (to conﬁgure the axes and tick marks, to add legend or arrows). • Additional plot types (density or “false-color” plots, vector plots, polar plots). • Digital image processing—matplotlib can read and manipulate PNG images and can also call into the Python Image Library (PIL) if it is installed. • Matplotlib can be embedded in a GUI and can handle GUI events. The Workshop of Chapter 4 contains another example that involves matplotlib being called from a script to generate image ﬁles. Further Reading In addition to the books listed below, you may check the references in Chapter 10 for additional material on linear regression. • The Elements of Graphing Data. William S. Cleveland. 2nd ed., Hobart Press. 1994. This is probably the deﬁnitive reference on graphical analysis (as opposed to presentation graphics). Cleveland is the inventor of both the LOESS and the banking techniques discussed in this chapter. My own thinking has been inﬂuenced strongly by Cleveland’s careful approach. A companion volume by the same author, entitled Visualizing Data, is also available. • Exploratory Data Analysis with MATLAB. Wendy L. Martinez and Angel R. Martinez. Chapman & Hall/CRC. 2004. This is an interesting book—it covers almost the same topics as the book you are reading but in opposite order, starting with dimensionality reduction and clustering techniques and ending with univariate distributions! Because it demonstrates all techniques by way of Matlab, it does not develop the conceptual background in great depth. However, I found the chapter on smoothing to be quite useful. 78 CHAPTER THREE O’Reilly-5980006 master October 28, 2010 20:29 CHAPTER FOUR Time As a Variable: Time-Series Analysis IF WE FOLLOW THE VARIATION OF SOME QUANTITY OVER TIME, WE ARE DEALING WITH A TIME SERIES. TIME series are incredibly common: examples range from stock market movements to the tiny icon that constantly displays the CPU utilization of your desktop computer for the previous 10 seconds. What makes time series so common and so important is that they allow us to see not only a single quantity by itself but at the same time give us the typical “context” for this quantity. Because we have not only a single value but a bit of history as well, we can recognize any changes from the typical behavior particularly easily. On the face of it, time-series analysis is a bivariate problem (see Chapter 3). Nevertheless, we are dedicating a separate chapter to this topic. Time series raise a different set of issues than many other bivariate problems, and a rather specialized set of methods has been developed to deal with them. Examples To get started, let’s look at a few different time series to develop a sense for the scope of the task. Figure 4-1 shows the concentration of carbon dioxide (CO2) in the atmosphere, as measured by the observatory on Mauna Loa on Hawaii, recorded at monthly intervals since 1959. This data set shows two features we often ﬁnd in a time-series plot: trend and seasonality. There is clearly a steady, long-term growth in the overall concentration of CO2; this is the trend. In addition, there is also a regular periodic pattern; this is the seasonality. If we look closely, we see that the period in this case is exactly 12 months, but we will use the term 79 O’Reilly-5980006 master October 28, 2010 20:29 310 320 330 340 350 360 370 380 390 400 1958 1964 1970 1976 1982 1988 1994 2000 2006 2012 FIGURE 4-1.Trend and seasonality: the concentration of CO2 (in parts per million) in the atmosphere as measured by the observatory on Mauna Loa, Hawaii, at monthly intervals. “seasonality” for any regularly recurring feature, regardless of the length of the period. We should also note that the trend, although smooth, does appear to be nonlinear, and in itself may be changing over time. Figure 4-2 displays the concentration of a certain gas in the exhaust of a gas furnace over time. In many ways, this example is the exact opposite of the previous example. Whereas the data in Figure 4-1 showed a lot of regularity and a strong trend, the data in Figure 4-2 shows no trend but a lot of noise. Figure 4-3 shows the dramatic drop in the cost of a typical long-distance phone call in the U.S. over the last century. The strongly nonlinear trend is obviously the most outstanding feature of this data set. As with many growth or decay processes, we may suspect an exponential time development; in fact, in a semi-logarithmic plot (Figure 4-3, inset) the data follows almost a straight line, conﬁrming our expectation. Any analysis that fails to account explicitly for this behavior of the original data is likely to lead us astray. We should therefore work with the logarithms of the cost, rather than with the absolute cost. There are some additional questions that we should ask when dealing with a long-running data set like this. What exactly is a “typical” long-distance call, and has that deﬁnition changed over the observation period? Are the costs adjusted for inﬂation or not? The data itself also begs closer scrutiny. For instance, the uncharacteristically low prices for a couple of years in the late 1970s make me suspicious: are they the result of a clerical error (a typo), or are they real? Did the breakup of the AT&T system have 80 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 -3 -2 -1 0 1 2 3 0 50 100 150 200 250 300 FIGURE 4-2.Notrend but relatively smooth variation over time: concentration of a certain gas in a furnace exhaust (in arbitrary units). 0 50 100 150 200 250 300 350 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 0.1 1 10 100 1000 1920 1940 1960 1980 2000 FIGURE 4-3.Nonlinear trend: cost of a typical long-distance phone call in the U.S. anything to do with these low prices? We will not follow up on these questions here because I am presenting this example only as an illustration of an exponential trend, but any serious analysis of this data set would have to follow up on these questions. Figure 4-4 shows the development of the Japanese stock market as represented by the Nikkei Stock Index over the last 40 years, an example of a time series that exhibits a TIME AS A VARIABLE: TIME-SERIES ANALYSIS 81 O’Reilly-5980006 master October 28, 2010 20:29 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 1970 1975 1980 1985 1990 1995 2000 2005 2010 FIGURE 4-4.Change in behavior: the Nikkei Stock Index over the last 40 years. marked change in behavior. Clearly, whatever was true before the New Year’s Day 1990 was no longer true afterward. (In fact, by looking closely, you can make out a second change in behavior that was more subtle than the bursting of the big Japanese bubble: its beginning, sometime around 1985–1986.) This data set should serve as a cautionary example. All time-series analysis is based on the assumption that the processes generating the data are stationary in time. If the rules of the game change, then time-series analysis is the wrong tool for the task; instead we need to investigate what caused the break in behavior. More benign examples than the bursting of the Japanese bubble can be found: a change in sales or advertising strategy may signiﬁcantly alter a company’s sales patterns. In such cases, it is more important to inquire about any further plans that the sales department might have, rather than to continue working with data that is no longer representative! After these examples that have been chosen for their “textbook” properties, let’s look at a “real-world” data set. Figure 4-5 shows the number of daily calls placed to a call center for a time period slightly longer than two years. In comparison to the previous examples, this data set has a lot more structure, which makes it hard to determine even basic properties. We can see some high-frequency variation, but it is not clear whether this is noise or has some form of regularity to it. It is also not clear whether there is any sort of regularity on a longer time scale. The amount of variation makes it hard to recognize any further structure. For instance, we cannot tell if there is a longer-term trend in the data. We will come back to this example later in the chapter. 82 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Data Smoothed FIGURE 4-5.Areal-world data set: number of daily calls placed to a call center. The data exhibits short- and long-term seasonality, noise, and possibly changes in behavior. Also shown is the result of applying a 31-point Gaussian smoothing filter. The Task After this tour of possible time-series scenarios, we can identify the main components of every time series: • Trend • Seasonality • Noise • Other(!) The trend may be linear or nonlinear, and we may want to investigate its magnitude. The seasonality pattern may be either additive or multiplicative. In the ﬁrst case, the seasonal change has the same absolute size no matter what the magnitude of the current baseline of the series is; in the latter case, the seasonal change has the same relative size compared with the current magnitude of the series. Noise (i.e., some form of random variation) is almost always part of a time series. Finding ways to reduce the noise in the data is usually a signiﬁcant part of the analysis process. Finally, “other” includes anything else that we may observe in a time series, such as particular signiﬁcant changes in overall behavior, special outliers, missing data—anything remarkable at all. Given this list of components, we can summarize what it means to “analyze” a time series. We can distinguish three basic tasks: • Description TIME AS A VARIABLE: TIME-SERIES ANALYSIS 83 O’Reilly-5980006 master October 28, 2010 20:29 • Prediction • Control Description attempts to identify components of a time series (such as trend and seasonality or abrupt changes in behavior). Prediction seeks to forecast future values. Control in this context means the monitoring of a process over time with the purpose of keeping it within a predeﬁned band of values—a typical task in many manufacturing or engineering environments. We can distinguish the three tasks in terms of the time frame they address: description looks into the past, prediction looks to the future, and control concentrates on the present. Requirements and the Real World Most standard methods of time-series analysis make a number of assumptions about the underlying data. • Data points have been taken at equally spaced time steps, with no missing data points. • The time series is sufﬁciently long (50 points are often considered as an absolute minimum). • The series is stationary: it has no trend, no seasonality, and the character (amplitude and frequency) of any noise does not change with time. Unfortunately, most of these assumptions will be more or less violated by any real-world data set that you are likely to encounter. Hence you may have to perform a certain amount of data cleaning before you can apply the methods described in this chapter. If the data has been sampled at irregular time steps or if some of the data points are missing, then you can try to interpolate the data and resample it at equally spaced intervals. Time series obtained from electrical systems or scientiﬁc experiments can be almost arbitrarily long, but most series arising in a business context will be quite short and contain possibly no more than two dozen data points. The exponential smoothing methods introduced in the next section are relatively robust even for relatively short series, but somewhere there is a limit. Three or four data points don’t constitute a series! Finally, most interesting series will not be stationary in the sense of the deﬁnition just given, so we may have to identify and remove trend and seasonal components explicitly (we’ll discuss how to do that later). Drastic changes in the nature of the series also violate the stationarity condition. In such cases we must not continue blindly but instead deal with the break in the data—for example, by treating the data set as two different series (one before and one after the event). Smoothing An important aspect of most time series is, the presence of noise—that is, random (or apparently random) changes in the quantity of interest. Noise occurs in many real-world 84 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 -5 0 5 10 15 20 0 20 40 60 80 100 120 140 160 180 200 Data Moving Average Weighted Average FIGURE 4-6.Simple and a Gaussian weighted moving average: the weighted average is less affected by sudden jumps in the data. data sets, but we can often reduce the noise by improving the apparatus used to measure the data or by collecting a larger sample and averaging over it. But the particular structure of time series makes this impossible: the sales ﬁgures for the last 30 days are ﬁxed, and they constitute all the data we have. This means that removing noise, or at least reducing its inﬂuence, is of particular importance in time-series analysis. In other words, we are looking for ways to smooth the signal. Running Averages The simplest smoothing algorithm that we can devise is the running, moving,orﬂoating average. The idea is straightforward: for any odd number of consecutive points, replace the centermost value with the average of the other points (here, the {xi } are the data points and the smoothed value at position i is si ): si = 1 2k + 1 k j=−k xi+ j This naive approach has a serious problem, as you can see in Figure 4-6. The ﬁgure shows the original signal together with the 11-point moving average. Unfortunately, the signal has some sudden jumps and occasional large “spikes,” and we can see how the smoothed curve is affected by these events: whenever a spike enters the smoothing window, the moving average is abruptly distorted by the single, uncommonly large value until the outlier leaves the smoothing window again—at which point the ﬂoating average equally abruptly drops again. TIME AS A VARIABLE: TIME-SERIES ANALYSIS 85 O’Reilly-5980006 master October 28, 2010 20:29 We can avoid this problem by using a weighted moving average, which places less weight on the points at the edge of the smoothing window. Using such a weighted average, any new point that enters the smoothing window is only gradually added to the average and then gradually removed again: si = k j=−k w j xi+ j where k j=−k w j = 1 Here the w j are the weighting factors. For example, for a 3-point moving average, we might use (1/4, 1/2, 1/4). The particular choice of weight factors is not very important provided they are peaked at the center, drop toward the edges, and add up to 1. I like to use the Gaussian function: f (x,σ)= 1√ 2πσ2 exp −1 2 x σ 2 to build smoothing weight factors. The parameter σ in the Gaussian controls the width of the curve, and the function is essentially zero for values of x larger than about 3.5σ. Hence f (x, 1) can be used to build a 9-point kernel by evaluating f (x, 1) at the positions [−4, −3, −2, −1, 0, 1, 2, 3, 4]. Setting σ = 2, we can form a 15-point kernel by evaluating the Gaussian for all integer arguments between −7 and +7. And so on. Exponential Smoothing All moving-average schemes have a number of problems. • They are painful to evaluate. For each point, the calculation has to be performed from scratch. It is not possible to evaluate weighted moving averages by updating a previous result. • Moving averages can never be extended to the true edge of the available data set, because of the ﬁnite width of the averaging window. This is especially problematic because often it is precisely the behavior at the leading edge of a data set that we are most interested in. • Similarly, moving averages are not deﬁned outside the range of the existing data set. As a consequence, they are of no use in forecasting. Fortunately, there exists a very simple calculational scheme that avoids all of these problems. It is called exponential smoothing or Holt–Winters method. There are various forms of exponential smoothing: single exponential smoothing for series that have neither trend nor seasonality, double exponential smoothing for series exhibiting a trend but no seasonality, and triple exponential smoothing for series with both trend and seasonality. The term “Holt–Winters method” is sometimes reserved for triple exponential smoothing alone. 86 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 All exponential smoothing methods work by updating the result from the previous time step using the new information contained in the data of the current time step. They do so by “mixing” the new information with the old one, and the relative weight of old and new information is controlled by an adjustable mixing parameter. The various methods differ in terms of the number of quantities they track and the corresponding number of mixing parameters. The recurrence relation for single exponential smoothing is particularly simple: si = αxi + (1 − α)si−1 with 0 ≤ α ≤ 1 Here si is the smoothed value at time step i, and xi is the actual (unsmoothed) data at that time step. You can see how si is a mixture of the raw data and the previous smoothed value si−1. The mixing parameter α can be chosen anywhere between 0 and 1, and it controls the balance between new and old information: as α approaches 1, we retain only the current data point (i.e., the series is not smoothed at all); as α approaches 0, we retain only the smoothed past (i.e., the curve is totally ﬂat). Why is this method called “exponential” smoothing? To see this, simply expand the recurrence relation: si = αxi + (1 − α)si−1 = αxi + (1 − α)[αxi−1 + (1 − α)si−2] = αxi + (1 − α) αxi−1 + (1 − α)[αxi−2 + (1 − α)si−3] = α xi + (1 − α)xi−1 + (1 − α)2xi−2 + (1 − α)3si−3 = ... = α i j=0 (1 − α)j xi− j What this shows is that in exponential smoothing, all previous observations contribute to the smoothed value, but their contribution is suppressed by increasing powers of the parameter α. That observations further in the past are suppressed multiplicatively is characteristic of exponential behavior. In a way, exponential smoothing is like a ﬂoating average with inﬁnite memory but with exponentially falling weights. (Also observe that the sum of the weights, j α(1 − α)j , equals 1 as required by virtue of the geometric series i qi = 1/(1 − q) for q < 1. See Appendix B for information on the geometric series.) The results of the simple exponential smoothing procedure can be extended beyond the end of the data set and thereby used to make a forecast. The forecast is extremely simple: xi+h = si where si is the last calculated value. In other words, single exponential smoothing yields a forecast that is absolutely ﬂat for all times. TIME AS A VARIABLE: TIME-SERIES ANALYSIS 87 O’Reilly-5980006 master October 28, 2010 20:29 Single exponential smoothing as just described works well for time series without an overall trend. However, in the presence of an overall trend, the smoothed values tend to lag behind the raw data unless α is chosen to be close to 1; however, in this case the resulting curve is not sufﬁciently smoothed. Double exponential smoothing corrects for this shortcoming by retaining explicit information about the trend. In other words, we maintain and update the state of two quantities: the smoothed signal and the smoothed trend. There are two equations and two mixing parameters: si = αxi + (1 − α)(si−1 + ti−1) ti = β(si − si−1) + (1 − β)ti−1 Let’s look at the second equation ﬁrst. This equation describes the smoothed trend. The current unsmoothed “value” of the trend is calculated as the difference between the current and the previous smoothed signal; in other words, the current trend tells us how much the smoothed signal changed in the last step. To form the smoothed trend, we perform a simple exponential smoothing process on the trend, using the mixing parameter β. To obtain the smoothed signal, we perform a similar mixing as before but consider not only the previous smoothed signal but take the trend into account as well. The last term in the ﬁrst equation is the best guess for the current smoothed signal—assuming we followed the previous trend for a single time step. To turn this result into a forecast, we take the last smoothed value and, for each additional time step, keep adding the last smoothed trend to it: xi+h = si + hti Finally, for triple exponential smoothing we add yet a third quantity, which describes the seasonality. We have to distinguish between additive and multiplicative seasonality. For the additive case, the equations are: si = α(xi − pi−k) + (1 − α)(si−1 + ti−1) ti = β(si − si−1) + (1 − β)ti−1 pi = γ(xi − si ) + (1 − γ)pi−k xi+h = si + hti + pi−k+h For the multiplicative case, they are: si = α xi pi−k + (1 − α)(si−1 + ti−1) ti = β(si − si−1) + (1 − β)ti−1 pi = γ xi si + (1 − γ)pi−k xi+h = (si + hti )pi−k+h Here, pi is the “periodic” component, and k is the length of the period. I have also included the expressions for forecasts. 88 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 All exponential smoothing methods are based on recurrence relations. This means that we need to ﬁx the start-up values in order to use them. Luckily, the speciﬁc choice for these values is not very critical: the exponential damping implies that all exponential smoothing methods have a short “memory,” so that after only a few steps, any inﬂuence of the initial values is greatly diminished. Some reasonable choices for start-up values are: s0 = x0 or s0 = 1 n n i xi with 1 < n < 5,...,10 and: t0 = 0ort0 = x1 − x0 For triple exponential smoothing we must provide one full season of values for start-up, but we can simply ﬁll them with 1s (for the multiplicative model) or 0s (for the additive model). Only if the series is short do we need to worry seriously about ﬁnding good starting values. The last question concerns how to choose the mixing parameters α, β, and γ . My advice is trial and error. Try a few values between 0.2 and 0.4 (very roughly), and see what results you get. Alternatively, you can deﬁne a measure for the error (between the actual data and the output of the smoothing algorithm), and then use a numerical optimization routine to minimize this error with respect to the parameters. In my experience, this is usually more trouble than it’s worth for at least the following two reasons. The numerical optimization is an iterative process that is not guaranteed to converge, and you may end up spending way too much time coaxing the algorithm to convergence. Furthermore, any such numerical optimization is slave to the expression you have chosen for the “error” to be minimized. The problem is that the parameter values minimizing that error may not have some other property you want to see in your solution (e.g., regarding the balance between the accuracy of the approximation and the smoothness of the resulting curve) so that, in the end, the manual approach often comes out ahead. However, if you have many series to forecast, then it may make sense to expend the effort and build a system that can determine the optimal parameter values automatically, but it probably won’t be easy to really make this work. Finally, I want to present an example of the kind of results we can expect from exponential smoothing. Figure 4-7 is a classical data set that shows the monthly number of international airline passengers (in thousands of passengers).* The graph shows the actual data together with a triple exponential approximation. The years 1949 through 1957 were used to “train” the algorithm, and the years 1958 through 1960 are forecasted. Note how well the forecast agrees with the actual data—especially in light of the strong seasonal pattern—for a rather long forecasting time frame (three full years!). Not bad for a method as simple as this. *This data is available in the “airpass.dat” data set from R. J. Hyndman’s Time Series Data Library at http://www.robjhyndman.com/TSDL. TIME AS A VARIABLE: TIME-SERIES ANALYSIS 89 O’Reilly-5980006 master October 28, 2010 20:29 100 200 300 400 500 600 700 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 ForecastSmoothing Actual Calculated FIGURE 4-7.Triple exponential smoothing in action: comparison between the raw data (solid line) and the smoothed curve (dashed). For the years after 1957, the dashed curve shows the forecast calculated with only the data available in 1957. Don't Overlook the Obvious! On a recent consulting assignment, I was discussing monthly sales numbers with the client when he made the following comment: “Oh, yes, sales for February are always somewhat lower—that’s an after effect of the Christmas peak.” Sales are always lower in February? How interesting. Sure enough, if you plotted the monthly sales numbers for the last few years, there was a rather visible dip from the overall trend every February. But in contrast, there wasn’t much of a Christmas spike! (The client’s business was not particularly seasonal.) So why should there be a corresponding dip two months later? By now I am sure you know the answer already: February is shorter than any of the other months. And it’s not a small effect, either: with 28 days, February is about three days shorter than the other months (which have 30–31 days). That’s about 10 percent—close to the size of the dip in the client’s sales numbers. When monthly sales numbers were normalized by the number of days in the month, the February dip all but disappeared, and the adjusted February numbers were perfectly in line with the rest of the months. (The average number of days per month is 365/12 = 30.4.) Whenever you are tracking aggregated numbers in a time series (such as weekly, monthly, or quarterly results), make sure that you have adjusted for possible variation in the aggregation time frame. Besides the numbers of days in the month, another likely 90 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 candidate for hiccups is the number of business days in a month (for months with ﬁve weekends, you can expect a 20 percent drop for most business metrics). But the problem is, of course, much more general and can occur whenever you are reporting aggregate numbers rather than rates. (If the client had been reporting average sales per day for each month, then there would never have been an anomaly.) This speciﬁc problem (i.e., nonadjusted variations in aggregation periods) is a particular concern for all business reports and dashboards. Keep an eye out for it! The Correlation Function The autocorrelation function is the primary diagnostic tool for time-series analysis. Whereas the smoothing methods that we have discussed so far deal with the raw data in a very direct way, the correlation function provides us with a rather different view of the same data. I will ﬁrst explain how the autocorrelation function is calculated and will then discuss what it means and how it can be used. The basic algorithm works as follows: start with two copies of the data set and subtract the overall average from all values. Align the two sets, and multiply the values at corresponding time steps with each other. Sum up the results for all time steps. The result is the (unnormalized) correlation coefﬁcient at lag 0. Now shift the two copies against each other by a single time step. Again multiply and sum: the result is the correlation coefﬁcient at lag 1. Proceed in this way for the entire length of the time series. The set of all correlation coefﬁcients for all lags is the autocorrelation function. Finally, divide all coefﬁcients by the coefﬁcient for lag 0 to normalize the correlation function, so that the coefﬁcient for lag 0 is now equal to 1. All this can be written compactly in a single formula for c(k)—that is the correlation function at lag k: c(k) = N−k i=1 (xi − μ)(xi+k − μ) N i=1 (xi − μ)2 with μ = 1 N N i=1 xi Here, N is the number of points in the data set. The formula follows the mathematical convention to start indexing sequences at 1, rather than the programming convention to start indexing at 0. Notice that we have subtracted the overall average μ from all values and that the denominator is simply the expression of the numerator for lag k = 0. Figure 4-8 illustrates the process. The meaning of the correlation function should be clear. Initially, the two signals are perfectly aligned and the correlation is 1. Then, as we shift the signals against each other, they slowly move out of phase with each other, and the correlation drops. How quickly it TIME AS A VARIABLE: TIME-SERIES ANALYSIS 91 O’Reilly-5980006 master October 28, 2010 20:29 Lag: 0 1 2 3 4 5 1 2 3 4 5 Lag: 1 1 2 3 4 5 1 2 3 4 5 Lag: 2 1 2 3 4 5 1 2 3 4 FIGURE 4-8.Algorithm to compute the correlation function. drops tells us how much “memory” there is in the data. If the correlation drops quickly, we know that, after a few steps, the signal has lost all memory of its recent past. However, if the correlation drops slowly, then we know that we are dealing with a process that is relatively steady over longer periods of time. It is also possible that the correlation function ﬁrst drops and then rises again to form a second (and possibly a third, or fourth,...) peak. This tells us that the two signals align again if we shift them far enough—in other words, that there is periodicity (i.e., seasonality) in the data set. The position of the secondary peak gives us the number of time steps per season. Examples Let’s look at a couple of examples. Figure 4-9 shows the correlation function of the gas furnace data in Figure 4-2. This is a fairly typical correlation function for a time series that has only short time correlations: the correlation falls quickly, but not immediately, to zero. There is no periodicity; after the initial drop, the correlation function does not exhibit any further signiﬁcant peaks. 92 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250 FIGURE 4-9.Thecorrelation function for the exhaust gas data shown in Figure 4-2. The data has only short time correlations and no seasonality; the correlation function falls quickly (but not immediately) to zero, and there are no secondary peaks. Figure 4-10 is the correlation function for the call center data from Figure 4-5. This data set shows a very different behavior. First of all, the time series has a much longer “memory”: it takes the correlation function almost 100 days to fall to zero, indicating that the frequency of calls to the call center changes more or less once per quarter but not more frequently. The second notable feature is the pronounced secondary peak at a lag of 365 days. In other words, the call center data is highly seasonal and repeats itself on a yearly basis. The third feature is the small but regular sawtooth structure. If we look closely, we will ﬁnd that the ﬁrst peak of the sawtooth is at a lag of 7 days and that all repeating ones occur at multiples of 7. This is the signature of the high-frequency component that we could see in Figure 4-5: the trafﬁc to the call center exhibits a secondary seasonal component with 7-day periodicity. In other words, trafﬁc is weekday dependent (which is not too surprising). Implementation Issues So far I have talked about the correlation function mostly from a conceptual point of view. If we want to proceed to an actual implementation, there are some ﬁne points we need to worry about. The autocorrelation function is intended for time series that do not exhibit a trend and have zero mean. Therefore, if the series we want to analyze does contain a trend, then we must remove it ﬁrst. There are two ways to do this: we can either subtract the trend or we can difference the series. TIME AS A VARIABLE: TIME-SERIES ANALYSIS 93 O’Reilly-5980006 master October 28, 2010 20:29 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 365 Days FIGURE 4-10.Thecorrelation function for the call center data shown in Figure 4-5. There is a secondary peak after exactly 365 days, as well as a smaller weekly structure to the data. Subtracting the trend is straightforward—the only problem is that we need to determine the trend ﬁrst! Sometimes we may have a “model” for the expected behavior and can use it to construct an explicit expression for the trend. For instance, the airline passenger data from the previous section, describes a growth process, and so we should suspect an exponential trend (a exp(x/b)). We can now try guessing values for the two parameters and then subtract the exponential term from the data. For other data sets, we might try a linear or power-law trend, depending on the data set and our understanding of the process generating the data. Alternatively, we might ﬁrst apply a smoothing algorithm to the data and then subtract the result of the smoothing process from the raw data. The result will be the trend-free “noise” component of the time series. A different approach consists of differencing the series: instead of dealing with the raw data, we instead work with the changes in the data from one time step to the next. Technically, this means replacing the original series xi with one consisting of the differences of consecutive elements: xi+1 − xi . This process can be repeated if necessary, but in most cases, single differencing is sufﬁcient to remove the trend entirely. Making sure that the time series has zero mean is easier: simply calculate the mean of the (de-trended!) series and subtract it before calculating the correlation function. This is done explicitly in the formula for the correlation function given earlier. Another technical wrinkle concerns how we implement the sum in the formula for the numerator. As written, this sum is slightly messy, because its upper limit depends on the lag. We can simplify the formula by padding one of the data sets with N zeros on the right and letting the sum run from i = 1toi = N for all lags. In fact, many computational 94 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 Signal 1 Filter 1 Signal 2 Filter 2 Signal 3 FIGURE 4-11.Afilter chain: each filter applied to a signal yields another signal, which itself can be filtered. software packages assume that the data has been prepared in this way (see the Workshop section in this chapter). The last issue you should be aware of is that there are two different normalization conventions for the autocorrelation function, which are both widely used. In the ﬁrst variant, numerator and denominator are not normalized separately—this is the scheme used in the previous formula. In the second variant, the numerator and denominator are each normalized by the number of nonzero terms in their respective sum. With this convention, the formula becomes: c(k) = 1 N − k N−k i=1 (xi − μ)(xi+k − μ) 1 N N i=1 (xi − μ)2 with μ = 1 N N i=1 xi Both conventions are ﬁne, but if you want to compare results from different sources or different software packages, then you will have to make sure you know which convention each of them is following! Optional: Filters and Convolutions Until now we have always spoken of time series in a direct fashion, but there is also a way to describe them (and the operations performed on them) on a much higher level of abstraction. For this, we borrow some concepts and terminology from electrical engineering, speciﬁcally from the ﬁeld of digital signal processing (DSP). In the lingo of DSP, we deal with signals (time series) and ﬁlters (operations). Applying a ﬁlter to a signal produces a new (ﬁltered) signal. Since ﬁlters can be applied to any signal, we can apply another ﬁlter to the output of the ﬁrst and in this way chain ﬁlters together (see Figure 4-11). Signals can also be combined and subtracted from each other. As it turns out, many of the operations we have seen so far (smoothing, differencing) can be expressed as ﬁlters. We can therefore use the convenient high-level language of DSP when referring to the processes of time-series analysis. To make this concrete, we need to understand how a ﬁlter is represented and what it means to “apply” a ﬁlter to a signal. Each digital ﬁlter is represented by a set of coefﬁcients or weights. To apply the ﬁlter, we multiply the coefﬁcients with a subset of the signal. The sum of the products is the value of the resulting (ﬁltered) signal: yt = k i=−k wi xt+i TIME AS A VARIABLE: TIME-SERIES ANALYSIS 95 O’Reilly-5980006 master October 28, 2010 20:29 This should look familiar! We used a similar expression when talking about moving averages earlier in the chapter. A moving average is simply a time series run through an n-point ﬁlter, where every coefﬁcient is equal to 1/n. A weighted moving average ﬁlter similarly consists of the weights used in the expression for the average. The ﬁlter concept is not limited to smoothing operations. The differencing step discussed in the previous section can be viewed as the application of the ﬁlter [1, −1]. We can even shift an entire time series forward in time by using the ﬁlter [0, 1]. The last piece of terminology that we will need concerns the peculiar sum of a product that we have encountered several times by now. It’s called a convolution. A convolution is a way to combine two sequences to yield a third sequence, which you can think of as the “overlap” between the original sequences. The convolution operation is usually deﬁned as follows: yt = ∞ i=−∞ wi xt−i Symbolically, the convolution operation is often expressed through an asterisk: y = wx, where y, w, and x are sequences. Of course, if one or both of the sequences have only a ﬁnite number of elements, then the sum also contains only a ﬁnite number of terms and therefore poses no difﬁculties. You should be able to convince yourself that every application of a ﬁlter to a time series that we have done was in fact a convolution of the signal with the ﬁlter. This is true in general: applying a ﬁlter to a signal means forming the convolution of the two. You will ﬁnd that many numerical software packages provide a convolution operation as a built-in function, making ﬁlter operations particularly convenient to use. I must warn you, however, that the entire machinery of digital signal processing is geared toward signals of inﬁnite (or almost inﬁnite) length, which makes good sense for typical electrical signals (such as the output from a microphone or a radio receiver). But for the rather short time series that we are likely to deal with, we need to pay close attention to a variety of edge effects. For example, if we apply a smoothing or differencing ﬁlter, then the resulting series will be shorter, by half the ﬁlter length, than the original series. If we now want to subtract the smoothed from the original signal, the operation will fail because the two signals are not of equal length. We therefore must either pad the smoothed signal or truncate the original one. The constant need to worry about padding and proper alignment detracts signiﬁcantly from the conceptual beauty of the signal-theoretic approach when used with time series of relatively short duration. Workshop: scipy.signal The scipy.signal package provides functions and operations for digital signal processing that we can use to good effect to perform calculations for time-series analysis. The 96 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:29 scipy.signal package makes use of the signal processing terminology introduced in the previous section. The listing that follows shows all the commands used to create graphs like Figures 4-5 and 4-10, including the commands required to write the results to ﬁle. The code is heavily commented and should be easy to understand. from scipy import * from scipy.signal import * from matplotlib.pyplot import * filename = 'callcenter' # Read data from a text file, retaining only the third column. # (Column indexes start at 0.) # The default delimiter is any whitespace. data = loadtxt( filename, comments='#', delimiter=None, usecols=(2,) ) # The number of points in the time series. We will need it later. n = data.shape[0] # Finding a smoothed version of the time series: # 1) Construct a 31-point Gaussian filter with standard deviation = 4 filt = gaussian( 31, 4 ) # 2) Normalize the filter through dividing by the sum of its elements filt /= sum( filt ) # 3) Pad data on both sides with half the filter length of the last value # (The function ones(k) returns a vector of length k, with all elements 1.) padded = concatenate( (data[0]*ones(31//2), data, data[n-1]*ones(31//2)) ) # 4) Convolve the data with the filter. See text for the meaning of "mode". smooth = convolve( padded, filt, mode='valid' ) # Plot the raw data together with the smoothed data: # 1) Create a figure, sized to 7x5 inches figure( 1, figsize=( 7,5)) # 2) Plot the raw data in red plot( data, 'r' ) # 3) Plot the smoothed data in blue plot( smooth, 'b' ) # 4) Save the figure to file savefig( filename + "_smooth.png" ) # 5) Clear the figure clf() # Calculate the autocorrelation function: # 1) Subtract the mean tmp = data - mean(data) # 2) Pad one copy of data on the right with zeros, then form correlation fct # The function zeros_like(v) creates a vector with the same dimensions # as the input vector v but with all elements zero. corr = correlate( tmp, concatenate( (tmp, zeros_like(tmp)) ), mode='valid' ) # 3) Retain only some of the elements corr = corr[:500] TIME AS A VARIABLE: TIME-SERIES ANALYSIS 97 O’Reilly-5980006 master October 28, 2010 20:29 # 4) Normalize by dividing by the first element corr /= corr[0] # Plot the correlation function: figure( 2, figsize=( 7,5)) plot( corr ) savefig( filename + "_corr.png" ) clf() The package provides the Gaussian ﬁlter as well as many others. The ﬁlters are not normalized, but this is easy enough to accomplish. More attention needs to be paid to the appropriate padding and truncating. For example, when forming the smoothed version of the data, I pad the data on both sides by half the ﬁlter length to ensure that the smoothed data has the same length as the original set. The mode argument to the convolve() and correlate functions determines which pieces of the resulting vector to retain. Several modes are possible. With mode="same", the returned vector has as many elements as the largest input vector (in our case, as the padded data vector), but the elements closest to the ends would be corrupted by the padded values. In the listing, I therefore use mode="valid", which retains only those elements that have full overlap between the data and the ﬁlter—in effect, removing the elements added in the padding step. Notice how the signal processing machinery leads in this application to very compact code. Once you strip out the comments and plotting commands, there are only about 10 lines of code that perform actual operations and calculations. However, we had to pad all data carefully and ensure that we kept only those pieces of the result that were least contaminated by the padding. Further Reading • The Analysis of Time Series. Chris Chatﬁeld. 6th ed., Chapman & Hall. 2003. This is my preferred text on time-series analysis. It combines a thoroughly practical approach with mathematical depth and a healthy preference for the simple over the obscure. Highly recommended. 98 CHAPTER FOUR O’Reilly-5980006 master October 28, 2010 20:31 CHAPTER FIVE More Than Two Variables: Graphical Multivariate Analysis AS SOON AS WE ARE DEALING WITH MORE THAN TWO VARIABLES SIMULTANEOUSLY, THINGS BECOME MUCH MORE complicated—in particular, graphical methods quickly become impractical. In this chapter, I’ll introduce a number of graphical methods that can be applied to multivariate problems. All of them work best if the number of variables is not too large (less than 15–25). The borderline case of three variables can be handled through false-color plots, which we will discuss ﬁrst. If the number of variables is greater (but not much greater) than three, then we can construct multiplots from a collection of individual bivariate plots by scanning through the various parameters in a systematic way. This gives rise to scatter-plot matrices and co-plots. Depicting how an overall entity is composed out of its constituent parts can be a rather nasty problem, especially if the composition changes over time. Because this task is so common, I’ll treat it separately in its own section. Multi-dimensional visualization continues to be a research topic, and in the last sections of the chapter, we look at some of the more recent ideas in this ﬁeld. One recurring theme in this chapter is the need for adequate tools: most multi- dimensional visualization techniques are either not practical with paper and pencil, or are outright impossible without a computer (in particular when it comes to animated techniques). Moreover, as the number of variables increases, so does the need to look at a data set from different angles; this leads to the idea of using interactive graphics for exploration. In the last section, we look at some ideas in this area. 99 O’Reilly-5980006 master October 28, 2010 20:31 -4 -3 -2 -1 0 1 2 3 4 -2 -1 0 1 2 -2 -1 0 1 a=2 FIGURE 5-1.Asimple but effective way to show three variables: treat one as parameter and draw a separate curve for several parameter values. False-Color Plots There are different ways to display information in three variables (typically, two independent variables and one dependent variable). Keep in mind that simple is sometimes best! Figure 5-1 shows the function f (x, a) = x4/2 + ax2 − x/2 + a/4 for various values of the parameter a in a simple, two-dimensional xy plot. The shape of the function and the way it changes with a are perfectly clear in this graph. It is very difﬁcult to display this function in any other way with comparable clarity. Another way to represent such trivariate data is in the form of a surface plot, such as the one shown in Figure 5-2. As a rule, surface plots are visually stunning but are of very limited practical utility. Unless the data set is very smooth and allows for a viewpoint such that we can look down onto the surface, they simply don’t work! For example, it is pretty much impossible to develop a good sense for the behavior of the function plotted in Figure 5-1 from a surface plot. (Try it!) Surface plots can help build intuition for the overall structure of the data, but it is notoriously difﬁcult to read off quantitative information from them. In my opinion, surface plots have only two uses: 1. To get an intuitive impression of the “lay of the land” for a complicated data set 2. To dazzle the boss (not that this isn’t important at times) 100 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 FIGURE 5-2.Surface plots are often visually impressive but generally don’t represent quantitative information very well. FIGURE 5-3.Grayscale version of a false-color plot of the function shown as a surface plot in Figure 5-2. Here white corresponds to positive values of the function, and black corresponds to negative values. Another approach is to project the function into the base plane below the surface in Figure 5-2. There are two ways in which we can represent values: either by showing contours of constant alleviation in a contour plot or by mapping the numerical values to a palette of colors in a false-color plot. Contour plots are familiar from topographic maps—they can work quite well, in particular if the data is relatively smooth and if one is primarily interested in local properties. The false-color plot is an alternative and quite versatile technique that can be used for different tasks and on a wide variety of data sets. To create a false-color plot, all values of the dependent variable z are mapped to a palette of colors. Each data point is then plotted as a region of the appropriate color. Figure 5-3 gives an example (where the color has been replaced by grayscale shading). MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 101 O’Reilly-5980006 master October 28, 2010 20:31 I like false-color plots because one can represent a lot of information in a them in a way that retains quantitative information. However, false-color plots depend crucially on the quality of the palette—that is, the mapping that has been used to associate colors with numeric values. Let’s quickly recap some information on color and computer graphics. Colors for computer graphics are usually speciﬁed by a triple of numbers that specify the intensity of their red, green, and blue (RGB) components. Although RGB triples make good sense technically, they are not particularly intuitive. Instead, we tend to think of color in terms of its hue, saturation, and value (i.e., luminance or lightness). Conventionally, hue runs through all the colors of the rainbow (from red to yellow, green, blue, and magenta). Curiously, the spectrum of hues seems to circle back onto itself, since magenta smoothly transforms back to red. (The reason for this behavior is that the hues in the rainbow spectrum are arranged in order of their dominant electromagnetic frequency. For violet/magenta, no frequency dominates; instead, violet is a mixture of low-frequency reds and high-frequency blues.) Most computer graphics programs will be able to generate color graphics using a hue–saturation–value (HSV) triple. It is surprisingly hard to ﬁnd reliable recommendations on good palette design, which is even more unfortunate given that convenience and what seems like common sense often lead to particularly bad palettes. Here are some ideas and suggestions that you may wish to consider: Keep it simple Very simple palettes using red, white, and blue often work surprisingly well. For continuous color changes you could use a blue-white-red palette, for segmentation tasks you could use a white-blue-red-white palette with a sharp blue–red transition at the segmentation threshold. Distinguish between segmentation tasks and the display of smooth changes Segmentation tasks (e.g., ﬁnding all points that exceed a certain threshold, ﬁnding the locations where the data crosses zero) call for palettes with sharp color transitions at the respective thresholds, whereas representing smooth changes in a data set calls for continuous color gradients. Of course, both aspects can be combined in a single palette: gradients for part of the palette and sharp transitions elsewhere. Try to maintain an intuitive sense of ordering Map low values to “cold” colors and higher values to “hot” colors to provide an intuitive sense of ordering in your palette. Examples include the simple blue-red palette and the “heat scale” (black-red-yellow-white—I’ll discuss in a moment why I don’t recommend the heat scale for use). Other palettes that convey a sense of ordering (if only by convention) are the “improved rainbow” (blue-cyan-green- yellow-orange-red-magenta) and the “geo-scale” familiar from topographic maps (blue-cyan-green-brown-tan-white). 102 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 Place strong visual gradients in regions with important changes Suppose that you have a data set with values that span the range from −100 to +100 but that all the really interesting or important change occurs in the range −10 to +10. If you use a standard palette (such as the improved rainbow) for such a data set, then the actual region of interest will appear to be all of the same color, and the rest of the spectrum will be “wasted” on parts of the data range that are not that interesting. To avoid this outcome, you have to compress the rainbow so that it maps only to the region of interest. You might want to consider mapping the extreme values (from −100 to −10 and from 10 to 100) to some unobtrusive colors (possibly even to a grayscale) and reserving the majority of hue changes for the most relevant part of the data range. Favor subtle changes This is possibly the most surprising recommendation. When creating palettes, there is a natural tendency to “crank it up full” by using fully saturated colors at maximal brightness throughout. That’s not necessarily a good idea, because the resulting effect can be so harsh that details are easily lost. Instead, you might want to consider using soft, pastel colors or even to experiment with mixed hues in favor of the pure primaries of the standard rainbow. (Recent versions of Microsoft Excel provide an interesting and easily accessible demonstration for this idea: all default colors offered for shading the background of cells are soft, mixed pastels—to good effect.) Furthermore, the eye is quite good at detecting even subtle variations. In particular, when working with luminance-based palettes, small changes are often all that is required. Avoid changes that are hard to detect Some visual changes are especially hard to perceive visually. For example, it is practically impossible to distinguish between different shades of yellow, and the transition from yellow to white is even worse! (This is why I don’t recommend the heat scale, despite its nice ordering property: the bottom third consists of hard-to-distinguish dark reds, and the entire upper third consists of very hard-to-distinguish shades of light yellow.) Use hue- and luminance-based palettes for different purposes In particular, consider using a luminance-based palette to emphasize ﬁne detail and using hue- or saturation-based palettes for smooth, large-scale changes. There is some empirical evidence that luminance-based palettes are better suited for images that contain a lot of ﬁne detail and that hue-based palettes are better suited for bringing out smooth, global changes. A pretty striking demonstration of this observation can be found when looking at medical images (surely an application where details matter!): a simple grayscale representation, which is pure luminance, often seems much clearer than a multicolored representation using a hue-based rainbow palette. This rule is more relevant to image processing of photographs or similar images (such as that in MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 103 O’Reilly-5980006 master October 28, 2010 20:31 our medical example) than to visualization of the sort of abstract information that we consider here, but it is worth keeping in mind. Don’t forget to provide a color box No matter how intuitive you think your palette is, nobody will know for sure what you are showing unless you provide a color box (or color key) that shows the values and the colors they are mapped to. Always, always, provide one. One big problem not properly addressed by these recommendations concerns visual uniformity. For example, consider palettes based on the “improved rainbow,” which is created by distributing the six primaries in the order blue-cyan-green-yellow-red-magenta across the palette. If you place these primaries at equal distances across from each other and interpolate linearly between them in color space, then the fraction of the palette occupied by green appears to be much larger than the fraction occupied by either yellow or cyan. Another example is that when placing a fully saturated yellow next to a fully saturated blue, then the blue region will appear to be more intense (i.e., saturated) than the yellow. Similarly, the browns that occur in a geo-scale easily appear darker than the other colors in the palette. This is a problem with our perception of color: simple interpolations in color space do not necessarily result in visually uniform gradients! There is a variation of the HSV color space, called the HCL (hue–chroma–luminance) space that takes visual perception into account to generate visually uniform color maps and gradients. The HCL color model is more complicated to use than the HSV model, because not all combinations of hue, chroma, and luminance values exist. For instance, a fully saturated yellow appears lighter than a fully saturated blue, so a palette at full chroma and with high luminance will include the fully saturated yellow but not the blue. As a result, HCL-based palettes that span the entire rainbow of hues tend naturally toward soft, pastel colors. A disadvantage of palettes in the HCL space is that they often degrade particularly poorly when reproduced in black and white.* A special case of false-color plots are geographic maps, and cartographers have signiﬁcant experience developing color schemes for various purposes. Their needs are a little different and not all of their recommendations may work for general data analysis purposes, but it is worthwhile to become familiar with what they have learned.† Finally, I’d like to point out two additional problems with all plots that depend on color to convey critical information. • Color does not reproduce well. Once photocopied or printed on a black-and-white laser printer, a false-color plot will become useless! *An implementation of the transformations between HCL and RGB is available in R and C in the “colorspace” module available from CRAN. †An interesting starting point is Cynthia Brewer’s online ColorBrewer at http://colorbrewer2.org/. 104 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 • Also keep in mind that about 10 percent of all men are at least partially color blind; these individuals won’t be able to make much sense of most images that rely heavily or exclusively on color. Either one of these problems is potentially serious enough that you might want to reconsider before relying entirely on color for the display of information. In my experience, preparing good false-color plots is often a tedious and time-consuming task. This is one area where better tools would be highly desirable—an interactive tool that could be used to manipulate palettes directly and in real time would be very nice to have. The same is true for a publicly available set of well-tested palettes. A Lot at a Glance: Multiplots The primary concern in all multivariate visualizations is ﬁnding better ways to put more “stuff” on a graph. In addition to color (see the previous section), there are basically two ways we can go about this. We can make the graph elements themselves richer, so that they can convey additional information beyond their position on the graph; or we can put several similar graphs next to each other and vary the variables that are not explicitly displayed in a systematic fashion from one subgraph to the next. The ﬁrst idea leads to glyphs, which we will introduce later in this chapter, whereas the latter idea leads to scatter-plot matrices and co-plots. The Scatter-Plot Matrix For a scatter-plot matrix (occasionally abbreviated SPLOM), we construct all possible two-dimensional scatter plots from a multivariate data set and then plot them together in a matrix format (Figure 5-4). We can now scan all of the graphs for interesting behavior, such as a marked correlation between any two variables. The data set shown in Figure 5-4 consists of seven different properties of a sample of 250 wines.* It is not at all clear how these properties should relate to each other, but by studying the scatter-plot matrix, we can make a few interesting observations. For example, we can see that sugar content and density are positively correlated: if the sugar content goes up, so does the density. The opposite is true for alcohol content and density: as the alcohol content goes up, density goes down. Neither of these observations should come as a surprise (sugar syrup has a higher density than water and alcohol a lower one). What may be more interesting is that the wine quality seems to increase with increasing alcohol content: apparently, more potent wines are considered to be better! *The data can be found in the “Wine Quality” data set, available at the UCI Machine Learning repository at http://archive.ics.uci.edu/ml/. MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 105 O’Reilly-5980006 master October 28, 2010 20:31 Acidity 515 0 40 100 91114 579 515 Sugar Chlorides 0.05 0.15 0 40 100 Sulfur Dioxide Density 0.990 1.000 91114 Alcohol 579 0.05 0.15 0.990 1.000 3579 3579 Quality FIGURE 5-4. In a scatter-plot matrix (SPLOM), a separate scatter plot is shown for each pair of variables. All scatter plots in a given row or column have the same plot range, so that we can compare them easily. One important detail that is easy to overlook is that all graphs in each row or column show the same plot range; in other words, they use shared scales. This makes it possible to compare graphs across the entire matrix. The scatter-plot matrix is symmetric across the diagonal: the subplots in the lower left are equal to the ones in the upper right but rotated by 90 degrees. It is nevertheless customary to plot both versions because this makes it possible to scan a single row or column in its entirety to investigate how one quantity relates to each of the other quantities. Scatter-plot matrices are easy to prepare and easy to understand. This makes them very popular, but I think they can be overused. Once we have more than about half a dozen variables, the individual subplots become too small as that we could still recognize 106 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 anything useful, in particular if the number of points is large (a few hundred points or more). Nevertheless, scatter-plot matrices are a convenient way to obtain a quick overview and to ﬁnd viewpoints (variable pairings) that deserve a closer look. The Co-Plot In contrast to scatter-plot matrices, which always show all data points but project them onto different surfaces of the parameter space, co-plots (short for “conditional plots”) show various slices through the parameter space such that each slice contains only a subset of the data points. The slices are taken in a systematic manner, and we can form an image of the entire parameter space by mentally gluing the slices back together again (the salami principle). Because of the regular layout of the subplots, this technique is also known as a trellis plot. Figure 5-5 shows a trivariate data set projected onto the two-dimensional xy plane. Although there is clearly structure in the data, no deﬁnite pattern emerges. In particular, the dependence on the third parameter is entirely obscured! Figure 5-6 shows a co-plot of the same data set that is sliced or conditioned on the third parameter a. The bottom part of the graph shows six slices through the data corresponding to different ranges of a. (The slice for the smallest values of a is in the lower left, and the one for the largest values of a is in the upper righthand corner.) As we look at the slices, the structure in the data stands out clearly, and we can easily follow the dependence on the third parameter a. The top part of Figure 5-6 shows the range of values that a takes on for each of the slices. If you look closely, you will ﬁnd that there are some subtle issues hidden in (or rather revealed by) this panel, because it provides information on the details of the slicing operation. Two decisions need to be made with regard to the slicing: 1. By what method should the overall parameter range be cut into slices? 2. Should slices overlap or not? In many ways, the most “natural” answer to these questions would be to cut the entire parameter range into a set of adjacent intervals of equal width. It is interesting to observe (by looking at the top panel in Figure 5-6) that in the example graph, a different decision was made in regard to both questions! The slices are not of equal width in the range of parameter values that they span; instead, they have been made in such a way that each slice contains the same number of points. Furthermore, the slices are not adjacent but partially overlap each other. The ﬁrst decision (to have each slice contain the same number of points, instead of spanning the same range of values) is particularly interesting because it provides additional information on how the values of the parameter a are distributed. For instance, MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 107 O’Reilly-5980006 master October 28, 2010 20:31 −2 −1 0 1 2 −10 −5 0 5 10 15 FIGURE 5-5. Projection of a trivariate data set onto the xyplane. How does the data vary with the third variable? we can see that large values of a (larger than about a =−1) are relatively rare, whereas values of a between −4 and −2 are much more frequent. This kind of behavior would be much harder to recognize precisely if we had chopped the interval for a into six slices of equal width. The other decision (to make the slices overlap partially) is more important for small data sets, where otherwise each slice contains so few points that the structure becomes hard to see. Having the slices overlap makes the data “go farther” than if the slices were entirely disjunct. Co-plots are especially useful if some of the variables in a data set are clearly “control” variables, because co-plots provide a systematic way to study the dependence of the remaining (“response”) variables on the controls. Variations The ideas behind scatter-plot matrices and co-plots are pretty generally applicable, and you can develop different variants depending on your needs and tastes. Here are some ideas: • In the standard scatter-plot matrix, half of the individual graphs are redundant. You can remove the individual graphs from half of the overall matrix and replace them with something different—for example, the numerical value of the appropriate correlation 108 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 −10 −5 0 5 10 15 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −10 −5 0 5 10 15 −4 −2 0 2 4 Given : a FIGURE 5-6.Aco-plot of the same data as in Figure 5-5. Each scatter plot includes the data points for only a certain range of a values; the corresponding values of a are shown in the top panel. (The scatter plot for the smallest value of a is in the lower left corner, and that for the largest value of a is in the upper right.) coefﬁcient. However, you will then lose the ability to visually scan a full row or column to see how the corresponding quantity correlates with all other variables. • Similarly, you can place a histogram showing the distribution of values for the quantity in question on the diagonal of the scatter-plot matrix. • The slicing technique used in co-plots can be used with other graphs besides scatter plots. For instance, you might want to use slicing with rank-order plots (see Chapter 2), where the conditioning “parameter” is some quantity not explicitly shown in the rank-order plot itself. Another option is to use it with histograms, making each subplot a histogram of a subset of the data where the subset is determined by the values of the control “parameter” variable. • Finally, co-plots can be extended to two conditioning variables, leading to a matrix of individual slices. By their very nature, all multiplots consist of many individual plot elements, sometimes with nontrivial interactions (such as the overlapped slicing in certain co-plots). Without a MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 109 O’Reilly-5980006 master October 28, 2010 20:31 good tool that handles most of these issues automatically, these plot types lose most of their appeal. For the plots in this section, I used R (the statistical package), which provides support for both scatter-plot matrices and co-plots as built-in functionality. Composition Problems Many data sets describe a composition problem; in other words, they describe how some overall quantity is composed out of its parts. Composition problems pose some special challenges because often we want to visualize two different aspects of the data simultaneously: on the one hand, we are interested in the relative magnitude of the different components, and on the other, we also care about their absolute size. For one-dimensional problems, this is not too difﬁcult (see Chapter 2). We can use a histogram or a similar graph to display the absolute size for all components; and we can use a cumulative distribution plot (or even the much-maligned pie chart) to visualize the relative contribution that each component makes to the total. But once we add additional variables into the mix, things can get ugly. Two problems stand out: how to visualize changes to the composition over time and how to depict the breakdown of an overall quantity along multiple axes at the same time. Changes in Composition To understand the difﬁculties in tracking compositional problems over time, imagine a company that makes ﬁve products labeled A, B, C, D, and E. As we track the daily production numbers over time, there are two different questions that we are likely to be interested in: on the one hand, we’d like to know how many items are produced overall; on the other hand, we would like to understand how the item mix is changing over time. Figures 5-7, 5-8, and 5-9 show three attempts to plot this kind of data. Figure 5-7 simply shows the absolute numbers produced per day for each of the ﬁve product lines. That’s not ideal—the graph looks messy because some of the lines obscure each other. Moreover, it is not possible to understand from this graph how the total number of items changes over time. Test yourself: does the total number of items go up over time, does it go down, or does it stay about even? Figure 5-8 is a stacked plot of the same data. The daily numbers for each product are added to the numbers for the products that appear lower down in the diagram—in other words, the line labeled B gives the number of items produced in product lines A and B. The topmost line in this diagram shows the total number of items produced per day (and answers the question posed in the previous paragraph: the total number of items does not change appreciably over the long run—a possibly surprising observation, given the appearance of Figure 5-7). Stacked plots can be compelling because they have intuitive appeal and appear to be clear and uncluttered. In reality, however, they tend to hide the details in the development of 110 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 Product A B C D E FIGURE 5-7.Absolute number of items produced per product line and day. 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 Product A B C D E FIGURE 5-8. Stacked graph of the number of items produced per product line and day. the individual components because the changing baseline makes comparison difﬁcult if not impossible. For example, from Figure 5-7 it is pretty clear that production of item D increased for a while but then dropped rapidly over the last 5 to 10 days. We would never guess this fact from Figure 5-8, where the strong growth of product line A masks the smaller changes in the other product lines. (This is why you should order the components in a stacked graph in ascending order of variation—which was intentionally not done in Figure 5-8.) MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 111 O’Reilly-5980006 master October 28, 2010 20:31 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 Product A B C D E FIGURE 5-9. Stacked graph of the relative contribution that each product line makes to the total. Figure 5-9 shows still another attempt to visualize this data. This ﬁgure is also a stacked graph, but now we are looking not at the absolute numbers of items produced but instead at the relative fraction that each product line contributes to the daily total. Because the change in the total number of items produced has been eliminated, this graph can help us understand how the item mix varies over time (although we still have the changing baseline problem common to all stacked graphs). However, information about the total number of items produced has been lost. All things considered, I don’t think any one of these graphs succeeds very well. No single graph can satisfy both of our conﬂicting goals—to monitor both absolute numbers as well as relative contributions—and be clear and visually attractive at the same time. I think an acceptable solution for this sort of problem will always involve a combination of graphs—for example, one for the total number of items produced and another for the relative item mix. Furthermore, despite their aesthetic appeal, stacked graphs should be avoided because they make it too difﬁcult to recognize relevant information in the graph. A plot such as Figure 5-7 may seem messy, but at least it can be read accurately and reliably. Multidimensional Composition: Tree and Mosaic Plots Composition problems are generally difﬁcult even when we do not worry about changes over time. Look at the following data: Male BS NYC Engineering Male MS SFO Engineering Male PhD NYC Engineering 112 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 Male BS LAX Engineering Male MS NYC Finance Male PhD SFO Finance Female PhD NYC Engineering Female MS LAX Finance Female BS NYC Finance Female PhD SFO Finance The data set shows information about ten employees of some company, and for each employee, we have four pieces of information: gender, highest degree obtained, ofﬁce where they are located (given by airport code—NYC: New York, SFO: San Francisco, LAX: Los Angeles), and their department. Keep in mind that each line corresponds to a single person. The usual way to summarize such data is in the form of a contingency table. Table 5-1 summarizes what we know about the relationship between an employee’s gender and his or her department. Contingency tables are used to determine whether there is a correlation between categorical variables: in this case, we notice that men tend to work in engineering and women in ﬁnance. (We may want to divide by the total number of records to get the fraction of employees in each cell of the table.) The problem is that contingency tables only work for two dimensions at a time. If we also want to include the breakdown by degree or location, we have no other choice than to repeat the basic structure from Table 5-1 several times: once for each ofﬁce or once for each degree. A mosaic plot is an attempt to ﬁnd a graphical representation for this kind of data. The construction of a mosaic plot is essentially recursive and proceeds as follows (see Figure 5-10): 1. Start with a square. 2. Select a dimension, and then divide the square proportionally according to the counts for this dimension. 3. Pick a second dimension, and then divide each subarea according to the counts along the second dimension, separately for each subarea. 4. Repeat for all dimensions, interchanging horizontal and vertical subdivisions for each new dimension. TABLE 5-1.Acontingency table: breakdown of male and female employees across two departments Male Female Total Engineering 4 1 5 Finance 2 3 5 Total 6410 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 113 O’Reilly-5980006 master October 28, 2010 20:31 Female Male Female Engineering Finance Male Female Engineering Finance Male LAX NYC SFO Engineering Finance FIGURE 5-10.Mosaic plots. In the top row, we start by dividing by gender, then also by department. In the bottom row, we have divided by gender, department, and location, with doctorate degrees shaded. The graph on the left uses the same sort order of dimensions as the graphs in the top row, whereas the graph on the bottom right uses a different sort order. Notice how the sort order changes the appearance of the graph! In the lower left panel of Figure 5-10, location is shown as a secondary vertical subdivision in addition to the gender (from left to right: LAX, NYC, SFO). In addition, the degree is shown through shading (shaded sections correspond to employees with a Ph.D.). Having seen this, we should ask how much mosaic plots actually help us understand this data set. Obviously, Figure 5-10 is difﬁcult to read and has to be studied carefully. Keep in mind that the information about the number of data points within each category is represented by the area—recursively at all levels. Also note that some categories are empty and therefore invisible (for instance, there are no female employees in either the Los Angeles or San Francisco engineering departments). 114 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 4 2 Total: 6 2 1 1 0 4 2 2 1 1 Total: 10 Total: 6 FIGURE 5-11.Atreemap(left) and the corresponding tree (right). The numbers give the weight of each node and, if applicable, also the weight of the entire subtree. I appreciate mosaic plots because they represent a new idea for how data can be displayed graphically, but I have not found them to be useful. In my own experience, it is easier to understand a data set by poring over a set of contingency tables than by drawing mosaic plots. Several problems stand out. • The order in which the dimensions are applied matters greatly for the appearance of the plot. The lower right panel in Figure 5-10 shows the same data set yet again, but this time the data was split along the location dimension ﬁrst and along the gender dimension last. Shading again indicates employees with a Ph.D. Is it obvious that this is the same data set? Is one representation more helpful than the other? • Changing the sort order changes more than just the appearance, it also inﬂuences what we are likely to recognize in the graph. Yet even with an interactive tool, I ﬁnd it thoroughly confusing to view a large number of mosaic plots with changing layouts. • It seems that once we have more than about four or ﬁve dimensions, mosaic plots become too cluttered to be useful. This is not a huge advance over the two dimensions presented in basic contingency tables! • Finally, there is a problem common to all visualization methods that rely on area to indicate magnitude: human perception is not that good at comparing areas, especially areas of different shape. In the lower right panel in Figure 5-10, for example, it is not obvious that the sizes of the two shaded areas for engineering in NYC are the same. (Human perception works by comparing visual objects to each other, and the easiest to compare are lengths, not areas or angles. This is also why you should favor histograms over pie charts!) In passing, let’s quickly consider a different but related concept: tree maps. Tree maps are area-based representations of hierarchical tree structures. As shown in Figure 5-11, the area of each parent node in the tree is divided according to the weight of its children. MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 115 O’Reilly-5980006 master October 28, 2010 20:31 Tree maps are something of a media phenomenon. Originally developed for the purpose of ﬁnding large ﬁles in a directory hierarchy, they seem to be more talked about then used. They share the problems of all area-based visualizations already discussed, and even their inventors report that people ﬁnd them hard to read—especially if the number of levels in the hierarchy increases. Tree maps lend themselves well to interactive explorations (where you can “zoom in” to deeper levels of the hierarchy). My greatest concern is that tree maps have abandoned the primary advantage of graphical methods without gaining sufﬁciently in power, namely intuition: looking at a tree map does not conjure up the image of, well, a tree! (I also think that the focus on treelike hierarchies is driven more by the interests of computer science, rather than by the needs of data analysis—no wonder if the archetypical application consisted of browsing a ﬁle system!) Novel Plot Types Most of the graph types I have described so far (with the exception of mosaic plots) can be described as “classical”: they have been around for years. In this section, we will discuss a few techniques that are much more recent—or, at least, that have only recently received greater attention. Glyphs We can include additional information in any simple plot (such as a scatter plot) if we replace the simple symbols used for individual data points with glyphs: more complicated symbols that can express additional bits of information by themselves. An almost trivial application of this idea occurs if we put two data sets on a single scatter plot and use different symbols (such as squares and crosses) to mark the data points from each data set. Here the symbols themselves carry meaning but only a simple, categorical one—namely, whether the point belongs to the ﬁrst or second data set. But if we make the symbols more complicated, then they can express more information. Textual labels (letters and digits) are often surprisingly effective when it comes to conveying more information—although distinctly low-tech, this is a technique to keep in mind! The next step up in sophistication are arrows, which can represent both a direction and a magnitude (see Figure 5-12), but we need not stop there. Each symbol can be a fully formed graph (such as a pie chart or a histogram) all by itself. And even that is not the end—probably the craziest idea in this realm are “Chernoff faces,” where different quantities are encoded as facial features (e.g., size of the mouth, distance between the eyes), and the faces are used as symbols on a plot! 116 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 FIGURE 5-12.Simple glyphs: using arrows to indicate both direction and magnitude of a field. Notice that the variation in the data is smooth and that the data itself has been recorded on a regular grid. As you can see, the problem lies not so much in putting more information on a graph as in being able to interpret the result in a useful manner. And that seems to depend mostly on the data, in particular on the presence of large-scale, regular structure in it. If such structure is missing, then plots using glyphs can be very hard to decode and quite possibly useless. Figures 5-12 and 5-13 show two extreme examples. In Figure 5-12, we visualize a four-dimensional data set using arrows (each point of the two-dimensional plot area has both a direction and a magnitude, so the total number of dimensions is four). You can think of the system as ﬂow in a liquid, as electrical or magnetic ﬁeld lines, or as deformations in an elastic medium—it does not matter, the overall nature of the data becomes quite clear. But Figure 5-13 is an entirely different matter! Here we are dealing with a data set in seven dimensions: the ﬁrst two are given by the position of the symbol on the plot, and the remaining ﬁve are represented via distortions of a ﬁve-edged polygon. Although we can make out some regularities (e.g., the shapes of the symbols in the lower lefthand corner are all quite similar and different from the shapes elsewhere), this graph is hard to read and does not reveal the overall structure of the data very well. Also keep in mind that the appearance of the graph will change if we map a different pair of variables to the main axes of the plot, or even if we change the order of variables in the polygons. Parallel Coordinate Plots As we have seen, a scatter plot can show two variables. If we use glyphs, we can show more, but not all variables are treated equally (some are encoded in the glyphs, some are MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 117 O’Reilly-5980006 master October 28, 2010 20:31 FIGURE 5-13.Complex glyphs: each polygon encodes five different variables, and its position on the plot adds another two. encoded by the position of the symbol on the plot). By using parallel coordinate plots,we can show all the variables of a multivariate data set on equal footing. The price we pay is that we end up with a graph that is neither pretty nor particularly intuitive, but that can be useful for exploratory work nonetheless. In a regular scatter plot in two (or even three) dimensions, the coordinate axes are at right angles to each other. In a parallel coordinate plot, the coordinate axes instead are parallel to each other. For every data point, its value for each of the variables is marked on the corresponding axis, and then all these points are connected with lines. Because the axes are parallel to each other, we don’t run out of spatial dimensions and therefore can have as many of them as we need. Figure 5-14 shows what a single record looks like in such a plot, and Figure 5-15 shows the entire data set. Each record consists of nine different quantities (labeled A through J). The main use of parallel coordinate plots is to ﬁnd clusters in high-dimensional data sets. For example, in Figure 5-15, we can see that the data forms two clusters for the quantity labeled B: one around 0.8 and one around 0. Furthermore, we can see that most records for which B is 0, tend to have higher values of C than those that have a B near 0.8. And so on. A few technical points should be noted about parallel coordinate plots: • You will usually want to rescale the values in each coordinate to the unit interval via the linear transformation (also see Appendix B): xscaled = x − xmin xmax − xmin 118 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 0 0.2 0.4 0.6 0.8 1 A B C E F G H I J FIGURE 5-14.Asinglerecord (i.e., a single data point) from a multivariate data set shown in a parallel coordinate plot. 0 0.2 0.4 0.6 0.8 1 A B C E F G H I J FIGURE 5-15.Allrecords from the data set shown in a parallel coordinate plot. The record shown in Figure 5-14 is highlighted. This is not mandatory, however. There may be situations where you care about the absolute positions of the points along the coordinate axis or about scaling to a different interval. • The appearance of parallel coordinate plots depends strongly on the order in which the coordinate lines are drawn: rearranging them can hide or reveal structure. Ideally, you have access to a tool that lets you reshufﬂe the coordinate axis interactively. MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 119 O’Reilly-5980006 master October 28, 2010 20:31 • Especially for larger data sets (several hundreds of points or more), overplotting of lines becomes a problem. One way to deal with this is through “alpha blending”: lines are shown as semi-transparent, and their visual effects are combined where they overlap each other. • Similarly, it is often highly desirable to be able to select a set of lines and highlight them throughout the entire graph—for example, to see how data points that are clustered in one dimension are distributed in the other dimensions. • Instead of combining points on adjacent coordinate axes with straight lines that have sharp kinks at the coordinate axes, one can use smooth lines that pass the coordinate axes without kinks. All of these issues really are tool issues, and in fact parallel coordinates don’t make sense without a tool that supports them natively and includes good implementations of the features just described. This implies that parallel coordinate plots serve less as ﬁnished, static graphs than as an interactive tool for exploring a data set. Parallel coordinate plots still seem pretty novel. The idea itself has been around for about 25 years, but even today, tools that support parallel coordinates plots well are far from common place. What is not yet clear is how useful parallel coordinate plots really are. On the one hand, the concept seems straightforward and easy enough to use. On the other hand, I have found the experience of actually trying to apply them frustrating and not very fruitful. It is easy to get bogged down in technicalities of the plot (ordering and scaling of coordinate axes) with little real, concrete insight resulting in the end. The erratic tool situation of course does not help. I wonder whether more computationally intensive methods (e.g., principal component analysis—see Chapter 14) do not give a better return on investment overall. But the jury is still out. Interactive Explorations All the graphs that we have discussed so far (in this and the preceding chapters) were by nature static. We prepared graphs, so that we then could study them, but this was the extent of our interaction. If we wanted to see something different, we had to prepare a new graph. In this section, I shall describe some ideas for interactive graphics: graphs that we can change directly in some way without having to re-create them anew. Interactive graphics cannot be produced with paper and pencil, not even in principle: they require a computer. Conversely, what we can do in this area is even more strongly limited by the tools or programs that are available to us than for other types of graphs. In this sense, then, this section is more about possibilities than about realities because the tool support for interactive graphical exploration seems (at the time of this writing) rather poor. 120 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 Querying and Zooming Interaction with a graph does not have to be complicated. A very simple form of interaction consists of the ability to select a point (or possibly a group of points) and have the tool display additional information about it. In the simplest case, we hover the mouse pointer over a data point and see the coordinates (and possibly additional details) in a tool tip or a separate window. We can refer to this activity as querying. Another simple form of interaction would allow us to change aspects of the graph directly using the mouse. Changing the plot range (i.e., zooming) is probably the most common application, but I could also imagine to adjust the aspect ratio, the color palette, or smoothing parameters in this way. (Selecting and highlighting a subset of points in a parallel coordinate plot, as described earlier, would be another application.) Observe that neither of these activities is inherently “interactive”: they all would also be possible if we used paper and pencil. The interactive aspect consists of our ability to invoke them in real time and by using a graphical input device (the mouse). Linking and Brushing The ability to interact directly with graphs becomes much more interesting once we are dealing with multiple graphs at the same time! For example, consider a scatter-plot matrix like the one in Figure 5-4. Now imagine we use the mouse to select and highlight a group of points in one of the subplots. If the graphs are linked, then the symbols corresponding to the data points selected in one of the subplots will also be highlighted in all other subplots as well. Usually selecting some points and then highlighting their corresponding symbols in the linked subgraphs requires two separate steps (or mouseclicks). A real-time version of the same idea is called brushing: any points currently under the mouse pointer are selected and highlighted in all of the linked subplots. Of course, linking and brushing are not limited to scatter-plot matrices, but they are applicable to any group of graphs that show different aspects of the same data set. Suppose we are working with a set of histograms of a multivariate data set, each histogram showing only one of the quantities. Now I could imagine a tool that allows us to select a bin in one of the histograms and then highlights the contribution from the points in that bin in all the other histograms. Grand Tours and Projection Pursuits Although linking and brushing allow us to interact with the data, they leave the graph itself static. This changes when we come to Grand Tours and Projection Pursuits. Now we are talking about truly animated graphics! Grand Tours and Projection Pursuits are attempts to enhance our understanding of a data set by presenting many closely related projections in the form of an animated “movie.” MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 121 O’Reilly-5980006 master October 28, 2010 20:31 The concept is straightforward: we begin with some projection and then continuously move the viewpoint around the data set. (For a three-dimensional data set, you can imagine the viewpoint moving on a sphere that encloses the data.) In Grand Tours, the viewpoint is allowed to perform essentially a random walk around the data set. In Projection Pursuits, the viewpoint is moved so that it will improve the value of an index that measures how “interesting” a speciﬁc projection will appear. Most indices currently suggested measure properties such as deviation from Gaussian behavior. At each step of a Pursuit, the program evaluates several possible projections and then selects the one that most improves the chosen index. Eventually, a Pursuit will reach a local maximum for the index, at which time it needs to be restarted from a different starting point. Obviously, Tours and Pursuits require specialized tools that can perform the required projections—and do so in real time. They are also exclusively exploratory techniques and not suitable for preserving results or presenting them to a general audience. Although the approach is interesting, I have not found Tours to be especially useful in practice. It can be confusing to watch a movie of essentially random patterns and frustrating to interact with projections when attempting to explore the neighborhood of an interesting viewpoint. Tools All interactive visualization techniques require suitable tools and computer programs; they cannot be done using paper-and-pencil methods. This places considerable weight on the quality of the available tools. Two issues stand out. • It seems difﬁcult to develop tools that support interactive features and are sufﬁciently general at the same time. For example, if we expect the plotting program to show additional detail on any data point that we select with the mouse, then the input (data) ﬁle will have to contain this information—possibly as metadata. But now we are talking about relatively complicated data sets, which require more complicated, structured ﬁle formats that will be speciﬁc to each tool. So before we can do anything with the data, we will have to transform it into the required format. This is a signiﬁcant burden, and it may make these methods infeasible in practice. (Several of the more experimental programs mentioned in the Workshop section in this chapter are nearly unusable on actual data sets for exactly this reason.) • A second problem concerns performance. Brushing, for instance, makes sense only if it truly occurs in real time—without any discernible delay as the mouse pointer moves. For a large data set and a scatter-plot matrix of a dozen attributes, this means updating a few thousand points in real time. Although by no means infeasible, such responsiveness does require that the tool is written with an eye toward performance and using appropriate technologies. (Several of the tools mentioned in the Workshop exhibit serious performance issues on real-world data sets.) 122 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 A ﬁnal concern involves the overall design of the user interface. It should be easy to learn and easy to use, and it should support the activities that are actually required. Of course, this concern is not speciﬁc to data visualization tools but common to all programs with a graphical user interface. Workshop: Tools for Multivariate Graphics Multivariate graphs tend to be complicated and therefore require good tool support even more strongly than do other forms of graphs. In addition, some multivariate graphics are highly specialized (e.g., mosaic plots) and cannot be easily prepared with a general- purpose plotting tool. That being said, the tool situation is questionable at best. Here are three different starting points for exploration—each with its own set of difﬁculties. R R is not a plotting tool per se; it is a statistical analysis package and a full development environment as well. However, R has always included pretty extensive graphing capabilities. R is particularly strong at “scientiﬁc” graphs: straightforward but highly accurate line diagrams. Because R is not simply a plotting tool, but instead a full data manipulation and programming environment, its learning curve is rather steep; you need to know a lot of different things before you can do anything. But once you are up and running, the large number of advanced functions that are already built in can make working with R very productive. For example, the scatter-plot matrix in Figure 5-4 was generated using just these three commands: d <- read.delim( "wines", header=T ) pairs(d) dev.copy2eps( file="splom.eps" ) (the R command pairs() generates a plot of all pairs—i.e., a scatter-plot matrix). The scatter plot in Figure 5-5 and the co-plot in Figure 5-6 were generated using: d <- read.delim( "data", header=F ) names( d ) <- c( "x", "a", "y" ) plot( y ~ x, data=d ) dev.copy2eps( file='coplot1.eps' ) coplot( y~x|a,data=d ) dev.copy2eps( file='coplot2.eps' ) Note that these are the entire command sequences, which include reading the data from ﬁle and writing the graph back to disk! We’ll have more to say about R in the Workshop sections of Chapters 10 and 14. MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 123 O’Reilly-5980006 master October 28, 2010 20:31 R has a strong culture of user-contributed add-on packages. For multiplots consisting of subplots arranged on a regular grid (in particular, for generalized co-plots), you should consider the lattice package, which extends or even replaces the functionality of the basic R graphic systems. This package is part of the standard R distribution. Experimental Tools If you want to explore some of the more novel graphing ideas, such as parallel coordinate plots and mosaic plots, or if you want to try out interactive ideas such as brushing and Grand Tours, then there are several options open to you. All of them are academic research projects, and all are highly experimental. (In a way, this is a reﬂection of the state of the ﬁeld: I don’t think any of these novel plot types have been reﬁned to a point where they are clearly useful.) • The ggobi project (http://www.ggobi.org) allows brushing in scatter-plot matrices and parallel coordinate plots and includes support for animated tours and pursuits. • Mondrian (http://www.rosuda.org/mondrian) is a Java application that can produce mosaic plots (as well as some other multivariate graphs). Again, both tools are academic research projects—and it shows. They are technology demonstrators intended to try out and experiment with new graph ideas, but neither is anywhere near production strength. Both are rather fussy about the required data input format, their graphical user interfaces are clumsy, and neither includes a proper way to export graphs to ﬁle (if you want to save a plot, you have to take a screenshot). The interactive brushing features in ggobi are slow, which makes them nearly unusable for realistically sized data sets. There are some lessons here (besides the intended ones) to be learned about the design of tools for statistical graphics. (For instance, GUI widget sets do not seem suitable for interactive visualizations: they are too slow. You have to use a lower-level graphics library instead.) Other open source tools you may want to check out are Tulip (http://tulip.labri.fr) and ManyEyes (http://manyeyes.alphaworks.ibm.com/manyeyes). The latter project is a web-based tool and community that allows you to upload your data set and generate plots of it online. A throwback to a different era is OpenDX (http://www.research.ibm.com/dx). Originally designed by IBM in 1991, it was donated to the open source community in 1999. It certainly feels overly complicated and dated, but it does include a selection of features not found elsewhere. Python Chaco Library The Chaco library (http://code.enthought.com/projects/chaco/) is a Python library for two-dimensional plotting. In addition to the usual line and symbol drawing capabilities, it 124 CHAPTER FIVE O’Reilly-5980006 master October 28, 2010 20:31 includes easy support for color and color manipulation as well as—more importantly—for real-time user interaction. Chaco is an exciting toolbox if you plan to experiment with writing your own programs to visualize data and interact with it. However, be prepared to do some research: the best available documentation seems to be the set of demos that ship with it. Chaco is part of the Enthought Tool Suite, which is developed by Enthought, Inc., and is available under a BSD-style license. Further Reading • Graphics of Large Datasets: Visualizing a Million. Antony Unwin, Martin Theus, and Heike Hofmann. Springer. 2006. This is a modern book that in many ways describes the state of the art in statistical data visualization. Mosaic plots, glyph plots, parallel coordinate plots, Grand Tours—all are discussed here. Unfortunately, the basics are neglected: standard tools like logarithmic plots are never even mentioned, and simple things like labels are frequently messed up. This book is nevertheless interesting as a survey of some of the state of the art. • The Elements of Graphing Data. William S. Cleveland. 2nd ed., Hobart Press. 1994. This book provides an interesting counterpoint to the book by Unwin and colleagues. Cleveland’s graphs often look pedestrian, but he thinks more deeply than almost anyone else about ways to incorporate more (and more quantitative) information in a graph. What stands out in his works is that he explicitly takes human perception into account as a guiding principle when developing new graphs. My discussion of scatter-plot matrices and co-plots is heavily inﬂuenced by his careful treatment. • Gnuplot in Action: Understanding Data with Graphs. Philipp K. Janert. Manning Publications. 2010. Chapter 9 of this book contains additional details on and examples for the use of color to prepare false-color plots, including explicit recipes to create them using gnuplot. But the principles are valid more generally, even if you use different tools. • Why Should Engineers and Scientists Be Worried About Color? B. E. Rogowitz and L. A. Treinish. http://www.research.ibm.com/people/l/lloydt/color/color.HTM. 1995. This paper contains important lessons for false-color plots, including the distinction between segmentation and smooth variation as well as the difference between hue- and luminance-based palettes. The examples were prepared using IBM’s (now open source) OpenDX graphical Data Explorer. • Escaping RGBland: Selecting Colors for Statistical Graphics. A. Zeileis, K. Hornik, and P. Murrell. http://statmath.wu.ac.at/∼zeileis/papers/Zeileis+Hornik+Murrell-2009.pdf . 2009. This is a more recent paper on the use of color in graphics. It emphasizes the importance of perception-based color spaces, such as the HCL model. MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 125 O’Reilly-5980006 master October 28, 2010 20:31 O’Reilly-5980006 master October 28, 2010 20:53 CHAPTER SIX Intermezzo: A Data Analysis Session OCCASIONALLY I GET THE QUESTION: “HOW DO YOU ACTUALLY WORK?” OR “HOW DO YOU COME UP WITH THIS stuff?” As an answer, I want to take you on a tour through a new data set. I will use gnuplot, which is my preferred tool for this kind of interactive data analysis—you will see why. And I will share my observations and thoughts as we go along. A Data Analysis Session The data set is a classic: the CO2 measurements above Mauna Loa on Hawaii. The inspiration for this section comes from Cleveland’s Elements of Graphical Analysis,* but the approach is entirely mine. First question: what’s in the data set? I see that the ﬁrst column represents the date (month and year) while the second contains the measured CO2 concentration in parts per million. Here are the ﬁrst few lines: Jan-1959 315.42 Feb-1959 316.32 Mar-1959 316.49 Apr-1959 317.56 ... The measurements are regularly spaced (in fact, monthly), so I don’t need to parse the date in the ﬁrst column; I simply plot the second column by itself. (In the ﬁgure, I have *The Elements of Graphing Data. William S. Cleveland. Hobart Press. 1994. The data itself (in a slightly different format) is available from StatLib: http://lib.stat.cmu.edu/datasets/visualizing.data.zip and from many other places around the Web. 127 O’Reilly-5980006 master October 28, 2010 20:53 310 315 320 325 330 335 340 345 350 355 360 Jan 1958 Jan 1961 Jan 1964 Jan 1967 Jan 1970 Jan 1973 Jan 1976 Jan 1979 Jan 1982 Jan 1985 Jan 1988 Jan 1991 FIGURE 6-1.Thefirstlookatthedata: plot “data”u1wl added tick labels on the horizontal axis for clarity, but I am omitting the commands required here—they are not essential.) plot "data" u1wl The plot shows a rather regular short-term variation overlaid on a nonlinear upward trend. (See Figure 6-1.) The coordinate system is not convenient for mathematical modeling: the x axis is not numeric, and for modeling purposes it is usually helpful if the graph goes through the origin. So, let’s make it do so by subtracting the vertical offset from the data and expressing the horizontal position as the number of months since the ﬁrst measurement. (This corresponds to the line number in the data ﬁle, which is accessible in a gnuplot session through the pseudo-column with column number 0.) plot "data" u 0:($2-315) w l A brief note on the command: the speciﬁcation after the u (short for using) gives the columns to be used for the x and y coordinates, separated by a colon. Here we use the line number (which is in the pseudo-column 0) for the x coordinate. Also, we subtract the constant offset 315 from the values in the second column and use the result as the y value. Finally, we plot the result with lines (abbreviated wl) instead of using points or other symbols. See Figure 6-2. The most predominant feature is the trend. What can we say about it? First of all, the trend is nonlinear: if we ignore the short-term variation, the curve is convex downward. This suggests a power law with an as-yet-unknown exponent: xk. All power-law functions go through the origin (0, 0) and also through the point (1, 1). We already made sure that the data passes through the origin, but to ﬁx the upper-right corner, we need to rescale both axes: if xk goes through (1, 1), then b x a k goes through (a, b). 128 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:53 -5 0 5 10 15 20 25 30 35 40 45 0 50 100 150 200 250 300 350 400 FIGURE 6-2.Makingthex values numeric and subtracting the constant vertical offset: plot “data” u 0:($2-315) w l -5 0 5 10 15 20 25 30 35 40 45 0 50 100 150 200 250 300 350 400 FIGURE 6-3.Addingafunction: plot “data” u 0:($2-315) w l, 35*(x/350)**2 What’s the value for the exponent k? All I know about it right now is that it must be greater than 1 (because the function is convex). Let’s try k = 2. (See Figure 6-3.) plot "data" u 0:($2-315) w l, 35*(x/350)**2 Not bad at all! The exponent is a bit too large—some ﬁddling suggests that k = 1.35 would be a good value (see Figure 6-4). plot "data" u 0:($2-315) w l, 35*(x/350)**1.35 To verify this, let’s plot the residual; that is, we subtract the trend from the data and plot what’s left. If our guess for the trend is correct, then the residual should not exhibit any trend itself—it should just straddle y = 0 in a balanced fashion (see Figure 6-5). plot "data" u 0:($2-315 - 35*($0/350)**1.35) w l INTERMEZZO: A DATA ANALYSIS SESSION 129 O’Reilly-5980006 master October 28, 2010 20:53 -5 0 5 10 15 20 25 30 35 40 45 0 50 100 150 200 250 300 350 400 FIGURE 6-4.Getting the exponent right: f (x) = 35 x 350 1.35 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 50 100 150 200 250 300 350 400 FIGURE 6-5.Theresidual, after subtracting the function from the data. It might be hard to see the longer-term trend in this data, so we may want to approximate it by a smoother curve. We can use the weighted-spline approximation built into gnuplot for that purpose. It takes a third parameter, which is a measure for the smoothness: the smaller the third parameter, the smoother the resulting curve; the larger the third parameter, the more closely the spline follows the original data (see Figure 6-6). plot "data" u 0:(2 − 315 − 35 ∗ (0/350)**1.35) w l, \ "" u 0:($2-315 - 35*($0/350)**1.35):(0.001) s acs w l At this point, the expression for the function that we use to approximate the data has become unwieldy. Thus it now makes sense to deﬁne it as a separate function: f(x) = 315 + 35*(x/350)**1.35 plot "data" u 0:($2-f($0)) w l, "" u 0:($2-f($0)):(0.001) s acs w l 130 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:53 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 50 100 150 200 250 300 350 400 FIGURE 6-6.Plotting a smoothed version of the residual together with the unsmoothed residual to test whether there is any systematic trend remaining in the residual. From the smoothed line we can see that the overall residual is pretty much ﬂat and straddles zero. Apparently, we have captured the overall trend quite well: there is little evidence of a systematic drift remaining in the residuals. With the trend taken care of, the next feature to tackle is the seasonality. The seasonality seems to consist of rather regular oscillations, so we should try some combination of sines and cosines. The data pretty much starts out at y = 0 for x = 0, so we can try a sine by itself. To make a guess for its wavelength, we recall that the data is meteorological and has been taken on a monthly basis—perhaps there is a year-over-year periodicity. This would imply that the data is the same every 12 data points. If so, then a full period of the sine, which corresponds to 2π, should equal a horizontal distance of 12 points. For the amplitude, the graph suggests a value close to 3 (see Figure 6-7). plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l Right on! In particular, our guess for the wavelength worked out really well. This makes sense, given the origin of the data. Let’s take residuals again, employing splines to see the bigger picture as well (see Figure 6-8): f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12) plot "data" u 0:($2-f($0)) w l, "" u 0:($2-f($0)):(0.001) s acs w l The result is pretty good but not good enough. There is clearly some regularity remaining in the data, although at a higher frequency than the main seasonality. Let’s zoom in on a smaller interval of the data to take a closer look. The data in the interval [60:120] appears particularly regular, so let’s look there (see Figure 6-9): plot [60:120] "data" u 0:($2-f($0)) w lp, "" u 0:($2-f($0)):(0.001) s acs w l INTERMEZZO: A DATA ANALYSIS SESSION 131 O’Reilly-5980006 master October 28, 2010 20:53 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 50 100 150 200 250 300 350 400 FIGURE 6-7.Fitting the seasonality with a sine wave: 3 sin 2π x 12 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 50 100 150 200 250 300 350 400 FIGURE 6-8.Residuals after subtracting both trend and seasonality. I have indicated the individual data points using gnuplot’s linespoints (lp) style. We can now count the number of data points between the main valleys in the data: 12 points. This is the main seasonality. But it seems that between any two primary valleys there is exactly one secondary valley. Of course: higher harmonics! The original seasonality had a period of exactly 12 months, but its shape was not entirely symmetric: its rising ﬂank comprised 7 months but the falling ﬂank only 5 (as you can see by zooming in on the original data with only the trend removed). This kind of asymmetry implies that the seasonality cannot be represented by a simple sine wave alone but that we have to take into account higher harmonics—that is, sine functions with frequencies that are integer multiples of the primary seasonality. So let’s try the ﬁrst higher harmonic, again punting a little on the amplitude (see Figure 6-10): f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12) - 0.75*sin(2*pi*$0/6) plot "data" u 0:($2-f($0)) w l, "" u 0:($2-f($0)):(0.001) s acs w l 132 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:53 -1.5 -1 -0.5 0 0.5 1 1.5 2 60 70 80 90 100 110 120 FIGURE 6-9.Zoominginforacloser look. Individual data points are marked by symbols. -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 50 100 150 200 250 300 350 400 FIGURE 6-10.Residual after removing trend and the first and second harmonic of the seasonality. Now we are really pretty close. Look at the residual—in particular, for values of x greater than about 150. The data starts to look quite “random,” although there is some systematic behavior for x in the range [0:70] that we don’t really capture. Let’s add some constant ranges to the plot for comparison (see Figure 6-11): plot "data" u 0:($2-f($0)) w l, "" u 0:($2-f($0)):(0.001) s acs w l, 0, 1, -1 It looks as if the residual is skewed toward positive values, so let’s adjust the vertical offset by 0.1 (see Figure 6-12): f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12) - 0.75*sin(2*pi*$0/6) + 0.1 plot "data" u 0:($2-f($0)) w l, "" u 0:($2-f($0)):(0.001) s acs w l, 0, 1, -1 INTERMEZZO: A DATA ANALYSIS SESSION 133 O’Reilly-5980006 master October 28, 2010 20:53 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 50 100 150 200 250 300 350 400 FIGURE 6-11.Addingsome grid lines for comparison. -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 50 100 150 200 250 300 350 400 FIGURE 6-12.Thefinalresidual. That’s now really close. You should notice how small the last adjustment was—we started out with data ranging from 300 to 350, and now we are making adjustments to the parameters on the order of 0.1. Also note how small the residual has become: mostly in the range from −0.7to0.7. That’s only about 3 percent of the total variation in the data. Finally, let’s look at the original data again, this time together with our analytical model (see Figure 6-13): f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12) - 0.75*sin(2*pi*$0/6) + 0.1 plot "data" u 0:2 w l, f(x) All in all, pretty good. 134 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:53 310 315 320 325 330 335 340 345 350 355 360 0 50 100 150 200 250 300 350 400 FIGURE 6-13.Therawdata with the final fit. So what is the point here? The point is that we started out with nothing—no idea at all of what the data looked like. And then, layer by layer, we peeled off components of the data, until only random noise remained. We ended up with an explicit, analytical formula that describes the data remarkably well. But there is something more. We did so entirely “manually”: by plotting the data, trying out some approximations, and wiggling the numbers until they agreed reasonably well with the data. At no point did we resort to a black-box ﬁtting routine—because we didn’t have to! We did just ﬁne. (In fact, after everything was ﬁnished, I tried to perform a nonlinear ﬁt using the functional form of the analytical model as we have worked it out—only to have it explode terribly! The model depends on seven parameters, which means that convergence of a nonlinear ﬁt can be a bit precarious. In fact, it took me longer to try to make the ﬁt work than it took me to work the parameters out manually as just demonstrated.) I’d go even further. We learned more by doing this work manually than if we had used a ﬁtting routine. Some of the observations (such as the idea to include higher harmonics) arose only through direct interaction with the data. And it’s not even true that the parameters would be more accurate if they had been calculated by a ﬁtting routine. Sure, they would contain 16 digits but not more information. Our manual wiggling of the parameters enabled us to see quickly and directly the point at which changes to the parameters are so small that they no longer inﬂuence the agreement between the data and the model. That’s when we have extracted all the information from the data—any further “precision” in the parameters is just insigniﬁcant noise. You might want to try your hand at this yourself and also experiment with some variations of your own. For example, you may question the choice of the power-law behavior for the long-term trend. Does an exponential function (like exp(x)) give a better INTERMEZZO: A DATA ANALYSIS SESSION 135 O’Reilly-5980006 master October 28, 2010 20:53 310 320 330 340 350 360 370 380 390 400 Jan 1958 Jan 1964 Jan 1970 Jan 1976 Jan 1982 Jan 1988 Jan 1994 Jan 2000 Jan 2006 Jan 2012 Data Model FIGURE 6-14.Theextended data set up to early 2010 together with the model (up to 1990). ﬁt? It is not easy to tell from the data, but it makes a huge difference if we want to project our ﬁndings signiﬁcantly (10 years or more) into the future. You might also take a closer look at the seasonality. Because it is so regular—and especially since its period is known exactly—you should be able to isolate just the periodic part of the data in a separate model by averaging corresponding months for all years. Finally, there is 20 years’ worth of additional data available beyond the “classic” data set used in my original exploration.* Figure 6-14 shows all the available data together with the model that we have developed. Does the ﬁt continue to work well for the years past 1990? Workshop: gnuplot The example commands in this chapter should have given you a good idea what working with gnuplot is like, but let’s take a quick look at some of the basics. Gnuplot (http://www.gnuplot.info) is command-line oriented: when you start gnuplot, it presents you with a text prompt at which to enter commands; the resulting graphs are shown in a separate window. Creating plots is simple—the command: plot sin(x) with lines, cos(x) with linespoints will generate a plot of (you guessed it) a sine and a cosine. The sine will be drawn with plain lines, and the cosine will be drawn with symbols (“points”) connected by lines. *You can obtain the data from the observatory’s ofﬁcial website at http://www.esrl.noaa.gov/gmd/ccgg/ trends/. Also check out the narrative (with photos of the apparatus!) at http://celebrating200years.noaa. gov/datasets/mauna/welcome.html. 136 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:53 (Many gnuplot keywords can be abbreviated: instead of with lines I usually type: wl,orw lp instead of with linespoints. These short forms are a major convenience although rather cryptic in the beginning. In this short introductory section, I will make sure to only use the full forms of all commands.) To plot data from a ﬁle, you also use the plot command; for instance: plot "data" using 1:2 with lines When plotting data from a ﬁle, we use the using keyword to specify which columns from the ﬁle we want to plot—in the command just given, we use entries from the ﬁrst column as x values and use entries from the second column for y values. One of the nice features of gnuplot is that you can apply arbitrary transformations to the data as it is being plotted. To do so, you put parentheses around each entry in the column speciﬁcation that you want to apply a transform to. Within these parentheses you can use any mathematical expression. The data from each column is available by preﬁxing the column index by the dollar sign. An example will make this more clear: plot "data" using (1/$1):($2+$3) with lines This command plots the sum of the second and third columns (that is: $2+$3) as a function of one over the value in the ﬁrst column (1/$1). It is also possible to mix data and functions in a single plot command (as we have seen in the examples in this chapter): plot "data" using 1:2 with lines, cos(x) with lines This is different from the Matlab-style of plotting, where a function must be explicitly evaluated for a set of points before the resulting set of values can be plotted. We can now proceed to add decorations (such as labels and arrows) to the plot. All kinds of options are available to customize virtually every aspect of the plot’s appearance: tick marks, the legend, aspect ratio—you name it. When we are done with a plot, we can save all the commands used to create it (including all decorations) via the save command: save "plot.gp" Now we can use load "plot.gp" to re-create the graph. As you can see, gnuplot is extremely straightforward to use. The one area that is often regarded as somewhat clumsy is the creation of graphs in common graphics ﬁle formats. The reason for this is historical: the ﬁrst version of gnuplot was written in 1985, a time when one could not expect every computer to be connected to a graphics-capable terminal and when many of our current ﬁle formats did not even exist! The gnuplot designers dealt with this situation by creating the so-called “terminal” abstraction. All hardware-speciﬁc capabilities were encapsulated by this abstraction so that the rest of gnuplot could be as portable as possible. Over time, this “terminal” came to include different graphics ﬁle formats as well (not just graphics hardware terminals), and this usage continues to this day. INTERMEZZO: A DATA ANALYSIS SESSION 137 O’Reilly-5980006 master October 29, 2010 17:41 Exporting a graph to a common ﬁle format (such as GIF, PNG, PostScript, or PDF) requires a ﬁve-step process: set terminal png set output "plot.png" replot set terminal wxt set output In the ﬁrst step, we choose the output device or “terminal”: here, a PNG ﬁle. In the second step, we choose the ﬁle name. In the third step, we explicitly request that the graph be regenerated for this newly chosen device. The remaining commands restore the interactive session by selecting the interactive wxt terminal (built on top of the wxWidgets widget set) and redirecting output back to the interactive terminal. If you ﬁnd this process clumsy and error-prone, then you are not alone, but rest assured: gnuplot allows you to write macros, which can reduce these ﬁve steps to one! I should mention one further aspect of gnuplot: because it has been around for 25 years, it is extremely mature and robust when it comes to dealing with typical day-to-day problems. For example, gnuplot is refreshingly unpicky when it comes to parsing input ﬁles. Many other data analysis or plotting programs that I have seen are pretty rigid in this regard and will bail when encountering unexpected data in an input ﬁle. This is the right thing to do in theory, but in practice, data ﬁles are often not clean—with ad hoc formats and missing or corrupted data points. Having your plotting program balk over whitespace instead of tabs is a major nuisance when doing real work. In contrast, gnuplot usually does an amazingly good job at making sense of almost any input ﬁle you might throw at it, and that is indeed a great help. Similarly, gnuplot recognizes undeﬁned mathematical expressions (such as 1/0, log(0), and so on) and discards them. This is also very helpful, because it means that you don’t have to worry about the domains over which functions are properly deﬁned while you are in the thick of things. Because the output is graphical, there is usually very little risk that this silent discarding of undeﬁned values will lead you to miss essential behavior. (Things are different in a computer program, where silently ignoring error conditions usually only compounds the problem.) Further Reading • Gnuplot in Action: Understanding Data with Graphs. Philipp K. Janert. Manning Publications. 2010. If you want to know more about gnuplot, then you may ﬁnd this book interesting. It includes not only explanations of all sorts of advanced options, but also helpful hints for working with gnuplot. 138 CHAPTER SIX O’Reilly-5980006 master October 28, 2010 20:55 PART II Analytics: Modeling Data O’Reilly-5980006 master October 28, 2010 20:55 O’Reilly-5980006 master October 28, 2010 20:55 CHAPTER SEVEN Guesstimation and the Back of the Envelope LOOK AROUND THE ROOM YOU ARE SITTING IN AS YOU READ THIS. NOW ANSWER THE FOLLOWING QUESTION: how many Ping-Pong balls would it take to ﬁll this room? Yes, I know it’s lame to make the reader do jot’em-dot’em exercises, and the question is old anyway, but please make the effort to come up with a number. I am trying to make a point here. Done? Good—then, tell me, what is the margin of error in your result? How many balls, plus or minus, do you think the room might accommodate as well? Again, numbers, please! Look at the margin of error: can you justify it, or did you just pull some numbers out of thin air to get me off your back? And if you found an argument to base your estimate on: does the result seem right to you? Too large, too small? Finally, can you state the assumptions you made when answering the ﬁrst two questions? What did or did you not take into account? Did you take the furniture out or not? Did you look up the size of a Ping-Pong ball, or did you guess it? Did you take into account different ways to pack spheres? Which of these assumptions has the largest effect on the result? Continue on a second sheet of paper if you need more space for your answer. The game we just played is sometimes called guesstimation and is a close relative to the back-of-the-envelope calculation. The difference is minor: the way I see it, in guesstimation we worry primarily about ﬁnding suitable input values, whereas in a typical back-of-the-envelope calculation, the inputs are reasonably well known and the challenge is to simplify the actual calculation to the point that it can be done on the back of the proverbial envelope. (Some people seem to prefer napkins to envelopes—that’s the more sociable crowd.) 141 O’Reilly-5980006 master October 28, 2010 20:55 Let me be clear about this: I consider proﬁciency at guesstimation and similar techniques the absolute hallmark of the practical data analyst—the person who goes out and solves real problems in the real world. It is so powerful because it connects a conceptual understanding (no matter how rough) with the concrete reality of the problem domain; it leaves no place to hide. Guesstimation also generates numbers (not theories or models) with their wonderful ability to cut through vague generalities and opinion-based discussions. For all these reasons, guesstimation is a crucial skill. It is where the rubber meets the road. The whole point of guesstimation is to come up with an approximate answer—quickly and easily. The ﬂip side of this is that it forces us to think about the accuracy of the result: ﬁrst how to estimate the accuracy and then how to communicate it. That will be the program for this chapter. Principles of Guesstimation Let’s step through our introductory Ping-Pong ball example together. This will give me an opportunity to point out a few techniques that are generally useful. First consider the room. It is basically rectangular in shape. I have bookshelves along several walls; this helps me estimate the length of each wall, since I know that shelves are 90 cm (3 ft) wide—that’s a pretty universal standard. I also know that I am 1.80 m (6 ft) tall, which helps me estimate the height of the room. All told, this comes to5mby3.5m by 2.5 m or about 50 m3. Now, the Ping-Pong ball. I haven’t had one in my hands for a long time, but I seem to remember that they are about 2.5 cm (1 in) in diameter. That means I can line up 40 of them in a meter, which means I have 403 in a cubic meter. The way I calculate this is: 403 = 43 · 103 = 26 · 1,000 = 64,000. That’s the number of Ping-Pong balls that ﬁt into a cubic meter. Taking things together, I can ﬁt 50 · 64,000 or approximately 3,000,000 Ping-Pong balls into this room. That’s a large number. If each ball costs me a dollar at a sporting goods store, then the value of all the balls required to ﬁll this room would be many times greater than the value of the entire house! Next, the margins of error. The uncertainty in each dimension is at least 10 percent. Relative errors are added to each other in a multiplication (we will discuss error propagation later in this chapter), so the total error turns out to be 3 · 10 percent = 30 percent! That’s pretty large—the number of balls required might be as low as two million or as high as four million. It is uncomfortable to see how the rather harmless-looking 10 percent error in each individual dimension has compounded to lead to a 30 percent uncertainty. 142 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 The same problem applies to the diameter of the Ping-Pong balls. Maybe 2.5 cm is a bit low—perhaps 3 cm is more like it. Now, that’s a 20 percent increase, which means that the number of balls ﬁtting into one cubic meter is reduced by 60 percent (3 times the relative error, again): now we can ﬁt only about 30,000 of them into a cubic meter. The same goes for the overall estimate: a decrease by half if balls are 5 mm larger than initially assumed. Now the range is something between one and two million. Finally, the assumptions. Yes, I took the furniture out. Given the uncertainty in the total volume of the room, the space taken up by the furniture does not matter much. I also assumed that balls would stack like cubes, when in reality they pack tighter if we arrange them in the way oranges (or cannonballs) are stacked. It’s a slightly nontrivial exercise in geometry to work out the factor, but it comes to about 15 percent more balls in the same space. So, what can we now say with certainty? We will need a few million Ping-Pong balls—probably not less than one million and certainly not more than ﬁve million. The biggest uncertainty is the size of the balls themselves; if we need a more accurate estimate than the one we’ve obtained so far, then we can look up their exact dimensions and adjust the result accordingly. (After I wrote this paragraph, I ﬁnally looked up the size of a regulation Ping-Pong ball: 38–40 mm. Oops. This means that only about 15,000 balls ﬁt into a cubic meter, and so I must adjust all my estimates down by a factor of 4.) This example demonstrates all important aspects of guesstimation: • Estimate sizes of things by comparing them to something you know. • Establish functional relationships by using simplifying assumptions. • Originally innocuous errors can compound dramatically, so tracking the accuracy of an estimate is crucial. • And ﬁnally, a few bad guesses on things that are not very familiar can have a devastating effect (I really haven’t played Ping-Pong in a long time), but they can be corrected easily when better input is available. Still, we did ﬁnd the order of magnitude, one way or the other: a few million. Estimating Sizes The best way to estimate the size of an object is to compare it to something you know. The shelves played this role in the previous example, although sometimes you have to work a little harder to ﬁnd a familiar object to use as reference in any given situation. Obviously, this is easier to do the more you know, and it can be very frustrating to ﬁnd yourself in a situation where you don’t know anything you could use as a reference. That GUESSTIMATION AND THE BACK OF THE ENVELOPE 143 O’Reilly-5980006 master October 28, 2010 20:55 being said, it is usually possible to go quite far with just a few data points to use as reference values. (There are stories from the Middle Ages of how soldiers would count how many rows of stone blocks were used in the walls of a fortress before mounting an attack, the better to estimate the height of the walls. Obtaining an accurate value was necessary to prepare scaling ladders of the appropriate length: if the ladders were too short, then the top of the wall could not be reached; if they were too long, the defenders could grab the overhanging tops and topple the ladders back over. Bottom line: you’ve got to ﬁnd your reference objects where you can.) Knowing the sizes of things is therefore the ﬁrst order of business. The more you know, the easier it is to form an estimate; but also the more you know, the more you develop a feeling for the correct answer. That is an important step when operating with guesstimates: to perform an independent “sanity check” at the end to ensure we did not make some horrible mistake along the way. (In fact, the general advice is that “two (independent) estimates are better than one”; this is certainly true but not always possible—at least I can’t think of an independent way to work out the Ping-Pong ball example we started with.) Knowing the sizes of things can be learned. All it takes is a healthy interest in the world around you—please don’t go through the dictionary, memorizing data points in alphabetical order. This is not about beating your buddies at a game of Trivial Pursuit! Instead, this is about becoming familiar (I’d almost say intimate) with the world you live in. Feynman once wrote about Hans A. Bethe that “every number was near something he knew.” That is the ideal. The next step is to look things up. In situations where one frequently needs relatively good approximations to problems coming from a comparably small problem domain, special-purpose lookup tables can be a great help. I vividly remember a situation in a senior physics lab where we were working on an experiment (I believe, to measure the muon lifetime), when the instructor came by and asked us some guesstimation problem—I forget what it was, but it was nontrivial. None of us had a clue, so he whipped out from his back pocket a small booklet the size of a playing card that listed the physical properties of all kinds of subnuclear particles. For almost any situation that could arise in the lab, he had an approximate answer right there. Specialized lookup tables exist in all kinds of disciplines, and you might want to make your own as necessary for whatever it is you are working on. The funniest I have seen gave typical sizes (and costs) for all elements of a manufacturing plant or warehouse: so many square feet for the ofﬁce of the general manager, so many square feet for his assistant (half the size of the boss’s), down to the number of square feet per toilet stall, and—not to forget—how many toilets to budget for every 20 workers per 8-hour shift. Finally, if we don’t know anything close and we can’t look anything up, then we can try to estimate “from the ground up”: starting just with what we know and then piling up 144 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 arguments to arrive at an estimate. The problem with this approach is that the result may be way off. We have seen earlier how errors compound, and the more steps we have in our line of arguments the larger the ﬁnal error is likely to be—possibly becoming so large that the result will be useless. If that’s the case, we can still try and ﬁnd a cleverer argument that makes do with fewer argument steps. But I have to acknowledge that occasionally we will ﬁnd ourselves simply stuck: unable to make an adequate estimate with the information we have. The trick is to make sure this happens only rarely. Establishing Relationships Establishing relationships that get us from what we know to what we want to ﬁnd is usually not that hard. This is true in particular under common business scenarios, where the questions often revolve around rather simple relationships (how something ﬁts into something else, how many items of a kind there are, and the like). In scientiﬁc applications, this type of argument can be harder. But for most situations that we are likely to encounter outside the science lab, simple geometric and counting arguments will sufﬁce. In the next chapter, we will discuss in more detail the kinds of arguments you can use to establish relationships. For now, just one recommendation: make it simple! Not: keep it simple because, more likely than not, initially the problem is not simple; hence you have to make it so in order to make it tractable. Simplifying assumptions let you cut through the fog and get to the essentials of a situation. You may incur an error as you simplify the problem, and you will want to estimate its effect, but at least you are moving toward a result. An anecdote illustrates what I mean. When working for Amazon.com, I had a discussion with a rather sophisticated mathematician about how many packages Amazon can typically ﬁt onto a tractor-trailer truck, and he started to work out the different ways you can stack rectangular boxes into the back of the truck! This is entirely missing the point because, for a rough calculation, we can make the simplifying assumption that the packages can take any shape at all (i.e., they behave like a liquid) and simply divide the total volume of the truck by the typical volume of a package. Since the individual package is tiny compared to the size of the truck, the speciﬁc shapes and arrangements of individual packages are irrelevant: their effect is much smaller than the errors in our estimates for the size of the truck, for instance. (We’ll discuss this in more detail in Chapter 8, where we discuss the mean-ﬁeld approximation.) The point of back-of-the-envelope estimates is to retain only the core of the problem, stripping away as much nonessential detail as possible. Be careful that your sophistication does not get in the way of ﬁnding simple answers. GUESSTIMATION AND THE BACK OF THE ENVELOPE 145 O’Reilly-5980006 master October 28, 2010 20:55 Working with Numbers When working with numbers, don’t automatically reach for a calculator! I know that I am now running the risk of sounding ridiculous—praising the virtues of old-fashioned reading, ’riting, and ’rithmetic. But that’s not my point. My point is that it is all right to work with numbers. There is no reason to avoid them. I have seen the following scenario occur countless times: a discussion is under way, everyone is involved, ideas are ﬂying, concentration is intense—when all of a sudden we need a few numbers to proceed. Immediately, everything comes to a screeching halt while several people grope for their calculators and others ﬁre up their computers, followed by hasty attempts to get the required answer, which invariably (given the haste) leads to numerous keying errors and false starts, followed by arguments about the best calculator software to use. In any case, the whole creative process just died. It’s a shame. Besides forcing you to switch context, calculators remove you one step further from the nature of the problem. When working out a problem in your head, you get a feeling for the signiﬁcant digits in the result: for which digits does the result change as the inputs take on any value from their permissible range? The surest sign that somebody has no clue is when they quote the results from a calculation based on order-of-magnitude inputs to 16 digits! The whole point here is not to be religious about it—either way. If it actually becomes more complicated to work out a numerical approximation in your head, then by all means use a calculator. But the compulsive habit to avoid working with numbers at all cost should be restrained. There are a few good techniques that help with the kinds of calculations required for back-of-the-envelope estimates and that are simple enough that they still (even today) hold their own against uncritical calculator use. Only the ﬁrst is a must-have; the other two are optional. Powers of ten The most important technique for deriving order-of-magnitude estimates is to work with orders of magnitudes directly—that is, with powers of ten. It quickly gets confusing to multiply 9,000 by 17 and then to divide by 400, and so on. Instead of trying to work with the numbers directly, split each number into the most signiﬁcant digit (or digits) and the respective power of ten. The multiplications now take place among the digits only while the powers of ten are summed up separately. In the example I just gave, we split 9,000 = 9 · 1,000, 17 = 1.7 · 10 ≈ 2 · 10, and 400 = 4 · 100. From the leading digits we have 9 times 2 divided by 4 equals 4.5, and from the powers of ten we have 3 plus 1 minus 2 equals 2; so then 4.5 · 102 = 450. That wasn’t so hard, was it? (I have replaced 17 with 2 · 10 in this approximation, so the result is a bit on the high 146 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 side, by about 15 percent. I might want to correct for that in the end—a better approximation would be closer to 390. The exact value is 382.5.) More systematically, any number can be split into a decimal fraction and a power of ten. It will be most convenient to require the fraction to have exactly one digit before the decimal point, like so: 123.45 = 1.2345 · 102 1,000,000 = 1.0 · 106 0.00321 = 3.21 · 10−3 The fraction is commonly known as the mantissa (or the signiﬁcand in most recent usage), whereas the power of ten is always referred to as the exponent. This notation signiﬁcantly simpliﬁes multiplication and division between numbers of very different magnitude: the mantissas multiply (involving only single-digit multiplications, if we restrict ourselves to the most signiﬁcant digit), and the exponents add. The biggest challenge is to keep the two different tallies simultaneously in one’s head. Small perturbations The techniques in this section are part of a much larger family of methods known as perturbation theory, methods that play a huge role in applied mathematics and related ﬁelds. The idea is always the same—we split the original problem into two parts: one that is easy to solve and one that is somehow “small” compared to the ﬁrst. If we do it right, the effect of the latter part is only a “small perturbation” to the ﬁrst, easy part of the problem. (You may want to review Appendix B if some of this material is unfamiliar to you.) The easiest application of this idea is in the calculation of simple powers, such as 123. Here is how we would proceed: 123 = (10 + 2)3 = 103 + 3 · 102 · 2 + 3 · 10 · 22 + 23 = 1,000 + 600 +··· = 1,600 +··· In the ﬁrst step, we split 12 into 10 + 2: here 10 is the easy part (because we know how to raise 10 to an integer power) and 2 is the perturbation (because 2 10). In the next step, we make use of the binomial formula (see Appendix B), ignoring everything except the linear term in the “perturbation.” The ﬁnal result is pretty close to the exact value. The same principle can be applied to many other situations. In the context of this chapter, I am interested in this concept because it gives us a way to estimate and correct for the error introduced by ignoring all but the ﬁrst digit in powers-of-ten calculations. Let’s look at another example: 32 · 430 GUESSTIMATION AND THE BACK OF THE ENVELOPE 147 O’Reilly-5980006 master October 28, 2010 20:55 Using only the most signiﬁcant digits, this is (3 · 101) · (4 · 102) = (3 · 4) · 101+2 = 12,000. But this is clearly not correct, because we dropped some digits from the factors. We can consider the nonleading digits as small perturbations to the result and treat them separately. In other words, the calculation becomes: (3 + 0.2) · (4 + 0.3) · 103 ≈ 3(1 + 0.1 ...)· 4(1 + 0.1 ...)· 103 where I have factored out the largest factor in each term. On the righthand side I did not write out the correction terms in full—for our purposes, it’s enough to know that they are about 0.1. Now we can make use of the binomial formula: (1 + )2 = 1 + 2 + 2 We drop the last term (since it will be very small compared to the other two), but the second term gives us the size of the correction: +2. In our case, this amounts to about 20 percent, since is one tenth. I will admit that this technique seems somewhat out of place today, although I do use it for real calculations when I don’t have a calculator on me. But the true value of this method is that it enables me to estimate and reason about the effect that changes to my input variables will have on the overall outcome. In other words, this method is a ﬁrst step toward sensitivity analysis. Logarithms This is the method by which generations before us performed numerical calculations. The crucial insight is that we can use logarithms for products (and exponentiation) by making use of the functional equation for logarithms: log(xy) = log(x) + log(y) In other words, instead of multiplying two numbers, we can add their logarithms. The slide rule was a mechanical calculator based on this idea. Amazingly, using logarithms for multiplication is still relevant—but in a slightly different context. For many statistical applications (in particular when using Bayesian methods), we need to multiply the probabilities of individual events in order to arrive at the probability for the combination of these events. Since probabilities are by construction less than 1, the product of any two probabilities is always smaller than the individual factors. It does not take many probability factors to underﬂow the ﬂoating-point precision of almost any standard computer. Logarithms to the rescue! Instead of multiplying the probabilities, take logarithms of the individual probabilities and then add the logarithms. (The logarithm of a number that is less than 1 is negative, so one usually works with − log(p).) The resulting numbers, although mathematically equivalent, have much better numerical properties. Finally, since in many applications we mostly care which of a 148 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 selection of different events has the maximum probability, we don’t even need to convert back to probabilities: the event with maximum probability will also be the one with the maximum (negative) logarithm. More Examples We have all seen this scene in many a Hollywood movie: the gangster comes in to pay off the hitman (or pay for the drug deal, or whatever it is). Invariably, he hands over an elegant briefcase with the money—cash, obviously. Question: how much is in the case? Well, a briefcase is usually sized to hold two letter-size papers next to each other; hence it is about 17 by 11 inches wide, and maybe 3 inches tall (or 40 by 30 by 7 centimeters). A bank note is about 6 inches wide and 3 inches tall, which means that we can ﬁt about six per sheet of paper. Finally, a 500-page ream of printer paper is about 2 inches thick. All told, we end up with 2 · 6 · 750 = 9,000 banknotes. The highest dollar denomination in general circulation is the $100 bill,* so the maximum value of that payoff was about $1 million, and certainly not more than $5 million. Conclusion: for the really big jobs, you need to pay by check. Or use direct transfer. For a completely different example, consider the following question. What’s the typical takeoff weight of a large, intercontinental jet airplane? It turns out that you can come up with an approximate answer even if you don’t know anything about planes. A plane is basically an aluminum tube with wings. Ignore the wings for now; let’s concentrate on the tube. How big is it? One way to ﬁnd out is to check your boarding pass: it will display your row number. Unless you are much classier than your author, chances are that it shows a row number in the range of 40–50. You can estimate that the distance between seats is a bit over 50 cm—although it feels closer. (When you stand in the aisle, facing sideways, you can place both hands comfortably on the tops of two consecutive seats; your shoulders are about 30 cm apart, so the distance between seats must be a tad greater than that.) Thus we have the length: 50 · 0.5 m. We double this to make up for ﬁrst and business class, and to account for cockpit and tail. Therefore, the length of the tube is about 50 m. How about its diameter? Back in economy, rows are about 9 seats abreast, plus two aisles. Each seat being just a bit wider than your shoulders (hopefully), we end up with a diameter of about 5 m. Hence we are dealing with a tube that is 50 m long and5mindiameter. As you walked through the door, you might have noticed the strength or thickness of the tube: it’s about 5 mm. Let’s make that 10 mm (1 cm) to account for “stuff”: wiring, seats, and all kinds of other hardware that’s in the plane. Imagining now that you unroll the entire plane (the way you unroll aluminum foil), the result is a sheet that is *Larger denominations exist but—although legal tender—are not ofﬁcially in circulation and apparently fetch far more than their face value among collectors. GUESSTIMATION AND THE BACK OF THE ENVELOPE 149 O’Reilly-5980006 master October 28, 2010 20:55 TABLE 7-1.Approximate measurements for some common intercontinental jets Weight Weight Length Width Diameter (empty) (full) Passengers B767 50 m 50 m 5 m 90 t 150 t 200 B747 70 m 60 m 6.5 m 175 t 350 t 400 A380 75 m 80 m 7 m 275 t 550 t 500 50 · π · 5 · 0.01m3. The density of aluminum is a little higher than water (if you have ever been to a country that uses aluminum coins, you know that you can barely make them ﬂoat), so let’s say it’s 3 g/cm3. It is at this point that we need to employ the proverbial back of the envelope (or the cocktail napkin they gave you with the peanuts) to work out the numbers. It will help to realize that there are 1003 = 106 cubic centimeters in a cubic meter and that the density of aluminum can therefore be written as 3 tons per cubic meter. The ﬁnal mass of the “tube” comes out to about 25 ton. Let’s double this to take into account the wings (wings are about as long as the fuselage is wide—if you look at the silhouette of a plane in the sky, it forms an approximate square); this yields 50 ton just for the “shell” of the airplane. It does not take into account the engines and most of the other equipment inside the plane. Now let’s compare this number with the load. We have 50 rows, half of them with 9 passengers and the other half with 5; this gives us an average of 7 passengers per row or a total of 350 passengers per plane. Assuming that each passenger contributes 100 kg (body weight and baggage), the load amounts to 35 ton: comparable to the weight of the plane itself. (This weight-to-load ratio is actually not that different than for a car, fully occupied by four people. Of course, if you are driving alone, then the ratio for the car is much worse.) How well are we doing? Actually, not bad at all: Table 7-1 lists typical values for three planes that are common on transatlantic routes: the mid size Boeing 767, the large Boeing 747 (the “Jumbo”), and the extra-large Airbus 380. That’s enough to check our calculations. We are not far off. (What we totally missed is that planes don’t ﬂy on air and in-ﬂight peanuts alone: in fact, the greatest single contribution to the weight of a fully loaded and fuelled airplane is the weight of the fuel. You can estimate its weight as well, but to do so, you will need one additional bit of information: the fuel consumption of a modern jet airplane per passenger and mile traveled is less than that of a typical compact car with only a single passenger.) That was a long and involved estimation, and I won’t blame you if you skipped some of the intermediate steps. In case you are just joining us again, I’d like to emphasize one point: we came up with a reasonable estimate without having to resort to any “seat of the pants” estimates—even though we had no prior knowledge! Everything that we used, we 150 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 could either observe directly (such as the number of rows in the plane or the thickness of the fuselage walls) or could relate to something that was familiar to us (such as the distance between seats). That’s an important takeaway! But not all calculations have to be complicated. Sometimes, all you have to do is “put two and two together.” A friend told me recently that his company had to cut their budget by a million dollars. We knew that the overall budget for this company was about ﬁve million dollars annually. I also knew that, since it was mostly a service company, almost all of its budget went to payroll (there was no inventory or rent to speak of). I could therefore tell my friend that layoffs were around the corner—even with a salary reduction program, the company would have to cut at least 15 percent of their staff. The response was: “Oh, no, our management would never do that.” Two weeks later, the company eliminated one third of all positions. Things I Know Table 7-2 is a collection of things that I know and frequently use to make estimates. Of course, this list may seem a bit whimsical, but it is actually pretty serious. For instance, note the range of areas from which these items are drawn! What domains can you reason about, given the information in this table? Also notice the absence of systematic “scales.” That is no accident. I don’t need to memorize the weights of a mouse, a cat, and a horse—because I know (or can guess) that a mouse is 1,000 times smaller than a human, a cat 10 times smaller, and a horse 10 times larger. The items in this table are not intended to be comprehensive; in fact, they are the bare minimum. Knowing how things relate to each other lets me take it from there. Of course, this table reﬂects my personal history and interests. Yours will be different. How Good Are Those Numbers? Remember the Ping-Pong ball question that started out this chapter? I once posted that question as a homework problem in a class, and one student’s answer was something like 1,020,408.16327. (Did you catch both mistakes? Not only does the result of this rough estimate pretend to be accurate to within a single ball; but the answer also includes a fractional part—which is meaningless, given the context.) This type of confusion is incredibly common: we focus so much on the calculation (any calculation) that we forget to interpret the result! This story serves as a reminder that there are two questions that we should ask before any calculation as well as one afterward. The two questions to ask before we begin are: • What level of correctness do I need? • What level of correctness can I afford? GUESSTIMATION AND THE BACK OF THE ENVELOPE 151 O’Reilly-5980006 master October 28, 2010 20:55 TABLE 7-2.Reference points for guesstimations Size of an atomic data type 10 bytes A page of text 55 lines of 80 characters, or about 4,500 characters total A record (of anything) 100–1,000 bytes A car 4 m long, 1 ton weight A person 2 m tall, 100 kg weight A shelf 1 m wide, 2 m tall Swimming pool (not Olympic) 25 × 12.5 meters A story in a commercial building 4 m high Passengers on a large airplane 350 Speed of a jetliner 1,000 km/hr Flight time from NY 6 hr (to the West Coast or Europe) Human, walking 1 m/s (5 km/hr) Human, maximum power output 200 W (not sustainable) Power consumption of a water kettle 2 kW Electricity grid 100 V (U.S.), 220 V (Europe) Household fuse 16 A 3 · 3 10 (minus 10%) π 3 Large city 1 million Population, Germany or Japan 100 million Population, USA 300 million Population, China or India 1 billion Population, Earth 7 billion U.S. median annual income $60,000 U.S. federal income tax rate 25% (but also as low as 0% and as high as 40%) Minimum hourly wage $10 per hour Billable hours in a year 2,000 (50 weeks at 40 hours per week) Low annual inflation 2% High annual inflation 8% Price of a B-2 bomber $2 billion American Civil War; Franco-Prussian War 1860s; 1870s French Revolution 1789 Reformation 1517 Charlemagne 800 Great Pyramids 3000 B.C.E. Hot day 35 Celsius Very hot kitchen oven 250 Celsius Steel melts 1200 Celsius Density of water 1 g/cm3 Density of aluminum 3 g/cm3 Density of lead 13 g/cm3 Density of gold 20 g/cm3 Ionization energy of hydrogen 13.6 eV Atomic diameter (Bohr radius) 10−10 m Energy of X-ray radiation keV Nuclear binding energy per particle MeV Wavelength of the sodium doublet 590 nm 152 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 The question to ask afterward is: • What level of correctness did I achieve? I use the term “correctness” here a bit loosely to refer to the quality of the result. There are actually two different concepts involved: accuracy and precision. Accuracy Accuracy expresses how close the result of a calculation or measurement comes to the “true” value. Low accuracy is due to systematic error. Precision Precision refers to the “margin of error” in the calculation or the experiment. In experimental situations, precision tells us how far the results will stray when the experiment is repeated several times. Low precision is due to random noise. Said another way: accuracy is a measure for the correctness of the result, and precision is a measure of the result’s uncertainty. Before You Get Started: Feasibility and Cost The ﬁrst question (what level of correctness is needed) will deﬁne the overall approach—if I only need an order-of-magnitude approximation, then the proverbial back of the envelope will do; if I need better results, I might need to work harder. The second question is the necessary corollary: it asks whether I will be able to achieve my goal given the available resources. In other words, these two questions pose a classic engineering trade-off (i.e., they require a regular cost–beneﬁt analysis). This obviously does not matter much for a throwaway calculation, but it matters a lot for bigger projects. I once witnessed a huge project (involving a dozen developers for over a year) to build a computation engine that had failed to come clear on both counts until it was too late. The project was eventually canceled when it turned out that it would cost more to achieve the accuracy required than the project was supposed to gain the company in increased revenue! (Don’t laugh—it could happen to you. Or at least in your company.) This story points to an important fact: correctness is usually expensive, and high correctness is often disproportionally more expensive. In other words, a 20 percent approximation can be done on the back of an envelope, a 5 percent solution can be done in a couple of months, but the cost for a 1 percent solution may be astronomical. It is also not uncommon that there is no middle ground (e.g., an affordable 10 percent solution). I have also seen the opposite problem: projects chasing correctness that is not really necessary—or not achievable because the required input data is not available or of poor quality. This is a particular risk if the project involves the opportunity to play with some attractive new technology. Finding out the true cost or beneﬁt of higher-quality results can often be tricky. I was working on a project to forecast the daily number of visitors viewing the company’s website, when I was told that “we must have absolute forecast accuracy; nothing else GUESSTIMATION AND THE BACK OF THE ENVELOPE 153 O’Reilly-5980006 master October 28, 2010 20:55 matters.” I suggested that if this were so, then we should take the entire site down, since doing so would guarantee a perfect forecast (zero page views). Yet because this would also imply zero revenue from display advertising, my suggestion focused the client’s mind wonderfully to deﬁne more clearly what “else” mattered. After You Finish: Quoting and Displaying Numbers It is obviously pointless to report or quote results to more digits than is warranted. In fact, it is misleading or at the very least unhelpful, because it fails to communicate to the reader another important aspect of the result—namely its reliability! A good rule (sometimes known as Ehrenberg’s rule) is to quote all digits up to and including the ﬁrst two variable digits. Starting from the left, you keep all digits that do not change over the entire range of numbers from one data point to the next; then you also keep the ﬁrst two digits that vary over the entire range from 0 to 9 as you scan over all data points. An example will make this clear. Consider the following data set: 121.733 122.129 121.492 119.782 120.890 123.129 Here, the ﬁrst digit (from the left) is always 1 and the second digit takes on only two values (1 and 2), so we retain them both. All further digits can take on any value between 0 and 9, and we retain the ﬁrst two of them—meaning that we retain a total of four digits from the left. The two right-most digits therefore carry no signiﬁcance, and we can drop them when quoting results. The mean (for instance) should be reported as: 121.5 Displaying further digits is of no value. This rule—to retain the ﬁrst two digits that vary over the entire range of values and all digits to the left of them—works well with the methods described in this chapter. If you are working with numbers as I suggested earlier, then you also develop a sense for the digits that are largely unaffected by reasonable variations in the input parameters as well as for the position in the result after which uncertainties in the input parameters corrupt the outcome. Finally, a word of warning. The accuracy level of a numerical result should be established from the outset, since doing so later will trigger resistance. I have encountered a system that reported projected sales numbers (which were typically in the hundreds of thousands) to six “signiﬁcant” digits (e.g., as 324,592 or so). But because these were forecasts that were at best accurate to within 30 percent, all digits beyond the ﬁrst were absolute junk! (Note that 30 percent of 300,000 is 100,000, which means that the 154 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 conﬁdence band for this result was 200,000–400,000.) However, a later release of the same software, which now reported only the actually signiﬁcant digits, was met by violent opposition from the user community because it was “so much less precise”! Optional: A Closer Look at Perturbation Theory and Error Propagation I already mentioned the notion of “small perturbations.” It is one of the great ideas of applied mathematics, so it is worth a closer look. Whenever we can split a problem into an “easy” part and a part that is “small,” the problem lends itself to a perturbative solution. The “easy” part we can solve directly (that’s what we mean by “easy”), and the part that is “small” we solve in an approximative fashion. By far the most common source of approximations in this area is based on the observation that every function (every curve) is linear (a straight line) in a sufﬁciently small neighborhood: we can therefore replace the full problem by its linear approximation when dealing with the “small” part—and linear problems are always solvable. As a simple example, let’s calculate √ 17. Can we split this into a “simple” and a “small” problem? Well, we know that 16 = 42 and so √ 16 = 4. That’s the simple part, and we therefore now write √ 17 = √ 16 + 1. Obviously 1 16, so there’s the “small” part of the problem. We can now rewrite our problem as follows: √ 17 = √ 16 + 1 = 16(1 + ) = √ 16 √ 1 + = 4 √ 1 + It is often convenient to factor out everything so that we are left with 1 + small stuff as in the second line here. At this point, we also replaced the small part with (we will put the numeric value back in at the end). So far everything has been exact, but to make progress we need to make an approximation. In this case, we replace the square root by a local approximation around 1. (Remember: is small, and √ 1 is easy.) Every smooth function can be replaced by a straight line locally, and if we don’t go too far, then that approximation turns out to be quite good (see Figure 7-1). These approximations can be derived in a systematic fashion by a process known as Taylor expansion. The ﬁgure shows both the simplest approximation, which is just a straight line, and also the next-higher (second-order) approximation, which is even better. Taylor expansions are so fundamental that they are almost considered a ﬁfth basic operation (after addition, subtraction, multiplication, and division). See Appendix B for a little more information on them. GUESSTIMATION AND THE BACK OF THE ENVELOPE 155 O’Reilly-5980006 master October 28, 2010 20:55 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 -0.5 0 0.5 1 1.5 Exact First Approximation Second Approximation FIGURE 7-1.Thesquare-root function √ 1 + x and the first two approximations around x = 0. With the linear approximation in place, our problem has now become quite tractable: √ 17 ≈ 4 1 + 2 +··· = 4 + 2 We can now plug the numeric value = 1/16 back in: √ 17 ≈ 4 + 2/16 = 4.125. The exact value is √ 17 = 4.12310 .... Our approximation is pretty good. Error Propagation Error propagation considers situations where we have some quantity x and an associated uncertainty δx. We write x ± δx to indicate that we expect the true value to lie anywhere in the range from x − δx to x + δx. In other words, we have not just a single value for the quantity x, but instead a whole range of possible values. Now suppose we have several quantities—each with its own error term—and we need to combine them in some fashion. We probably know how to work with the quantities themselves, but what about the uncertainties? For example, we know both the height and width of a rectangle to within some range: h + δh and w + δw. We also know that the area is A = hw (from basic geometry). But what can we say about the uncertainty in the area? This kind of scenario is ideal for the perturbative methods discussed earlier: the uncertainties are “small,” so we can use simplifying approximations to deduce their behavior. 156 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 Let’s work through the area example: A = (h ± δh)(w ± δw) = hw 1 ± δh h 1 ± δw w = hw 1 ± δh h ± δw w + δh h δw w Here again we have factored the primary terms out, to end up with terms of the form 1 + small stuff, because that makes life easier. This also means that, instead of expressing the uncertainty through the absolute error δh or δw, we express them through the relative error δh/h or δw/w. (Observe that if δh h, then δh/h 1.) So far, everything has been exact. Now comes the approximation: the error terms are small (in fact, smaller than 1); hence their product is extra-small, and we can therefore drop it. Our ﬁnal result is thus A = hw 1 ± ( δh h + δw w ) or, in words: “When multiplying two quantities, their relative errors add.” So if I know both the width and the height to within 10 percent each, then my uncertainty in the area will be 20 percent. Here are a few more results of this form, which are useful whenever you work with quantities that have associated uncertainties (you might want to try deriving some of these yourself): (x ± δx) + (y ± δy) = x + y ± (δx + δy) Sum (x ± δx) · (y ± δy) = xy 1 ± δx x + δy y Product x ± δx y ± δy = x y 1 ± δx x + δy y Fraction √ x + δx = √ x 1 + δx x ≈ √ x 1 + 1 2 δx x Square root log(x + δx) = log x 1 + δx x ≈ log x + δx x Logarithm The most important ones are the ﬁrst two: when adding (or subtracting) two quantities, their absolute errors add; and when multiplying (or dividing) two quantities, their relative errors add. This implies that, if one of two quantities has a signiﬁcantly larger error than the other, then the larger error dominates the ﬁnal uncertainty. Finally, you may have seen a different way to calculate errors that gives slightly tighter bounds, but it is only appropriate if the errors have been determined by calculating the variances in repeated measurements of the same quantity. Only in that case are the statistical assumptions valid upon which this alternative calculation is based. For guesstimation, the simple (albeit more pessimistic) approach described here is more appropriate. GUESSTIMATION AND THE BACK OF THE ENVELOPE 157 O’Reilly-5980006 master October 28, 2010 20:55 Workshop: The Gnu Scientiﬁc Library (GSL) What do you do when a calculation becomes too involved to do it in your head or even on the back of an envelope? In particular, what can you do if you need the extra precision that a simple order-of-magnitude estimation (as practiced in this chapter) will not provide? Obviously, you reach for a numerical library! The Gnu Scientiﬁc Library, or GSL, (http://www.gnu.org/software/gsl/) is the best currently available open source library for numerical and scientiﬁc calculations that I am aware of. The list of included features is comprehensive, and the implementations are of high quality. Thanks to some unifying conventions, the API, though forbidding at ﬁrst, is actually quite easy to learn and comfortable to use. Most importantly, the library is mature, well documented, and reliable. Let’s use it to solve two rather different problems; this will give us an opportunity to highlight some of the design choices incorporated into the GSL. The ﬁrst example involves matrix and vector handling: we will calculate the singular value decomposition (SVD) of a matrix. The second example will demonstrate how the GSL handles non-linear, iterative problems in numerical analysis as we ﬁnd the minimum of a nonlinear function. The listing that follows should give you a ﬂavor of what vector and matrix operations look like when using the GSL. First, we allocate a couple of (two-dimensional) vectors and assign values to their elements. We then perform some basic vector operations: adding one vector to another and performing a dot product. (The result of a dot product is a scalar, not another vector.) Finally, we allocate and initialize a matrix and calculate its SVD. (See Chapter 14 for more information on vector and matrix operations.) /* Basic Linear Algebra using the GSL */ #include #include #include #include #include int main() { double r; gsl_vector *a, *b, *s, *t; gsl_matrix *m, *v; /* --- Vectors --- */ a = gsl_vector_alloc( 2 ); /* two dimensions */ b = gsl_vector_alloc( 2 ); /*a=[1.0,2.0]*/ gsl_vector_set( a, 0, 1.0 ); gsl_vector_set( a, 1, 2.0 ); 158 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 /*a=[3.0,6.0]*/ gsl_vector_set( b, 0, 3.0 ); gsl_vector_set( b, 1, 6.0 ); /* a += b (so that now a = [ 4.0, 8.0 ]) */ gsl_vector_add( a, b ); gsl_vector_fprintf( stdout, a, "%f" ); /*r=a.b(dot product) */ gsl_blas_ddot( a, b, &r ); fprintf( stdout, "%f\n", r ); /* --- Matrices --- */ s = gsl_vector_alloc( 2 ); t = gsl_vector_alloc( 2 ); m = gsl_matrix_alloc( 2, 2 ); v = gsl_matrix_alloc( 2, 2 ); /*m=[[1,2], [0, 3] ] */ gsl_matrix_set( m, 0, 0, 1.0 ); gsl_matrix_set( m, 0, 1, 2.0 ); gsl_matrix_set( m, 1, 0, 0.0 ); gsl_matrix_set( m, 1, 1, 3.0 ); /*m=UsV^T(SVD : singular values are in vector s) */ gsl_linalg_SV_decomp( m, v, s, t ); gsl_vector_fprintf( stdout, s, "%f" ); /* --- Cleanup --- */ gsl_vector_free( a ); gsl_vector_free( b ); gsl_vector_free( s ); gsl_vector_free( t ); gsl_matrix_free( m ); gsl_matrix_free( v ); return 0; } It is becoming immediately (and a little painfully) clear that we are dealing with plain C, not C++ or any other more modern, object-oriented language! There is no operator overloading; we must use regular functions to access individual vector and matrix elements. There are no namespaces, so function names tend to be lengthy. And of course there is no garbage collection! What is not so obvious is that element access is actually boundary checked: if you try to access a vector element that does not exist (e.g., gsl vector set( a, 4, 1.0 );), then the GSL internal error handler will be invoked. By default, it will halt the program and print a message to the screen. This is quite generally true: if the library detects an GUESSTIMATION AND THE BACK OF THE ENVELOPE 159 O’Reilly-5980006 master October 28, 2010 20:55 error—including bad inputs, failure to converge numerically, or an out-of-memory situation—it will invoke its error handler to notify you. You can provide your own error handler to respond to errors in a more ﬂexible fashion. For a fully tested program, you can also turn range checking on vector and matrix elements off completely, to achieve the best possible runtime performance. Two more implementation details before leaving the linear algebra example: although the matrix and vector elements are of type double in this example, versions of all routines exist for integer and complex data types as well. Furthermore, the GSL will use an optimized implementation of the BLAS (Basic Linear Algebra Subprograms) API if one is available; if not, the GSL comes with its own, basic implementation. Now let’s take a look at the second example. Here we use the GSL to ﬁnd the minimum of a one-dimensional function. The function to minimize is deﬁned at the top of the listing: x2 log(x). In general, nonlinear problems such as this must be solved iteratively: we start with a guess, then calculate a new trial solution based on that guess, and so on until the result meets whatever stopping criteria we care to deﬁne. At least that’s what the introductory textbooks tell you. In the main part of the program, we instantiate a “minimizer,” which is an encapsulation of a speciﬁc minimization algorithm (in this case, Golden Section Search—others are available, too) and initialize it with the function to minimize as well as our initial guess for the interval containing the minimum. Now comes the surprising part: an explicit loop! In this loop, the “minimizer” takes a single step in the iteration (i.e., calculates a new, tighter interval bounding the minimum) but then essentially hands control back to us. Why so complicated? Why can’t we just specify the desired accuracy of the interval and let the library handle the entire iteration for us? The reason is that real problems more often than not don’t converge as obediently as the textbooks suggest! Instead they can (and do) fail in a variety of ways: they converge to the wrong solution, they attempt to access values for which the function is not deﬁned, they attempt to make steps that (for reasons of the larger system of which the routine is only a small part) are either too large or too small, or they diverge entirely. Based on my experience, I have come to the conclusion that every nonlinear problem is different (whereas every linear problem is the same), and therefore generic black-box routines don’t work! This brings us back to the way this minimization routine is implemented: the required iteration is not a black box and instead is open and accessible to us. We can simply monitor its progress (as we do in this example, by printing every iteration step to the screen), but we could also interfere with it—for instance to enforce some invariant that is speciﬁc to our problem. The “minimizer” does as much as it can by calculating and proposing a new interval; ultimately, however, we are in control over how the iteration progresses. (For the textbook example used here, this doesn’t matter, but it makes all the difference when you are doing serious numerical analysis on real problems!) 160 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 /* Minimizing a function with the GSL */ #include #include double fct( double x, void *params ) { return x*x*log(x); } int main() { double a = 0.1,b=1;/*interval which bounds the minimum */ gsl_function f; /* pointer to the function to minimize */ gsl_min_fminimizer *s; /* pointer to the minimizer instance */ f.function = &fct; /* the function to minimize */ f.params = NULL; /* no additional parameters needed */ /* allocate the minimizer, choosing a particular algorithm */ s = gsl_min_fminimizer_alloc( gsl_min_fminimizer_goldensection ); /* initialize the minimizer with a function an an initial interval */ gsl_min_fminimizer_set( s, &f, (a+b)/2.0, a, b ); while ( b-a > 1.e-6 ) { /* perform one minimization step */ gsl_min_fminimizer_iterate( s ); /* obtain the new bounding interval */ a = gsl_min_fminimizer_x_lower( s ); b = gsl_min_fminimizer_x_upper( s ); printf( "%f\t%f\n", a, b ); } printf( "Minimum Position: %f\tValue: %f\n", gsl_min_fminimizer_x_minimum(s), gsl_min_fminimizer_f_minimum(s) ); gsl_min_fminimizer_free( s ); return 0; } Obviously, we have only touched on the GSL. My primary intention in this section was to give you a sense for the way the GSL is designed and for what kinds of considerations it incorporates. The list of features is extensive—consult the documentation for more information. Further Reading • Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin. Lawrence Weinstein and John A. Adam. Princeton University Press. 2008. GUESSTIMATION AND THE BACK OF THE ENVELOPE 161 O’Reilly-5980006 master October 28, 2010 20:55 This little book contains about a hundred guesstimation problems (with solutions!) from all walks of life. If you are looking for ideas to get you started, look no further. • Programming Pearls. Jon Bentley. 2nd ed., Addison-Wesley. 1999; also, More Programming Pearls: Confessions of a Coder. Jon Bentley. Addison-Wesley. 1989. These two volumes of reprinted magazine columns are delightful to read, although (or because) they breathe the somewhat dated atmosphere of the old Bell Labs. Both volumes contain chapters on guesstimation problems in a programming context. • Back-of-the-Envelope Physics. Clifford E. Swartz. Johns Hopkins University Press. 2003. Physicists regard themselves as the inventors of back-of-the-envelope calculations. This book contains a set of examples from introductory physics (with solutions). • The Flying Circus of Physics. Jearl Walker. 2nd ed., Wiley. 2006. If you’d like some hints on how to take an interest in the world around you, try this book. It contains hundreds of everyday observations and challenges you to provide an explanation for each. Why are dried coffee stains always darker around the rim? Why are shower curtains pulled inward? Remarkably, many of these observations are still not fully understood! (You might also want to check out the rather different and more challenging ﬁrst edition.) • Pocket Ref. Thomas J. Glover. 3rd ed., Sequoia Publishing. 2009. This small book is an extreme example of the “lookup” model. It seems to contain almost everything: strength of wood beams, electrical wiring charts, properties of materials, planetary data, ﬁrst aid, military insignia, and sizing charts for clothing. It also shows the limitations of an overcomplete collection of trivia: I simply don’t ﬁnd it all that useful, but it is interesting for the breadth of topics covered. 162 CHAPTER SEVEN O’Reilly-5980006 master October 28, 2010 20:55 CHAPTER EIGHT Models from Scaling Arguments AFTER FAMILIARIZING YOURSELF WITH THE DATA THROUGH PLOTS AND GRAPHS, THE NEXT STEP IS TO START building a model for the data. The meaning of the word “model” is quite hazy, and I don’t want to spend much time and effort attempting to deﬁne this concept in an abstract way. For our purposes, a model is a mathematical description of the data that ideally is guided by our understanding of the system under consideration and that relates the various variables of the system to each other: a “formula.” Models Models like this are incredibly important. It is at this point that we go from the merely descriptive (plots and graphs) to the prescriptive: having a model allows us to predict what the system will do under a certain set of conditions. Furthermore, a good or truly useful model—because it helps us to understand how the system works—allows us to do so without resorting to the model itself or having to evaluate any particular formula explicitly. A good model ties the different variables that control the system together in such a way that we can see how varying any one of them will inﬂuence the outcome. It is this use of models—as an aide to or expression of our understanding—that is the most important one. (Of course, we must still evaluate the model formulas explicitly in order to obtain actual numbers for a speciﬁc prediction.) I should point out that this view of models and what they can do is not universal, and you will ﬁnd the term used quite differently elsewhere. For instance, statistical models (and this includes machine-learning models) are much more descriptive: they do not purport to explain the observed behavior in the way just described. Instead, their purpose is to predict expected outcomes with the greatest level of accuracy possible (numbers in, 163 O’Reilly-5980006 master October 28, 2010 20:55 numbers out). In contrast, my training is in theoretical physics, where the development of conceptual understanding of the observed behavior is the ultimate goal. I will use all available information about the system and how it works (or how I suspect it works!) wherever I can; I don’t restrict myself to using only the information contained in the data itself. (This is a practice that statisticians traditionally frown upon, because it constitutes a form of “pollution” of the data. They may very well be right, but my purpose is different: I don’t want to understand the data, I want to understand the system!) At the same time, I don’t consider the absolute accuracy of a model paramount: a model that yields only order-of-magnitude accuracy but helps me understand the system’s behavior (so that I can, for instance, make informed trade-off decisions) is much more valuable to me than a model that yields results with 1 percent accuracy but that is a black box otherwise. To be clear: there are situations when achieving the best possible accuracy is all that matters and conceptual understanding is of little interest. (Often these cases involve repeatable processes in well-understood systems.) If this describes your situation, then you need to use different methods that are appropriate to your problem scenario. Modeling As should be clear from the preceding description, building models is basically a creative process. As such, it is difﬁcult (if not impossible) to teach: there are no established techniques or processes for arriving at a useful model in any given scenario. One common approach to teaching this material is to present a large number of case studies, describing the problem situations and attempts at modeling them. I have not found this style to be very effective. First of all, every (nontrivial) problem is different, and tricks and fortuitous insights that work well for one example rarely carry over to a different problem. Second, building effective models often requires fairly deep insight into the particulars of the problem space, so you may end up describing lots of tedious details of the problem when actually you wanted to talk about the model (or the modeling). In this chapter, we will take a different approach. Effective modeling is often an exercise in determining “what to leave out”: good models should be simple (so that they are workable) yet retain the essential features of the system—certainly those that we are interested in. As it turns out, there are a few essential arguments and approximations that prove helpful again and again to make a complex problem tractable and to identify the dominant behavior. That’s what I want to talk about. Using and Misusing Models Just a reminder: models are not reality. They are descriptions or approximations of reality—often quite coarse ones! We need to ensure that we only place as much conﬁdence in a model as is warranted. 164 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 How much conﬁdence is warranted? That depends on how well- tested the model is. If a model is based on a good theory, agrees well with a wide range of data sets, and has shown it can predict observations correctly, then our conﬁdence may be quite strong. At the other extreme are what one might call “pie in the sky” models: ad hoc models, involving half a dozen (or so) parameters—all of which have been estimated independently and not veriﬁed against real data. The reliability of such a model is highly dubious: each of the parameters introduces a certain degree of uncertainty, which in combination can make the results of the model meaningless. Recall the discussion in Chapter 7: three parameters known to within 10 percent produce an uncertainty in the ﬁnal result of 30 percent—and that assumes that the parameters are actually known to within 10 percent! With four to six parameters that possibly are known, only much less precisely than 10 percent, the situation is correspondingly worse. (Many business models fall into this category.) Also keep in mind that virtually all models have only a limited region of validity. If you try to apply an existing model to a drastically different situation or use input values that are very different from those that you used to build the model, then you may well ﬁnd that the model makes poor predictions. Be sure to check that the assumptions on which the model is based are actually fulﬁlled for each application that you have in mind! Arguments from Scale Next to the local stadium there is a large, open parking lot. During game days, the parking lot is ﬁlled with cars, and—for obvious reasons—a line of portable toilets is set up all along one of the edges of the parking lot. This poses an interesting balancing problem: will this particular arrangement work for all situations, no matter how large the parking lot in question? The answer is no. The number of people in the parking lot grows with the area of the parking lot, which grows with the square of the edge length (i.e., it “scales as” L2); but the number of toilets is proportional to the edge length itself (so it scales as L). Therefore, as we make the parking lot bigger and bigger, there comes a point where the number of people overwhelms the number of available facilities. Guaranteed. Scaling Arguments This kind of reasoning is an example of a scaling argument. Scaling arguments try to capture how some quantity of interest depends on a control parameter. In particular, a scaling argument describes how the output quantity will change as the control parameter changes. Scaling arguments are a particularly fruitful way to arrive at symbolic expressions for phenomena (“formulas”) that can be manipulated analytically. You should have observed that the expressions I gave in the introductory example were not “dimensionally consistent.” We had people expressed as the square of a length and MODELS FROM SCALING ARGUMENTS 165 O’Reilly-5980006 master October 28, 2010 20:55 toilets expressed as length—what is going on here? Nothing, I merely omitted some detail that was not relevant for the argument I tried to make. A car takes up some amount of space on a parking lot; hence given the size of the parking lot (its area), we can ﬁgure out how many cars it can accommodate. Each car seats on average two people (on a game day), so we can ﬁgure out the number of people as well. Each person has a certain probability of using a bathroom during the duration of the game and will spend a certain number of minutes there. Given all these parameters, we can ﬁgure out the required “toilet availability minutes.” We can make a similar argument to ﬁnd the “availability minutes” provided by the installed facilities. Observe that none of these parameters depend on the size of the parking lot: they are constants. Therefore, we don’t need to worry about them if all we want to determine is whether this particular arrangement (with toilets all along one edge, but nowhere else) will work for parking lots of any size. (It is a widely followed convention to use the tilde,asinA ∼ B, to express that A “scales as” B, where A and B do not necessarily have the same dimensions.) On the other hand, if we actually want to know the exact number of toilets required for a speciﬁc parking lot size, then we do need to worry about these factors and try to obtain the best possible estimates for them. Because scaling arguments free us from having to think about pesky numerical factors, they provide such a convenient and powerful way to begin the modeling process. At the beginning, when things are most uncertain and our understanding of the system is least developed, they free us from having to worry about low-level details (e.g., how long does the average person spend in the bathroom?) and instead help us concentrate on the system’s overall behavior. Once the big picture has become clearer (and if the model still seems worth pursuing), we may want to derive some actual numbers from it as well. Only at this point do we need to concern ourselves with numerical constants, which we must either estimate or derive from available data. A recurring challenge with scaling models is to ﬁnd the correct scales. For example, we implicitly assumed that the parking lot was square (or at least nearly so) and would remain that shape as it grew. But if the parking lot were growing in one direction only (i.e., becoming longer and longer, while staying the same width), then its area would no longer scale as L2 but instead scale as L, where L is now the “long” side of the lot. This changes the argument, for if the portable toilets were located along the long side of the lot then the balance between people and available facilities would be the same no matter how large the lot became! On the other hand, if the facilities were set up along the short side, then their number would remain constant while the long side grew, resulting again in an imbalanced situation. Finding the correct scales is a bit of an experience issue—the important point here is that it is not as simple as saying: “It’s an area, therefore it must scale as length squared.” It depends on the shape of the area and on which of its lengths controls the size. 166 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 25 30 35 40 45 50 55 125 130 135 140 145 150 155 160 165 170 Mass [kg] Height [cm] Data Model: 0.84* x - 84 FIGURE 8-1.Heightsandweightsofagroupofmiddle-school students. The parking lot example demonstrates one typical application of high-level scaling arguments: what I call a “no-go argument.” Even without any speciﬁc numbers, the scaling behavior alone was enough to determine that this particular arrangement of toilets to visitors will break down at some point. Example: A Dimensional Argument Figure 8-1 shows the heights and weights of a class of female middle-school students.* Also displayed is the function m = 0.84h − 84.0, where m stands for the mass (or weight) and h for the height. The ﬁt seems to be quite close—is this a good model? The answer is no, because the model makes unreasonable predictions. Look at it: the model suggests that students have no weight unless they are at least 84 centimeters (almost 3 feet) tall; if they were shorter, their weight would be negative. Clearly, this model is no good (although it does describe the data over the range shown quite well). We expect that people who have no height also have no weight, and our model should reﬂect that. Rather than a model of the form ax + b, we might instead try axb, because this is the simplest function that gives the expected result for x = 0. *A description of this data set can be found in A Handbook of Small Data Sets. David J. Hand, Fergus Daly, K. McConway, D. Lunn, and E. Ostrowski. Chapman & Hall/CRC. 1993. MODELS FROM SCALING ARGUMENTS 167 O’Reilly-5980006 master October 28, 2010 20:55 20 30 40 50 60 70 120 130 140 150 160 170 180 Mass [kg] Height [cm] Linear Cubic FIGURE 8-2.Adouble logarithmic plot of the data from Figure 8-1. The cubic function m = ah3 seems to describe the data much better than the linear function m = ah. Figure 8-2 shows the same data but on a double logarithmic plot. Also indicated are functions of the form y = ax and y = ax3. The cubic function ax3 seems to represent the data quite well—certainly better than the linear function. But this makes utmost sense! The weight of a body is proportional to its volume—that is, to height times width times depth or h · w · d. Since body proportions are pretty much the same for all humans (i.e., a person who is twice as tall as another will have shoulders that are twice as wide, too), it follows that the volume of a person’s body (and hence its mass) scales as the third power of the height: mass ∼ height3. Figure 8-3 shows the data one more time and together with the model m = 1.25 · 10−5h3. Notice that the model makes reasonable predictions even for values outside the range of available data points, as you can see by comparing the model predictions with the average body measurements for some different age groups. (The ﬁgure also shows the possible limitations of a model that is built using less than perfectly representative data: the model underestimates adult weights because middle-school students are relatively light for their size. In contrast, two-year-olds are notoriously “chubby.”) Nevertheless, this is a very successful model. On the one hand, although based on very little data, the model successfully predicts the weight to within 20 percent accuracy over a range of almost two orders of magnitude in height. On the other hand, and arguably more importantly, it captures the general relationship between body height and weight—a relationship that makes sense but that we might not necessarily have guessed without looking at the data. 168 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 0 20 40 60 80 100 0 50 100 150 200 250 Mass [kg] Height [cm] Newborn Two-Year-Old Adult Woman Adult Man Data Model: 1.25 · 10-5 x3 Linear Approximation FIGURE 8-3.Thedata from Figure 8-1, together with the cubic model and the linear approximation to this model around h = 150 cm. Note that the approximation is good over the range of the actual data set but is wildly off farther away from it. The last question you may ask is why the initial description, m = 0.84x − 84 in Figure 8-1 seemed so good. The answer is that this is exactly the linear approximation to the correct model, m = 1.25 · 10−5h3, near h = 150 cm. (See Appendix B.) As with all linear approximations, it works well in a small region but fails for values farther away. Example: An Optimization Problem Another application of scaling arguments is to cast a question as an optimization problem. Consider a group of people scheduled to perform some task (say, a programming team). The amount of work that this group can perform in a ﬁxed amount of time (its “throughput”) is obviously proportional to the number n of people on the team: ∼ n. However, the members of the team will have to coordinate with each other. Let’s assume that each member of the team needs to talk to every other member of the team at least once a day. This implies a communication overhead that scales the square of the number of people: ∼−n2. (The minus sign indicates that the communication overhead results in a loss in throughput.) This argument alone is enough to show that for this task, there is an optimal number of people for which the realized productivity will be highest. (Also see Figure 8-4.) To ﬁnd the optimal stafﬁng level, we want to maximize the productivity P with respect to the number of workers on the team n: P(n) = cn − dn2 MODELS FROM SCALING ARGUMENTS 169 O’Reilly-5980006 master October 28, 2010 20:55 Team Size Raw Throughput Communication Overhead Achieved Throughput FIGURE 8-4. The work achievable by a team as a function of its size: the raw amount of work that can be accomplished grows with the team size, but the communication overhead grows even faster, which leads to an optimal team size. where c is the number of minutes each person can contribute during a regular workday, and d is the effective number of minutes consumed by each communication event. (I’ll return to the cautious “effective” modiﬁer shortly.) To ﬁnd the maximum, we take the derivative of P(n) with respect to n, set it equal to 0, and solve for n (see Appendix B). The result is: noptimal = c 2d Clearly, as the time consumed by each communication event d grows larger, the optimal team size shrinks. If we now wish to ﬁnd an actual number for the optimal stafﬁng level, then we need to worry about the numerical factors, and this is where the “effective” comes in. The total number of hours each person can put in during a regular workday is easy to estimate (8 hours at 60 minutes, less time for diversions), but the amount of time spent in a single communication event is more difﬁcult to determine. There are also additional effects that I would lump into the “effective” parameter: for example, not everybody on the team needs to talk to everybody else. Adjustments like this can be lumped into the parameter d which increasingly turns it into a synthetic parameter and less one that can be measured directly. Example: A Cost Model Models don’t have to be particularly complicated to provide important insights. I remember a situation where we were trying to improve the operation of a manufacturing environment. One particular job was performed on a special machine that had to be 170 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 retooled for each different type of item to be produced. First the machine would be set up (which took about 5 to 10 minutes), and then a worker would operate the machine to produce a batch of 150 to 200 identical items. The whole cycle lasted a bit longer than an hour and a half to complete the batch, and then the machine was retooled for the next batch. The retooling part of the cycle was a constant source of management frustration: for 10 minutes (while the machine was being set up), nothing seemed to be happening. Wasted time! (In manufacturing, productivity—deﬁned as “units per hour”—is the most closely watched metric.) Consequently, there had been a long string of process improvement projects dedicated to making the retooling part more efﬁcient and thereby faster. By the time I arrived, it had been streamlined very well. Nevertheless, there were constant efforts underway to reduce the time it took—after all, the sight of the machine sitting idle for 10 minutes seemed to be all the proof that was needed. It is interesting to set up a minimal cost model for this process. The relevant quantity to study is “minutes per unit.” This is essentially the inverse of the productivity, but I ﬁnd it easier to think in terms of the time it takes to produce a single unit than the other way around. Also note that “time per unit” equates to “cost per unit” after we take the hourly wage into account. Thus, the time per unit is the time T it takes to produce an entire batch, divided by the number of items n in the batch. The total processing time itself consists of the setup time T1 and n times the amount of time t required to produce a single item: T n = T1 + nt n = T1 n + t The ﬁrst term on the righthand side is the amount of the setup time that can be attributed to a single item; the second term, of course, is the time it takes to actually produce the item. The larger the batch size, the smaller the contribution of the setup time to the cost of each item as the setup time is “amortized” over more units. This is one of those situations where the numerical factors actually matter. We know that T1 is in the range of 300–600 seconds, and that n is between 150 and 200, so that the setup time per item, T1/n, is between 1–4 seconds. We can also ﬁnd the time t required to actually produce a single item if we recall that the cycle time for the entire batch was about 90 minutes; therefore t = 90 · 60/n, which is about 30 seconds per item. In other words, the setup time that caused management so much grief actually accounted for less than 10 percent of the total time to produce an item! But we aren’t ﬁnished yet. Let’s assume that, through some strenuous effort, we are able to reduce the setup time by 10 percent. (Not very likely, given that this part of the process had already received a lot of attention, but let’s assume—best case!) This would mean that we can reduce the setup time per item to 1–3.5 seconds. However, this means that the total time per item is reduced by only 1 or 2 percent! This is the kind of efﬁciency gain that MODELS FROM SCALING ARGUMENTS 171 O’Reilly-5980006 master October 28, 2010 20:55 0 50 100 150 200 0 20 40 60 80 100 120 140 160 Seconds per Unit Items per Batch Setup time: 600 seconds 300 seconds Single-Item Time: 30 seconds FIGURE 8-5. Total time required to process a unit, as a function of the batch size. makes sense only in very, very controlled situations where everything else is completely optimized. In contrast, a 10 percent reduction in t, the actual work time per item, would result in (almost) a 10 percent improvement in overall productivity (because the amount of time that it takes to produce an item is so much greater than the fraction of the setup time attributable to a single item). We can see this in Figure 8-5 which shows the “loaded” time per unit (including the setup time) for two typical values of the setup time as a function of the number of items produced in a single batch. Although the setup time contributes signiﬁcantly to the per-item time when there are fewer than about 50 items per batch, its effect is very small for batch sizes of 150 or more. For batches of this size, the time it takes to actually make an item dominates the time to retool the machine. The story is still not ﬁnished. We eventually launched a project to look at ways to reduce t for a change, but it was never strongly supported and shut down at the earliest possible moment by plant management in favor of a project to look at, you guessed it, the setup time! The sight of the machine sitting idle for 10 minutes was more than any self-respecting plant manager could bear. Optional: Scaling Arguments Versus Dimensional Analysis Scaling arguments may seem similar to another concept you may have heard of: dimensional analysis. Although they are related, they are really quite different. Scaling concepts, as introduced here, are based on our intuition of how the system behaves and are a way to capture this intuition in a mathematical expression. 172 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 Dimensional analysis, in contrast, applies to physical systems, which are described by a number of quantities that have different physical dimensions, such as length, mass, time, or temperature. Because equations describing a physical system must be dimensionally consistent, we can try to deduce the form of these equations by forming dimensionally consistent combinations of the relevant variables. Let’s look at an example. Everybody is familiar with the phenomenon of air resistance, or drag: there is a force F that acts to slow a moving body down. It seems reasonable to assume that this force depends on the cross-sectional area of the body A and the speed (or velocity) v. But it must also depend on some property of the medium (air, in this case) through which the body moves. The most basic property is the density ρ, which is the mass (in grams or kilograms) per volume (in cubic centimeters or meters): F = f (A,v,ρ) Here, f (x, y, z) is an as-yet-unknown function. Force has units of mass · length2/time2, area has units of length2, velocity of length/time, and density has units of mass/length3. We can now try to combine A, v, and ρ to form a combination that has the same dimensions as force. A little experimentation leads us to: F = cρ Av2 where c is a pure (dimensionless) number. This equation expresses the well-known result that air resistance increases with the square of the speed. Note that we arrived at it using purely dimensional arguments without any insight into the physical mechanisms at work. This form of reasoning has a certain kind of magic to it: why did we choose these speciﬁc quantities? Why did we not include the viscosity of air, the ambient air pressure, the temperature, or the length of the body? The answer is (mostly) physical intuition. The viscosity of air is small (viscosity measures the resistance to shear stress, which is the force transmitted by a ﬂuid captured between parallel plates moving parallel to each other but in opposite directions—clearly, not a large effect for air at macroscopic length scales). The pressure enters indirectly through the density (at constant temperature, according to the ideal gas law). And the length of the body is hidden in the numerical factor c, which depends on the shape of the body and therefore on the ratio of the cross-sectional radius√ A to the length. In summary: it is impressive how far we came using only very simple arguments, but it is hard to overcome a certain level of discomfort entirely. Methods of dimensional analysis appear less arbitrary when the governing equations are known. If this is the case, then we can use dimensional arguments to reduce the number of independently variable quantities. For example: assume that we already know the drag force is described by F = cρ Av2. Suppose further that we want to perform experiments to determine c for various bodies by measuring the drag force on them under various conditions. Naively, it might appear as if we had to map out the full three-dimensional MODELS FROM SCALING ARGUMENTS 173 O’Reilly-5980006 master October 28, 2010 20:55 parameter space by making measurements for all combinations of (ρ, A,v). But these three parameters only occur in the combination γ = ρ Av2, therefore it is sufﬁcient to run a single series of tests that varies γ over the range of values that we are interested in. This constitutes a signiﬁcant simpliﬁcation! Dimensional analysis relies on dimensional consistency and therefore works best for physical and engineering systems, which are described by independently measurable, dimensional quantities. It is particularly prevalent in areas such as ﬂuid dynamics, where the number of variables is especially high, and the physical laws are complicated and often not well understood. It is much less applicable in economic or social settings, where there are fewer (if any) rigorously established, dimensionally consistent relationships. Other Arguments There are other arguments that can be useful when attempting to formulate models. They come from the physical sciences, and (like dimensional analysis) they may not work as well in social and economic settings, which are not governed by strict physical laws. Conservation laws Conservation laws tell us that some quantity does not change over time. The best-known example is the law of conservation of energy. Conservation laws can be very powerful (in particular when they are exact, as opposed to only approximate) but may not be available: after all, the entire idea of economic growth and (up to a point) manufacturing itself rest on the assumption that more comes out than is being put in! Symmetries Symmetries, too, can be helpful in reducing complexity. For example, if an apparently two-dimensional system exhibits the symmetry of a circle, then I know that I’m dealing with a one-dimensional problem: any variation can occur only in the radial direction, since a circle looks the same in all directions. When looking for symmetries, don’t restrict yourself to geometric considerations—for instance, items entering and leaving a buffer at the same rate exhibit a form of symmetry. In this case, you might only need to solve one of the two processes explicitly while treating the other as a mirror image of the ﬁrst. Extreme-value considerations How does the system behave at the extremes? If there are no customers, messages, orders, or items? If there are inﬁnitely many? What if the items are extremely large or vanishingly small, or if we wait an inﬁnite amount of time? Such considerations can help to “sanity check” an existing model, but they can also provide inspiration when ﬁrst establishing a model. Limiting cases are often easier to treat because only one effect dominates, which eliminates the complexities arising out of the interplay of different factors. 174 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 Mean-Field Approximations The term mean-ﬁeld approximation comes from statistical physics, but I use it only as a convenient and intuitive expression for a much more general approximation scheme. Statistical physics deals with large systems of interacting particles, such as gas molecules in a piston or atoms on a crystal lattice. These systems are extraordinarily complicated because every particle interacts with every other particle. If you move one of the particles, then this will affect all the other particles, and so they will move, too; but their movement will, in turn, inﬂuence the ﬁrst particle that we started with! Finding exact solutions for such large, coupled systems is often impossible. To make progress, we ignore the individual interactions between explicit pairs of particles. Instead, we assume that the test particle experiences a ﬁeld, the “mean-ﬁeld,” that captures the “average” effect of all the other particles. For example, consider N gas atoms in a bottle of volume V . We may be interested to understand how often two gas atoms collide with each other. To calculate that number exactly, we would have to follow every single atom over time to see whether it bumps into any of the other atoms. This is obviously very difﬁcult, and it certainly seems as if we would need to keep track of a whole lot of detail that should be unnecessary if we are only interested in macroscopic properties. Realizing this, we can consider this gas in a mean-ﬁeld approximation: the probability that our test particle collides with another particle should be proportional to the average density of particles in that bottle ρ = N/V . Since there are N particles in the bottle, we expect that the number of collisions (over some time frame) will be proportional to Nρ. This is good enough to start making some predictions—for example, note that this expression is proportional to N 2. Doubling the number of particles in the bottle therefore means that the number of collisions will grow by a factor of 4. In contrast, reducing the volume of the container by half will increase the number of collisions only by a factor of 2. You will have noticed that in the previous argument, I omitted lots of detail—for example, any reference to the time frame over which I intend to count collisions. There is also a constant of proportionality missing: Nρ is not really the number of collisions but is merely proportional to it. But if all I care about is understanding how the number of collisions depends on the two variables I consider explicitly (i.e.,onN and V ), then I don’t need to worry about any of these details. The argument so far is sufﬁcient to work out how the number of collisions scales with both N and V . You can see how mean-ﬁeld approximations and scaling arguments enhance and support each other. Let’s step back and look at the concept behind mean-ﬁeld approximations more closely. MODELS FROM SCALING ARGUMENTS 175 O’Reilly-5980006 master October 28, 2010 20:55 TABLE 8-1.Mean-field approximations replace an average over functions with functions of averages. Exact Mean-Field E[x] = all outcomes x F(x)p(x) EMF[x] = F all outcomes x xp(x) Background and Further Examples If mean-ﬁeld approximations were limited to systems of interacting particles, they would not be of much interest in this book. However, the concept behind them is much more general and is very widely applicable. Whenever we want to calculate with a quantity whose values are distributed according to some probability distribution, we face the challenge that this quantity does not have a single, ﬁxed value. Instead, it has a whole spectrum of possible values, each more or less likely according to the probability distribution. Operating with such a quantity is difﬁcult because at least in principle we have to perform all calculations for each possible outcome and then weight the result of our calculation by the appropriate probability. At the very end of the calculation, we eventually form the average (properly weighted according to the probability factors) to arrive at a unique numerical value. Given the combinatorial explosion of possible outcomes, attempting to perform such a calculation exactly invariably starts to feel like wading in a quagmire—and that assumes that the calculation can be carried out exactly at all! The mean-ﬁeld approach cuts through this difﬁculty by performing the average before embarking on the actual calculation. Rather than working with all possible outcomes (and averaging them at the end), we determine the average outcome ﬁrst and then only work with that value alone. Table 8-1 summarizes the differences. This may sound formidable, but it is actually something we do all the time. Do you ever try to estimate how high the bill is going to be when you are waiting in line at the supermarket? You can do this explicitly—by going through all the items individually and adding up their prices (approximately) in your head—or you can apply a mean-ﬁeld approximation by realizing that the items in your cart represent a sample, drawn “at random,” from the selection of goods available. In the mean-ﬁeld approximation, you would estimate the average single-item price for goods from that store (probably about $5–$7) and then multiply that value by the number of items in your cart. Note that it should be much easier to count the items in your cart than to add up their individual prices explicitly. This example also highlights the potential pitfalls with mean-ﬁeld arguments: it will only be reliable if the average item price is a good estimator. If your cart contains two bottles of champagne and a rib roast for a party of eight, then an estimate based on a typical item price of $7 is going to be way off. 176 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 To get a grip on the expected accuracy of a mean-ﬁeld approximation, we can try to ﬁnd a measure for the width of the original distribution (e.g., its standard deviation or inter-quartile range) and then repeat our calculations after adding (and subtracting) the width from the mean value. (We may also treat the width as a small perturbation to the average value and use the perturbation methods discussed in Chapter 7.) Another example: how many packages does UPS (or any comparable freight carrier) ﬁt onto a truck (to be clear: I don’t mean a delivery truck, but one of these 53 feet tractor-trailer long-hauls)? Well, we can estimate the “typical” size of a package as about a cubic foot (0.33 m3), but it might also be as small as half that or as large as twice that size. To ﬁnd an estimate for the number of packages that will ﬁt, we divide the volume of the truck (17 m long, 2 m wide, 2.5 m high—we can estimate height and width if we realize that a person can stand upright in these things) by the typical size of a package: (17 · 2 · 2.5/0.33) ≈ 3,000 packages. Because the volume (not the length!) of each package might vary by as much as a factor of 2, we end up with lower and upper bounds of (respectively) 1,500 to 6,000 packages. This calculation makes use of the mean-ﬁeld idea twice. First, we work with the “average” package size. Second, we don’t worry about the actual spatial packing of boxes inside the truck; instead, we pretend that we can reshape them like putty. (This also is a form of “mean-ﬁeld” approximation.) I hope you appreciate how the mean-ﬁeld idea has turned this problem from almost impossibly difﬁcult to trivial—and I don’t just mean with regard to the actual computation and the eventual numerical result; but more importantly in the way we thought about it. Rather than getting stuck in the enormous technical difﬁculties of working out different stacking orders for packages of different sizes, the mean-ﬁeld notion reduced the problem description to the most fundamental question: into how many small pieces can we divide a large volume? (And if you think that all of this is rather trivial, I fully agree with you—but the “trivial” can easily be overlooked when one is presented with a complex problem in all of its ugly detail. Trying to ﬁnd mean-ﬁeld descriptions helps strip away nonessential detail and helps reveal the fundamental questions at stake.) One common feature of mean-ﬁeld solutions is that they frequently violate some of the system’s properties. For example, at Amazon, we would often consider the typical order to contain 1.7 items, of which 0.9 were books, 0.3 were CDs, and the remaining 0.5 items were other stuff (or whatever the numbers were). This is obviously nonsense, but don’t let this disturb you! Just carry on as if nothing happened, and work out the correct breakdown of things at the end. This approach doesn’t always work: you’ll still have to assign a whole person to a job, even it requires only one tenth of a full-time worker. However, this kind of argument is often sufﬁcient to work out the general behavior of things. There is a story involving Richard Feynman working on the Connection Machine, one of the earliest massively parallel supercomputers. All the other people on the team were MODELS FROM SCALING ARGUMENTS 177 O’Reilly-5980006 master October 28, 2010 20:55 computer scientists, and when a certain problem came up, they tried to solve it using discrete methods and exact enumerations—and got stuck with it. In contrast, Feynman worked with quantities such as “the average number of 1 bits in a message address” (clearly a mean-ﬁeld approach). This allowed him to cast the problem in terms of partial differential equations, which were easier to solve.* Common Time-Evolution Scenarios Sometimes we can propose a model based on the way the system under consideration evolves. The “proper” way to do this is to write down a differential equation that describes the system (in fact, this is exactly what the term “modeling” often means) and then proceed to solve it, but that would take us too far aﬁeld. (Differential equations relate the change in some quantity, expressed through its derivative, to the quantity itself. These equations can be solved to yield the quantity for all times.) However, there are a few scenarios so fundamental and so common that we can go ahead and simply write down the solution in its ﬁnal form. (I’ll give a few notes on the derivation as well, but it’s the solutions to these differential equations that should be committed to memory.) Unconstrained Growth and Decay Phenomena The simplest case concerns pure growth (or death) processes. If the rate of change of some quantity is constant in time, then the quantity will follow an exponential growth (or decay). Consider a cell culture. At every time step, a certain fraction of all cells in existence at that time step will split (i.e., generate offspring). Here the fraction of cells that participate in the population growth at every time step is constant in time; however, because the population itself grows, the total number of new cells at each time step is larger than at the previous time step. Many pure growth processes exhibit this behavior—compound interest on a monetary amount is another example (see Chapter 17). Pure death processes work similarly, only in this case a constant fraction of the population dies or disappears at each time step. Radioactive decay is probably the best-known example; but another one is the attenuation of light in a transparent medium (such as water). For every unit of length that light penetrates into the medium, its intensity is reduced by a constant fraction, which gives rise to the same exponential behavior. In this case, the independent variable is space, not time, but the argument is exactly the same. Mathematically, we can express the behavior of a cell culture as follows: if N(t) is the number of cells alive at time t and if a fraction f of these cells split into new cells, then the *This story is reported in “Richard Feynman and the Connection Machine.” Daniel Hillis. Physics Today 42 (February 1989), p. 78. The paper can also be found on the Web. 178 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 number of cells at the next time step t + 1 will be: N(t + 1) = N(t) + fN(t) The ﬁrst term on the righthand side comes from the cells which were already alive at time t, whereas the second term on the right comes from the “new” cells created at t.Wecan now rewrite this equation as follows: N(t + 1) − N(t) = fN(t) This is a difference equation. If we can assume that the time “step” is very small, we can replace the lefthand side with the derivative of N (this process is not always quite as simple as in this example—you may want to check Appendix B for more details on difference and differential quotients): d dt N = 1 T N(t) This equation is true for growth processes; for pure death processes instead we have an additional minus sign on the righthand side. These equations can be solved or integrated explicitly, and their solutions are: N(t) = N0 et/T Pure birth process N(t) = N0 e−t/T Pure death process Instead of using the “fraction” f of new or dying cells that we used in the difference equation, here we employ a characteristic time scale T , which is the time over which the number of cells changes by a factor e or 1/e, where e = 2.71828 .... The value for this time scale will depend on the actual system: for cells that multiply rapidly, T will be smaller than for another species that grows more slowly. Notice that such a scale factor must be there to make the argument of the exponential function dimensionally consistent! Furthermore, the parameter N0 is the number of cells in existence at the beginning t = 0. Exponential processes (either birth or death) are very important, but they never last very long. In a pure death process, the population very quickly dwindles to practically nothing. At t = 3T , only 5 percent of the original population are left; at t = 10T , less than 1 in 10,000 of the original cells has survived; at t = 20T , we are down to one in a billion. In other words, after a time that is a small multiple of T , the population will have all but disappeared. Pure birth processes face the opposite problem: the population grows so quickly that, after a very short while, it will exceed the capacity of its environment. This is so generally true that it is worth emphasizing: exponential growth is not sustainable over extended time periods. A process may start out as exponential, but before long, it must and will saturate. That brings us to the next scenario. MODELS FROM SCALING ARGUMENTS 179 O’Reilly-5980006 master October 28, 2010 20:55 Constrained Growth: The Logistic Equation Pure birth processes never continue for very long: the population quickly grows to a size that is unsustainable, and then the growth slows. A common model that takes this behavior into account assumes that the members of the population start to “crowd” each other, possibly competing for some shared resource such as food or territory. Mathematically, this can be expressed as follows: d dt N = λN(K − N)λ,K > 0 ﬁxed The ﬁrst term on the righthand side (which equals λKN) is the same as in the exponential growth equation. By itself, it would lead to an exponentially growing population N(t) = C exp(λKt). But the second term (−λN 2) counteracts this: it is negative, so its effect is to reduce the population; and it is proportional to N 2, so it grows more strongly as N becomes large. (You can motivate the form of this term by observing that it measures the number of collisions between members of the population and therefore expresses the “crowding” effect.) This equation is known as the logistic differential equation, and its solution is the logistic function: N(t) = K 1 + K N0 − 1 e−λKt This is a complicated function that depends on three parameters: λ The characteristic growth rate K The carrying capacity K = N(t →∞) N0 The initial number N0 = N(t = 0) of cells Compared to a pure (exponential) growth process, the appearance of the parameter K is new. It stands for the system’s “carrying capacity”—that is the maximum number of cells that the environment can support. You should convince yourself that the logistic function indeed tends to K as t becomes large. (You will ﬁnd different forms of this function elsewhere and with different parameters, but the form given here is the most useful one.) Figure 8-6 shows the logistic function for a selection of parameter values. I should point out that determining values for the three parameters from data can be extraordinarily difﬁcult especially when the only data points available are those to the left of the inﬂection point (the point with maximum slope, about halfway between N0 and K). Many different combinations of λ, K, and N0 may seem to ﬁt the data about equally well. In particular, it is difﬁcult to assess K from early-stage data alone. You may want to try to obtain an independent estimate (even a very rough one) for the carrying capacity and use it when determining the remaining parameters from the data. 180 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 0 2 4 6 8 10 -2 0 2 4 6 8 10 Population Size Time Initial Population N0 = 2 Carrying Capacity K = 10 λ=0.50λ=0.10λ=0.05 FIGURE 8-6.Logisticgrowth for different values of the growth rate λ. The initial population N0 and the overall carrying capacity K are the same in all cases. The logistic function is the most common model for all growth processes that exhibit some form of saturation. For example, infection rates for contagious diseases can be modeled using the logistic equation, as can the approach to equilibrium for cache hit rates. Oscillations The last of the common dynamical behaviors occurs in systems in which some quantity has an equilibrium value and that respond to excursions from that equilibrium position with a restoring effect, which drives the system back to the equilibrium position. If the system does not come to rest in the equilibrium position but instead overshoots, then the process will continue, going back and forth across the neutral position—in other words, the system undergoes oscillation. Oscillations occur in many physical systems (from tides to grandfather clocks to molecular bonds), but the “restore and overshoot” phenomenon is much more general. In fact, oscillations can be found almost everywhere: the pendulum that has “swung the other way” is proverbial, from the political scene to personal relationships. Oscillations are periodic: the system undergoes the same motion again and again. The simplest functions that exhibit this kind of behavior are the trigonometric functions sin(x) and cos(x) (also see Appendix B), therefore we can express any periodic behavior, at least approximately, in terms of sines or cosines. Sine and cosine are periodic with period 2π. To express an oscillation with period D, we therefore need to rescale x by 2π/D.Itmay also be necessary to shift x by a phase factor φ: an expression like sin(2π(x − φ)/D) will at least approximately describe any periodic data set. MODELS FROM SCALING ARGUMENTS 181 O’Reilly-5980006 master October 28, 2010 20:55 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 π 1 π 2 π 3 π 4 π 5 π 1 term: sin(x) 2 terms 3 terms 25 terms Sawtooth FIGURE 8-7.Thesawtooth function can be composed out of sine functions and their higher harmonics. But it gets better: a powerful theorem states that every periodic function, no matter how crazy, can be written as a (possibly inﬁnite) combination of trigonometric functions called a Fourier series. A Fourier series looks like this: f (x) = ∞ n=1 an sin 2πn x D where I have assumed that φ = 0. The important point is that only integer multiples of 2π/D are being used in the argument of the sine—the so-called “higher harmonics” of sin(2πx/D). We need to adjust the coefﬁcients an to describe a data set. Although the series is in principle inﬁnite, we can usually get reasonably good results by truncating it after only a few terms. (We saw an example for this in Chapter 6, where we used the ﬁrst two terms to describe the variation in CO2 concentration over Mauna Loa on Hawaii.) If the function is known exactly, then the coefﬁcients an can be worked out. For the sawtooth function (see Figure 8-7), the coefﬁcients are simply 1, 1/2, 1/3, 1/4,... with alternating signs: f (x) = sin x 1 − sin 2x 2 + sin 3x 3 ∓··· You can see that the series converges quite rapidly—even for such a crazy, discontinuous function as the sawtooth. Case Study: How Many Servers Are Best? To close out this chapter, let’s discuss an additional simple case study in model building. 182 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 0 2 4 6 8 10 0 2 4 6 8 10 Cost Servers Fixed Cost Expected Loss Total Cost Total Cost, Alternative Vendor FIGURE 8-8. Costs associated with provisioning a data center, as a function of the number of servers. Imagine you are deciding how many servers to purchase to power your ecommerce site. Each server costs you a ﬁxed amount E per day—this includes both the operational cost for power and colocation as well as the amortized acquisition cost (i.e., the purchase price divided by the number of days until the server is obsolete and will be replaced). The total cost for n servers is therefore nE. Given the expected trafﬁc, one server should be sufﬁcient to handle the load. However, each server has a ﬁnite probability p of failing on any given day. If your site goes down, you expect to lose B in proﬁt before a new server can be provisioned and brought back online. Therefore, the expected loss when using a single server is pB. Of course, you can improve the reliability of your site by using multiple servers. If you have n servers, then your site will be down only if all of them fail simultaneously. The probability for this event is pn. (Note that pn < p, since p is a probability and therefore p < 1.) The total daily cost C that you incur can now be written as the combination of the ﬁxed cost nE and the expected loss due to server downtime pn B (also see Figure 8-8): C = pn B + nE Given p, B, and E, you would like to minimize this cost with respect to the number of servers n. We can do this either analytically (by taking the derivative of C with respect to n) or numerically. But wait, there’s more! Suppose we also have an alternative proposal to provision our data center with servers from a different vendor. We know that their reliability q is worse MODELS FROM SCALING ARGUMENTS 183 O’Reilly-5980006 master October 28, 2010 20:55 (so that q > p), but their price F is signiﬁcantly lower (F E). How does this variant compare to the previous one? The answer depends on the values for p, B, and E. To make a decision, we must evaluate not only the location of the minimum in the total cost (i.e., the number of servers required) but also the actual value of the total cost at the minimum position. Figure 8-8 includes the total cost for the alternative proposal that uses less reliable but much cheaper servers. Although we need more servers under this proposal, the total cost is nevertheless lower than in the ﬁrst one. (We can go even further: how about a mix of different servers? This scenario, too, we can model in a similar fashion and evaluate it against its alternatives.) Why Modeling? Why worry about modeling in a book on data analysis? It seems we rarely have touched any actual data in the examples of this chapter. It all depends on your goals when working with data. If all you want to do is to describe it, extract some features, or even decompose it fully into its constituent parts, then the “analytic” methods of graphical and data analysis will sufﬁce. However, if you intend to use the data to develop an understanding of the system that produced the data, then looking at the data itself will be only the ﬁrst (although important) step. I consider conceptual modeling to be extremely important, because it is here that we go from the descriptive to the prescriptive. A conceptual model by itself may well be the most valuable outcome of an analysis. But even if not, it will at the very least enhance the purely analytical part of our work, because a conceptual model will lead us to additional hypothesis and thereby suggest additional ways to look at and study the data in an iterative process—in other words, even a purely conceptual model will point us back to the data but with added insight. The methods described in this chapter and the next are the techniques that I have found to be the most practically useful when thinking about data and the processes that generated it. Whenever looking at data, I always try to understand the system behind it, and I always use some (if not all) of the methods from these two chapters. Workshop: Sage Most of the tools introduced in this book work with numbers, which makes sense given that we are mostly interested in understanding data. However, there is a different kind of tool that works with formulas instead: computer algebra systems. The big (commercial) brand names for such systems have been Maple and Mathematica; in the open source world, the Sage project (http://www.sagemath.org) has become somewhat of a front runner. 184 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 Sage is an “umbrella” project that attempts to combine several existing open source projects (SymPy, Maxima, and others) together with some added functionality into a single, coherent, Python-like environment. Sage places heavy emphasis on features for number theory and abstract algebra (not exactly everyone’s cup of tea) and also includes support for numerical calculations and graphics, but in this section we will limit ourselves to basic calculus and a little linear algebra. (A word of warning: if you are not really comfortable with calculus, then you probably want to skip the rest of this section. Don’t worry—it won’t be needed in the rest of the book.) Once you start Sage, it drops you into a text-based command interpreter (a REPL, or read-eval-print loop). Sage makes it easy to perform some simple calculations. For example, let’s deﬁne a function and take its derivative: sage: a, x = var( 'a x' ) sage: f(x) = cos(a*x) sage: diff( f, x ) x |--> -a*sin(a*x) In the ﬁrst line we declare a and x as symbolic variables—so that we can refer to them later and Sage knows how to handle them. We then deﬁne a function using the “mathematical” notation f(x) = ... . Only functions deﬁned in this way can be used in symbolic calculations. (It is also possible to deﬁne Python functions using regular Python syntax, as in def f(x, a): return cos(a*x), but such functions can only be evaluated numerically.) Finally, we calculate the ﬁrst derivative of the function just deﬁned. All the standard calculus operations are available. We can combine functions to obtain more complex ones, we can ﬁnd integrals (both deﬁnite and indeﬁnite), and we can even evaluate limits: sage: # Indefinite integral: sage: integrate( f(x,a) + a*x^2, x ) 1/3*a*x^3 + sin(a*x)/a sage: sage: # Definite integral on [0,1]: sage: integrate( f(x,a) + a*x^2, x, 0, 1 ) 1/3*(a^2 + 3*sin(a))/a sage: sage: # Definite integral on [0,pi], assigned to function: sage: g(x,a) = integrate( f(x,a) + a*x^2, x, 0, pi ) sage: sage: # Evaluate g(x,a) for different a: sage: g(x,1) 1/3*pi^3 sage: g(x,1/2) 1/6*pi^3 + 2 sage: g(x,0) ---------------------------------------------------------- RuntimeError (some output omitted...) MODELS FROM SCALING ARGUMENTS 185 O’Reilly-5980006 master October 28, 2010 20:55 RuntimeError: power::eval(): division by zero sage: limit( g(x,a), a=0 ) pi In the next-to-last command, we tried to evaluate an expression that is mathematically not well deﬁned: the function g(x,a) includes a term of the form sin(πa)/a, which we can’t evaluate for a = 0 because we can’t divide by zero. However, the limit lima→0 sin(πa) a = π exists and is found by the limit() function. As a ﬁnal example from calculus, let’s evaluate some Taylor series (the arguments are: the function to expand, the variable to expand in, the point around which to expand, and the degree of the desired expansion): sage: taylor( f(x,a), x, 0, 5 ) 1/24*a^4*x^4 - 1/2*a^2*x^2 + 1 sage: taylor( sqrt(1+x), x, 0, 3 ) 1/16*x^3 - 1/8*x^2 + 1/2*x + 1 So much for basic calculus. Let’s also visit an example from linear algebra. Suppose we have the linear system of equations: ax + by = 1 2x + ay + 3z = 2 b2x − z = a and that we would like to ﬁnd those values of (x, y, z) that solve this system. If all the coefﬁcients were numbers, then we could use a numeric routine to obtain the solution; but in this case, some coefﬁcients are known only symbolically (as a and b), and we would like to express the solution in terms of these variables. Sage can do this for us quite easily: sage: a, b, x, y, z = var( 'abxyz') sage: sage: eq1 = a*x + b*y == 1 sage: eq2 = 2*x + a*y + 3*z == 2 sage: eq3 = b^2 - z == a sage: sage: solve( [eq1,eq2,eq3], x,y,z ) [[x == (3*b^3 - (3*a + 2)*b + a)/(a^2 - 2*b), y == -(3*a*b^2 - 3*a^2 - 2*a + 2)/(a^2 - 2*b), z == b^2 - a]] As a last example, let’s demonstrate how to calculate the eigenvalues of the following matrix: M = ⎛ ⎜⎝ aba bcb ab0 ⎞ ⎟⎠ Again, if the matrix were given numerically, then we could use a numeric algorithm, but here we would like to obtain a symbolic solution. 186 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 Again, Sage can do this easily: sage: m = matrix( [[a,b,a],[b,c,b],[a,b,0]] ) sage: m.eigenvalues() [-1/18*(-I*sqrt(3) + 1)*(4*a^2 - a*c + 6*b^2 + c^2)/(11/54*a^3 - 7/18*a^2*c + 1/3 *b^2*c + 1/27*c^3 + 1/18*(15*b^2 - c^2)*a + 1/18*sqrt(-5*a^6 - 6*a^4*b^2 + 11*a^2 *b^4 - 5*a^2*c^4 - 32*b^6 + 2*(5*a^3 + 4*a*b^2)*c^3 + (5*a^4 - 62*a^2*b^2 - 4*b^4 )*c^2 - 2*(5*a^5 + 17*a^3*b^2 - 38*a*b^4)*c)*sqrt(3))^(1/3) - 1/2*(I*sqrt(3) + 1) *(11/54*a^3 - 7/18*a^2*c + 1/3*b^2*c + 1/27*c^3 + 1/18*(15*b^2 - c^2)*a + 1/18*sq rt(-5*a^6 - 6*a^4*b^2 + 11*a^2*b^4 - 5*a^2*c^4 - 32*b^6 + 2*(5*a^3 + 4*a*b^2)*c^3 + (5*a^4 - 62*a^2*b^2 - 4*b^4)*c^2 - 2*(5*a^5 + 17*a^3*b^2 - 38*a*b^4)*c)*sqrt(3 ))^(1/3) + 1/3*a + 1/3*c, -1/18*(I*sqrt(3) + 1)*(4*a^2 - a*c + 6*b^2 + c^2)/(11/5 4*a^3 - 7/18*a^2*c + 1/3*b^2*c + 1/27*c^3 + 1/18*(15*b^2 - c^2)*a + 1/18*sqrt(-5* a^6 - 6*a^4*b^2 + 11*a^2*b^4 - 5*a^2*c^4 - 32*b^6 + 2*(5*a^3 + 4*a*b^2)*c^3 + (5* a^4 - 62*a^2*b^2 - 4*b^4)*c^2 - 2*(5*a^5 + 17*a^3*b^2 - 38*a*b^4)*c)*sqrt(3))^(1/ 3) - 1/2*(-I*sqrt(3) + 1)*(11/54*a^3 - 7/18*a^2*c + 1/3*b^2*c + 1/27*c^3 + 1/18*( 15*b^2 - c^2)*a + 1/18*sqrt(-5*a^6 - 6*a^4*b^2 + 11*a^2*b^4 - 5*a^2*c^4 - 32*b^6 + 2*(5*a^3 + 4*a*b^2)*c^3 + (5*a^4 - 62*a^2*b^2 - 4*b^4)*c^2 - 2*(5*a^5 + 17*a^3* b^2 - 38*a*b^4)*c)*sqrt(3))^(1/3) + 1/3*a + 1/3*c, 1/3*a + 1/3*c + 1/9*(4*a^2 - a *c + 6*b^2 + c^2)/(11/54*a^3 - 7/18*a^2*c + 1/3*b^2*c + 1/27*c^3 + 1/18*(15*b^2 - c^2)*a + 1/18*sqrt(-5*a^6 - 6*a^4*b^2 + 11*a^2*b^4 - 5*a^2*c^4 - 32*b^6 + 2*(5*a ^3 + 4*a*b^2)*c^3 + (5*a^4 - 62*a^2*b^2 - 4*b^4)*c^2 - 2*(5*a^5 + 17*a^3*b^2 - 38 *a*b^4)*c)*sqrt(3))^(1/3) + (11/54*a^3 - 7/18*a^2*c + 1/3*b^2*c + 1/27*c^3 + 1/18 *(15*b^2 - c^2)*a + 1/18*sqrt(-5*a^6 - 6*a^4*b^2 + 11*a^2*b^4 - 5*a^2*c^4 - 32*b^ 6 + 2*(5*a^3 + 4*a*b^2)*c^3 + (5*a^4 - 62*a^2*b^2 - 4*b^4)*c^2 - 2*(5*a^5 + 17*a^ 3*b^2 - 38*a*b^4)*c)*sqrt(3))^(1/3)] Whether these results are useful to us is a different question! This last example demonstrates something I have found to be quite generally true when working with computer algebra systems: it can be difﬁcult to ﬁnd the right kind of problem for them. Initially, computer algebra systems seem like pure magic, so effortlessly do they perform tasks that took us years to learn (and that we still get wrong). But as we move from trivial to more realistic problems, it is often difﬁcult to obtain results that are actually useful. All too often we end up with a result like the one in the eigenvalue example, which—although “correct”—simply does not shed much light on the problem we tried to solve! And before we try manually to simplify an expression like the one for the eigenvalues, we might be better off solving the entire problem with paper and pencil, because using paper and pencil, we can can introduce new variables for frequently occurring terms or even make useful approximations as we go along. I think computer algebra systems are most useful in scenarios that require the generation of a very large number of terms (e.g., combinatorial problems), which in the end are evaluated (numerically or otherwise) entirely by the computer to yield the ﬁnal result without providing a “symbolic” solution in the classical sense at all. When these conditions are fulﬁlled, computer algebra systems enable you to tackle problems that would simply not be feasible with paper and pencil. At the same time, you can maintain a greater level of accuracy because numerical (ﬁnite-precision) methods, although still required to obtain a useful result, are employed only in the ﬁnal stages of the calculation (rather than from the outset). Neither of these conditions is fulﬁlled for relatively MODELS FROM SCALING ARGUMENTS 187 O’Reilly-5980006 master October 28, 2010 20:55 straightforward ad hoc symbolic manipulations. Despite their immediate “magic” appeal, computer algebra systems are most useful as specialized tools for specialized tasks! One ﬁnal word about the Sage project. As an open source project, it leaves a strange impression. You ﬁrst become aware of this when you attempt to download the binary distribution: it consists of a 500 MB bundle, which unpacks to 2 GB on your disk! When you investigate what is contained in this huge package, the answer turns out to be everything. Sage ships with all of its dependencies. It ships with its own copy of all libraries it requires. It ships with its own copy of R. It ships with its own copy of Python! In short, it ships with its own copy of everything. This bundling is partially due to the well-known difﬁculties with making deeply numerical software portable, but is also an expression of the fact that Sage is an umbrella project that tries to combine a wide range of otherwise independent projects. Although I sincerely appreciate the straightforward pragmatism of this solution, it also feels heavy-handed and ultimately unsustainable. Personally, it makes me doubt the wisdom of the entire “all under one roof” approach that is the whole purpose of Sage: if this is what it takes, then we are probably on the wrong track. In other words, if it is not feasible to integrate different projects in a more organic way, then perhaps those projects should remain independent, with the user free to choose which to use. Further Reading There are two or three dozen books out there speciﬁcally on the topic of modeling, but I have been disappointed by most of them. Some of the more useful (from the elementary to the quite advanced) include the following. • How to Model It: Problem Solving for the Computer Age. A. M. Starﬁeld, K. A. Smith, and A. L. Bleloch. Interaction Book Company. 1994. Probably the best elementary introduction to modeling that I am aware of. Ten (ﬁcticious) case studies are presented and discussed, each demonstrating a different modeling method. (Out of print, but available used.) • An Introduction to Mathematical Modeling. Edward A. Bender. Dover Publications. 2000. Short and idiosyncratic. A bit dated but still insightful. • Concepts of Mathematical Modeling. Walter J. Meyer. Dover Publications. 2004. This book is a general introduction to many of the topics required for mathematical modeling at an advanced beginner level. It feels more dated than it is, and the presentation is a bit pedestrian; nevertheless, it contains a lot of accessible, and most of all practical, material. • Introduction to the Foundations of Applied Mathematics. Mark H. Holmes. Springer. 2009. This is one of the few books on modeling that places recurring mathematical techniques, rather than case studies, at the center of its discussion. Much of the material is advanced, but the ﬁrst few chapters contain a careful discussion of 188 CHAPTER EIGHT O’Reilly-5980006 master October 28, 2010 20:55 dimensional analysis and nice introductions to perturbation expansions and time-evolution scenarios. • Modeling Complex Systems. Nino Boccara. 2nd ed., Springer. 2010. This is a book by a physicist (not a mathematician, applied or otherwise), and it demonstrates how a physicist thinks about building models. The examples are rich, but mostly of theoretical interest. Conceptually advanced, mathematically not too difﬁcult. • Practical Applied Mathematics. Sam Howison. Cambridge University Press. 2005. This is a very advanced book on applied mathematics with a heavy emphasis on partial differential equations. However, the introductory chapters, though short, provide one of the most insightful (and witty) discussions of models, modeling, scaling arguments, and related topics that I have seen. The following two books are not about the process of modeling. Instead, they provide examples of modeling in action (with a particular emphasis on scaling arguments): • The Simple Science of Flight. Henk Tennekes. 2nd ed., MIT Press. 2009. This is a short yet fascinating book about the physics and engineering of ﬂying, written at the “popular science” level. The author makes heavy use of scaling laws throughout. If you are interested in aviation, then you will be interested in this book. • Scaling Concepts in Polymer Physics. Pierre-Gilles de Gennes. Cornell University Press. 1979. This is a research monograph on polymer physics and probably not suitable for a general audience. But the treatment, which relies almost exclusively on a variety of scaling arguments, is almost elementary. Written by the master of the scaling models. MODELS FROM SCALING ARGUMENTS 189 O’Reilly-5980006 master October 28, 2010 20:55 O’Reilly-5980006 master October 28, 2010 20:57 CHAPTER NINE Arguments from Probability Models WHEN MODELING SYSTEMS THAT EXHIBIT SOME FORM OF RANDOMNESS, THE CHALLENGE IN THE MODELING process is to ﬁnd a way to handle the resulting uncertainty. We don’t know for sure what the system will do—there is a range of outcomes, each of which is more or less likely, according to some probability distribution. Occasionally, it is possible to work out the exact probabilities for all possible events; however, this quickly becomes very difﬁcult, if not impossible, as we go from simple (and possibly idealized systems) to real applications. We need to ﬁnd ways to simplify life! In this chapter, I want to take a look at some of the “standard” probability models that occur frequently in practical problems. I shall also describe some of their properties that make it possible to reason about them without having to perform explicit calculations for all possible outcomes. We will see that we can reduce the behavior of many random systems to their “typical” outcome and a narrow range around that. This is true for many situations but not for all! Systems characterized by power-law distribution functions can not be summarized by a narrow regime around a single value, and you will obtain highly misleading (if not outright wrong) results if you try to handle such scenarios with standard methods. It is therefore important to recognize this kind of behavior and to choose appropriate techniques. The Binomial Distribution and Bernoulli Trials Bernoulli trials are random trials that can have only two outcomes, commonly called Success and Failure. Success occurs with probability p, and Failure occurs with probability 191 O’Reilly-5980006 master October 28, 2010 20:57 1 − p. We further assume that successive trials are independent and that the probability parameter p stays constant throughout. Although this description may sound unreasonably limiting, in fact many different processes can be expressed in terms of Bernoulli trials. We just have to be sufﬁciently creative when deﬁning the class of events that we consider “Successes.” A few examples: • Deﬁne Heads as Success in n successive tosses of a fair coin. In this case, p = 1/2. • Using fair dice, we can deﬁne getting an “ace” as Success and all other outcomes as Failure. In this case, p = 1/6. • We could just as well deﬁne not getting an “ace” as Success. In this case, p = 5/6. • Consider an urn that contains b black tokens and r red tokens. If we deﬁne drawing a red token as Success, then repeated drawings (with replacement!) from the urn constitute Bernoulli trials with p = r/(r + b). • Toss two identical coins and deﬁne obtaining two Heads as Success. Each toss of the two coins together constitutes a Bernoulli trial with p = 1/4. As you can see, the restriction to a binary outcome is not really limiting: even a process that naturally has more than two possible outcomes (such as throwing dice) can be cast in terms of Bernoulli trials if we restrict the deﬁnition of Success appropriately. Furthermore, as the last example shows, even combinations of events (such as tossing two coins or, equivalently, two successive tosses of a single coin) can be expressed in terms of Bernoulli trials. The restricted nature of Bernoulli trials makes it possible to derive some exact results (we’ll see some in a moment). More importantly, though, the abstraction forced on us by the limitations of Bernoulli trials can help to develop simpliﬁed conceptual models of a random process. Exact Results The central formula for Bernoulli trials gives the probability of observing k Successes in N trials with Success probability p, and it is also known as the Binomial distribution (see Figure 9-1): P(k, N; p) = N k pk(1 − p)N−k This should make good sense: we need to obtain k Successes, each occurring with probability p, and N − k Failures, each occurring with probability 1 − p. The term: N k = N! k!(N − k)! consisting of a binomial coefﬁcient is combinatorial in nature: it gives the number of distinct arrangements for k successes and N − k failures. (This is easy to see. There are N! ways to 192 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 Probability Number of Successes Success Probability: p = 1/6 10 Trials 30 Trials 60 Trials FIGURE 9-1.TheBinomialdistribution: the probability of obtaining k Successes in N trials with Success probability p. arrange N distinguishable items: you have N choices for the ﬁrst item, N − 1 choices for the second, and so on. However, the k Successes are indistinguishable from each other, and the same is true for the N − k Failures. Hence the total number of arrangements is reduced by the number of ways in which the Successes can be rearranged, since all these rearrangements are identical to each other. With k Successes, this means that k! rearrangements are indistinguishable, and similarly for the N − k failures.) Notice that the combinatorial factor does not depend on p. This formula gives the probability of obtaining a speciﬁc number k of Successes. To ﬁnd the expected number of Successes μ in N Bernoulli trials, we need to average over all possible outcomes: μ = N k kP(k, N; p) = Np This result should come as no surprise. We use it intuitively whenever we say that we expect “about ﬁve Heads in ten tosses of fair coin” (N = 10, p = 1/2) or that we expect to obtain “about ten aces in sixty tosses of a fair die” (N = 60, p = 1/6). Another result that can be worked out exactly is the standard deviation: σ = Np(1 − p) ARGUMENTS FROM PROBABILITY MODELS 193 O’Reilly-5980006 master October 28, 2010 20:57 The standard deviation gives us the range over which we expect the outcomes to vary. (For example, assume that we perform m experiments, each consisting of N tosses of a fair coin. The expected number of Successes in each experiment is Np, but of course we won’t obtain exactly this number in each experiment. However, over the course of the m experiments, we expect to ﬁnd the number of Successes in the majority of them to lie between Np− √ Np(1 − p) and Np+ √ Np(1 − p)). Notice that σ grows more slowly with the number of trials than does μ (σ ∼ √ N versus μ ∼ N). The relative width of the outcome distribution therefore shrinks as we conduct more trials. Using Bernoulli Trials to Develop Mean-Field Models The primary reason why I place so much emphasis on the concept of Bernoulli trials is that it lends itself naturally to the development of mean-ﬁeld models (see Chapter 8). Suppose we try to develop a model to predict the stafﬁng level required for a call center to deal with customer complaints. We know from experience that about one in every thousand orders will lead to a complaint (hence p = 1/1000). If we shipped a million orders a day, we could use the Binomial distribution to work out the probability to receive 1, 2, 3,...,999,999, 1,000,000 complaints a day and then work out the required stafﬁng levels accordingly—a daunting task! But in the spirit of mean-ﬁeld theories, we can cut through the complexity by realizing that we will receive “about Np = 1,000” complaints a day. So rather than working with each possible outcome (and its associated probability), we limit our attention to a single expected outcome. (And we can now proceed to determine how many calls a single person can handle per day to ﬁnd the required number of customer service people.) We can even go a step further and incorporate the uncertainty in the number of complaints by considering the standard deviation, which in this example comes out to √ Np(1 − p) ≈ √ 1000 ≈ 30. (Here I made use of the fact that 1 − p is very close to 1 for the current value of p.) The spread is small compared to the expected number of calls, lending credibility to our initial approximation of replacing the full distribution with only its expected outcome. (This is a demonstration for the observation we made earlier that the width of the resulting distribution grows much more slowly with N than does the expected value itself. As N gets larger, this effect becomes more drastic, which means that mean-ﬁeld theory gets better and more reliable the more urgently we need it! The tough cases can be situations where N is of moderate size—say, in the range of 10,...,100. This size is too large to work out all outcomes exactly but not large enough to be safe working only with the expected values.) Having seen this, we can apply similar reasoning to more general situations. For example, notice that the number of orders shipped each day will probably not equal exactly one million—instead, it will be a random quantity itself. So, by using N = 1,000,000 we have employed the mean-ﬁeld idea already. It should be easy to generalize to other situations from here. 194 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -4 -3 -2 -1 0 1 2 3 4 FIGURE 9-2.TheGaussian probability density. The Gaussian Distribution and the Central Limit Theorem Probably the most ubiquitous formula in all of probability theory and statistics is: p(x; μ, σ) = 1√ 2πσ e− 1 2 ( x−μ σ )2 This is the formula for the Gaussian (or Normal) probability density. This is the proverbial “Bell Curve.” (See Figure 9-2 and Appendix B for additional details.) Two factors contribute to the elevated importance of the Gaussian distribution: on the foundational side, the Central Limit Theorem guarantees that the Gaussian distribution will arise naturally whenever we take averages (of almost anything). On the sheerly practical side, the fact that we can actually explicitly work out most integrals involving the Gaussian means that such expressions make good building blocks for more complicated theories. The Central Limit Theorem Imagine you have a source of data points that are distributed according to some common distribution. The data could be numbers drawn from a uniform random-number generator, prices of items in a store, or the body heights of a large group of people. Now assume that you repeatedly take a sample of n elements from the source (n random numbers, n items from the store, or measurements for n people) and form the total sum of ARGUMENTS FROM PROBABILITY MODELS 195 O’Reilly-5980006 master October 28, 2010 20:57 the values. You can also divide by n to get the average. Notice that these sums (or averages) are random quantities themselves: since the points are drawn from a random distribution, their sums will also be random numbers. Note that we don’t necessarily know the distributions from which the original points come, so it may seem it would be impossible to say anything about the distribution of their sums. Surprisingly, the opposite is true: we can make very precise statements about the form of the distribution according to which the sums are distributed. This is the content of the Central Limit Theorem. The Central Limit Theorem states that the sums of a bunch of random quantities will be distributed according to a Gaussian distribution. This statement is not strictly true; it is only an approximation, with the quality of the approximation improving as more points are included in each sample (as n gets larger, the approximation gets better). In practice, though, the approximation is excellent even for quite moderate values of n. This is an amazing statement, given that we made no assumptions whatsoever about the original distributions (I will qualify this in a moment): it seems as if we got something for nothing! After a moment’s thought, however, this result should not be so surprising: if we take a single point from the original distribution, it may be large or it may be small—we don’t know. But if we take many such points, then the highs and the lows will balance each other out “on average.” Hence we should not be too surprised that the distribution of the sums is a smooth distribution with a central peak. It is, however, not obvious that this distribution should turn out to be the Gaussian speciﬁcally. We can now state the Central Limit Theorem formally. Let {xi } be a sample of size n, having the following properties: 1. All xn are mutually independent. 2. All xn are drawn from a common distribution. 3. The mean μ and the standard deviation σ for the distribution of the individual data points xi are ﬁnite. Then the sample average 1 n n i xi is distributed according to a Gaussian with mean μ and standard deviation σ/√ n. The approximation improves as the sample size n increases. In other words, the probability of ﬁnding the value x for the sample mean 1 n i xi becomes Gaussian as n gets large: P 1 n n i xi = x → 1√ 2π √ n σ exp −1 2 x − μ σ/√ n 2 Notice that, as for the binomial distribution, the width of the resulting distribution of the average is smaller than the width of the original distribution of the individual data points. This aspect of the Central Limit Theorem is the formal justiﬁcation for the common practice to “average out the noise”: no matter how widely the individual data points scatter, their averages will scatter less. 196 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 On the other hand, the reduction in width is not as fast as one might want: it is not reduced linearly with the number n of points in the sample but only by √ n. This means that if we take 10 times as many points, the scatter is reduced to only 1/ √ 10 ≈ 30 percent of its original value. To reduce it to 10 percent, we would need to increase the sample size by a factor of 100. That’s a lot! Finally, let’s take a look at the Central Limit Theorem in action. Suppose we draw samples from a uniform distribution that takes on the values 1, 2,...,6 with equal probability—in other words, throws of a fair die. This distribution has mean μ = 3.5 (that’s pretty obvious) and standard deviation σ = (62 − 1)/12 ≈ 1.71 (not as obvious but not terribly hard to work it out, or you can look it up). We now throw the die a certain number of times and evaluate the average of the values that we observe. According to the Central Limit Theorem, these averages should be distributed according to a Gaussian distribution that becomes narrower as we increase the number of throws used to obtain an average. To see the distribution of values, we generate a histogram (see Chapter 2). I use 1,000 “repeats” to have enough data for a histogram. (Make sure you understand what is going on here: we throw the die a certain number of times and calculate an average based on those throws; and this entire process is repeated 1,000 times.) The results are shown in Figure 9-3. In the upper-left corner we have thrown the die only once and thus form the “average” over only a single throw. You can see that all of the possible values are about equally likely: the distribution is uniform. In the upper-right corner, we throw the dice twice every time and form the average over both throws. Already a central tendency in the distribution of the average of values can be observed! We then continue to make longer and longer averaging runs. (Also shown is the Gaussian distribution with the appropriately adjusted width: σ/√ n, where n is the number of throws over which we form the average.) I’d like to emphasize two observations in particular. First, note how quickly the central tendency becomes apparent—it only takes averaging over two or three throws for a central peak to becomes established. Second, note how well the properly scaled Gaussian distribution ﬁts the observed histograms. This is the Central Limit Theorem in action. The Central Term and the Tails The most predominant feature of the Gaussian density function is the speed with which it falls to zero as |x| (the absolute value of x—see Appendix B) becomes large. It is worth looking at some numbers to understand just how quickly it does decay. For x = 2, the standard Gaussian with zero mean and unit variance is approximately p(2, 0, 1) = 0.05 .... For x = 5, it is already on the order of 10−6; for x = 10 it’s about 10−22; and not much further out, at x = 15, we ﬁnd p(15, 0, 1) ≈ 10−50. One needs to ARGUMENTS FROM PROBABILITY MODELS 197 O’Reilly-5980006 master October 28, 2010 20:57 1 throw 2 throws 3 throws 5 throws 1 2 3 4 5 6 10 throws 1 2 3 4 5 6 50 throws FIGURE 9-3.TheCentralLimitTheorem in action. Distribution of the average number of points when throwing a fair die several times. The boxes show the histogram of the value obtained; the line shows the distribution according to the Central Limit Theorem. keep this in perspective: the age of the universe is currently estimated to be about 15 billion years, which is about 4 · 1017 seconds. So, even if we had made a thousand trials per second since the beginning of time, we would still not have found a value as large or larger than x = 10! Although the Gaussian is deﬁned for all x, its weight is so strongly concentrated within a ﬁnite, and actually quite small, interval (about [−5, 5]) that values outside this range will not occur. It is not just that only one in a million events will deviate from the mean by more than 5 standard deviations: the decline continues, so that fewer than one in 1022 events will deviate by more than 10 standard deviations. Large outliers are not just rare—they don’t happen! This is both the strength and the limitation of the Gaussian model: if the Gaussian model applies, then we know that all variation in the data will be relatively small and therefore “benign.” At the same time, we know that for some systems, large outliers do occur in practice. This means that, for such systems, the Gaussian model and theories based on it will not apply, resulting in bad guidance or outright wrong results. (We will return to this problem shortly.) Why Is the Gaussian so Useful? It is the combination of two properties that makes the Gaussian probability distribution so common and useful: because of the Central Limit Theorem, the Gaussian distribution will 198 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 occur whenever we we dealing with averages; and because so much of the Gaussian’s weight is concentrated in the central region, almost any expression can be approximated by concentrating only on the central region, while largely disregarding the tails. As we will discuss in Chapter 10 in more detail, the ﬁrst of these two arguments has been put to good use by the creators of classical statistics: although we may not know anything about the distribution of the actual data points, the Central Limit Theorem enables us to make statements about their averages. Hence, if we concentrate on estimating the sample average of any quantity, then we are on much ﬁrmer ground, theoretically. And it is impressive to see how classical statistics is able to make rigorous statements about the extent of conﬁdence intervals for parameter estimates while using almost no information beyond the data points themselves! I’d like to emphasize these two points again: through clever application of the Central Limit Theorem, classical statistics is able to give rigorous (not just intuitive) bounds on estimates—and it can do so without requiring detailed knowledge of (or making additional assumptions about) the system under investigation. This is a remarkable achievement! The price we pay for this rigor is that we lose much of the richness of the original data set: the distribution of points has been boiled down to a single number—the average. The second argument is not so relevant from a conceptual point, but it is, of course, of primary practical importance: we can actually do many integrals involving Gaussians, either exactly or in very good approximation. In fact, the Gaussian is so convenient in this regard that it is often the ﬁrst choice when an integration kernel is needed (we have already seen examples of this in Chapter 2, in the context of kernel density estimates, and in Chapter 4, when we discussed the smoothing of a time series). Optional: Gaussian Integrals The basic idea goes like this: we want to evaluate an integral of the form: f (x)e−x2/2 dx We know that the Gaussian is peaked around x = 0, so that only nearby points will contribute signiﬁcantly to the value of the integral. We can therefore expand f (x) in a power series for small x. Even if this expansion is no good for large x, the result will not be affected signiﬁcantly because those points are suppressed by the Gaussian. We end up with a series of integrals of the form an xne−x2/2 dx which can be performed exactly. (Here, an is the expansion coefﬁcient from the expansion of f (x).) ARGUMENTS FROM PROBABILITY MODELS 199 O’Reilly-5980006 master October 28, 2010 20:57 We can push this idea even further. Assume that the kernel is not exactly Gaussian but is still strongly peaked: f (x)e−g(x) dx where the function g(x) has a minimum at some location (otherwise, the kernel would not have a peak at all). We can now expand g(x) into a Taylor series around its minimum (let’s assume it is at x = 0), retaining only the ﬁrst two terms: g(x) ≈ g(0) + g(0)x2/2 +···. The linear term vanishes because the ﬁrst derivative g must be zero at a minimum. Keeping in mind that the ﬁrst term in this expansion is a constant not depending on x, we have transformed the original integral to one of Gaussian type: e−g(0) f (x)e−g(0) x2/2 dx which we already know how to solve. This technique goes by the name of Laplace’s method (not to be confused with “Gaussian integration,” which is something else entirely). Beware: The World Is Not Normal! Given that the Central Limit Theorem is a rigorously proven theorem, what could possibly go wrong? After all, the Gaussian distribution guarantees the absence of outliers, doesn’t it? Yet we all know that unexpected events do occur. There are two things that can go wrong with the discussion so far: • The Central Limit Theorem only applies to sums or averages of random quantities but not necessarily to the random quantities themselves. The distribution of individual data points may be quite different from a Gaussian, so if we want to reason about individual events (rather than about an aggregate such as their average), then we may need different methods. For example, although the average number of items in a shipment may be Gaussian distributed around a typical value of three items per shipment, there is no guarantee that the actual distribution of items per shipment will follow the same distribution. In fact, the distribution will probably be geometrical, with shipments containing only a single item being much more common than any other shipment size. • More importantly, the Central Limit Theorem may not apply. Remember the three conditions listed as requirements for the Central Limit Theorem to hold? Individual events must be independent, follow the same distribution, and must have a ﬁnite mean and standard deviation. As it turns out, the ﬁrst and second of these conditions can be weakened (meaning that individual events can be somewhat correlated and drawn from slightly different distributions), but the third condition cannot be weakened: individual events must be drawn from a distribution of ﬁnite width. 200 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 Now this may seem like a minor matter: surely, all distributions occurring in practice are of ﬁnite width, aren’t they? As it turns out, the answer is no! Apparently “pathological” distributions of this kind are much more common in real life than one might expect. Such distributions follow power-law behavior, and they are the topic of the next section. Power-Law Distributions and Non-Normal Statistics Let’s start with an example. Figure 9-4 shows a histogram for the number of visits per person that a sample of visitors made to a certain website over one month. Two things stand out: the huge number of people who made a handful of visits (fewer than 5 or 6) and, at the other extreme, the huge number of visits that a few people made. (The heaviest user made 41,661 visits: that’s about one per minute over the course of the month—probably a bot or monitor of some sort.) This distribution looks nothing like the “benign” case in Figure 9-2. The distribution in Figure 9-4 is not merely skewed—it would be no exaggeration to say that it consists entirely of outliers! Ironically, the “average” number of visits per person—calculated naively, by summing the visits and dividing by the number of unique visitors—equals 26 visits per person. This number is clearly not representative of anything: it describes neither the huge majority of light users on the lefthand side of the graph (who made one or two visits), nor the small group of heavy users on the right. (The standard deviation is ±437, which clearly suggests that something is not right, given that the mean is 26 and the number of visits must be positive.) This kind of behavior is typical for distributions with so-called fat or heavy tails. In contrast to systems ruled by a Gaussian distribution or another distribution with short tails, data values are not effectively limited to a narrow domain. Instead, we can ﬁnd a nonnegligible fraction of data points that are very far away from the majority of points. Mathematically speaking, a distribution is heavy-tailed if it falls to zero much slower than an exponential function. Power laws (i.e., functions that behave as ∼ 1/xβ for some exponent β>0) are usually used to describe such behavior. In Chapter 3, we discussed how to recognize power laws: data points falling onto a straight line on a double logarithmic plot. A double logarithmic plot of the data from Figure 9-4 is shown in Figure 9-5, and we see that eventually (i.e., for more than ﬁve visits per person), the data indeed follows a power law (approximately ∼ x−1.9). On the lefthand side of Figure 9-5(i.e., for few visits per person), the behavior is different. (We will come back to this point later.) Power-law distributions like the one describing the data set in in Figures 9-4 and 9-5 are surprisingly common. They have been observed in a number of different (and often colorful) areas: the frequency with which words are used in texts, the magnitude of ARGUMENTS FROM PROBABILITY MODELS 201 O’Reilly-5980006 master October 28, 2010 20:57 0 5,000 10,000 15,000 20,000 25,000 30,000 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 Number of Users Number of Visits per User FIGURE 9-4.Ahistogram of the number of visitors who made x number of visits to a certain website. Note the extreme skewness of the distribution: most visitors made one or two visits, but a few made tens of thousands of visits. earthquakes, the size of ﬁles, the copies of books sold, the intensity of wars, the sizes of sand particles and solar ﬂares, the population of cities, and the distribution of wealth. Power-law distributions go by different names in different contexts—you will ﬁnd them referred to as “Zipf” of “Pareto” distributions, but the mathematical structure is always the same. The term “power-law distribution” is probably the most widely accepted, general term for this kind of heavy-tailed distribution. Whenever they were found, power-law distributions were met with surprise and (usually) consternation. The reason is that they possess some unexpected and counterintuitive properties: • Observations span a wide range of values, often many orders of magnitude. • There is no typical scale or value that could be used to summarize the distribution of points. • The distribution is extremely skewed, with many data points at the low end and few (but not negligibly few) data points at very high values. • Expectation values often depend on the sample size. Taking the average over a sample of n points may yield a signiﬁcantly smaller value than taking the average over 2n or 10n data points. (This is in marked contrast to most other distributions, where the quality of the average improves when it is based on more points. Not so for power-law distributions!) 202 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 1 10 100 1,000 10,000 100,000 1 10 100 1,000 10,000 Number of Users Number of Visits per User Mean: 26 visits ∝ 1/x1.9 FIGURE 9-5.Thedata from Figure 9-4 but on double logarithmic scales. The righthand side of this curve is well described by the power law 1/x1.9. It is the last item that is the most disturbing. After all, didn’t the Central Limit Theorem tell us that the scatter of the average was always reduced by a factor of 1/√ n as the sample size increases? Yes, but remember the caveat at the end of the last section: the Central Limit Theorem applies only to those distributions that have a ﬁnite mean and standard deviation. For power-law distributions, this condition is not necessarily fulﬁlled, and hence the Central Limit Theorem does not apply. The importance of this fact cannot be overstated. Not only does much of our intuition go out the window but most of statistical theory, too! For the most part, distributions without expectations are simply not treated by standard probability theory and statistics.* Working with Power-Law Distributions So what should you do when you encounter a situation described by a power-law distribution? The most important thing is to stop using classical methods. In particular, the mean-ﬁeld approach (replacing the distribution by its mean) is no longer applicable and will give misleading or incorrect results. From a practical point of view, you can try segmenting the data (and, by implication, the system) into different groups: the majority of data points at small values (on the lefthand side in Figure 9-5), the set of data points in the tail of the distribution (for relatively large *The comment on page 48 (out of 440) of Larry Wasserman’s excellent All of Statistics is typical: “From now on, whenever we discuss expectations, we implicitly assume that they exist.” ARGUMENTS FROM PROBABILITY MODELS 203 O’Reilly-5980006 master October 28, 2010 20:57 values), and possibly even a group of data points making up the intermediate regime. Each such group is now more homogeneous, so that standard methods may apply. You will need insight into the business domain of the data, and you should exercise discretion when determining where to make those cuts, because the data itself will not yield a natural “scale” or other quantity that could be used for this purpose. There is one more practical point that you should be aware of when working with power-law distributions: the form ∼ 1/xβ is only valid “asymptotically” for large values of x. For small x, this rule must be supplemented, since it obviously cannot hold for x → 0 (we can’t divide by zero). There are several ways to augment the original form near x = 0. We can either impose a minimum value xmin of x and consider the distribution only for values larger than this. That is often a reasonable approach because such a minimum value may exist naturally. For example there is an obvious “minimum” number of pages (i.e., one page) that a website visitor can view and still be considered a “visitor.” Similar considerations hold for the population of a city and the copies of books sold—all are limited on the left by xmin = 1. Alternatively, the behavior of the observed distribution may be different for small values. Look again at Figure 9-5: for values less than about 5, the curve deviates from the power-law behavior that we ﬁnd elsewhere. Depending on the shape that we require near zero, we can modify the original rule in different ways. Two examples stand out: if we want a ﬂat peak for x = 0, then we can try a form like ∼ 1/(a + xβ ) for some a > 0, and if we require a peak at a nonzero location, we can use a distribution like ∼ exp(−C/x)/xβ (see Figure 9-6). For speciﬁc values of β, two distributions of this kind have special names: 1 π 1 1 + x2 Cauchy distribution c 2π e−c/2x x3/2 L´evy distribution Optional: Distributions with Inﬁnite Expectation Values The expectation value E( f ) of a function f (x), which in turn depends on some random quantity x, is nothing but the weighted average of that function in which we use the probability density p(x) of x as the weight function: E( f ) = f (x)p(x) dx Of particular importance are the expectation values for simple powers of the variable x, the so called moments of the distribution: E(1) = p(x) dx (must always equal 1) E(x) = xp(x) dx Mean or ﬁrst moment E(x2) = x2 p(x) dx Second moment 204 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 1 2 3 4 5 6 7 8 c = 1 = 2 = 5 FIGURE 9-6.TheL´evy distribution for several values of the parameter c. The ﬁrst expression must always equal 1, because we expect p(x) to be properly normalized. The second is the familiar mean, as the weighted average of x. The last expression is used in the deﬁnition of the standard deviation: σ = E(x2) − E(x)2 For power-law distributions, which behave as ∼ 1/xβ with β>1 for large x, some of these integrals may not converge—in this case, the corresponding moment “does not exist.” Consider the kth moment (C is the normalization constant C = E(1) = p(x) dx): E(xk) = C ∞ xk 1 xβ dx = C ∞ 1 xβ−k dx Unless β − k > 1, this integral does not converge at the upper limit of integration. (I assume that the integral is proper at the lower limit of integration, through a lower cutoff xmin or another one of the methods discussed previously.) In particular, if β<2, then the mean and all higher moments do not exist; if β<3, then the standard deviation does not exist. We need to understand that this is an analytical result—it tells us that the distribution is ill behaved and that, for instance, the Central Limit Theorem does not apply in this case. Of course, for any ﬁnite sample of n data points drawn from such a distribution, the mean (or other moment) will be perfectly ﬁnite. But these analytical results warn us that, if we continue to draw additional data points from the distribution, then their average (or other ARGUMENTS FROM PROBABILITY MODELS 205 O’Reilly-5980006 master October 28, 2010 20:57 moment) will not settle down: it will grow as the number of data points in the sample grows. Any summary statistic calculated from a ﬁnite sample of points will therefore not be a good estimator for the true (in this case: inﬁnite) value of that statistic. This poses an obvious problem because, of course, all practical samples contain only a ﬁnite number of points. Power-law distributions have no parameters that could (or need be) estimated—except for the exponent, which we know how to obtain from a double logarithmic plot. There is also a maximum likelihood estimator for the exponent: β = 1 + nn i=0 log xi x0 where x0 is the smallest value of x for which the asymptotic power-law behavior holds. Where to Go from Here If you want to dig deeper into the theory of heavy-tail phenomena, you will ﬁnd that it is a mess. There are two reasons for that: on the one hand, the material is technically hard (since one must make do without two standard tools: expectation values and the Central Limit Theorem), so few simple, substantial, powerful results have been obtained—a fact that is often covered up by excessive formalism. On the other hand, the “colorful” and multi disciplinary context in which power-law distributions are found has led to much confusion. Similar results are being discovered and re-discovered in various ﬁelds, with each ﬁeld imposing its own terminology and methodology, thereby obscuring the mathematical commonalities. The unexpected and often almost paradoxical consequences of power-law behavior also seem to demand an explanation for why such distributions occur in practice and whether they might all be expressions of some common mechanisms. Quite a few theories have been proposed toward this end, but none has found widespread acceptance or proved particularly useful in predicting new phenomena—occasionally grandiose claims to the contrary notwithstanding. At this point, I think it is fair to say that we don’t understand heavy-tail phenomena: not when and why they occur, nor how to handle them if they do. Other Distributions There are some other distributions that describe common scenarios you should be aware of. Some of the most important (or most frequently used) ones are described in this section. 206 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 2 4 6 8 10 p = 0.2 = 0.5 = 0.8 FIGURE 9-7.Thegeometric distribution: p(k, p) = p(1 − p)k−1. Geometric Distribution The geometric distribution (see Figure 9-7): p(k, p) = p(1 − p)k−1 with k = 1, 2, 3,... is a special case of the binomial distribution. It can be viewed as the probability of obtaining the ﬁrst Success at the kth trial (i.e., after observing k − 1 failures). Note that there is only a single arrangement of events for this outcome, hence the combinatorial factor is equal to one. The geometric distribution has mean μ = 1/p and standard deviation σ = √ 1 − p/p. Poisson Distribution The binomial distribution gives us the probability of observing exactly k events in n distinct trials. In contrast, the Poisson distribution describes the probability of ﬁnding k events during some continuous observation interval of known length. Rather than being characterized by a probability parameter and a number of trials (as for the binomial distribution), the Poisson distribution is characterized by a rate λ and an interval length t. The Poisson distribution p(k, t,λ)gives the probability of observing exactly k events during an interval of length t when the rate at which events occur is λ (see Figure 9-8): p(k, t,λ)= (λt)k k! e−λt ARGUMENTS FROM PROBABILITY MODELS 207 O’Reilly-5980006 master October 28, 2010 20:57 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 5 10 15 20 λ = 1 = 3 = 10 FIGURE 9-8.ThePoissondistribution: p(k, t,λ)= (λt)k k! e−λt . Because t and λ only occur together, this expression is often written in a two-parameter form as p(k,ν)= e−ννk/k!. Also note that the term e−λt does not depend on k at all—it is merely there as a normalization factor. All the action is in the fractional part of the equation. Let’s look at an example. Assume that phone calls arrive at a call center at a rate of 15 calls per hour (so that λ = 0.25 calls/minute). Then the Poisson distribution p(k, 1, 0.25) will give us the probability that k = 0, 1, 2,... calls will arrive in any given minute. But we can also use it to calculate the probability that k calls will arrive during any 5-minute time period: p(k, 5, 0.25). Note that in this context, it makes no sense to speak of independent trials: time passes continuously, and the expected number of events depends on the length of the observation interval. We can collect a few results. Mean μ and standard deviation σ for the Poisson distribution are given by: μ = λt σ = √ λt Notice that only a single parameter (λt) controls both the location and the width of the distribution. For large λ, the Poisson distribution approaches a Gaussian distribution with μ = λ and σ = √ λ. Only for small values of λ (say, λ<20) are the differences notable. Conversely, to estimate the parameter λ from observations, we divide the number k of events observed by the length t of the observation period: λ = k/t. Keep in mind that 208 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 when evaluating the formula for the Poisson distribution, the rate λ and the length t of the interval of interest must be of compatible units. To ﬁnd the probability of k calls over 6 minutes in our call center example above, we can either use t = 6 minutes and λ = 0.25 calls per minute or t = 0.1 hours and λ = 15 calls per hour, but we cannot mix them. (Also note that 6 · 0.25 = 0.1 · 15 = 1.5, as it should.) The Poisson distribution is appropriate for processes in which discrete events occur independently and at a constant rate: calls to a call center, misprints in a manuscript, trafﬁc accidents, and so on. However, you have to be careful: it applies only if you can identify a rate at which events occur and if you are interested speciﬁcally in the number of events that occur during intervals of varying length. (You cannot expect every histogram to follow a Poisson distribution just because “we are counting events.”) Log-Normal Distribution Some quantities are inherently asymmetrical. Consider, for example, the time it takes people to complete a certain task: because everyone is different, we expect a distribution of values. However, all values are necessarily positive (since times cannot be negative). Moreover, we can expect a particular shape of the distribution: there will be some minimum time that nobody can beat, then a small group of very fast champions, a peak at the most typical completion time, and ﬁnally a long tail of stragglers. Clearly, such a distribution will not be well described by a Gaussian, which is deﬁned for both positive and negative values of x, is symmetric, and has short tails! The log-normal distribution is an example of an asymmetric distribution that is suitable for such cases. It is related to the Gaussian: a quantity follows the log-normal distribution if its logarithm is distributed according to a Gaussian. The probability density for the log-normal distribution looks like this: p(x; μ, σ) = 1√ 2πσx exp −1 2 log(x/μ) σ 2 (The additional factor of x in the denominator stems from the Jacobian in the change of variables from x to log x.) You may often ﬁnd the log-normal distribution written slightly differently: p(x; ˜μ, σ ) = 1√ 2πσx exp −1 2 log(x) − ˜μ σ 2 This is the same once you realize that log(x/μ) = log(x) − log(μ) and make the identiﬁcation ˜μ = log(μ). The ﬁrst form is much better because it expresses clearly that μ is the typical scale of the problem. It also ensures that the argument of the logarithm is dimensionless (as it must be). ARGUMENTS FROM PROBABILITY MODELS 209 O’Reilly-5980006 master October 28, 2010 20:57 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 0.5 1 1.5 2 2.5 σ = 2 = 1 = 1/2 = 1/4 μ = 1 FIGURE 9-9.Thelog-normal distribution. Figure 9-9 shows the log-normal distribution for a few different values of σ. The parameter σ controls the overall “shape” of the curve, whereas the parameter μ controls its “scale.” In general, it can be difﬁcult to predict what the curve will look like for different values of the parameters, but here are some results (the mode is the position of the peak). Mode: μe−σ 2 Mean: μe σ2 2 Standard deviation: μ eσ 2 eσ 2 − 1 Values for the parameters can be estimated from a data set as follows: μ = exp 1 n n i=1 log xi σ = 1 n n i=1 log xi μ 2 The log-normal distribution is important as an example of a standard statistical distribution that provides an alternative to the Gaussian model for situations that require an asymmetrical distribution. That being said, the log-normal distribution can be ﬁckle to use in practice. Not all asymmetric point distributions are described well by a log-normal distribution, and you may not be able to obtain a good ﬁt for your data using a log- normal distribution. For truly heavy-tail phenomena in particular, you will need a power-law distribution after all. Also keep in mind that the log-normal distribution 210 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 approaches the Gaussian as σ becomes small compared to μ (i.e., σ/μ 1), at which point it becomes easier to work with the familiar Gaussian directly. Special-Purpose Distributions Many additional distributions have been deﬁned and studied. Some, such as the gamma distribution, are mostly of theoretical importance, whereas others—such as the chi-square, t, and F distributions—are are at the core of classical, frequentist statistics (we will encounter them again in Chapter 10). Still others have been developed to model speciﬁc scenarios occurring in practical applications—especially in reliability engineering, where the objective is to make predictions about likely failure rates and survival times. I just want to mention in passing a few terms that you may encounter. The Weibull distribution is used to express the probability that a device will fail after a certain time. Like the log-normal distribution, it depends on both a shape and a scale parameter. Depending on the value of the shape parameter, the Weibull distribution can be used to model different failure modes. These include “infant mortality” scenarios, where devices are more likely to fail early but the failure rate declines over time as defective items disappear from the population, and “fatigue death” scenarios, where the failure rate rises over time as items age. Yet another set of distributions goes by the name of extreme-value or Gumbel distributions. They can be used to obtain the probability that the smallest (or largest) value of some random quantity will be of a certain size. In other words, they answer the question: what is the probability that the largest element in a set of random numbers is precisely x? Quite intentionally, I don’t give formulas for these distributions here. They are rather advanced and specialized tools, and if you want to use them, you will need to consult the appropriate references. However, the important point to take away here is that, for many typical scenarios involving random quantities, people have developed explicit models and studied their properties; hence a little research may well turn up a solution to whatever your current problem is. Optional: Case Study—Unique Visitors over Time To put some of the ideas introduced in the last two chapters into practice, let’s look at an example that is a bit more involved. We begin with a probabilistic argument and use it to develop a mean-ﬁeld model, which in turn will lead to a differential equation that we proceed to solve for our ﬁnal answer. This example demonstrates how all the different ideas we have been introducing in the last few chapter can ﬁt together to tackle more complicated problems. Imagine you are running a website. Users visit this website every day of the month at a rate that is roughly constant. We can also assume that we are able to track the identity of ARGUMENTS FROM PROBABILITY MODELS 211 O’Reilly-5980006 master October 28, 2010 20:57 these users (through a cookie or something like that). By studying those cookies, we can see that some users visit the site only once in any given month while others visit it several times. We are interested in the number of unique users for the month and, in particular, how this number develops over the course of the month. (The number of unique visitors is a key metric in Internet advertising, for instance.) The essential difﬁculty is that some users visit several times during the month, and so the number of unique visitors is smaller than the total number of visitors. Furthermore, we will observe a “saturation effect”: on the ﬁrst day, almost every user is new; but on the last day of the month, we can expect to have seen many of the visitors earlier in the month already. We would like to develop some understanding for the number of unique visitors that can be expected for each day of the month (e.g., to monitor whether we are on track to meet some monthly goal for the number of unique visitors). To make progress, we need to develop a model. To see more clearly, we use the following idealization, which is equivalent to the original problem. Consider an urn that contains N identical tokens (total number of potential visitors). At each turn (every day), we draw k tokens randomly from the urn (average number of visitors per day). We mark all of the drawn tokens to indicate that we have “seen” them and then place them back into the urn. This cycle is repeated for every day of the month. Because at each turn we mark all unmarked tokens from the random sample drawn at this turn, the number of marked tokens in the urn will increase over time. Because each token is marked at most once, the number of marked tokens in the urn at the end of the month is the number of unique visitors that have visited during that time period. Phrased this way, the process can be modeled as a sequence of Bernoulli trials. We deﬁne drawing an already marked token as Success. Because the number of marked tokens in the urn is increasing, the success probability p will change over time. The relevant variables are: N Total number of tokens in urn k Number of tokens drawn at each turn m(t) Number of already-marked tokens drawn at turn t n(t) Total number of marked tokens in urn at time t p(t) = n(t) N Probability of drawing an already-marked token at turn t Each day consists of a new Bernoulli trial in which k tokens are drawn from the urn. However, because the number of marked tokens in the urn increases every day, the probability p(t) is different every day. 212 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 On day t, we have n(t) marked tokens in the urn. We now draw k tokens, of which we expect m(t) = kp(t) to be marked (Successes). This is simply an application of the basic result for the expectation value of Bernoulli trials, using the current value for the probability. (Working with the expectation value in this way constitutes a mean-ﬁeld approximation.) The number of unmarked tokens in the current drawing is: k − m(t) = k − kp(t) = k(1 − p(t)) We now mark these tokens and place them back into the urn, which means that the number of marked tokens in the urn grows by k(1 − p(t)): n(t + 1) = n(t) + k(1 − p(t)) This equation simply expresses the fact that the new number of marked tokens n(t + 1) consists of the previous number of marked tokens n(t) plus the newly marked tokens k(1 − p(t)). We can now divide both sides by N (the total number of tokens). Recalling that p(t) = n(t)/N, we write: p(t + 1) = p(t) + f (1 − p(t)) with f = k N This is a recurrence relation for p(t), which can be rewritten as: p(t + 1) − p(t) = f (1 − p(t)) In the continuum limit, we replace the difference between the “new” and the “old” values by the derivative at time t, which turns the recurrence relation into a more convenient differential equation: dp(t) dt = f (1 − p(t)) with initial condition p(t = 0) = 0 (because initially there are no marked tokens in the urn). This differential equation has the solution: p(t) = 1 − e− ft Figure 9-10 shows p(t) for various values of the parameter f . (The parameter f has an obvious interpretation as size of each drawing expressed as a fraction of the total number of tokens in the urn.) This is the result that we have been looking for. Remember that p(t) = n(t)/N; hence the probability is directly proportional to the number of unique visitors so far. We can rewrite it more explicitly as: n(t) = N 1 − e− k N t ARGUMENTS FROM PROBABILITY MODELS 213 O’Reilly-5980006 master October 28, 2010 20:57 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 f = 1/4 = 1/10 = 1/20 FIGURE 9-10.Fraction of unique visitors seen on day t. The parameter f is the number of daily users expressed as a fraction of all potential users. In this form, the equation gives us, for each day of the month, the number of unique visitors for the month up to that date. There is only one unknown parameter: N, the total number of potential visitors. (We know k, the average number of total visitors per day, because this number is immediately available from the web-server logs.) We can now try to ﬁt one or two months’ worth of data to this formula to obtain a value for N. Once we have determined N, the formula predicts the expected number of unique visitors for each day of the month. We can use this information to track whether the actual number of unique visitors for the current month is above or below expectations. The steps we took in this little example are typical of a lot of modeling. We start with a real problem in a speciﬁc situation. To make headway, we recast it in an idealized format that tries to retain only the most relevant information. (In this example: mapping the original problem to an idealized urn model.) Expressing things in terms of an idealized model helps us recognize the problem as one we know how to solve. (Urn models have been studied extensively; in this example, we could identify it with Bernoulli trials, which we know how to handle.) Finding a solution often requires that we make actual approximations in addition to the abstraction from the problem domain to an idealized model. (Working with the expectation value was one such approximation to make the problem tractable; replacing the recurrence relation with a differential equation was another.) Finally, we end up with a “model” that involves some unknown parameters. If we are mostly interested in developing conceptual understanding, then we don’t need to go any further, since we can read off the model’s behavior directly from the formula. 214 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 However, if we actually want to make numerical predictions, then we’ll need to ﬁnd numerical values for those parameters, which is usually done by ﬁtting the model to some already available data. (We should also try to validate the model to see whether it gives a good “ﬁt”; refer to the discussion in Chapter 3 on examining residuals, for instance.) Finally, I should point out that the model in this example is simpliﬁed—as models usually are. The most critical simpliﬁcation (which would most likely not be correct in a real application) is that every token in the urn has the same probability of being drawn at each turn. In contrast, if look at the behavior of actual visitors, we will ﬁnd that some are much more likely to visit more frequently while others are less likely to visit. Another simpliﬁcation is that we assumed the total number of potential visitors to be constant. But if we have a website that sees signiﬁcant growth from one month to the next, this assumption may not be correct, either. You may want to try and build an improved model that takes these (and perhaps other) considerations into account. (The ﬁrst one in particular is not easy—in fact, if you succeed, then let me know how you did it!) Workshop: Power-Law Distributions The crazy effects of power-law distributions have to be seen to be believed. In this workshop, we shall generate (random) data points distributed according to a power-law distribution and begin to study their properties. First question: how does one actually generate nonuniformly distributed random numbers on a computer? A random generator that produces uniformly distributed numbers is available in almost all programming environments, but generating random numbers distributed according to some other distribution requires a little bit more work. There are different ways of going about it; some are speciﬁc to certain distributions only, whereas others are designed for speed in particular applications. We’ll discuss a simple method that works for distributions that are analytically known. The starting point is the cumulative distribution function for the distribution in question. By construction, the distribution function is strictly monotonic and takes on values in the interval [0, 1]. If we now generate uniformly distributed numbers between 0 and 1, then we can ﬁnd the locations at which the cumulative distribution function assumes these values. These points will be distributed according to the desired distribution (see Figure 9-11). (A good way to think about this is as follows. Imagine you distribute n points uniformly on the interval [0, 1] and ﬁnd the corresponding locations at which the cumulative distribution function assumes these values. These locations are spaced according to the distribution in question—after all, by construction, the probability grows by the same amount between successive locations. Now use points that are randomly distributed, rather than uniformly, and you end up with random points distributed according to the desired distribution.) ARGUMENTS FROM PROBABILITY MODELS 215 O’Reilly-5980006 master October 28, 2010 20:57 0 0.2 0.4 0.6 0.8 1 -10 -5 0 5 10 Gaussian Distribution Function with mean μ = 2 and standard deviation σ = 3 FIGURE 9-11.Generating random numbers from the Gaussian distribution: generate uniformly distributed numbers between 0 and 1, then find the locations values at which the Gaussian distribution function assumes these values. The locations follow a Gaussian distribution. For power-law distributions, we can easily work out the cumulative distribution function and its inverse. Let the probability density p(x) be: p(x) = α xα+1 x ≥ 1,α >0 This is known as the the “standard” form of the Pareto distribution. It is valid for values of x greater than 1. (Values of x < 1 have zero probability of occurring.) The parameter α is the “shape parameter” and must be greater than zero, because otherwise the probability is not normalizable. (This is a different convention than the one we used earlier: β = 1 + α.) We can work out the cumulative distribution function P(x): P(x) = y = x 1 p(t) dt = 1 − 1 xα This expression can be inverted to give: x = 1 (1 − y)1/α If we now use uniformly distributed random values for y, then the values for x will be distributed according to the Pareto distribution that we started with. (For other distributions, such as the Gaussian, inverting the expression for the cumulative distribution function is often harder, and you may have to ﬁnd a numerical library that includes the inverse of the distribution function explicitly.) 216 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 Now remember what we said earlier. If the exponent in the denominator is less than 2 (i.e.,ifβ ≤ 2orα ≤ 1), then the “mean does not exist.” In practice, we can evaluate the mean for any sample of points, and for any ﬁnite sample the mean will, of course, also be ﬁnite. But as we take more and more points, the mean does not settle down—instead it keeps on growing. On the other hand, if the exponent in the denominator is strictly greater than 2 (i.e.,ifβ>2orα>1), then the mean does exist, and its value does not depend on the sample size. I would like to emphasize again how counterintuitive the behavior for α ≤ 1 is. We usually expect that larger samples will give us better results with less noise. But in this particular scenario, the opposite is true! We can explore behavior of this type using the simple program shown below. All it does is generate 10 million random numbers distributed according to a Pareto distribution. I generate those numbers using the method described at the beginning of this section; alternatively, I could have used the paretovariate() function in the standard random module. We maintain a running total of all values (so that we can form the mean) and also keep track of the largest value seen so far. The results for two runs with α = 0.5 and α = 1.2 are shown in Figures 9-12 and 9-13, respectively. import sys, random def pareto( alpha ): y = random.random() return 1.0/pow( 1-y, 1.0/alpha ) alpha = float( sys.argv[1] ) n,ttl,mx=0,0,0 while n<1e7: n+=1 v = pareto( alpha ) ttl += v mx = max( mx, v ) if( n%50000 == 0 ): print n, ttl/n, mx The typical behavior for situations with α ≤ 1 versus α>1 is immediately apparent: whereas in Figure 9-13, the mean settles down pretty quickly to a ﬁnite value, the mean in Figure 9-12 continues to grow. We can also recognize clearly what drives this behavior. For α ≤ 1, very large values occur relatively frequently. Each such occurrence leads to an upward jump in the total sum of values seen, which is reﬂected in a concomitant jump in the mean. Over time, as more trials are conducted, the denominator in the mean grows, and hence the value of the ARGUMENTS FROM PROBABILITY MODELS 217 O’Reilly-5980006 master October 28, 2010 20:57 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07 9e+07 1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 Mean Max Value α = 0.5 Mean Max Value FIGURE 9-12.Sampling from the Pareto distribution P(x) = 1 2x3/2 . Both the mean and the maximum value grow without bound. mean begins to fall. However (and this is what is different for α ≤ 1 versus α>1), before the mean has fallen back to its previous value, a further extraordinarily large value occurs, driving the sum (and hence the mean) up again, with the consequence that the numerator of the expression ttl/n in the example program grows faster than the denominator. You may want to experiment yourself with this kind of system. The behavior at the borderline value of α = 1 is particularly interesting. You may also want to investigate how quickly ttl/n grows with different values of α. Finally, don’t restrict yourself only to the mean. Similar considerations hold for the standard deviation (see our discussion regarding this point earlier in the chapter). Further Reading • An Introduction to Probability Theory and Its Applications, vol. 1. William Feller. 3rd ed., Wiley. 1968. Every introductory book on probability theory covers most of the material in this chapter. This classic is my personal favorite for its deep, yet accessible treatment and for its large selection of interesting or amusing examples. • An Introduction to Mathematical Statistics and Its Applications. Richard J. Larsen and Morris L. Marx. 4th ed., Prentice Hall. 2005. This is my favorite book on theoretical statistics. The ﬁrst third contains a good, practical introduction to many of this chapter’s topics. 218 CHAPTER NINE O’Reilly-5980006 master October 28, 2010 20:57 5 5.2 5.4 5.6 5.8 6 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 Mean Max Value α = 1.2 Mean Max Value FIGURE 9-13.Sampling from the Pareto distribution P(x) = 1.2 x2.2 . Both the mean and the maximum reach a finite value and retain it as we continue to make further drawings. • NIST/SEMATECH e-Handbook of Statistical Methods. NIST. http://www.itl.nist.gov/div898/handbook/. 2010. This free ebook is made available by the National Institute for Standards and Technology (NIST). There is a wealth of reliable, high-quality information here. • Statistical Distributions. Merran Evans, Nicholas Hastings, and Brian Peacock. 3rd ed., Wiley. 2000. This short and accessible reference includes basic information on 40 of the most useful or important probability distributions. If you want to know what distributions exist and what their properties are, this is a good place to start. • “Power Laws, Pareto Distributions and Zipf’s Law.” M. E. J. Newman. Contemporary Physics 46 (2005), p. 323. This review paper provides a knowledgeable yet very readable introduction to the ﬁeld of power laws and heavy-tail phenomena. Highly recommended. (Versions of the document can be found on the Web.) • Modeling Complex Systems. Nino Boccara. 2nd ed., Springer. 2010. Chapter 8 of this book provides a succinct and level-headed overview of the current state of research into power-law phenomena. ARGUMENTS FROM PROBABILITY MODELS 219 O’Reilly-5980006 master October 28, 2010 20:57 O’Reilly-5980006 master October 28, 2010 20:59 CHAPTER TEN What You Really Need to Know About Classical Statistics BASIC CLASSICAL STATISTICS HAS ALWAYS BEEN SOMEWHAT OF A MYSTERY TO ME: A TOPIC FULL OF OBSCURE notions, such as t-tests and p-values, and confusing statements like “we fail to reject the null hypothesis”—which I can read several times and still not know if it is saying yes, no, or maybe.* To top it all off, all this formidable machinery is then used to draw conclusions that don’t seem to be all that interesting—it’s usually something about whether the means of two data sets are the same or different. Why would I care? Eventually I ﬁgured it out, and I also ﬁgured out why the ﬁeld seemed so obscure initially. In this chapter, I want to explain what classical statistics does, why it is the way it is, and what it is good for. This chapter does not attempt to teach you how to perform any of the typical statistical methods: this would require a separate book. (I will make some recommendations for further reading on this topic at the end of this chapter.) Instead, in this chapter I will tell you what all these other books omit. Let me take you on a trip. I hope you know where your towel is. Genesis To understand classical statistics, it is necessary to realize how it came about. The basic statistical methods that we know today were developed in the late 19th and early 20th centuries, mostly in Great Britain, by a very small group of people. Of those, one worked *I am not alone—even professional statisticians have the same experience. See, for example, the preface of Bayesian Statistics. Peter M. Lee. Hodder & Arnold. 2004. 221 O’Reilly-5980006 master October 28, 2010 20:59 for the Guinness brewing company and another—the most inﬂuential one of them—worked at an agricultural research lab (trying to increase crop yields and the like). This bit of historical context tells us something about their working conditions and primary challenges. No computational capabilities All computations had to be performed with paper and pencil. No graphing capabilities, either All graphs had to be generated with pencil, paper, and a ruler. (And complicated graphs—such as those requiring prior transformations or calculations using the data—were especially cumbersome.) Very small and very expensive data sets Data sets were small (often not more than four to ﬁve points) and could be obtained only with great difﬁculty. (When it always takes a full growing season to generate a new data set, you try very hard to make do with the data you already have!) In other words, their situation was almost entirely the opposite of our situation today: • Computational power that is essentially free (within reason) • Interactive graphing and visualization capabilities on every desktop • Often huge amounts of data It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today. By contrast, modern statistics is very different. It places greater emphasis on nonparametric methods and Bayesian reasoning, and it leverages current computational capabilities through simulation and resampling methods. The book by Larry Wasserman (see the recommended reading at the end of this chapter) provides an overview of a more contemporary point of view. However, almost all introductory statistics books—that is, those books one is likely to pick up as a beginner—continue to limit themselves to the same selection of slightly stale topics. Why is that? I believe it is a combination of institutional inertia together with the expectations of the “end-user” community. Statistics has always been a support science for other ﬁelds: originally agriculture but also medicine, psychology, sociology, and others. And these ﬁelds, which merely apply statistics but are not engaged in actively developing it themselves, continue to operate largely using classical methods. However, the machine-learning community—with its roots in computer science but great demand for statistical methods—provides a welcome push for the widespread adoption of more modern methods. 222 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 Keep this historical perspective in mind as we take a closer look at statistics in the rest of this chapter. Statistics Deﬁned All of statistics deals with the following scenario: we have a population—that is the set of all possible outcomes. Typically, this set is large: all male U.S. citizens, for example, or all possible web-server response times. Rather than dealing with the total population (which might be impossible, infeasible, or merely inconvenient), we instead work with a sample. A sample is a subset of the total population that is chosen so as to be representative of the overall population. Now we may ask: what conclusions about the overall population can we draw given one speciﬁc sample? It is this particular question that classical statistics answers via a process known as statistical inference: properties of the population are inferred from properties of a sample. Intuitively, we do this kind of thing all the time. For example, given the heights of ﬁve men (let’s say 178 cm, 180 cm, 179 cm, 178 cm, and 180 cm), we are immediately comfortable calculating the average (which is 179 cm) and concluding that the “typical” body size for all men in the population (not just the ﬁve in the sample!) is 179 cm, “more or less.” This is where formal classical statistics comes in: it provides us with a way of making the vague “more or less” statement precise and quantitative. Given the sample, statistical reasoning allows us to make speciﬁc statements about the population, such as, “We expect x percent of men to be between y and z cm tall,” or, “We expect fewer than x percent of all men to be taller than y cm,” and so on. Classical statistics is mostly concerned with two procedures: parameter estimation (or “estimation” for short) and hypothesis testing. Parameter estimation works as follows. We assume that the population is described by some distribution—for example, the Gaussian: N(x; μ, σ) = 1√ 2πσ exp −1 2 x − μ σ 2 and we seek to estimate values for the parameters (μ and σ this case) from a sample. Note that once we have estimates for the parameters, the distribution describing the population is fully determined, and we can (at least in principle) calculate any desired property of the population directly from that distribution. Parameter estimation comes in two ﬂavors: point estimation and interval estimation. The ﬁrst just gives us a speciﬁc value for the parameter, whereas the second gives us a range of values that is supposed to contain the true value. Compared with parameter estimation, hypothesis testing is the weirder of the two procedures. It does not attempt to quantify the size of an effect; it merely tries to determine whether there is any effect at all. Note well that this is a largely theoretical WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 223 O’Reilly-5980006 master October 28, 2010 20:59 argument; from a practical point of view, the existence of an effect cannot be separated entirely from its size. We will come back to this point later, but ﬁrst let’s understand how hypothesis testing works. Suppose we have developed a new fertilizer but don’t know yet whether it actually works. Now we run an experiment: we divide a plot of land in two and treat the crops on half of the plot with the new fertilizer. Finally, we compare the yields: are they different? The speciﬁc amounts of the yield will almost surely differ, but is this difference due to the treatment or is it merely a chance ﬂuctuation? Hypothesis testing helps us decide how large the difference needs to be in order to be statistically signiﬁcant. Formal hypothesis testing now proceeds as follows. First we set up the two hypotheses between which we want to decide: the null hypothesis (no effect; that is there is no difference between the two experiments) and the alternate hypothesis (there is an effect so that the two experiments have signiﬁcantly different outcomes). If the difference between the outcomes of the two experiments is statistically signiﬁcant, then we have sufﬁcient evidence to “reject the null hypothesis,” otherwise we “fail to reject the null hypothesis.” In other words: if the outcomes are not sufﬁciently different, then we retain the null hypothesis that there is no effect. This convoluted, indirect line of reasoning is required because, strictly speaking, no hypothesis can ever be proved correct by empirical means. If we ﬁnd evidence against a hypothesis, then we can surely reject it. But if we don’t ﬁnd evidence against the hypothesis, then we retain the hypothesis—at least until we do ﬁnd evidence against it (which may possibly never happen, in which case we retain the hypothesis indeﬁnitely). This, then, is the process by which hypothesis testing proceeds: because we can never prove that a treatment was successful, we instead invent a contradicting statement that we can prove to be false. The price we pay for this double negative (“it’s not true that there is no effect”) is that the test results mean exactly the opposite from what they seem to be saying: “retaining the null hypothesis,” which sounds like a success, means that the treatment had no effect; whereas “rejecting the null hypothesis” means that the treatment did work. This is the ﬁrst problem with hypothesis testing: it involves a convoluted, indirect line of reasoning and a terminology that seems to be saying the exact opposite from what it means. But there is another problem with hypothesis testing: it makes a statement that has almost no practical meaning! In reducing the outcome of an experiment to the Boolean choice between “signiﬁcant” and “not signiﬁcant,” it creates an artiﬁcial dichotomy that is not an appropriate view of reality. Experimental outcomes are not either strictly signiﬁcant or strictly nonsigniﬁcant: they form a continuum. In order to judge the results of an experiment, we need to know where along the continuum the experimental outcome falls and how robust the estimate is. If we have this information, we can decide how to interpret the experimental result and what importance to attach to it. 224 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 Classical hypothesis testing exhibits two well-known traps. The ﬁrst is that an experimental outcome that is marginally outside the statistical signiﬁcance level abruptly changes the interpretation of the experiment from “signiﬁcant” to “not signiﬁcant”—a discontinuity in interpretation that is not borne out by the minimal change in the actual outcome of the experiment. The other problem is that almost any effect, no matter how small, can be made “signiﬁcant” by increasing the sample size. This can lead to “statistically signiﬁcant” results that nevertheless are too small to be of any practical importance. All of this is compounded by the arbitrariness of the chosen “signiﬁcance level” (typically 5 percent). Why not 4.99 percent? Or 1 percent, or 0.1 percent? This seems to render the whole hypothesis testing machinery (at least as generally practiced) fundamentally inconsistent: on the one hand, we introduce an absolutely sharp cutoff into our interpretation of reality; and on the other hand, we choose the position of this cutoff in an arbitrary manner. This does not seem right. (There is a third trap: at the 5 percent signiﬁcance level, you can expect 1 out of 20 tests to give the wrong result. This means that if you run enough tests, you will always ﬁnd one that supports whatever conclusion you want to draw. This practice is known as data dredging and is strongly frowned upon.) Moreover, in any practical situation, the actual size of the effect is so much more important than its sheer existence. For this reason, hypothesis testing often simply misses the point. A project I recently worked on provides an example of this. The question arose as to whether two events were statistically independent (this is a form of hypothesis testing). But, for the decision that was ultimately made, it did not matter whether the events truly were independent (they were not) but that treating them as independent made no measurable difference to the company’s balance sheet. Hypothesis testing has its place but typically in rather abstract or theoretical situations where the mere existence of an effect constitutes an important discovery (“Is this coin loaded?” “Are people more likely to die a few days after their birthdays than before?”). If this describes your situation, then you will quite naturally employ hypothesis tests. However, if the size of an effect is of interest to you, then you should feel free to ignore tests altogether and instead work out an estimate of the effect—including its conﬁdence interval. This will give you the information that you need. You are not “doing it wrong” just because you haven’t performed a signiﬁcance test somewhere along the way. Finally, I’d like to point out that the statistics community itself has become uneasy with the emphasis that is placed on tests in some ﬁelds (notably medicine but also social sciences). Historically, hypothesis testing was invented to deal with sample sizes so small (possibly containing only four or ﬁve events) that drawing any conclusion at all was a challenge. In such cases, the broad distinction between “effect” and “no effect” was about the best one could do. If interval estimates are available, there is no reason to use statistical tests. The Wikipedia entry on p-values (explained below) provides some starting points to the controversy. WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 225 O’Reilly-5980006 master October 28, 2010 20:59 I have devoted quite a bit of space to a topic that may not seem especially relevant. However, hypothesis tests feature so large in introductory statistics books and courses and, at the same time, are so obscure and counterintuitive, that I found it important to provide some background. In the next section, we will take a more detailed look at some of the concepts and terminology that you are likely to ﬁnd in introductory (or not-so- introductory) statistics books and courses. Statistics Explained In Chapter 9, we already encountered several well-known probability distributions, including the binomial (used for trials resulting in Success or Failure), the Poisson (applicable in situations where events are evenly distributed according to some density), and the ubiquitous Normal, or Gaussian, distribution. All of these distributions describe real-world, observable phenomena. In addition, classical statistics uses several distributions that describe the distribution of certain quantities that are not observed but calculated. These distributions are not (or not usually) used to describe events in the real world. Instead, they describe how the outcomes of speciﬁc typical calculations involving random quantities will be distributed. There are four of these distributions, and they are known as sampling distributions. The ﬁrst of these (and the only one having much use outside of theoretical statistics) is the Gaussian distribution. As a sampling distribution, it is of interest because we already know that it describes the distribution of a sum of independent, identically distributed random variables. In other words, if X1, X2,...,Xn are random variables, then Z = X1 + X2 +···+Xn will be normally distributed and (because we can divide by a constant) the average m = (X1 + X2 +···+Xn)/n will also follow a Gaussian. It is this last property that makes the Gaussian important as a sampling distribution: it describes the distribution of averages. One caveat: to arrive at a closed formula for the Gaussian, we need to know the variance (i.e., the width) of the distribution from which the individual Xi are drawn. For most practical situations this is not a realistic requirement, and in a moment we will discuss what to do if the variance is not known. The second sampling distribution is the chi-square (χ2) distribution. It describes the distribution of the sum of squares of independent, identically distributed Gaussian random variables. Thus, if X1, X2,...,Xn are Gaussian random variables with unit variance, then U = X 2 1 + X 2 2 +···+X 2 n will follow a chi-square distribution. Why should we care? Because we form this kind of sum every time we calculate the variance. (Recall that the variance is deﬁned as 1 n (xi − m)2.) Hence, the chi-square distribution is used to describe the distribution of variances. The number n of elements in the sum is referred to as the number of degrees of freedom of the chi-square distribution, and it is an additional parameter we need to know to evaluate the distribution numerically. 226 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 The third sampling distribution describes the behavior of the ratio T of a normally (Gaussian) distributed random variable Z and a chi-square-distributed random variable U. This distribution is the famous Student t distribution. Speciﬁcally, let Z be distributed according to the standard Gaussian distribution and U according to the chi-square distribution with n degrees of freedom. Then T = Z/√ U/n is distributed according to the t distribution with n degrees of freedom. As it turns out, this is the correct formula to use for the distribution of the average if the variance is not known but has to be determined from the sample together with the average. The t distribution is a symmetric, bell-shaped curve like the Gaussian but with fatter tails. How fat the tails are depends on the number of degrees of freedom (i.e., on the number of data points in the sample). As the number of degrees of freedom increases, the t distribution becomes more and more like the Gaussian. In fact, for n larger than about 30, the differences between them are negligible. This is an important point to keep in mind: the distinction between the t distribution and the Gaussian matters only for small samples—that is, samples containing less than approximately 30 data points. For larger samples, it is all right to use the Gaussian instead of the t distribution. The last of the four sampling distributions is Fisher’s F distribution, which describes the behavior of the ratio of two chi-square random variables. We care about this when we want to compare two variances against each other (e.g., to test whether they are equal or not). These are the four sampling distributions of classical statistics. I will neither trouble you with the formulas for these distributions, nor show you their graphs—you can ﬁnd them in every statistics book. What is important here is to understand what they are describing and why they are important. In short, if you have n independent but identically distributed measurements, then the sampling distributions describe how the average, the variance, and their ratios will be distributed. The sampling distributions therefore allow us to reason about averages and variances. That’s why they are important and why statistics books spend so much time on them. One way to use the sampling distribution is to construct conﬁdence intervals for an estimate. Here is how it works. Suppose we have n observations. We can ﬁnd the average and variance of these measurements as well as the ratio of the two. Finally, we know that the ratio is distributed according to the t distribution. Hence we can ﬁnd the interval that has a 95 percent probability of containing the true value (see Figure 10-1). The boundaries of this range are the 95 percent conﬁdence interval; that is, we expect the true value to fall outside this conﬁdence range in only 1 out 20 cases. A similar concept can be applied to hypothesis testing, where sampling distributions are often used to calculate so-called p-values.Ap-value is an attempt to express the strength of the evidence in a hypothesis test and, in so doing, to soften the sharp binary distinction WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 227 O’Reilly-5980006 master October 28, 2010 20:59 95% Confidence Interval 95% of Area FIGURE 10-1.Theshaded area contains 95 percent of the area under the curve; the boundaries of the shaded region are the bounds on the 95 percent confidence interval. between signiﬁcant and not signiﬁcant outcomes mentioned earlier. A p-value is the probability of obtaining a value as (or more) extreme than the one actually observed under the assumption that the null hypothesis is true (see Figure 10-2). In other words, if the null hypothesis is that there is no effect, and if the observed effect size is x, then the p-value is the probability of observing an effect at least as large as x. Obviously, a large effect is improbable (small p-value) if the null hypothesis (zero effect) is true; hence a small p-value is considered strong evidence against the null hypothesis. However, a p-value is not “the probability that the null hypothesis is true”—such an interpretation (although appealing!) is incorrect. The p-value is the probability of obtaining an effect as large or larger than the observed one if the null hypothesis is true. (Classical statistics does not make probability statements about the truth of hypotheses. Doing so would put us into the realm of Bayesian statistics, a topic we will discuss toward the end of this chapter.) By the way, if you are thinking that this approach to hypothesis testing—with its sliding p-values—is quite different from the cut-and-dried signiﬁcant–not signiﬁcant approach discussed earlier, then you are right. Historically, two competing theories of signiﬁcance tests have been developed and have generated quite a bit of controversy; even today they sit a little awkwardly next to each other. (The approach based on sliding p-values that need to be interpreted by the researcher is due to Fisher; the decision-rule approach was developed by Pearson and Neyman.) But enough, already. You can consult any statistics book if you want to know more details. 228 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 Observed Value p-value: Area Under Curve FIGURE 10-2.The p-value is the probability of observing a value as large or larger than the one actually observed if the null hypothesis is true. Example: Formal Tests Versus Graphical Methods Historically, classical statistics evolved as it did because working with actual data was hard. The early statisticians therefore made a number of simplifying assumptions (mostly that data would be normally distributed) and then proceeded to develop mathematical tools (such as the sampling distributions introduced earlier in the chapter) that allowed them to reason about data sets in a general way and required only the knowledge of a few, easily calculated summary statistics (such as the mean). The ingenuity of it all is amazing, but it has led to an emphasis on formal technicalities as opposed to the direct insight into the data. Today our situation is different, and we should take full advantage of that. An example will demonstrate what I mean. The listing below shows two data sets. Are they the same, or are they different (in the sense that their means are the same or different)?* 0.209 0.225 0.205 0.262 0.196 0.217 0.210 0.240 0.202 0.230 0.207 0.229 0.224 0.235 0.223 0.217 0.220 0.201 *This is a famous data set with history that is colorful but not really relevant here. A Web search for “Quintus Curtius Snodgrass” will turn up plenty of references. WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 229 O’Reilly-5980006 master October 28, 2010 20:59 FIGURE 10-3.Box-and-whisker plots of the two Quintus Curtius Snodgrass data sets. There is almost no overlap between the two. In case study 9.2.1 of their book, Larsen and Marx (see the recommended reading at the end of this chapter) labor for several pages and ﬁnally conclude that the data sets are different at the 99 percent level of signiﬁcance. Figure 10-3 shows a box plot for each of the data sets. Case closed. (In fairness, the formal test does something that a graphical method cannot do: it gives us a quantitative criterion by which to make a decision. I hope that the discussion in this chapter has convinced you that this is not always an advantage, because it can lead to blind faith in “the number.” Graphical methods require you to interpret the results and take responsibility for the conclusions. Which is why I like them: they keep you honest!) Controlled Experiments Versus Observational Studies Besides the machinery of formal statistical inference (using the sampling distributions just discussed), the early statistics pioneers also developed a general theory of how best to undertake statistical studies. This conceptual framework is sometimes known as Design of Experiment and is worth knowing about—not least because so much of typical data mining activity does not make use of it. The most important distinction formalized by the Design of Experiment theory is the one between an observational study and a controlled experiment. As the name implies, a controlled 230 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 experiment allows us to control many aspects of the experimental setup and procedure; in particular, we control which treatment is applied to which experimental unit (we will deﬁne these terms shortly). For example, in an agricultural experiment, we would treat some (but not all) of the plots with a new fertilizer and then later compare the yields from the two treatment groups. In contrast, with an observational study, we merely collect data as it becomes (or already is) available. In particular, retrospective studies are always observational (not controlled). In a controlled experiment, we are able to control the “input” of an experiment (namely, the application of a treatment) and therefore can draw much more powerful conclusions from the output. In contrast to observational studies, a properly conducted controlled experiment can provide strong support for cause-and-effect relationships between two observations and can be used to rule out hidden (or confounding) causes. Observational studies can merely suggest the existence of a relationship between two observations; however, they can neither prove that one observation is caused by the other nor rule out that additional (unobserved) factors have played a role. The following (intentionally whimsical) example will serve to make the point. Let’s say we have data that suggests that cities with many lawyers also have many espresso stands and that cities with few lawyers have few espresso stands. In other words, there is strong correlation between the two quantities. But what conclusions can we draw about the causal relationship between the two? Are lawyers particularly high consumers of expensive coffee? Or does caffeine make people more litigious? In short, there is no way for us to determine what is cause and what is effect in this example. In contrast, if the fertilized yields in the controlled agricultural experiment are higher than the yields from the untreated control plots, we have strong reason to conclude that this effect is due to the fertilizer treatment. In addition to the desire to establish that the treatment indeed causes the effect, we also want to rule out the possibility of additional, unobserved factors that might account for the observed effect. Such factors, which inﬂuence the outcome of a study but are not themselves part of it, are known as confounding (or “hidden” or “lurking”) variables. In our agricultural example, differences in soil quality might have a signiﬁcant inﬂuence on the yield—perhaps a greater inﬂuence than the fertilizer. The spurious correlation between the number of lawyers and espresso stands is almost certainly due to confounding: larger cities have more of everything! (Even if we account for this effect and consider the per capita density of lawyers and espresso stands, there is still a plausible confounding factor: the income generated per head in the city.) In the next section, we will discuss how randomization can help to remove the effect of confounding variables. The distinction between controlled experiments and observational studies is most critical. Many of the most controversial scientiﬁc or statistical issues involve observational studies. In particular, reports in the mass media often concern studies that (inappropriately) draw causal inferences from observational studies (about topics such as the relationship WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 231 O’Reilly-5980006 master October 28, 2010 20:59 between gun laws and homicide rates, for example). Sometimes controlled experiments are not possible, with the result that it becomes almost impossible to settle certain questions once and for all. (The controversy around the connection between smoking and lung cancer is a good example.) In any case, make sure you understand clearly the difference between controlled and observational studies, as well as the fundamental limitations of the latter! Design of Experiments In a controlled experiment, we divide the experimental units that constitute our sample into two or more groups and then apply different treatments or treatment levels to the units in each group. In our agricultural example, the plots correspond to the experimental units, fertilization is the treatment, and the options “fertilizer” and “no fertilizer” are the treatment levels. Experimental design involves several techniques to improve the quality and reliability of any conclusions drawn from a controlled experiment. Randomization Randomization means that treatments (or treatment levels) are assigned to experimental units in a random fashion. Proper randomization suppresses systematic errors. (If we assign fertilizer treatment randomly to plots, then we remove the systematic inﬂuence of soil quality, which might otherwise be a confounding factor, because high-quality and low-quality plots are now equally likely to receive the fertilizer treatment.) Achieving true randomization is not as easy as it looks—I’ll come back to this point shortly. Replication Replication means that the same treatment is applied to more than one experimental unit. Replication serves to reduce the variability of the results by averaging over a larger sample. Replicates should be independent of each other, since nothing is gained by repeating the same experiment on the same unit multiple times. Blocking We sometimes know (or at least strongly suspect) that not all experimental units are equal. In this case, it may make sense to group equivalent experimental units into “blocks” and then to treat each such block as a separate sample. For example, if we know that plots A and C have poor soil quality and that B and D have better soil, then we would form two blocks—consisting of (A, C) and (B, D), respectively—before proceeding to make a randomized assignment of treatments for each block separately. Similarly, if we know that web trafﬁc is drastically different in the morning and the afternoon, we should collect and analyze data for both time periods separately. This also is a form of blocking. Factorization The last of these techniques applies only to experiments involving several treatments (e.g., irrigation and fertilization, to stay within our agricultural framework). The 232 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 simplest experimental design would make only a single change at any given time, so that we would observe yields with and without irrigation as well as with and without fertilizer. But this approach misses the possibility that there are interactions between the two treatments—for example, the effect of the fertilizer may be signiﬁcantly higher when coupled with improved irrigation. Therefore, in a factorial experiment all possible combinations of treatment levels are tried. Even if a fully factorial experiment is not possible (the number of combinations goes up quickly as the number of different treatments grows), there are rules for how best to select combinations of treatment levels for drawing optimal conclusions from the study. Another term you may come across in this context is ANOVA (analysis of variance), which is a standard way of summarizing results from controlled experiments. It emphasizes the variations within each treatment group for easy comparison with the variances between the treatments, so that we can determine whether the differences between different treatments are signiﬁcant compared to the variation within each treatment group. ANOVA is a clever bookkeeping technique, but it does not introduce particularly noteworthy new statistical concepts. A word of warning: when conducting a controlled experiment, make sure that you apply the techniques properly; in particular, beware of pseudo-randomization and pseudo-replication. Pseudo-randomization occurs if the assignment of treatments to experimental units is not truly random. This can occur relatively easily, even if the assignment seems to be random. For example, if you would like to try out two different drugs on lab rats, it is not sufﬁcient to “pick a rat at random” from the cage to administer the treatment. What does “at random” mean? It might very well mean picking the most active rat ﬁrst because it comes to the cage door. Or maybe the least aggressive-looking one. In either case, there is a systematic bias! Here is another example, perhaps closer to home: the web-lab. Two different site designs are to be presented to viewers, and the objective is to measure conversion rate or click-throughs or some other metric. There are multiple servers, so we dedicate one of them (chosen “at random”) to serve the pages with the new design. What’s wrong with that? Everything! Do you have any indication that web requests are assigned to servers in a random fashion? Or might servers have, for example, a strong geographic bias? Let’s assume the servers are behind some “big-IP” box that routes requests to the servers. How is the routing conducted—randomly, or round-robin, or based on trafﬁc intensity? Is the routing smart, so that servers with slower response times get fewer hits? What about sticky sessions, and what about the relationship between sticky sessions and slower response times? Is the router reordering the incoming requests in some way? That’s a lot of questions—questions that randomization is intended to avoid. In fact, you are not running a controlled experiment at all: you are conducting an observational study! WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 233 O’Reilly-5980006 master October 28, 2010 20:59 The only way that I know to run a controlled experiment is by deciding ahead of time which experimental unit will receive which treatment. In the lab rat example, rats should have been labeled and then treatments assigned to the labels using a (reliable) random number generator or random table. In the web-server example it is harder to achieve true randomization, because the experimental units are not known ahead of time. A simple rule (e.g., show the new design to every nth request) won’t work, because there may be signiﬁcant correlation between subsequent requests. It’s not so easy. Pseudo-replication occurs when experimental units are not truly independent. Injecting the same rat ﬁve times with the same drug does not reduce variability! Similarly, running the same query against a database could be misleading because of changing cache utilization. And so on. In my experience, pseudo-replication is easier to spot and hence tends to be less of a problem than pseudo-randomization. Finally, I should mention one other term that often comes up in the context of proper experimental process: blind and double-blind experiments. In a blind experiment, the experimental unit should not know which treatment it receives; in a double-blind experiment, the investigator—at the time of the experiment—does not know either. The purpose of blind and double-blind experiments is to prevent the knowledge of the treatment level from becoming a confounding factor. If people know that they have been given a new drug, then this knowledge itself may contribute to their well-being. An investigator who knows which ﬁeld is receiving the fertilizer might weed that particular ﬁeld more vigorously and thereby introduce some invisible and unwanted bias. Blind experiments play a huge role in the medical ﬁeld but can also be important in other contexts. However, I would like to emphasize that the question of “blindness” (which concerns the experimental procedure) is a different issue than the Design of Experiment prescriptions (which are intended to reduce statistical uncertainty). Perspective It is important to maintain an appropriate perspective on these matters. In practice, many studies are observational, not controlled. Occasionally, this is a painful loss and only due to the inability to conduct a proper controlled experiment (smoking and lung cancer, again!). Nevertheless, observational studies can be of great value: one reason is that they may be exploratory and discover new and previously unknown behavior. In contrast, controlled experiments are always conﬁrmatory in deciding between the effectiveness or ineffectiveness of a speciﬁc “treatment.” Observational studies can be used to derive predictive models even while setting aside the question of causation. The machine-learning community, for instance, attempts to develop classiﬁcation algorithms that use descriptive attributes or features of the unit to predict whether the unit belongs to a given class. They work entirely without controlled experiments and have developed methods for quantifying the accuracy of their results. (We will describe some in Chapter 18.) 234 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 That being said, it is important to understand the limitations of observational studies—in particular, their inability to support strong conclusions regarding cause-and-effect relationships and their inability to rule out confounding factors. In the end, the power of controlled experiments can be their limitation, because such experiments require a level of control that limits their application. Optional: Bayesian Statistics—The Other Point of View There is an alternative approach to statistics that is based on a different interpretation of the concept of probability itself. This may come as a surprise, since probability seems to be such a basic concept. The problem is that, although we have a very strong intuitive sense of what we mean by the word “probability,” it is not so easy to give it a rigorous meaning that can be used to develop a mathematical theory. The interpretation of probability used by classical statistics (and, to some degree, by abstract probability theory) treats probability as a limiting frequency: if you toss a fair coin “a large number of times,” then you will obtain Heads about half of the time; hence the probability for Heads is 1/2. Arguments and theories starting from this interpretation are often referred to as “frequentist.” An alternative interpretation of probability views it as the degree of our ignorance about an outcome: since we don’t know which side will be on top in the next toss of a fair coin, we assign each possible outcome the same probability—namely 1/2. We can therefore make statements about the probabilities associated with individual events without having to invoke the notion of a large number of repeated trials. Because this approach to probability and statistics makes use of Bayes’ theorem at a central step in its reasoning, it is usually called Bayesian statistics and has become increasingly popular in recent years. Let’s compare the two interpretations in a bit more detail. The Frequentist Interpretation of Probability In the frequentist interpretation, probability is viewed as the limiting frequency of each outcome of an experiment that is repeated a large number of times. This “frequentist” interpretation is the reason for some of the peculiarities of classical statistics. For example, in classical statistics it is incorrect to say that a 95 percent conﬁdence interval for some parameter has a 95 percent chance of containing the true value—after all, the true value is either contained in the interval or not; period. The only statement that we can make is that, if we perform an experiment to measure this parameter many times, then in about 95 percent of all cases the experiment will yield a value for this parameter that lies within the 95 percent conﬁdence interval. This type of reasoning has a number of drawbacks. • It is awkward and clumsy, and liable to (possibly even unconscious) misinterpretations. WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 235 O’Reilly-5980006 master October 28, 2010 20:59 • The constant appeal to a “large number of trials” is artiﬁcial even in situations where such a sequence of trials would—at least in principle—be possible (such as tossing a coin). But it becomes wholly ﬁcticious in situations where the trial cannot possibly be repeated. The weather report may state: “There is an 80 percent chance of rain tomorrow.” What is that supposed to mean? It is either going to rain tomorrow or not! Hence we must again invoke the unlimited sequence of trials and say that in 8 out of 10 cases where we observe the current meteorological conditions, we expect rain on the following day. But even this argument is illusionary, because we will never observe these precise conditions ever again: that’s what we have been learning from chaos theory and related ﬁelds. • We would frequently like to make statements such as the one about the chance of rain, or similar ones—for example, “The patient has a 60 percent survival probability,” and “I am 25 percent certain that the contract will be approved.” In all such cases the actual outcome is not of a probabilistic nature: it will rain or it will not; the patient will survive or not; the contract will be approved or not. Even so, we’d like to express a degree of certainty about the expected outcome even if appealing to an unlimited sequence of trials is neither practical nor even meaningful. From a strictly frequentist point of view, a statement like “There is an 80 percent chance of rain tomorrow” is nonsensical. Nevertheless, it seems to make so much intuitive sense. In what way can this intuition be made more rigorous? This question leads us to Bayesian statistics or Bayesian reasoning. The Bayesian Interpretation of Probability To understand the Bayesian point of view, we ﬁrst need to review the concept of conditional probability. The conditional probability P(A|B) gives us the probability for the event A, given (or assuming) that event B has occurred. You can easily convince yourself that the following is true: P(A|B) = P(A ∩ B) P(B) where P(A ∩ B) is the joint probability of ﬁnding both event A and event B. For example, it is well known that men are much more likely than women to be color-blind: about 10 percent of men are color-blind but fewer than 1 percent of women are color-blind. These are conditional probabilities—that is, the probability of being color-blind given the gender: P(color-blind|male) = 0.1 P(color-blind|female) = 0.01 In contrast, if we “randomly” pick a person off the street, then we are dealing with the joint probability that this person is color-blind and male. The person has a 50 percent chance of 236 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 being male and a 10 percent conditional probability of being color-blind, given that the person is male. Hence, the joint probability for a random person to be color-blind and male is 5 percent, in agreement with the deﬁnition of conditional probability given previously. One can now rigorously prove the following equality, which is known as Bayes’ theorem: P(A|B) = P(B|A)P(A) P(B) In words: the probability of ﬁnding A given B is equal to the probability of ﬁnding B given A multiplied by the probability of ﬁnding A and divided by the probability of ﬁnding B. Now, let’s return to statistics and data analysis. Assume there is some parameter that we attempt to determine through an experiment (say, the mass of the proton or the survival rate after surgery). We are now dealing with two “events”: event B is the occurrence of the speciﬁc set of measurements that we have observed, and the parameter taking some speciﬁc value constitutes event A. We can now rewrite Bayes’ theorem as follows: P(parameter|data) ∝ P(data|parameter)P(parameter) (I have dropped the denominator, which I can do because the denominator is simply a constant that does not depend on the parameter we wish to determine. The left- and righthand sides are now no longer equal, so I have replaced the equality sign with ∝ to indicate that the two sides of the expression are merely proportional: equal to within a numerical constant.) Let’s look at this equation term by term. On the lefthand side, we have the probability of ﬁnding a certain value for the parameter, given the data. That’s pretty exciting, because this is an expression that makes an explicit statement about the probability of an event (in this case, that the parameter has a certain value), given the data. This probability is called the posterior probability, or simply the posterior, and is deﬁned solely through Bayes’ theorem without reference to any unlimited sequence of trials. Instead, it is a measure of our “belief” or “certainty” about the outcome (i.e., the value of the parameter) given the data. The ﬁrst term on the righthand side, P(data|parameter), is known as the likelihood function. This is a mathematical expression that links the parameter to the probability of obtaining speciﬁc data points in an actual experiment. The likelihood function constitutes our “model” for the system under consideration: it tells us what data we can expect to observe, given a particular value of the parameter. (The example in the next section will help to clarify the meaning of this term.) Finally, the term P(parameter) is known as the prior probability, or simply the prior, and captures our “prior” (prior to the experiment) belief of ﬁnding a certain outcome—speciﬁcally our prior belief that the parameter has a certain value. It is the WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 237 O’Reilly-5980006 master October 28, 2010 20:59 existence of this prior that makes the Bayesian approach so controversial, because it seems to introduce an inappropriately subjective element into the analysis. In reality, however, the inﬂuence of the prior on the ﬁnal result of the analysis is typically small, in particular when there is plenty of data. One can also ﬁnd so-called “noninformative” priors that express our complete ignorance about the possible outcomes. But the prior is there, and it forces us to think about our assumptions regarding the experiment and to state some of these assumptions explicitly (in form of the prior distribution function). Bayesian Data Analysis: A Worked Example All of this will become much clearer once we demonstrate these concepts in an actual example. The example is very simple, so as not to distract from the concepts. Assume we have a coin that has been tossed 10 times, producing the following set of outcomes (H for Heads, T for Tails): THHHHTTHHH If you count the outcomes, you will ﬁnd that we obtained 7 Heads and 3 Tails in 10 tosses of the coin. Given this data, we would like to determine whether the coin is fair or not. Speciﬁcally, we would like to determine the probability p that a toss of this coin will turn out Heads. (This is the “parameter” we would like to estimate.) If the coin is fair, then p should be close to 1/2. Let’s write down Bayes’ equation, adapted to this system: P(p| {THHHHTTHHH}) ∝ P({THHHHTTHHH} |p)P(p) Notice that at this point, the problem has become parametric. All that is left to do is to determine the value of the parameter p or, more precisely, the posterior probability distribution for all values of p. To make progress, we need to supply the likelihood function and the prior. Given this system, the likelihood function is particularly simple: P(H|p) = p and P(T|p) = 1 − p. You should convince yourself that this choice of likelihood function gives us exactly what we want: the probability to obtain Heads or Tails, given p. We also assume that the tosses are independent, which implies that only the total number of Heads or Tails matters but not the order in which they occurred. Hence we don’t need to ﬁnd the combined likelihood for the speciﬁc sequence of 10 tosses; instead, the likelihood of the set of events is simply the product of the 10 individual tosses. (The likelihood “factors” for independent events—this argument occurs frequently in Bayesian analysis.) 238 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 0 0.2 0.4 0.6 0.8 1 P( p | {7 Heads, 3 Tails} ) ∝ p7(1-p)3 FIGURE 10-4.The(unnormalized) posterior probability of obtaining 7 Heads in 10 tosses of a coin as a function of p. Finally, we know nothing about this coin. In particular, we have no reason to believe that any value of p is more likely than any other, so we choose as prior probability distribution the “ﬂat” distribution P(p) = 1 for all p. Collecting everything, we end up with the following expression (where I have dropped some combinatorial factors that do not depend on p): P(p| {7 Heads, 3 Tails}) ∝ p7(1 − p)3 This is the posterior probability distribution for the parameter p based on the experimental data (see Figure 10-4). We can see that it has a peak near p = 0.7, which is the most probable value for p. Note that the absence of tick marks on the y axis in Figure 10-4: the denominator, which we dropped earlier, is still undetermined, and therefore the overall scale of the function is not yet ﬁxed. If we are interested only in the location of the maximum, this does not matter. But we are not restricted to a single (point) estimate for p—the entire distribution function is available to us! We can now use it to construct conﬁdence intervals for p. And because we are now talking about Bayesian probabilities, it would be legitimate to state that “the conﬁdence interval has a 95 percent chance of containing the true value of p.” We can also evaluate any function that depends on p by integrating it against the posterior distribution for p. As a particularly simple example, we could calculate the WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 239 O’Reilly-5980006 master October 28, 2010 20:59 0 0.2 0.4 0.6 0.8 1 10 Tosses: 7 Heads, 3 Tails 30 Tosses: 21 Heads, 9 Tails FIGURE 10-5.The(unnormalized) posterior probability of obtaining 70 percent Heads in 10 and in 30 tosses of a coin. The more data there is, the more strongly peaked the posterior distribution becomes. expectation value of p to obtain the single “best” estimate of p (rather than use the most probable value as we did before): E[p] = pP(p| {7 Heads, 3 Tails}) dp P(p| {7 Heads, 3 Tails}) dp Here we ﬁnally need to worry about all the factors that we dropped along the way, and the denominator in the formula is our way of ﬁxing the normalization “after the fact.” To ensure that the probability distribution is properly normalized, we divide explicitly by the integral over the whole range of values, thereby guaranteeing that the total probability equals 1 (as it must). It is interesting to look at the roles played by the likelihood and the prior in the result. In Bayesian analysis, the posterior “interpolates” between the prior and the data-based likelihood function. If there is only very little data, then the likelihood function will be relatively ﬂat, and therefore the posterior will be more inﬂuenced by the prior. But as we collect more data (i.e., as the empirical evidence becomes stronger), the likelihood function becomes more and more narrowly peaked at the most likely value of p, regardless of the choice of prior. Figure 10-5 demonstrates this effect. It shows the posterior for a total of 10 trials and a total of 30 trials (while keeping the same ratio of Heads to Tails): as we gather more data, the uncertainty in the resulting posterior shrinks. 240 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 0 0.2 0.4 0.6 0.8 1 10 Tosses: 7 Heads, 3 Tails 30 Tosses: 21 Heads, 9 Tails 100 Tosses: 70 Heads, 30 Tails Prior FIGURE 10-6.Theeffect of a nonflat prior: posterior probabilities for data sets of different sizes, calculated using a Gaussian prior. Finally, Figure 10-6 demonstrates the effect of the prior. Whereas the posterior distributions shown in Figure 10-5 were calculated using a ﬂat prior, those in Figure 10-6 were calculated using a Gaussian prior—which expresses a rather strong belief that the value of p will be between 0.35 and 0.65. The inﬂuence of this prior belief is rather signiﬁcant for the smaller data set, but as we take more and more data points, its inﬂuence is increasingly diminished. Bayesian Inference: Summary and Discussion Let’s summarize what we have learned about Bayesian data analysis or Bayesian inference and discuss what it can do for us—and what it can’t. First of all, the Bayesian (as opposed to the frequentist) approach to inference allows us to compute a true probability distribution for any parameter in question. This has great intuitive appeal, because it allows us to make statements such as “There is a 90 percent chance of rain tomorrow” without having to appeal to the notion of extended trials of identical experiments. The posterior probability distribution arises as the product of the likelihood function and the prior. The likelihood links experimental results to values of the parameter, and the prior expresses our previous knowledge or belief about the parameter. The Bayesian approach has a number of appealing features. Of course, there is the intuitive nature of the results obtained using Bayesian arguments: real probabilities and WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 241 O’Reilly-5980006 master October 28, 2010 20:59 95 percent conﬁdence intervals that have exactly the kind of interpretation one would expect! Moreover, we obtain the posterior probability distribution in full generality and without having to make limiting assumptions (e.g., having to assume that the data is normally distributed). Additionally, the likelihood function enters the calculation in a way that allows for great ﬂexibility in how we build “models.” Under the Bayesian approach, it is very easy to deal with missing data, with data that is becoming available over time, or with heterogeneous data sets (i.e., data sets in which different attributes are known about each data point). Because the result of Bayesian inference is a probability distribution itself, it can be used as input for a new model that builds on the previous one (hierarchical models). Moreover, we can use the prior to incorporate previous (domain) knowledge that we may have about the problem under consideration. On the other hand, Bayesian inference has some problems, too—even when we concentrate on practical applications only, leaving the entire philosophical debate about priors and subjectivity aside. First of all, Bayesian inference is always parametric; it is never just exploratory or descriptive. Because Bayesian methods force us to supply a likelihood function explicitly, they force us to be speciﬁc about our choice of model assumptions: we must already have a likelihood function in mind, for otherwise we can’t even get started (hence such analysis can never be exploratory). Furthermore, the result of a Bayesian analysis is always a posterior distribution—that is, a conditional probability of something, given the data. Here, that “something” is some form of hypothesis that we have, and the posterior gives us the probability that this hypothesis is true. To make this prescription operational (and, in particular, expressible through a likelihood function), we pretty much have to parameterize the hypothesis. The inference then consists of ﬁnding the best value for this parameter, given the data—which is a parametric problem, given a speciﬁc choice for the model (i.e., the likelihood function). (There are so-called “nonparametric” Bayesian methods, but in reality they boil down to parametric models with very large numbers of parameters.) Additionally, actual Bayesian calculations are often difﬁcult. Recall that Bayesian inference gives us the full explicit posterior distribution function. If we want to summarize this function, we either need to ﬁnd its maximum or integrate it to obtain an expectation value. Both of these problems are hard, especially when the likelihood function is complicated and there is more than one parameter that we try to estimate. Instead of explicitly integrating the posterior, one can sample it—that is, draw random points that are distributed according to the posterior distribution, in order to evaluate expectation values. This is clearly an expensive process that requires computer time and specialized software (and the associated know-how). There can also be additional problems. For example, if the parameter space is very high-dimensional, then evaluating the likelihood function (and hence the posterior) may be difﬁcult. 242 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 In contrast, frequentist methods tend to make more assumptions up front and rely more strongly on general analytic results and approximations. With frequentist methods, the hard work has typically already been done (analytically), leading to an asymptotic or approximate formula that you only need to plug in. Bayesian methods give you the full, nonapproximate result but leave it up to you to evaluate it. The disadvantage of the plug-in approach, of course, is that you might be plugging into an inappropriate formula—because some of the assumptions or approximations that were used to derive it do not apply to your system or data set. To bring this discussion to a close, I’d like to end with a cautionary note. Bayesian methods are very appealing and even exciting—something that is rarely said about classical frequentist statistics. On the other hand, they are probably not very suitable for casual uses. • Bayesian methods are parametric and speciﬁc; they are never exploratory or descriptive. If we already know what speciﬁc question to ask, then Bayesian methods may be the best way of obtaining an answer. But if we don’t yet know the proper questions to ask, then Bayesian methods are not applicable. • Bayesian methods are difﬁcult and require a fair deal of sophistication, both in setting up the actual model (likelihood function and prior) and in performing the required calculations. As far as results are concerned, there is not much difference between frequentist and Bayesian analysis. When there is sufﬁcient data (so that the inﬂuence of the prior is small), then the end results are typically very similar, whether they were obtained using frequentist methods or Bayesian methods. Finally, you may encounter some other terms and concepts in the literature that also bear the “Bayesian” moniker: Bayesian classiﬁer, Bayesian network, Bayesian risk, and more. Often, these have nothing to do with Bayesian (as opposed to frequentist) inference as explained in this chapter. Typically, these methods involve conditional probabilities and therefore appeal at some point to Bayes’ theorem. A Bayesian classiﬁer, for instance, is the conditional probability that an object belongs to a certain class, given what we know about it. A Bayesian network is a particular way of organizing the causal relationships that exist among events that depend on many interrelated conditions. And so on. Workshop: R R is an environment for data manipulation and numerical calculations, speciﬁcally statistical applications. Although it can be used in a more general fashion for programming or computation, its real strength is the large number of built-in (or user-contributed) statistical functions. WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 243 O’Reilly-5980006 master October 28, 2010 20:59 R is an open source clone of the S programming language, which was originally developed at Bell Labs in the 1970s. It was one of the ﬁrst environments to combine the capabilities that today we expect from a scripting language (e.g., memory management, proper strings, dynamic typing, easy ﬁle handling) with integrated graphics and intended for an interactive usage pattern. I tend to stress the word environment when referring to R, because the way it integrates its various components is essential to R. It is misleading to think of R as a programming language that also has an interactive shell (like Python or Groovy). Instead, you might consider it as a shell but for handling data instead of ﬁles. Alternatively, you might want to view R as a text-based spreadsheet on steroids. The “shell” metaphor in particular is helpful in motivating some of the design choices made by R. The essential data structure offered by R is the so-called data frame. A data frame encapsulates a data set and is the central abstraction that R is built on. Practically all operations involve the handling and manipulation of frames in one way or the other. Possibly the best way to think of a data frame is as being comparable to a relational database table. Each data frame is a rectangular data structure consisting of rows and columns. Each column has a designated data type, and all entries in that column must be of that type. Consequently, each row will in general contain entries of different types (as deﬁned by the types of the columns), but all rows must be of the same form. All this should be familiar from relational databases. The similarities continue: operations on frames can either project out a subset of columns, or ﬁlter out a subset of rows; either operation results in a new data frame. There is even a command (merge) that can perform a join of two data frames on a common column. In addition (and in contrast to databases), we will frequently add columns to an existing frame—for example, to hold the results of an intermediate calculation. We can refer to columns by name. The names are either read from the ﬁrst line of the input ﬁle, or (if not provided) R will substitute synthetic names of the form V1, V2, ... .In contrast, we ﬁlter out a set of rows through various forms of “indexing magic.” Let’s look at some examples. Consider the following input ﬁle: Name Height Weight Gender Joe 6.2 192.2 0 Jane 5.5 155.4 1 Mary 5.7 164.3 1 Jill 5.6 166.4 1 Bill 5.8 185.8 0 Pete 6.1 201.7 0 Jack 6.0 195.2 0 Let’s investigate this data set using R, placing particular emphasis on how to handle and manipulate data with R—the full session transcript is included below. The commands 244 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 entered at the command prompt are preﬁxed by the prompt >, while R output is shown without the prompt: > d <- read.csv( "data", header = TRUE, sep = "\t" ) > str(d) 'data.frame': 7 obs. of 4 variables: $ Name : Factor w/ 7 levels "Bill","Jack",..:5364172 $ Height: num 6.2 5.5 5.7 5.6 5.8 6.1 6 $ Weight: num 192 155 164 166 186 ... $ Gender: int 0111000 > > mean( d$Weight ) [1] 180.1429 > mean( d[,3] ) [1] 180.1429 > > mean( d$Weight[ d$Gender ==1]) [1] 162.0333 > mean( d$Weight[ 2:4 ] ) [1] 162.0333 > > d$Diff <- d$Height - mean( d$Height ) > print(d) Name Height Weight Gender Diff 1 Joe 6.2 192.2 0 0.35714286 2 Jane 5.5 155.4 1 -0.34285714 3 Mary 5.7 164.3 1 -0.14285714 4 Jill 5.6 166.4 1 -0.24285714 5 Bill 5.8 185.8 0 -0.04285714 6 Pete 6.1 201.7 0 0.25714286 7 Jack 6.0 195.2 0 0.15714286 > summary(d) Name Height Weight Gender Diff Bill:1 Min. :5.500 Min. :155.4 Min. :0.0000 Min. :-3.429e-01 Jack:1 1st Qu.:5.650 1st Qu.:165.3 1st Qu.:0.0000 1st Qu.:-1.929e-01 Jane:1 Median :5.800 Median :185.8 Median :0.0000 Median :-4.286e-02 Jill:1 Mean :5.843 Mean :180.1 Mean :0.4286 Mean : 2.538e-16 Joe :1 3rd Qu.:6.050 3rd Qu.:193.7 3rd Qu.:1.0000 3rd Qu.: 2.071e-01 Mary:1 Max. :6.200 Max. :201.7 Max. :1.0000 Max. : 3.571e-01 Pete:1 > > d$Gender <- factor( d$Gender, labels = c("M", "F") ) > summary(d) Name Height Weight Gender Diff Bill:1 Min. :5.500 Min. :155.4 M:4 Min. :-3.429e-01 Jack:1 1st Qu.:5.650 1st Qu.:165.3 F:3 1st Qu.:-1.929e-01 Jane:1 Median :5.800 Median :185.8 Median :-4.286e-02 Jill:1 Mean :5.843 Mean :180.1 Mean : 2.538e-16 Joe :1 3rd Qu.:6.050 3rd Qu.:193.7 3rd Qu.: 2.071e-01 Mary:1 Max. :6.200 Max. :201.7 Max. : 3.571e-01 Pete:1 > > plot( d$Height ~ d$Gender ) > plot( d$Height ~ d$Weight, xlab="Weight", ylab="Height" ) WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 245 O’Reilly-5980006 master October 28, 2010 20:59 > m <- lm( d$Height ~ d$Weight ) > print(m) Call: lm(formula = d$Height ~ d$Weight) Coefficients: (Intercept) d$Weight 3.39918 0.01357 > abline(m) > abline( mean(d$Height), 0, lty=2 ) Let’s step through this session in some detail and explain what is going on. First, we read the ﬁle in and assign it to the variable d, which is a data frame as discussed previously. The function str(d) shows us a string representation of the data frame. We can see that the frame consists of ﬁve named columns, and we can also see some typical values for each column. Notice that R has assigned a data type to each column: height and weight have been recognized as ﬂoating-point values; the names are considered a “factor,” which is R’s way of indicating a categorical variable; and ﬁnally the gender ﬂag is interpreted as an integer. This is not ideal—we will come back to that. > d <- read.csv( "data", header = TRUE, sep = "\t" ) > str(d) 'data.frame': 7 obs. of 4 variables: $ Name : Factor w/ 7 levels "Bill","Jack",..:5364172 $ Height: num 6.2 5.5 5.7 5.6 5.8 6.1 6 $ Weight: num 192 155 164 166 186 ... $ Gender: int 0111000 Let’s calculate the mean of the weight column to demonstrate some typical ways in which we can select rows and columns. The most convenient way to specify a column is by name: d$Weight. The use of the dollar-sign ($) to access members of a data structure is one of R’s quirks that one learns to live with. Think of a column as a shell variable! (By contrast, the dot (.) is not an operator and can be part of a variable or function name—in the same way that an underscore ( ) is used in other languages. Here again the shell metaphor is useful: recall that shells allow the dot as part of ﬁlenames!) > mean( d$Weight ) [1] 180.1429 > mean( d[,3] ) [1] 180.1429 Although its name is often the most convenient method to specify a column, we can also use its numeric index. Each element in a data frame can be accessed using its row and column index via the familiar bracket notation: d[row,col]. Keep in mind that the vertical (row) index comes ﬁrst, followed by the horizontal (column) index. Omitting one of them selects all possible values, as we do in the listing above: d[,3] selects all rows from the third column. Also note that indices in R start at 1 (mathematical convention), not at 0 (programming convention). 246 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 Now that we know how to select a column, let’s see how to select rows. In R, this is usually done through various forms of “indexing magic,” two examples of which are shown next in the listing. We want to ﬁnd the mean weight of only the women in the sample. To do so, we take the weight column but now index it with a logical expression. This kind of operation takes some getting used to: inside the brackets, we seem to compare a column (d$Gender) with a scalar—and then use the result to index another column. What is going on here? Several things: ﬁrst, the scalar on the righthand side of the comparison is expanded into a vector of the same length as the operator on the lefthand side. The result of the equality operator is then a Boolean vector of the same length as d$Gender or d$Weight. A Boolean vector of the appropriate length can be used as an index and selects only those rows for which it evaluates as True—which it does in this case only for the women in the sample. The second line of code is much more conventional: the colon operator (:) creates a range of numbers, which are used to index into the d$Weight column. (Remember that indices start at 1, not at 0!) > mean( d$Weight[ d$Gender ==1]) [1] 162.0333 > mean( d$Weight[ 2:4 ] ) [1] 162.0333 These kinds of operation are very common in R: using some form of creative indexing to ﬁlter out a subset of rows (there are more ways to do this, which I don’t show) and mixing vectors and scalars in expressions. Here is another example: > d$Diff <- d$Height - mean( d$Height ) Here we create an additional column, called d$Diff, as the residual that remains when the mean height is subtracted from each individual’s height. Observe how we mix a column with a scalar expression to obtain another vector. summary(d) Next, we calculate the summary of the entire data frame with the new column added. Take a look at the gender column: because R interpreted the gender ﬂag as an integer, it went ahead and calculated its “mean” and other quantities. This is meaningless, of course; the values in this column should be treated as categorical. This can be achieved using the factor() function, which also allows us to replace the uninformative numeric labels with more convenient string labels. > d$Gender <- factor( d$Gender, labels = c("M", "F") ) As you can see when we run summary(d) again, R treats categorical variables differently: it counts how often each value occurs in the data set. Finally, let’s take a look at R’s plotting capabilities. First, we plot the height “as a function of” the gender. (R uses the tilde (~) to separate control and response variables; the response variable is always on the left.) > plot( d$Height ~ d$Gender ) WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 247 O’Reilly-5980006 master October 28, 2010 20:59 MF 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 d$Gender d$Height FIGURE 10-7.Aboxplot, showing the distribution of heights by gender. This gives us a box plot, which is shown in Figure 10-7. On the other hand, if we plot the height as a function of the weight, then we obtain a scatter plot (see Figure 10-8—without the lines; we will add them in a moment). > plot( d$Height ~ d$Weight, xlab="Weight", ylab="Height" ) Given the shape of the data, we might want to ﬁt a linear model to it. This is trivially easy to do in R—it’s a single line of code: > m <- lm( d$Height ~ d$Weight ) Notice once again the tilde notation used to indicate control and response variable. We may also want to add the linear model to the scatter plot with the data. This can be done using the abline() function, which plots a line given its offset (“a”) and slope (“b”). We can either specify both parameters explicitly, or simply supply the result m of the ﬁtting procedure; the abline function can use either. (The parameter lty selects the line type.) > abline(m) > abline( mean(d$Height), 0, lty=2 ) This short example should have given you an idea of what working with R is like. R can be difﬁcult to learn: it uses some unfamiliar idioms (such as creative indexing) as well as some obscure function and parameter names. But the greatest challenge to the newcomer (in my opinion) is its indiscriminate use of function overloading. The same function can behave quite differently depending on the (usually opaque) type of inputs it is given. If the default choices made by R are good, then this can be very convenient, but it can be hellish if you want to exercise greater, manual control. 248 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 160 170 180 190 200 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 Weight Height FIGURE 10-8.Ascatter plot with a linear fit. Look at our example again: the same plot() command generates entirely different plot types depending on whether the control variable is categorical or numeric (box plot in the ﬁrst case, scatter plot in the latter). For the experienced user, this kind of implicit behavior is of course convenient, but for the beginner, the apparent unpredictability can be very confusing. (In Chapter 14, we will see another example, where the same plot() command generates yet a different type of plot.) These kinds of issues do not matter much if you use R interactively because you see the results immediately or, in the worst case, get an error message so that you can try something else. However, they can be unnerving if you approach R with the mindset of a contemporary programmer who prefers for operations to be explicit. It can also be difﬁcult to ﬁnd out which operations are available in a given situation. For instance, it is not at all obvious that the (opaque) return type of the lm() function is admissible input to the abline() function—it certainly doesn’t look like the explicit set of parameters used in the second call to abline(). Issues of this sort make it hard to predict what R will do at any point, to develop a comprehensive understanding of its capabilities, or how to achieve a desired effect in a speciﬁc situation. Further Reading The number of introductory statistics texts seems almost inﬁnite—which makes it that much harder to ﬁnd good ones. Below are some texts that I have found useful: WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 249 O’Reilly-5980006 master October 28, 2010 20:59 • An Introduction to Mathematical Statistics and Its Applications. Richard J. Larsen and Morris L. Marx. 4th ed., Prentice Hall. 2005. This is my preferred introductory text for the mathematical background of classical statistics: how it all works. This is a math book; you won’t learn how to do practical statistical ﬁeldwork from it. (It contains a large number of uncommonly interesting examples; however, on close inspection many of them exhibit serious ﬂaws in their experimental design—at least as described in this book.) But as a mathematical treatment, it very neatly blends accessibility with sufﬁcient depth. • Statistics for Technology: A Course in Applied Statistics. Chris Chatﬁeld. 3rd ed., Chapman & Hall/CRC. 1983. This book is good companion to the book by Larsen and Marx. It eschews most mathematical development and instead concentrates on the pragmatics of it, with an emphasis on engineering applications. • The Statistical Sleuth: A Course in Methods of Data Analysis. Fred Ramsey and Daniel Schafer. 2nd ed., Duxbury Press. 2001. This advanced undergraduate textbook emphasizes the distinction between observational studies and controlled experiments more strongly than any other book I am aware of. After working through some of their examples, you will not be able to look at the description of a statistical study without immediately classifying it as observational or controlled (and questioning the conclusions if it was merely observational). Unfortunately, the development of the general theory gets a little lost in the detailed description of application concerns. • The Practice of Business Statistics. David S. Moore, George P. McCabe, William M. Duckworth, and Layth Alwan. 2nd ed., Freeman. 2008. This is a “for business” version of a popular beginning undergraduate textbook. The coverage of topics is comprehensive, and the presentation is particularly easy to follow. This book can serve as a ﬁrst course, but will probably not provide sufﬁcient depth to develop proper understanding. • Problem Solving: A Statistician’s Guide. Chris Chatﬁeld. 2nd ed., Chapman & Hall/CRC. 1995; and Statistical Rules of Thumb. Gerald van Belle. 2nd ed., Wiley. 2008. Two nice books with lots of practical advice on statistical ﬁeldwork. Chatﬁeld’s book is more general; van Belle’s contains much material speciﬁc to epidemiology and related applications. • All of Statistics: A Concise Course in Statistical Inference. Larry Wasserman. Springer. 2004. A thoroughly modern treatment of mathematical statistics, this book presents all kinds of fascinating and powerful topics that are sorely missing from the standard introductory curriculum. The treatment is advanced and very condensed, requiring general previous knowledge in basic statistics and a solid grounding in mathematical methods. 250 CHAPTER TEN O’Reilly-5980006 master October 28, 2010 20:59 • Bayesian Methods for Data Analysis. Bradley P. Carlin, and Thomas A. Louis. 3rd ed., Chapman & Hall. 2008. This is a book on Bayesian methods applied to data analysis problems (as opposed to Bayesian theory only). It is a thick book, and some of the topics are fairly advanced. However, the early chapters provide the best introduction to Bayesian methods that I am aware of. • “Sifting the Evidence—What’s Wrong with Signiﬁcance Tests?” Jonathan A. C. Sterne and George Davey Smith. British Medical Journal 322 (2001), p. 226. This paper provides a penetrating and nonpartisan overview of the problems associated with classical hypothesis tests, with an emphasis on applications in medicine (although the conclusions are much more generally valid). The full text is freely available on the Web; a search will turn up multiple locations. WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 251 O’Reilly-5980006 master October 28, 2010 20:59 O’Reilly-5980006 master October 29, 2010 17:43 CHAPTER ELEVEN Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That EVERYBODY HAS HEARD OF BIGFOOT, THE MYSTICAL FIGURE THAT LIVES IN THE WOODS, BUT NOBODY HAS EVER actually seen him. Similarly, there are some concepts from basic statistics that everybody has heard of but that—like Bigfoot—always remain a little shrouded in mystery. Here, we take a look at three of them: the average of averages, the mystical standard deviation, and the ever-popular least squares. How to Average Averages Recently, someone approached me with the following question: given the numbers in Table 11-1, what number should be entered in the lower-right corner? Just adding up the individual defect rates per item and dividing by 3 (in effect, averaging them) did not seem right—if only because it would come out to about 0.75, which is pretty high when one considers that most of the units produced (100 out of 103) are not actually defective. The speciﬁc question asked was: “Should I weight the individual rates somehow?” This situation comes up frequently but is not always recognized: we have a set of rates (or averages) and would like to summarize them into an overall rate (or overall average). The TABLE 11-1.Defect rates: what value should go into the lower-right corner? Item type Units produced Defective units Defect rate A 2 1 0.5 B 1 1 1.0 C 100 1 0.01 Total defect rate ??? 253 O’Reilly-5980006 master October 28, 2010 21:12 problem is that the naive way of doing so (namely, to add up the individual rates and then to divide by the number of rates) will give an incorrect result. However, this is rarely noticed unless the numbers involved are as extreme as in the present example. The correct way to approach this task is to start from scratch. What is the “defect rate,” anyway? It is the number of defective items divided by the number of items produced. Hence, the total defect rate is the total number of defective items divided by the total number of items produced: 3/103 ≈ 0.03. There should be no question about that. Can we arrive at this result in a different way by starting with the individual defect rates? Absolutely—provided we weight them appropriately. Each individual defect rate should contribute to the overall defect rate in the same way that the corresponding item type contributes to the total item count. In other words, the weight for item type A is 2/103, forBis1/103, and for C it is 100/103. Pulling all this together, we have: 0.5 · 2/103 + 1.0 · 1/103 + 0.01 · 100/103 = (1 + 1 + 1)/103 = 3/103 as before. To show that this agreement is not accidental, let’s write things out in greater generality: nk Number of items of type k dk Number of defective items of type k k = dk nk Defect rate for type k fk = nk i ni Contribution of type k to total production Now look at what it means to weight each individual defect rate: fkk = nk i ni dk nk = dk i ni In other words, weighting the individual defect rate k by the appropriate weight factor fk has the effect of turning the defect rate back to the the defect count dk (normalized by total number of items). In this example, each item could get only one of two “grades,” namely 1 (for defective) or 0 (for not defective), and so the “defect rate” was a measure of the “average defectiveness” of a single item. The same logic as just demonstrated applies if you have a greater (or different) range of values. (You can make up your own example: give items grades from 1 to 5, and then calculate the overall “average grade” to see how it works.) Simpson's Paradox Since we are talking about mystical ﬁgures that can sometimes be found in tables, we should also mention Simpson’s paradox. Look at Table 11-2 which shows applications and admissions to a ﬁctional college in terms the applicants’ gender and department. 254 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:12 TABLE 11-2.Simpson’s paradox: applications and admissions by gender of applicant. Male Female Overall Department A 80/100 = 0.8 9/10 = 0.9 89/110 = 0.81 Department B 5/10 = 0.5 60/100 = 0.6 65/110 = 0.59 Total 85/110 = 0.77 69/110 = 0.63 If you look only at the bottom line with the totals, then it might appear that the college is discriminating against women, since the acceptance rate for male applicants is higher than that for female applicants (0.77 versus 0.63).* But when you look at the rates for each individual department within the college, it turns out that women have higher acceptance rates than men for every department. How can that be? The short and intuitive answer is that many more women apply to department B, which has a lower overall admission rate than department A (0.59 versus 0.81), and this drags down their (gender-speciﬁc) acceptance rate. The more general explanation speaks of a “reversal of association due to a confounding factor.” When considering only the totals, it may seem as if there is an association between gender and admission rates, with male applicants being accepted more frequently. However, this view ignores the presence of a hidden but important factor: the choice of department. In fact, the choice of department has a greater inﬂuence on the acceptance rate than the original explanatory variable (the gender). By lumping the observations for the different departments into a single number, we have in fact masked the inﬂuence of this factor—with the consequence that the association between acceptance rate (which favors women for each department) and gender was reversed. The important insight here is that such “reversal of association” due to a confounding factor is always possible. However, both conditions must occur: the confounding factor must be sufﬁciently strong (in our case, the acceptance rates for departments A and B were sufﬁciently different), and the assignment of experimental units to the levels of this factor must be sufﬁciently imbalanced (in our case, many more women applied to department B than to department A). As opposed to Bigfoot, Simpson’s paradox is known to occur in the real world. The example in this section, for instance, was based on a well-publicized case involving the University of California (Berkeley) in the early 1970s. A quick Internet search will turn up additional examples. *You should check that the entries in the bottom row have been calculated properly, per the discussion in the previous section! INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 255 O’Reilly-5980006 master October 28, 2010 21:12 The Standard Deviation The fabled standard deviation is another close relative of Bigfoot. Everybody (it seems) has heard of it, everybody knows how to calculate it, and—most importantly—everybody knows that 68 percent of all data points fall within 1 standard deviation, 95 percent within 2, and virtually all (that is: 99.7 percent) within 3. Problem is: this is utter nonsense. It is true that the standard deviation is a measure for the spread (or width) of a distribution. It is also true that, for a given set of points, the standard deviation can always be calculated. But that does not mean that the standard deviation is always a good or appropriate measure for the width of a distribution; in fact, it can be quite misleading if applied indiscriminately to an unsuitable data set. Furthermore, we must be careful how to interpret it: the whole 68 percent business applies only if the data set satisﬁes some very speciﬁc requirements. In my experience, the standard deviation is probably the most misunderstood and misapplied quantity in all of statistics. Let me tell you a true story (some identifying details have been changed to protect the guilty). The story is a bit involved, but this is no accident: in the same way that Bigfoot sightings never occur in a suburban front yard on a sunny Sunday morning, severe misunderstandings in mathematical or statistical methods usually don’t reveal themselves as long as the applications are as clean and simple as the homework problems in a textbook. But once people try to apply these same methods in situations that are a bit less standard, anything can happen. This is what happened in this particular company. I was looking over a bit of code used to identify outliers in the response times from a certain database server. The purpose of this program was to detect and report on uncommonly slow responses. The piece of code in question processed log ﬁles containing the response times and reported a threshold value: responses that took longer than this threshold were considered “outliers.” An existing service-level agreement deﬁned an outlier as any value “outside of 3 standard deviations.” So what did this piece of code do? It sorted the response times to identify the top 0.3 percent of data points and used those to determine the threshold. (In other words, if there were 1,000 data points in the log ﬁle, it reported the response time of the third slowest as threshold.) After all, 99.7 percent of data points fall within 3 standard deviations. Right? After reading Chapter 2, I hope you can immediately tell where the original programmer went wrong: the threshold that the program reported had nothing at all to do with standard deviations—instead, it reported the top 0.3 percentile. In other words, the program completely failed to do what it was supposed to do. Also, keep in mind that it is 256 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:12 incorrect to blindly consider the top x percent of any distribution as outliers (review the discussion of box plots in Chapter 2 if you need a reminder). But the story continues. This was a database server whose typical response time was less than a few seconds. It was clear that anything that took longer than one or two minutes had to be considered “slow”—that is, an outlier. But when the program was run, the threshold value it reported (the 0.3 percentile) was on the order of hours. Clearly, this threshold value made no sense. In what must have been a growing sense of desperation, the original programmer now made a number of changes: from selecting the top 0.3 percent, to the top 1 percent, then the top 5 percent and ﬁnally the top 10 percent. (I could tell, because each such change had dutifully been checked into source control!) Finally, the programmer had simply hard-coded some seemingly “reasonable” value (such as 47 seconds or something) into the program, and that’s what was reported as “3 standard deviations” regardless of the input. It was the only case of outright technical fraud that I have ever witnessed: a technical work product that—with the original author’s full knowledge—in no way did what it claimed to do. What went wrong here? Several things. First, there was a fundamental misunderstanding about the deﬁnition of the standard deviation, how it is calculated, and some of the properties that in practice it often (but not always) has. The second mistake was applying the standard deviation to a situation where it is not a suitable measure. Let’s recap some basics: we often want to characterize a point distribution by a typical value (its location) and its spread around this location. A convenient measure for the location is the mean: μ = 1 n n i xi . Why is the mean so convenient? Because it is easy to calculate: just sum all the values and divide by n. To ﬁnd the width of the distribution, we would like see how far points “typically” stray from the mean. In other words, we would like to ﬁnd the mean of the deviations xi − μ. But since the deviations can be positive and negative, they would simply cancel, so instead we calculate the mean of the squared deviations: σ 2 = 1 n n i (xi − μ)2. This quantity is called the variance, and its square root is the standard deviation. Why do we bother with the square root? Because it has the same units as the mean, whereas in the variance the units are raised to the second power. Now, if and only if the point distribution is well behaved (which in practice means: it is Gaussian), then it is true that about 68 percent of points will fall within the interval [μ − σ, μ + σ] and that 95 percent fall within the interval [μ − 2σ, μ + 2σ] and so on. The inverse is not true: you cannot conclude that 68 percent of points deﬁne a “standard deviation” (this is where the programmer in our story made the ﬁrst mistake). If the point distribution is not Gaussian, then there are no particular patterns by which fractions of points will fall within 1, 2, or any number of standard deviations from the mean. However, keep in mind that the deﬁnitions of the mean and the standard deviation (as INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 257 O’Reilly-5980006 master October 28, 2010 21:12 given by the previous equations) both retain their meaning: you can calculate them for any distribution and any data set. However (and this is the second mistake that was made), if the distribution is strongly asymmetrical, then mean and standard deviation are no longer good measures of location and spread, respectively. You can still calculate them, but their values will just not be very informative. In particular, if the distribution has a fat tail then both mean and standard deviation will be inﬂuenced heavily by extreme values in the tail. In this case, the situation was even worse: the distribution of response times was a power-law distribution, which is extremely poorly summarized by quantities such as mean and standard deviation. This explains why the top 0.3 percent of response times were on the order of hours: with power-law distributions, all values—even extreme ones—can (and do!) occur; whereas for Gaussian or exponential distributions, the range of values that do occur in practice is pretty well limited. (See Chapter 9 for more information on power-law distributions.) To summarize, the standard deviation, deﬁned as 1 n n i (xi − μ)2, is a measure of the width of a distribution (or a sample). It is a good measure for the width only if the distribution of points is well behaved (i.e., symmetric and without fat tails). Points that are far away from the center (compared to the width of the distribution) can be considered outliers. For distributions that are less well behaved, you will have to use other measures for the width (e.g., the inter-quartile range); however, you can usually still identify outliers as points that fall outside the typical range of values. (For power-law distributions, which do not have a “typical” scale, it doesn’t make sense to deﬁne outliers by statistical means; you will have to justify them differently—for instance by appealing to requirements from the business domain.) How to Calculate Here is a good trick for calculating the standard deviation efﬁciently. At ﬁrst, it seems you need to make two passes over the data in order to calculate both mean and standard deviation. In the ﬁrst pass you calculate the mean, but then you need to make a second pass to calculate the deviations from that mean: σ 2 = 1 n (xi − μ)2 It appears as if you can’t ﬁnd the deviations until the mean μ is known. However, it turns out that you can calculate both quantities in a single pass through the data. All you need to do is to maintain both the sum of the values ( xi ) and the sum of the squares of the values ( x2 i ), because you can write the preceding equation for σ 2 in a 258 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:12 form that depends only on those two sums: σ 2 = 1 n (xi − μ)2 = 1 n x2 i − 2xi μ + μ2 = 1 n x2 i − 2μ xi + μ2 1 = 1 n x2 i − 2μ1 n xi + μ2 1 n n = 1 n x2 i − 2μ · μ + μ2 = 1 n x2 i − μ2 = 1 n x2 i − 1 n xi 2 This is a good trick that is apparently too little known. Keep it in mind; similar situations crop up in different contexts from time to time. (To be sure, the ﬂoating-point properties of both methods are different, but if you care enough to worry about the difference, then you should be using a library anyway.) Optional: One over What? You may occasionally see the standard deviation deﬁned with an n in the denominator and sometimes with a factor of n − 1 instead. 1 n n i (xi − μ)2 or 1 n − 1 n i (xi − μ)2 What really is the difference, and which expression should you use? The factor 1/n applies only if you know the exact value of the mean μ ahead of time. This is usually not the case; instead, you will usually have to calculate the mean from the data. This adds a bit of uncertainty, which leads to the widening of the proper estimate for the standard deviation. A theoretical argument then leads to the use of the factor 1/(n − 1) instead of 1/n. In short, if you calculated the mean from the data (as is usually the case), then you should really be using the 1/(n − 1) factor. The difference is going to be small, unless you are dealing with very small data sets. Optional: The Standard Error While we are on the topic of obscure sources of confusion, let’s talk about the standard error. INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 259 O’Reilly-5980006 master October 28, 2010 21:12 0 1 2 3 4 5 6 0 2 4 6 8 10 Data Fit FIGURE 11-1.Fitting for statistical parameter estimation: data affected by random noise. What is the slope of the straight line? The standard error is the standard deviation of an estimated quantity. Let’s say we estimate some quantity (e.g., the mean). If we repeatedly take samples, then the means calculated from those samples will scatter around a little, according to some distribution. The standard deviation of this distribution is the “standard error” of the estimated quantity (the mean, in this example). The following observation will make this clearer. Take a sample of size n from a normally distributed population with standard deviation σ. Then 68 percent of the members of the sample will be within ±σ from the estimated mean (i.e., the sample mean). However, the mean itself is normally distributed (because of the Central Limit Theorem, since the mean is a sum of random variables) with standard deviation σ/√ n (again because of the Central Limit Theorem). So if we take several samples, each of size n, then we can expect 68 percent of the estimated means to lie within ±σ/√ n of the true mean (i.e., the mean of the overall population). In this situation, the quantity σ/√ n is therefore the standard error of the mean. Least Squares Everyone loves least squares. In the confusing and uncertain world of data and statistics, they provide a sense of security—something to rely on! They give you, after all, the “best” ﬁt. Doesn’t that say it all? Problem is, I have never (not once!) seen least squares applied appropriately, and I have come to doubt that it should ever be considered a suitable technique. In fact, when today I 260 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:12 0 0.2 0.4 0.6 0.8 1 1.2 0 1 2 3 4 5 Data Fit FIGURE 11-2.Fitting a function to approximate a curve known only at discrete locations. Is the fit a good representation of the data? see someone doing anything involving “least-squares ﬁtting,” I am pretty certain this person is at wit’s end—and probably does not even know it! There are two problems with least squares. The ﬁrst is that it is used for two very different purposes that are commonly confused. The second problem is that least-squares ﬁtting is usually not the best (or even a suitable) method for either purpose. Alternative techniques should be used, depending on the overall purpose (see ﬁrst problem) and on what, in the end, we want to do with the result. Let’s try to unravel these issues. Why do we ever want to “ﬁt” a function to data to begin with? There are two different reasons. Statistical Parameter Estimation Data is corrupted by random noise, and we want to extract parameters from it. Smooth Interpolation or Approximation Data is given as individual points, and we would like either to ﬁnd a smooth interpolation to arbitrary positions between those points or to determine an analytical “formula” describing the data. These two scenarios are conceptually depicted in Figures 11-1 and 11-2. Statistical Parameter Estimation Statistical parameter estimation is the more legitimate of the two purposes. In this case, we have a control variable x and an outcome y. We set the former and measure the latter, INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 261 O’Reilly-5980006 master October 28, 2010 21:12 resulting in a data set of pairs: {(x1, y1), (x2, y2),...}. Furthermore, we assume that the outcome is related to the control variable through some function f (x; {a, b, c,...}) of known form that depends on the control variable x and also on a set of (initially unknown) parameters {a, b, c,...}. However, in practice, the actual measurements are affected by some random noise , so that the measured values yi are a combination of the “true” value and the noise term: yi = f (xi , {a, b, c,...}) + i We now ask: how should we choose values for the parameters {a, b, c,...}, such that the function f (x, {a, b, c,...}) reproduces the measured values of y most faithfully? The usual answer is that we want to choose the parameters such that the total mean-square error E2 (sometimes called the residual sum of squares): E2 = i ( f (xi , {a, b, c,...}) − yi )2 is minimized. As long as the distribution of errors is reasonably well behaved (not too asymmetric and without heavy tails), the results are adequate. If, in addition, the noise is Gaussian, then we can even invoke other parts of statistics and show that the estimates for the parameters obtained by the least-squares procedure agree with the “maximum likelihood estimate.” Thus the least-squares results are consistent with alternative ways of calculation. But there is another important aspect to least-squares estimation that is frequently lost: we can obtain not only point estimates for the parameters {a, b, c,...} but also conﬁdence intervals, through a self-consistent argument that links the distribution of the parameters to the distribution of the measured values. I cannot stress this enough: a point estimate by itself is of limited use. After all, what good is knowing that the point estimate for a is 5.17 if I have no idea whether this means a = 5.17 ± 0.01 or a = 5.17 ± 250? We must have some way of judging the range over which we expect our estimate to vary, which is the same as ﬁnding a conﬁdence interval for it. Least squares works, when applied in a probabilistic context like this, because it gives us not only an estimate for the parameters but also for their conﬁdence intervals. One last point: in statistical applications, it is rarely necessary to perform the minimization of E2 by numerical means. For most of the functions f (x, {a, b, c,...}) that are commonly used in statistics, the conditions that will minimize E2 can be worked out explicitly. (See Chapter 3 for the results when the function is linear.) In general, you should be reluctant to resort to numerical minimization procedures—there might be better ways of obtaining the result. 262 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:12 Function Approximation In practice, however, least-squares ﬁtting is often used for a different purpose. Consider the situation in Figure 11-2, where we have a set of individual data points. These points clearly seem to fall on a smooth curve. It would be convenient to have an explicit formula to summarize these data points rather than having to work with the collection of points directly. So, can we “ﬁt” a formula to them? Observe that, in this second application of least-squares ﬁtting, there is no random noise.In fact, there is no random component at all! This is an important insight, because it implies that statistical methods and arguments don’t apply. This becomes relevant when we want to determine the degree of conﬁdence in the results of a ﬁt. Let’s say we have performed a least-squares routine and obtained some values for the parameters. What conﬁdence intervals should we associate with the parameters, and how good is the overall ﬁt? Whatever errors we may incur in the ﬁtting process, they will not be of a random nature, and we therefore cannot make probabilistic arguments about them. The scenario in Figure 11-2 is typical: the plot shows the data together with the best ﬁt for a function of the form f (x; a, b) = a/(1 + x)b, with a = 1.08 and b = 1.77. Is this a good ﬁt? And what uncertainty do we have in the parameters? The answer depends on what you want to do with the results—but be aware that the deviations between the ﬁt and the data are not at all “random” and hence that statistical “goodness of ﬁt” measures are inappropriate. We have to ﬁnd other ways to answer our questions. (For instance, we may ﬁnd the largest of the residuals between the data points and our ﬁtted function and report that the ﬁt “represents the data with a maximum deviation of ....”) This situation is typical in yet another way: given how smooth the curve is that the data points seem to fall on, our “best ﬁt” seems really bad. In particular, the ﬁt exhibits a systematic error: for 0 < x < 1.5, the curve is always smaller than the data, and for x > 1.5, it is always greater. Is this really the best we can do? The answer is yes, for functions of the form a/(1 + x)b. However, a different choice of function might give much better results. The problem here is that the least-squares approach forces us to specify the functional form of the function we are attempting to ﬁt, and if we get it wrong, then the results won’t be any good. For this reason, we should use less constraining approaches (such as nonparametric or local approximations) unless we have good reasons to favor a particular functional form. In other words, what we really have here is a problem of function interpolation or approximation: we know the function on a discrete set of points, and we would like to extend it smoothly to all values. How we should do this depends on what we want to do with the results. Here is some advice for common scenarios: • To ﬁnd a “smooth curve” for plotting purposes, you should use one of the smoothing routines discussed in Chapter 3, such as splines or LOESS. These nonparametric INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES, AND ALL THAT 263 O’Reilly-5980006 master October 28, 2010 21:12 methods have the advantage that they do not impose a particular functional form on the data (in contrast to the situation in Figure 11-2). • If you want to be able to evaluate the function easily at an arbitrary location, then you should use a local interpolation method. Such methods build a local approximation by using the three or four data points closest to the desired location. It is not necessary to ﬁnd a global expression in this case: the local approximation will sufﬁce. • Sometimes you may want to summarize the behavior of the data set in just a few “representative” values (e.g., so you can more easily compare one data set against another). This is tricky—it is probably a better idea to compare data sets directly against each other using similarity metrics such as those discussed in Chapter 13. If you still need to do this, consider a basis function expansion using Fourier, Hermite, or wavelet functions. (These are special sets of functions that enable you to extract greater and greater amounts of detail from a data set. Expansion in basis functions also allows you to evaluate and improve the quality of the approximation in a systematic fashion.) • At times you might be interested in some particular feature of the data: for example, you suspect that the data follows a power law xb and you would like to extract the exponent; or the data is periodic and you need to know the length of one period. In such cases, it is usually a better idea to transform the data in such a way that you can obtain that particular feature directly, rather than ﬁtting a global function. (To extract exponents, you should consider a logarithmic transform. To obtain the length of an oscillatory period, measure the peak-to-peak (or, better still, the zero-to-zero) distance.) • Use specialized methods if available and applicable. Time series, for instance, should be treated with the techniques discussed in Chapter 4. You may have noticed that none of these suggestions involve least squares! Further Reading Every introductory statistics book covers the standard deviation and least squares (see the book recommendations in Chapter 10). For the alternatives to least squares, consult a book on numerical analysis, such as the one listed here. • Numerical Methods That (Usually) Work. Forman S. Acton. 2nd ed., Mathematical Association of America. 1997. Although originally published in 1970, this book does not feel the least bit dated—it is still one of the best introductions to the art of numerical analysis. Neither a cookbook nor a theoretical treatise, it stresses practicality and understanding ﬁrst and foremost. It includes an inimitable chapter on “What Not to Compute.” 264 CHAPTER ELEVEN O’Reilly-5980006 master October 28, 2010 21:15 PART III Computation: Mining Data O’Reilly-5980006 master October 28, 2010 21:15 O’Reilly-5980006 master October 28, 2010 21:15 CHAPTER TWELVE Simulations IN THIS CHAPTER, WE LOOK AT SIMULATIONS AS A WAY TO UNDERSTAND DATA. IT MAY SEEM STRANGE TO FIND simulations included in a book on data analysis: don’t simulations just generate even more data that needs to be analyzed? Not necessarily—as we will see, simulations in the form of resampling methods provide a family of techniques for extracting information from data. In addition, simulations can be useful when developing and validating models, and in this way, they facilitate our understanding of data. Finally, in the context of this chapter we can take a brief look at a few other relevant topics, such as discrete event simulations and queueing theory. A technical comment: I assume that your programming environment includes a random-number generator—not only for uniformly distributed random numbers but also for other distributions (this is a pretty safe bet). I also assume that this random-number generator produces random numbers of sufﬁciently high quality. This is probably a reasonable assumption, but there’s no guarantee: although the theory of random-number generators is well understood, broken implementations apparently continue to ship. Most books on simulation methods will contain information on random-number generators—look there if you feel that you need more detail. A Warm-Up Question As a warm-up to demonstrate how simulations can help us analyze data, consider the following example. We are given a data set with the results of eight tosses of a coin: six Heads and two Tails. Given this data, would we say the coin is biased? 267 O’Reilly-5980006 master October 28, 2010 21:15 0 0.2 0.4 0.6 0.8 1 FIGURE 12-1.Thelikelihood function p6(1 − p)2 of observing six Heads and two Tails in eight tosses of a coin, as a function of the coin’s “balance parameter” p. The problem is that the data set is small—if there had been 80,000 tosses of which 60,000 came out Heads, then we would have no doubt that the coin was biased. But with just eight tosses, it seems plausible that the imbalance in the results might be due to chance alone—even with a fair coin. It was for precisely this kind of question that formal statistical methods were developed. We could now either invoke a classical frequentist point of view and calculate the probability of obtaining six or more Heads in eight tosses of a fair coin (i.e., six or more successes in eight Bernoulli trials with p = 0.5). The probability comes out to 37/256 ≈ 0.14, which is not enough to “reject the null hypothesis (that the coin is fair) at the 5 percent level.” Alternatively, we could adopt a Bayesian viewpoint and evaluate the appropriate likelihood function for the given data set with a noninformative prior (see Figure 12-1). The graph suggests that the coin is not balanced. But what if we have forgotten how to evaluate either quantity, or (more likely!) if we are dealing with a problem more intricate than the one in this example, so that we neither know the appropriate model to choose nor the form of the likelihood function? Can we ﬁnd a quick way to make progress on the question we started with? Given the topic of this chapter, the answer is easy. We can simulate tosses of a coin, for various degrees of imbalance, and then compare the simulation results to our data set. import random repeats, tosses = 60, 8 268 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 Number of Heads Observed Balance Parameter p FIGURE 12-2.Results of 60 simulation runs, each consisting of eight tosses of a coin, for different values of the coin’s “balance parameter” p. Shown are the number of Heads observed in each run. Although a slight balance toward Heads (p ≈ 0.7) seems most probable, note that as many as six Heads can occasionally be observed even with a coin that is balanced toward Tails. def heads( tosses, p ): h=0 for x in range( 0, tosses ): if random.random() < p: h += 1 return h p=0 while p < 1.01: for t in range( 0, repeats ): print p, "\t", heads( tosses, p ) p += 0.05 The program is trivial to write, and the results, in the form of a jitter plot, are shown in Figure 12-2. (For each value of the parameter p, which controls the imbalance of the coin, we have performed 60 repeats of 8 tosses each and counted the number of Heads in each repeat.) The ﬁgure is quite clear: for p = 0.5(i.e., a balanced coin), it is pretty unlikely to obtain six or more Heads, although not at all impossible. On the other hand, given that we have observed six Heads, we would expect the parameter to fall into the range p = 0.6,...,0.7. We have thus not only answered the question we started with but also given it some context. The simulation therefore not only helped us understand the actual data set but also allowed us to explore the system that produced it. Not bad for 15 lines of code. SIMULATIONS 269 O’Reilly-5980006 master October 28, 2010 21:15 Monte Carlo Simulations The term Monte Carlo simulation is frequently used to describe any method that involves the generation of random points as input for subsequent operations. Monte Carlo techniques are a major topic all by themselves. Here, I only want to sketch two applications that are particularly relevant in the context of data analysis and modeling. First, simulations allow us to verify analytical work and to experiment with it further; second, simulations are a way of obtaining results from models for which analytical solutions are not available. Combinatorial Problems Many basic combinatorial problems can be solved exactly—but obtaining a solution is often difﬁcult. Even when one is able to ﬁnd a solution, it is surprisingly easy to arrive at incorrect conclusions, missing factors like 1/2or1/n! and so on. And lastly, it takes only innocuous looking changes to a problem formulation to render the problem intractable. In contrast, simulations for typical combinatorial problems are often trivially easy to write. Hence they are a great way to validate theoretical results, and they can be extended to explore problems that are not tractable otherwise. Here are some examples of questions that can be answered easily in this way: • If we place n balls into n boxes, what is the probability that no more than two boxes contain two or more balls? What if I told you that exactly m boxes are empty? What if at most m boxes are empty? • If we try keys from a key chain containing n different keys, how many keys will we have to try before ﬁnding the one that ﬁts the lock? How is the answer different if we try keys randomly (with replacement) as opposed to in order (without replacement)? • Suppose an urn contains 2n tokens consisting of n pairs of items. (Each item is marked in such a way that we can tell to which pair it belongs.) Repeatedly select a single token from the urn and put it aside. Whenever the most recently selected token is the second item from a pair, take both items (i.e., the entire pair) and return them to the urn. How many “broken pairs” will you have set aside on average? How does the answer change if we care about triples instead of pairs? What ﬂuctuations can we expect around the average value? The last problem is a good example of the kind of problem for which the simple case (average number of broken pairs) is fairly easy to solve but that becomes rapidly more complicated as we make seemingly small modiﬁcations to the original problem (e.g., going from pairs to triples). However, in a simulation such changes do not pose any special difﬁculties. 270 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 Another way that simulations can be helpful concerns situations that appear unfamiliar or even paradoxical. Simulations allow us to see how the system behaves and thereby to develop intuition for it. We already encountered an example in the Workshop section of Chapter 9, where we studied probability distributions without expectation values. Let’s look at another example. Suppose, we are presented with a choice of three closed envelopes. One envelope contains a prize, the other two are empty. After we have selected an envelope, it is revealed that one of the envelopes that we had not selected is empty. We are now permitted to choose again. What should we do? Stick with our initial selection? Randomly choose between the two remaining envelopes? Or pick the remaining envelope—that is, not the one that we selected initially and not the one that has been opened? This is a famous problem, which is sometimes known as the “Monty Hall Problem” (after the host of a game show that featured a similar game). As it turns out, the last strategy (always switch to the remaining envelope) is the most beneﬁcial. The problem appears to be paradoxical because the additional information that is revealed (that an envelope we did not select is empty) does not seem to be useful in any way. How can this information affect the probability that our initial guess was correct? The argument goes as follows. Our initial selection is correct with probability p = 1/3 (because one envelope among the original three contains the prize). If we stick with our original choice, then we should therefore have a 33 percent chance of winning. On the other hand, if in our second choice, we choose randomly from the remaining options (meaning that we are as likely to pick the initially chosen envelope or the remaining one), then we will select the correct envelope with probability p = 1/2 (because now one out of two envelopes contains the prize). A random choice is therefore better than staying put! But this is still not the best strategy. Remember that our initial choice only had a p = 1/3 probability of being correct—in other words, it has probability q = 2/3 of being wrong. The additional information (the opening of an empty envelope) does not change this probability, but it removes all alternatives. Since our original choice is wrong with probability q = 2/3 and since now there is only one other envelope remaining, switching to this remaining envelope should lead to a win with 66 percent probability! I don’t know about you, but this is one of those cases where I had to “see it to believe it.” Although the argument above seems compelling, I still ﬁnd it hard to accept. The program in the following listing helped me do exactly that. import sys import random as rnd strategy = sys.argv[1] # must be 'stick', 'choose', or 'switch' wins = 0 for trial in range( 1000 ): SIMULATIONS 271 O’Reilly-5980006 master October 28, 2010 21:15 # The prize is always in envelope 0 ... but we don't know that! envelopes = [0, 1, 2] first_choice = rnd.choice( envelopes ) if first_choice == 0: envelopes = [0, rnd.choice( [1,2])]#Randomly retain 1 or 2 else: envelopes = [0, first_choice] # Retain winner and first choice if strategy == 'stick': second_choice = first_choice elif strategy == 'choose': second_choice = rnd.choice( envelopes ) elif strategy == 'switch': envelopes.remove( first_choice ) second_choice = envelopes[0] # Remember that the prize is in envelope 0 if second_choice == 0: wins += 1 print wins The program reads our strategy from the command line: the possible choices are stick, choose, and switch. It then performs a thousand trials of the game. The “prize” is always in envelope 0, but we don’t know that. Only if our second choice equals envelope 0 we count the game as a win. The results from running this program are consistent with the argument given previously: stick wins in one third of all trials, choose wins half the time, but switch amazingly wins in two thirds of all cases. Obtaining Outcome Distributions Simulations can be helpful to verify with combinatorial problems, but the primary reason for using simulations is that they allow us to obtain results that are not available analytically. To arrive at an analytical solution for a model, we usually have to make simplifying assumptions. One particularly common one is to replace all random quantities with their most probable value (the mean-ﬁeld approximation; see Chapter 8). This allows us to solve the model, but we lose information about the distribution of outcomes. Simulations are a way of retaining the effects of randomness when determining the consequences of a model. Let’s return to the case study discussed at the end of Chapter 9. We had a visitor population making visits to a certain website. Because individual visitors can make repeat visits, the number of unique visitors grows more slowly than the number of total visitors. We found an expression for the number of unique visitors over time but had to make 272 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 some approximations in order to make progress. In particular, we assumed that the number of total visitors per day would be the same every day, and be equal to the average number of visitors per day. (We also assumed that the fraction of actual repeat visitors on any given day would equal the fraction of repeat visitors in the total population.) Both of these assumptions are of precisely the nature discussed earlier: we replaced what in reality is a random quantity with its most probable value. These approximations made the problem tractable, but we lost all sense of the accuracy of the result. Let’s see how simulations can help provide additional insight to this situation. The solution which in Chapter 9 was a model: an analytical (mean-ﬁeld) model. The short program that follows is another model of the same system, but this time it is a simulation model. It is a model in the sense that again everything that is not absolutely essential has been stripped away: there is no website, no actual visits, no browsing behavior. But the model retains two aspects that are important and that were missing from the mean-ﬁeld model. First, the number of visitors per day is no longer ﬁxed, instead it is distributed according to a Gaussian distribution. Second, we have a notion of individual visitors (as elements of the list has visited), and on every “day” we make a random selection from this set of visitors to determine who does visit on this day and who does not. import random as rnd n = 1000 # total visitors k = 100 # avg visitors per day s = 50 # daily variation def trial(): visitors_for_day = [0] # No visitors on day 0 has_visited = [0]*n # A flag for each visitor for day in range( 31 ): visitors_today = max( 0, int(rnd.gauss( k, s )) ) # Pick the individuals who visited today and mark them for i in rnd.sample( range( n ), visitors_today ): has_visited[i] = 1 # Find the total number of unique visitors so far visitors_for_day.append( sum(has_visited) ) return visitors_for_day for t in range( 25 ): r = trial() for i in range( len(r) ): print i, r[i] print print SIMULATIONS 273 O’Reilly-5980006 master October 28, 2010 21:15 0 100 200 300 400 500 600 700 800 900 1,000 0 5 10 15 20 25 30 Simulation Model FIGURE 12-3.Unique visitors as a function of time: results from the simulation run, together with predictions from the analytical model. All data points are jittered horizontally to minimize overplotting. The solid line is the most probable number of visitors according to the model; the dashed lines indicate a confidence band. The program performs 25 trials, where each trial consists of a full, 31-day month of visits. For each day, we ﬁnd the number of visitors for that day (which must be a positive integer) and then randomly select the same number of “visitors” from our list of visitors, setting a ﬂag to indicate that they have visited. Finally, we count the number of visitors that have the ﬂag set and print this number (which is the number of unique visitors so far) for each day. The results are shown in Figure 12-3. Figure 12-3 also includes results from the analytical model. In Chapter 9, we found that the number of unique visitors on day t was given by: n(t) = N 1 − e− k N t where N is the total number of visitors (N = 1,000 in the simulation) and k is the average number of visitors per day (k = 100 in the simulation). Accordingly, the solid line in Figure 12-3 is given by n(t) = 1,000 1 − exp − 100 1000 t . The simulation includes a parameter that was not part of the analytical model—namely the width s of the daily ﬂuctuations in visitors. I have chosen the value s = 50 for the simulation runs. The dashed lines in Figure 12-3 show the analytical model, with values of k ± s/2(i.e., k = 75 and k = 125) to provide a sense for the predicted spread, according to the mean-ﬁeld model. First of all, we should note that the analytical model agrees very well with the data from the simulation run: that’s a nice conﬁrmation of our previous result! But we should also 274 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 note the differences; in particular, the simulation results are consistently higher than the theoretical predictions. If we think about this for a moment, this makes sense. If on any day there are unusually many visitors, then this irrevocably bumps the number of unique visitors up: the number of unique visitors can never shrink, so any outlier above the average can never be neutralized (in contrast to an outlier below the average, which can be compensated by any subsequent high-trafﬁc day). We can further analyze the data from the simulation run, depending on our needs. For instance, we can calculate the most probable value for each day, and we can estimate proper conﬁdence intervals around it. (We will need more than 25 trials to obtain a good estimate of the latter.) What is more interesting about the simulation model developed here is that we can use it to obtain additional information that would be difﬁcult or impossible to calculate from the analytical formula. For example, we may ask for the distribution of visits per user (i.e., how many users have visited once, twice, three times, and so on). The answer to this question is just a snap of the ﬁngers away! We can also extend the model and ask for the number of unique visitors who have paid two or more visits (not just one). (For two visits per person, this question can be answered within the framework of the original analytical model, but the calculations rapidly become more tedious as we are asking for higher visit counts per person.) Finally, we can extend the simulation to include features not included in the analytical model at all. For instance, for a real website, not all possible visitors are equally likely to visit: some individuals will have a higher probability of visiting the website than do others. It would be very difﬁcult to incorporate this kind of generalization into the approach taken in Chapter 9, because it contradicts the basic assumption that the fraction of actual repeat visitors equals the fraction of repeat visitors in the total population. But it is not at all difﬁcult to model this behavior in a simulation model! Pro and Con Basic simulations of the kind discussed in this section are often easy to program—certainly as compared with the effort required to develop nontrivial combinatorial arguments! Moreover, when we start writing a simulation project, we can be fairly certain of being successful in the end; whereas there is no guarantee that an attempt to ﬁnd an exact answer to a combinatorial problem will lead anywhere. On the other hand, we should not forget that a simulation produces numbers, not insight! A simulation is always only one step in a larger process, which must include a proper analysis of the results from the simulation run and, ideally, also involves an attempt to incorporate the simulation data into a larger conceptual model. I always get a little uncomfortable when presented with a bunch of simulation results that have not been ﬁt into a larger context. Simulations cannot replace analytical modeling. SIMULATIONS 275 O’Reilly-5980006 master October 28, 2010 21:15 In particular, simulations do not yield the kind of insight into the mechanisms driving certain developments that a good analytical model affords. For instance, recall the case study near the end of Chapter 8, in which we tried to determine the optimal number of servers. One important insight from that model was that the probability pn for a total failure dropped extremely rapidly as the number n of servers increased: the exponential decay (with n) is much more important than the reliability p of each individual server. (In other words, redundant commodity hardware beats expensive supercomputers—at least for situations in which this simpliﬁed cost model holds!) This is the kind of insight that would be difﬁcult to gain simply by looking at results from simulation runs. Simulations can be valuable for verifying analytical work and for extending it by incorporating details that would be difﬁcult or impossible to treat in an analytical model. At the same time, the beneﬁt that we can derive from simulations is enhanced by the insight gained from the analytical, conceptual modeling of the the mechanisms driving a system. The two methods are complementary—although I will give primacy to analytical work. Analytical models without simulation may be crude but will still yield insight, whereas simulations without analysis produce only numbers, not insight. Resampling Methods Imagine you have taken a sample of n points from some population. It is now a trivial exercise to calculate the mean from this sample. But how reliable is this mean? If we repeatedly took new samples (of the same size) from the population and calculated their means, how much would the various values for the mean jump around? This question is important. A point estimate (such as the mean by itself) is not very powerful: what we really want is an interval estimate which also gives us a sense of the reliability of the answer. If we could go back and draw additional samples, then we could obtain the distribution of the mean directly as a histogram of the observed means. But that is not an option: all we have are the n data points of the original sample. Much of classical statistics deals with precisely this question: how can we make statements about the reliability of an estimate based only on a set of observations? To make progress, we need to make some assumptions about the way values are distributed. This is where the sampling distributions of classical statistics come in: all those Normal, t, and chi-square distributions (see Chapter 10). Once we have a theoretical model for the way points are distributed, we can use this model to establish conﬁdence intervals. Being able to make such statements is one of the outstanding achievements of classical statistics, but at the same time, the difﬁculties in getting there are a major factor in making classical statistics seem so obscure. Two problems stand out: 276 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 • Our assumptions about the shape of those distributions may not be correct, or we may not be able to formulate those distributions at all—in particular, if we are interested in more complicated quantities than just the sample mean or if we are dealing with populations that are ill behaved (i.e., not even remotely Gaussian). • Even if we know the sampling distribution, determining conﬁdence limits from it may be tedious, opaque, and error-prone. The Bootstrap The bootstrap is an alternative approach for ﬁnding conﬁdence intervals and similar quantities directly from the data. Instead of making assumptions about the distribution of values and then employing theoretical arguments, the bootstrap goes back to the original idea: what if we could draw additional samples from the population? We can’t go back to the original population, but the sample that we already have should be a fairly good approximation to the overall population. We can therefore create additional samples (also of size n)bysampling with replacement from the original sample. For each of these “synthetic” samples, we can calculate the mean (or any other quantity, of course) and then use this set of values for the mean to determine a measure of the spread of its distribution via any standard method (e.g., we might calculate its inter-quartile range; see Chapter 2). Let’s look at an example—one that is simple enough that we can work out the analytical answer and compare it directly to the bootstrap results. We draw n = 25 points from a standard Gaussian distribution (with mean μ = 0 and standard deviation σ = 1). We then ask about the (observed) sample mean and more importantly, about its standard error. In this case, the answer is simple: we know that the error of the mean is σ/√ n (see Chapter 11), which amounts to 1/5 here. This is the analytical result. To ﬁnd the bootstrap estimate for the standard error, we draw 100 samples, each containing n = 25 points, from our original sample of 25 points. Points are drawn randomly with replacement (so that each point can be selected multiple times). For each of these bootstrap samples, we calculate the mean. Now we ask: what is the spread of the distribution of these 100 bootstrap means? The data is plotted in Figure 12-4. At the bottom, we see the 25 points of the original data sample; above that, we see the means calculated from the 100 bootstrap samples. (All points are jittered vertically to minimize overplotting.) In addition, the ﬁgure shows kernel density estimates (see Chapter 2) of the original sample and also of the bootstrap means. The latter is the answer to our original question: if we repeatedly took samples from the original distribution, the sample means would be distributed similarly to the bootstrap means. (Because in this case we happen to know the original distribution, we can also plot both it and the theoretical distribution of the mean, which happens to be Gaussian as well but SIMULATIONS 277 O’Reilly-5980006 master October 28, 2010 21:15 -3 -2 -1 0 1 2 3 Sample Mean Data Sample Data Empirical Distribution Theoretical Distribution Sample Mean Bootstrap Means Bootstrap Means Bootstrap Distribution Theoretical Distribution FIGURE 12-4.Thebootstrap. The points in the original sample are shown at the bottom; the means calculated from the bootstrap samples are shown above. Also displayed are the original distribution and the distribution of the sample means, both using the theoretical result and a kernel density estimate from the corresponding samples. with a reduced standard deviation of σ/√ n. As we would expect, the theoretical distributions agree reasonably well with the kernel density estimated calculated from the data.) Of course, in this example the bootstrap procedure was not necessary. It should be clear, however, that the bootstrap provides a simple method for obtaining conﬁdence intervals even in situations where theoretical results are not available. For instance, if the original distribution had been highly skewed, then the Gaussian assumption would have been violated. Similarly, if we had wanted to calculate a more complicated quantity than the mean, analytical results might have been hard to obtain. Let me repeat this, because it’s important: bootstrapping is a method to estimate the spread of some quantity. It is not a method to obtain “better” estimates of the original quantity itself—for that, it is necessary to obtain a larger sample by making additional drawings from the original population. The bootstrap is not a way to give the appearance of a larger sample size by reusing points! When Does Bootstrapping Work? As we have seen, the bootstrap is a simple, practical, and relatively transparent method to obtain conﬁdence intervals for estimated quantities. This begs the question: when does it work? The following two conditions must be fulﬁlled. 1. The original sample must provide a good representation of the entire population. 2. The estimated quantity must depend “smoothly” on the data points. 278 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 The ﬁrst condition requires the original sample to be sufﬁciently large and relatively clean. If the sample size is too small, then the original estimate for the actual quantity in question (the mean, in our example) won’t be very good. (Bootstrapping in a way exacerbates this problem because data points have a greater chance of being reused repeatedly in the bootstrap samples.) In other words, the original sample has to be large enough to allow meaningful estimation of the primary quantity. Use common sense and insight into your speciﬁc application area to establish the required sample size for your situation. Additionally, the sample has to be relatively clean: crazy outliers, for instance, can be a problem. Unless the sample size is very large, outliers have a signiﬁcant chance of being reused in a bootstrap sample, distorting the results. Another problem exists in situations involving power-law distributions. As we saw in Chapter 9, estimated values for such distributions may not be unique but depend on the sample size. Of course, the same considerations apply to bootstrap samples drawn from such distributions. The second condition suggests that bootstrapping does not work well for quantities that depend critically on only a few data points. For example, we may want to estimate the maximum value of some distribution. Such an estimate depends critically on the largest observed value—that is, on a single data point. For such applications, the bootstrap is not suitable. (In contrast, the mean depends on all data points and with equal weight.) Another questions concerns the number of bootstrap samples to take. The short answer is: as many as you need to obtain a sufﬁciently good estimate for the spread you are calculating. If the number of points in the original sample is very small, then creating too many bootstrap samples is counterproductive because you will be regenerating the same bootstrap samples over and over again. However, for reasonably sized samples, this is not much of a problem, since the number of possible bootstrap samples grows very quickly with the number of data points n in the original sample. Therefore, it is highly unlikely that the same bootstrap example is generated more than once—even if we generate thousands of bootstrap samples. The following argument will help to develop a sense for the order of magnitudes involved. The problem of choosing n data points with replacement from the original n-point sample is equivalent to assigning n elements to n cells. It is a classical problem in occupancy theory to show that there are: 2n − 1 n = (2n − 1)! n!(n − 1)! ways of doing this. This number grows extremely quickly: for n = 5 it is 126, for n = 10 we have 92,378, but for n = 20 it already exceeds 1010. (The usual proof proceeds by observing that assigning r indistinguishable objects to n bins is equivalent to aligning r objects and n − 1 bin dividers. There are r + n − 1 spots in total, SIMULATIONS 279 O’Reilly-5980006 master October 28, 2010 21:15 which can be occupied by either an object or a divider, and the assignment amounts to choosing r of these spots for the r objects. The number of ways one can choose r elements out of n + r − 1 is given by the binomial coefﬁcient r+n−1 r . Since in our case r = n,we ﬁnd that the number of different bootstrap samples is given by the expression above.) Bootstrap Variants There are a few variants of the basic bootstrap idea. The method so far—in which points are drawn directly from the original sample—is known as the nonparametric bootstrap.An alternative is the parametric bootstrap: in this case, we assume that the original population follows some particular probability distribution (such as the Gaussian), and we estimate its parameters (mean and standard deviation, in this case) from the original sample. The bootstrap samples are then drawn from this distribution rather than from the original sample. The advantage of the parametric bootstrap is that the bootstrap values do not have to coincide exactly with the known data points. In a similar spirit, we may use the original sample to compute a kernel density estimate (as an approximation to the population distribution) and then draw bootstrap samples from it. This method combines aspects of both parametric and nonparametric approaches: it is nonparametric (because it make no assumption about the form of the underlying population distribution), yet the bootstrap samples are not restricted to the values occurring in the original sample. In practice, neither of these variants seems to provide much of an advantage over the original idea (in part because the number of possible bootstrap samples grows so quickly with the number of points in the sample that choosing the bootstrap samples from only those points is not much of a restriction). Another idea (which historically predates the bootstrap) is the so-called jackknife.Inthe jackknife, we don’t draw random samples. Instead, given an original sample consisting of n data points, we calculate the n estimates of the quantity of interest by successively omitting one of the data points from the sample. We can now use these n values in a similar way that we used values calculated from bootstrap samples. Since the jackknife does not contain any random element, it is an entirely deterministic procedure. Workshop: Discrete Event Simulations with SimPy All the simulation examples that we considered so far were either static (coin tosses, Monty Hall problem) or extremely stripped down and conceptual (unique visitors). But if we are dealing with the behavior and time development of more complex systems— consisting of many different particles or actors that interact with each other in complicated ways—then we want a simulation that expresses all these entities in a manner that closely resembles the problem domain. In fact, this is probably exactly what most of us think of when we hear the term “simulation.” There are basically two different ways that we can set up such a simulation. In a continuous time simulation, time progresses in “inﬁnitesimally” small increments. At each time step, all 280 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 simulation objects are advanced while taking possible interactions or status changes into account. We would typically choose such an approach to simulate the behavior of particles moving in a ﬂuid or a similar system. But in other cases, this model seems wasteful. For instance, consider customers arriving at a bank: in such a situation, we only care about the events that change the state of the system (e.g., customer arrives, customer leaves)—we don’t actually care what the customers do while waiting in line! For such system we can use a different simulation method, known as discrete event simulation. In this type of simulation, time does not pass continuously; instead, we determine when the next event is scheduled to occur and then jump ahead to exactly that moment in time. Discrete event simulations are applicable to a wide variety of problems involving multiple users competing for access to a shared server. It will often be convenient to phrase the description in terms of the proverbial “customers arriving at a bank,” but exactly the same considerations apply, for instance, to messages on a computer network. Introducing SimPy The SimPy package (http://simpy.sourceforge.net/) is a Python project to build discrete event simulation models. The framework handles all the event scheduling and messaging “under the covers” so that the programmer can concentrate on describing the behavior of the actors in the simulation. All actors in a SimPy simulation must be subclasses of the class Process. Congestion points where queues form are modeled by instances of the Resource class or its subclasses. Here is a short example, which describes a customer visiting a bank: from SimPy.Simulation import * class Customer( Process ): def doit( self ): print "Arriving" yield request, self, bank print "Being served" yield hold, self, 100.0 print "Leaving" yield release, self, bank # Beginning of main simulation program initialize() bank = Resource() cust = Customer() cust.start( cust.doit() ) simulate( until=1000 ) SIMULATIONS 281 O’Reilly-5980006 master October 28, 2010 21:15 Let’s skip the class deﬁnition of the Customer object for now and concentrate on the rest of the program. The ﬁrst function to call in any SimPy program is the initialize() method, which sets up the simulation run and sets the “simulation clock” to zero. We then proceed to create a Resource object (which models the bank) and a single Customer object. After creating the Customer, we need to activate it via the start() member function. The start() function takes as argument the function that will be called to advance the Customer through its life cycle (we’ll come back to that). Finally, we kick off the actual simulation, requiring it to stop after 1,000 time steps on the simulation clock have passed. The Customer subclasses Process, therefore its instances are active agents, which will be scheduled by the framework to receive events. Each agent must deﬁne a process execution method (PEM), which deﬁnes its behavior and which will be invoked by the framework whenever an event occurs. For the Customer class, the PEM is the doit() function. (There are no restrictions on its name—it can be called anything.) The PEM describes the customer’s behavior: after the customer arrives, the customer requests a resource instance (the bank in this case). If the resource is not available (because it is busy, serving other customers), then the framework will add the customer to the waiting list (the queue) for the requested resource. Once the resource becomes available, the customer is being serviced. In this simple example, the service time is a ﬁxed value of 100 time units, during which the customer instance is holding—just waiting until the time has passed. When service is complete, the customer releases the resource instance. Since no additional actions are listed in the PEM, the customer is not scheduled for future events and will disappear from the simulation. Notice that the Customer interacts with the simulation environment through Python yield statements, using special yield expressions of the form shown in the example. Yielding control back to the framework in this way ensures that the Customer retains its state and its current spot in the life cycle between invocations. Although there are no restrictions on the name and argument list permissible for a PEM, each PEM must contain at least one of these special yield statements. (But of course not necessarily all three, as in this case; we are free to deﬁne the behavior of the agents in our simulations at will.) The Simplest Queueing Process Of course the previous example which involved only a single customer entering and leaving the bank, is not very exciting—we hardly needed a simulation for that! Things change when we have more than one customer in the system at the same time. The listing that follows is very similar to the previous example, except that now there is an inﬁnite stream of customers arriving at the bank and requesting service. To generate this inﬁnite sequence of customers, the listing makes use of an idiom that’s often used in SimPy programs: a “source” (the CustomerGenerator instance). 282 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 from SimPy.Simulation import * import random as rnd interarrival_time = 10.0 service_time = 8.0 class CustomerGenerator( Process ): def produce( self, b ): while True: c = Customer( b ) c.start( c.doit() ) yield hold, self, rnd.expovariate(1.0/interarrival_time) class Customer( Process ): def __init__( self, resource ): Process.__init__( self ) self.bank = resource def doit( self ): yield request, self, self.bank yield hold, self, self.bank.servicetime() yield release, self, self.bank class Bank( Resource ): def servicetime( self ): return rnd.expovariate(1.0/service_time) initialize() bank = Bank( capacity=1, monitored=True, monitorType=Monitor ) src = CustomerGenerator() activate( src, src.produce( bank ) ) simulate( until=500 ) print bank.waitMon.mean() print for evt in bank.waitMon: print evt[0], evt[1] The CustomerGenerator is itself a subclass of Process and deﬁnes a PEM (produce()). Whenever it is triggered, it generates a new Customer and then goes back to sleep for a random amount of time. (The time is distributed according to an exponential distribution—we will discuss this particular choice in a moment.) Notice that we don’t need to keep track of the Customer instances explicitly: once they have been activated using the start() member function, the framework ensures that they will receive scheduled events. SIMULATIONS 283 O’Reilly-5980006 master October 28, 2010 21:15 There are two changes to the Customer class. First of all, we explicitly inject the resource to request (the bank) as an additional argument to the constructor. By contrast, the Customer in the previous example found the bank reference via lookup in the global namespace. That’s ﬁne for small programs but becomes problematic for larger ones—especially if there is more than one resource that may be requested. The second change is that the Customer now asks the bank for the service time. This is in the spirit of problem domain modeling—it’s usually the server (in this case, the bank) that controls the time it takes to complete a transaction. Accordingly, we have introduced Bank as subclass of Resource in order to accommodate this additional functionality. (The service time is also exponentially distributed but with a different wait time than that used for the CustomerGenerator.) Subtypes of the Process class are used to model actors in a SimPy simulation. Besides these active simulation objects, the next most important abstraction describes congestion points, modeled by the Resource class and its subclasses. Each Resource instance models a shared resource that actors may request, but its more important function is to manage the queue of actors currently waiting for access. Each Resource instance consists of a single queue and one or more actual “server units” that can fulﬁll client requests. Think of the typical queueing discipline followed in banks and post ofﬁces (in the U.S.—other countries have different conventions!): a single line but multiple teller windows, with the person at the head of the line moving to the next available window. That is the model represented by each Resource instance. The number of server units is controlled through the keyword argument capacity to the Resource constructor. Note that all server units in a single Resource instance are identical. Server units are also “passive”: they have no behavior themselves. They only exist so that a Process object can acquire them, hold them for a period of time, and then release them (like a mutex). Although a Resource instance may have multiple server units, it can contain only a single queue. If you want to model a supermarket checkout situation, where each server unit has its own queue, you therefore need to set up multiple Resource instances, each with capacity=1: one for each checkout stand and each managing its own queue of customers. For each Resource instance, we can monitor the length of the queue and the events that change it (arrivals and departures) by registering an observer object with the Resource. There are two types of such observers in SimPy: a Monitor records the time stamp and new queue length for every event that affects the queue, whereas a Tally only keeps enough information to calculate summary information (such as the average queue length). Here we have registered a Monitor object with the Bank. (We’ll later see an example of a Tally.) As before, we run the simulation until the internal simulation clock reaches 1,000. The CustomerGenerator produces an inﬁnite stream of Customer objects, each requesting service from the Bank, while the Monitor records all changes to the queue. 284 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 0 1 2 3 4 5 6 0 100 200 300 400 500 Queue Length Time Queue Length Average FIGURE 12-5.Number of customers in queue over time. After the simulation has run to completion, we retrieve the Monitor object from the Bank: if an observer had been registered with a Resource, then it is available in the waitMon member variable. We print out the average queue length over the course of the simulation as well as the full time series of events. (The Monitor class is a List subclass, so we can iterate over it directly.) The time evolution of the queue is shown in Figure 12-5. One last implementation detail: if you look closely, you will notice that the CustomerGenerator is activated using the standalone function activate(). This function is an alternative to the start() member function of all Process objects and is entirely equivalent to it. Optional: Queueing Theory Now that we have seen some of these concepts in action already, it is a good time to step back and ﬁll in some theory. A queue is a speciﬁc example of a stochastic process. In general, the term “stochastic process” refers to a sequence of random events occurring in time. In the queueing example, customers are joining or leaving the queue at random times, which makes the queue grow and shrink accordingly. Other examples of stochastic processes include random walks, the movement of stock prices, and the inventory levels in a store. (In the latter case, purchases by customers and possibly even deliveries by suppliers constitute the random events.) SIMULATIONS 285 O’Reilly-5980006 master October 28, 2010 21:15 In a queueing problem, we are concerned only about arrivals and departures. A particularly important special case assumes that the rate at which customers arrive is constant over time and that arrivals at different times are independent of each other. (Notice that these are reasonable assumptions in many cases.) These two conditions imply that the number of arrivals during a certain time period t follows a Poisson distribution, since the Poisson distribution: p(k, t,λ)= (λt)k k! e−λt gives the probability of observing k Successes (arrivals, in our case) during an interval of length t if the “rate” of Successes is λ (see Chapter 9). Another consequence is that the times between arrivals are distributed according to an exponential distribution: p(t,λ)= λe−λt The mean of the exponential distribution can be calculated without difﬁculty and equals 1/λ. It will often be useful to work with its inverse ta = 1/λ, the average interarrival time. (It’s not hard to show that interarrival times are distributed according to the exponential distribution when the number of arrivals per time interval follows a Poisson distribution. Assume that an arrival occurred at t = 0. Now we ask for the probability that no arrival has occurred by t = T ; in other words, p(0, T,λ)= e−λT because x0 = 1 and 0! = 1. Conversely, the probability that the next arrival will have occurred sometime between t = 0 and t = T is 1 − p(0, T,λ). This is the cumulative distribution function for the interarrival time, and from it, we ﬁnd the probability density for an arrival to occur at t as d dt (1 − p(0, t,λ)) = λe−λt .) The appearance of the exponential distribution as the distribution of interarrival times deserves some comment. At ﬁrst glance, it may seem surprising because this distribution is greatest for small interarrival times, seemingly favoring very short intervals. However, this observation has to be balanced against the inﬁnity of possible interarrival times, all of which may occur! What is more important is that the exponential distribution is in a sense the most “random” way that interarrival times can be distributed: no matter how long we have waited since the last arrival, the probability that the next visitor will arrive after t more minutes is always the same: p(t,λ)= λe−λt . This property is often referred to as the lack of memory of the exponential distribution. Contrast this with a distribution of interarrival times that has a peak for some nonzero time: such a distribution describes a situation of scheduled arrivals, as we would expect to occur at a bus stop. In this scenario, the probability for an arrival to occur within the next t minutes will change with time. Because the exponential distribution arises naturally from the assumption of a constant arrival rate (and from the independence of different arrivals), we have used it as the distribution of interarrival times in the CustomerGenerator in the previous example. It is less of a natural choice for the distribution of service times (but it makes some theoretical arguments simpler). 286 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 The central question in all queueing problems concerns the expected length of the queue—not only how large it is but also whether it will settle down to a ﬁnite value at all, or whether it will “explode,” growing beyond all bounds. In the simple memoryless, single-server–single-queue scenario that we have been investigating, the only two control parameters are the arrival rate λa and the service or exit rate λe; or rather their ratio: u = λa λe which is the fraction of time the server is busy. The quantity u is the server’s utilization.It is intuitively clear that if the arrival rate is greater than the exit rate (i.e., if customers are arriving at a faster rate then the server can process them), then the queue length will explode. However, it turns out that even if the arrival rate equals the service rate (so that u = 1), the queue length still grows beyond all bounds. Only if the arrival rate is strictly lower than the service rate will we end up with a ﬁnite queue. Let’s see how this surprising result can be derived. Let pn be the probability of ﬁnding exactly n customers waiting in the queue. The rate at which the queue grows is λa, but the rate at which the queue grows from exactly n to exactly n + 1isλa pn, since we must take into account the probability of the queue having exactly n members. Similarly, the probability of the queue shrinking from n + 1ton members is λe pn+1. In the steady state (which is the requirement for a ﬁnite queue length), these two rates must be equal: λa pn = λe pn+1 which we can rewrite as: pn+1 = λa λe pn = upn This relationship must hold for all n, and therefore we can repeat this argument and write pn = upn−1 and so on. This leads to an expression for pn in terms of p0: pn = un p0 The probability p0 is the probability of ﬁnding no customer in the queue—in other words, it is the probability that the server is idle. Since the utilization is the probability for the server to be busy, the probability p0 for the server to be idle must be p0 = 1 − u. We can now ask about the expected length L of the queue. We already know that the queue has length n with probability pn = un p0. Finding the expected queue length L SIMULATIONS 287 O’Reilly-5980006 master October 28, 2010 21:15 requires that we sum over all possible queue lengths, each one weighted by the appropriate probability: L = ∞ n=0 npn = p0 ∞ n=0 nun Now we employ a trick that is often useful for sums of this form: observe that d du un = nun−1 and hence that nun = u d du un. Using this expression in the sum for L leads to: L = p0 ∞ n=0 u d du un = p0u d du ∞ n=0 un = p0u d du 1 1 − u (geometric series) = p0 u (1 − u)2 = u 1 − u where we have used the sum of the geometric series (see Appendix B) and the expression for p0 = 1 − u. We can rewrite this expression directly in terms of the arrival and exit rates as: L = u 1 − u = λa λe − λa This is a central result. It gives us the expected length of the queue in terms of the utilization (or in terms of the arrival and exit rates). For low utilization (i.e., an arrival rate that is much lower than the service rate or, equivalently, an interarrival time that is much larger than the service time), the queue is very short on average. (In fact, whenever the server is idle, then the queue length equals 0, which drags down the average queue length.) But as the arrival rate approaches the service rate, the queue grows in length and becomes inﬁnite when the arrival rate equals the service rate. (An intuitive argument for why the queue length will explode when the arrival rate equals the service time is that, in this case, the server never has the opportunity to “catch up.” If the queue becomes longer due to a chance ﬂuctuation in arrivals, then this backlog will persist forever, since overall the server is only capable of keeping up with arrivals. The cumulative effect of such chance ﬂuctuations will eventually make the queue length diverge.) Running SimPy Simulations In this section, we will try to conﬁrm the previous result regarding the expected queue length by simulation. In the process, we will discuss a few practical points of using SimPy to understand queueing systems. 288 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:15 First of all, we must realize that each simulation run is only a particular realization of the sequence of events. To draw conclusions about the system in general, we therefore always need to perform several simulation runs and average their results. In the previous listing, the simulation framework maintained its state in the global environment. Hence, in order to rerun the simulation, you had to restart the entire program! The program in the next listing uses an alternative interface that encapsulates the entire environment for each simulation run in an instance of class Simulation. The global functions initialize(), activate(), and simulate() are now member functions of this Simulation object. Each instance of the Simulation class provides a separate, isolated simulation environment. A completely new simulation run now requires only that we create a new instance of this class. The Simulation class is provided by SimPy. Using it does not require any changes to the previous program, except that the current instance of the Simulation class must be passed explicitly to all simulation objects (i.e., instances of Process and Resource and their subclasses): from SimPy.Simulation import * import random as rnd interarrival_time = 10.0 class CustomerGenerator( Process ): def produce( self, bank ): while True: c = Customer( bank, sim=self.sim ) c.start( c.doit() ) yield hold, self, rnd.expovariate(1.0/interarrival_time) class Customer( Process ): def __init__( self, resource, sim=None ): Process.__init__( self, sim=sim ) self.bank = resource def doit( self ): yield request, self, self.bank yield hold, self, self.bank.servicetime() yield release, self, self.bank class Bank( Resource ): def setServicetime( self, s ): self.service_time = s def servicetime( self ): return rnd.expovariate(1.0/self.service_time ) def run_simulation( t, steps, runs ): for r in range( runs ): SIMULATIONS 289 O’Reilly-5980006 master October 29, 2010 17:45 sim = Simulation() sim.initialize() bank = Bank( monitored=True, monitorType=Tally, sim=sim ) bank.setServicetime( t ) src = CustomerGenerator( sim=sim ) sim.activate( src, src.produce( bank ) ) sim.startCollection( when=steps//2 ) sim.simulate( until=steps ) print t, bank.waitMon.mean() t=0 while t <= 11.0: t+=0.5 run_simulation( t, 100000, 10 ) Another important change is that we don’t start recording until half of the simulation time steps have passed (that’s what the startCollection() method is for). Remember that we are interested in the queue length in the steady state—for that reason, we don’t want to start recording until the system has settled down and any transient behavior has disappeared. To record the queue length, we now use a Tally object instead of a Monitor. The Tally will not allow us to replay the entire sequence of events, but since we are only interested in the average queue length, it is sufﬁcient for our current purposes. Finally, remember that as the utilization approaches u = 1(i.e., as the service time approaches the interarrival time), we expect the queue length to become inﬁnite. Of course, in any ﬁnite simulation it is impossible for the queue to grow to inﬁnite length: the length of the queue is limited by the ﬁnite duration of the simulation run. The consequence of this observation is that, for utilizations near or above 1, the queue length that we will observe depends on the number of steps that we allow in the simulation. If we terminate the simulation too quickly, then the system will not have had time to truly reach its fully developed steady state and so our results will be misleading. Figure 12-6 shows the results obtained when running the example program with 1,000 and 100,000 simulation steps. For low utilization (i.e., short queue lengths), the results from both data sets agree with each other (and with the theoretical prediction). However, as the service time approaches the interarrival time, the short simulation run does not last long enough for the steady state to form, and so the observed queue lengths are too short. Summary This concludes our tour of discrete event simulation with SimPy. Of course, there is more to SimPy than mentioned here—in particular, there are two additional forms of resources: 290 CHAPTER TWELVE O’Reilly-5980006 master October 29, 2010 17:45 0 20 40 60 80 100 120 0 2 4 6 8 10 Average Queue Length Service Time 1k Simulation Steps 100k Simulation Steps Theory FIGURE 12-6.Averagequeue length as a function of the service time for a fixed interarrival time of ta = 10. the Store and Level abstractions. Both of them not only encapsulate a queue but also maintain an inventory (of individual items for Store and of an undifferentiated amount for Level). This inventory can be consumed or replenished by simulation objects, allowing us to model inventory systems of various forms. Other SimPy facilities to explore include asynchronous events, which can be received by simulation objects as they are waiting in queue and additional recording and tracing functionality. The project documentation will provide further details. Further Reading • A First Course in Monte Carlo. George S. Fishman. Duxbury Press. 2005. This book is a nice introduction to Monte Carlo simulations and includes many topics that we did not cover. Requires familiarity with calculus. • Bootstrap Methods and Their Application. A. C. Davison and D. V. Hinkley. Cambridge University Press. 1997. The bootstrap is actually a fairly simple and practical concept, but most books on it are very theoretical and difﬁcult, including this one. But it is comprehensive and relatively recent. • Applied Probability Models. Do Le Paul Minh. Duxbury Press. 2000. The theory of random processes is difﬁcult, and the results often don’t seem commensurate with the amount of effort required to obtain them. This book (although possibly hard to ﬁnd) is one of the more accessible ones. SIMULATIONS 291 O’Reilly-5980006 master October 28, 2010 21:15 • Introduction to Stochastic Processes. Gregory F. Lawler. Chapman & Hall/CRC. 2006. This short book is much more advanced and theoretical than the previous one. The treatment is concise and to the point. • Introduction to Operations Research. Frederick S. Hillier and Gerald J. Lieberman. 9th ed., McGraw-Hill. 2009. The ﬁeld of operations research encompasses a set of mathematical methods that are relevant for many problems arising in a business or industrial setting, including queueing theory. This text is a standard introduction. • Fundamentals of Queueing Theory. Donald Gross, John F. Shortle, James M. Thompson, and Carl M. Harris. 4th ed., Wiley. 2008. The standard textbook on queueing theory. Not for the faint of heart. 292 CHAPTER TWELVE O’Reilly-5980006 master October 28, 2010 21:20 CHAPTER THIRTEEN Finding Clusters THE TERM CLUSTERING REFERS TO THE PROCESS OF FINDING GROUPS OF POINTS WITHIN A DATA SET THAT ARE IN some way “lumped together.” It is also called unsupervised learning—unsupervised because we don’t know ahead of time where the clusters are located or what they look like. (This is in contrast to supervised learning or classiﬁcation, where we attempt to assign data points to preexisting classes; see Chapter 18.) I regard clustering as an exploratory method: a computer-assisted (or even computationally driven) approach to discovering structure in a data set. As an exploratory technique, it usually needs to be followed by a conﬁrmatory analysis that validates the ﬁndings and makes them more precise. Clustering is a lot of fun. It is a rich topic with a wide variety of different problems, as we will see in the next section, where we discuss the different kinds of cluster one may encounter. The topic also has a lot of intuitive appeal, and most clustering methods are rather straightforward. This allows for all sorts of ad hoc modiﬁcations and enhancements to accommodate the speciﬁc problem one is working on. What Constitutes a Cluster? Clustering is not a very rigorous ﬁeld: there are precious few established results, rigorous theorems, or algorithmic guarantees. In fact, the whole notion of a “cluster” is not particularly well deﬁned. Descriptions such as “groups of points that are similar” or “close to each other” are insufﬁcient, because clusters must also be well separated from each other. Look at Figure 13-1: some points are certainly closer to each other than to other points, yet there are no discernible clusters. (In fact, it is an interesting exercise to deﬁne 293 O’Reilly-5980006 master October 28, 2010 21:20 FIGURE 13-1.Auniform point distribution. Any “clusters” that we may recognize are entirely spurious. what constitutes the absence of clusters.) This leads to one possible deﬁnition of clusters: contiguous regions of high data point density separated by regions of lower point density. Although not particularly rigorous either, this description does seem to capture the essential elements of typical clusters. (For a different point of view, see the next section.) The deﬁnition just proposed allows for very different kinds of clusters. Figures 13-2 and 13-3 show two very different types. Of course, Figure 13-2 is the “happy” case, showing a data set consisting of well-deﬁned and clearly separated regions of high data point density. The clusters in Figure 13-3 are of a different type, one that is more easily thought of by means of nearest-neighbor (graph) relationships than by point density. Yet in this case as well, there are higher density regions separated by lower density regions—although we might want to exploit the nearest-neighbor relationship instead of the higher density when developing with a practical algorithm for this case. Clustering is not limited to points in space. Figures 13-4 and 13-5 show two rather different cases for which it nevertheless makes sense to speak of clusters. Figure 13-4 shows a bunch of street addresses. No two of them are exactly the same, but if we look closely, we will easily recognize that all of them can be grouped into just a few neighborhoods. Figure 13-5 shows a bunch of different time series: again, some of them are more alike than others. The challenge in both of these examples is ﬁnding a way to express the “similarity” among these nonnumeric, nongeometric objects! Finally, we should keep in mind that clusters may have complicated shapes. Figure 13-6 shows two very well-behaved clusters as distinct regions of high point density. However, complicated and intertwined shapes of the regions will challenge many commonly used clustering algorithms. 294 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 FIGURE 13-2.The“happy” case: three well-separated, globular clusters. FIGURE 13-3.Examples of non-globular clusters in a smiley face. Some of the clusters are nested, meaning that they are entirely contained within other clusters. A bit of terminology can help to distinguish different cluster shapes. If the line connecting any two points lies entirely within the cluster itself (as in Figure 13-2), then the cluster is convex. This is the easiest shape to handle. A cluster is convex only if the connecting line between two points lies entirely within the cluster for all pairs of points. Sometimes this is not the case, but we can still ﬁnd at least one point (the center) such that the connecting line from the center to any other point lies entirely within the cluster: such a cluster is FINDING CLUSTERS 295 O’Reilly-5980006 master October 28, 2010 21:20 First Avenue 35 First Avenue 53 45 Second Street E Furst Avenue 33 1st Avenue 53 48 Second Street Main Blvd 19 45 Second St 44 second street Second Street, 48 Main Boulevard 9 Mn Boulevard 11 First Ave 35 Main Boulevrd 1 Main Bulevard 19 FIGURE 13-4. Clustering strings. Although none of these strings are identical, we can make out several groups of strings that are similar to each other. F E D C B A FIGURE 13-5. Six time series. We can recognize groups of time series that seem more similar to each other than to others. called star convex. Notice that the clusters in Figure 13-6 are neither convex nor star convex. Sometimes one cluster is entirely surrounded by another cluster without actually being part of it: in this case we speak of a nested cluster. Nested clusters can be particularly challenging (see Figure 13-3). A Different Point of View In the absence of a precise (mathematical) deﬁnition, a cluster can be whatever we consider as one. That is important because our minds have a different, alternative way of grouping (“clustering”) objects: not by proximity or density but rather by the way objects ﬁt into a larger structure. Figures 13-7 and 13-8 show two examples. Intuitively, we have no problem grouping the points in Figure 13-7 into two overlapping clusters. Yet, the density-based deﬁnition of a cluster we proposed earlier will not support such a conclusion. Similar considerations apply to the set of points in Figure 13-8. The distance between any two adjacent points is the same, but we perceive the larger 296 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 FIGURE 13-6.Twoclusters that are well separated but not globular. Some algorithms (e.g., the k-means algorithm) will not be able to handle such clusters. FIGURE 13-7.Animpossible situation for most clustering algorithms: although we believe to recognize two crossed clusters, no strictly local algorithm will be able to separate them. structures of the vertical and horizontal arrangements and assign points to clusters based on them. This notion of a cluster does not hinge on the similarity or proximity of any pair of points to each other but instead on the similarity between a point and a property of the entire cluster. For any algorithm that considers a single point (or a single pair of points) at a time, this leads to a problem: to determine cluster membership, we need the property of the whole cluster; but to determine the properties of the cluster, we must ﬁrst assign points to clusters. FINDING CLUSTERS 297 O’Reilly-5980006 master October 28, 2010 21:20 FIGURE 13-8. The two clusters are distinguished not by a local property between pairs of points but rather by a global property of the entire cluster. To handle such situations, we would need to perform some kind of global structure analysis—a task our minds are incredibly good at (which is why we tend to think of clusters this way) but that we have a hard time teaching computers to do. For problems in two dimensions, digital image processing has developed methods to recognize and extract certain features (such as edge detection). But general clustering methods, such as those described in the rest of this chapter, deal only with local properties and therefore can’t handle problems such as those in Figures 13-7 and 13-8. Distance and Similarity Measures Given how strongly our intuition about clustering is shaped by geometric problems such as those in Figures 13-2 and 13-3, it is an interesting and perhaps surprising observation that clustering does not actually require data points to be embedded into a geometric space: all that is required is a distance or (equivalently) a similarity measure for any pair of points. This makes it possible to perform clustering on a set of strings, such as those in Figure 13-4 that do not map to points in space. However, if the data points have properties of a vector space (see Appendix C), then we can develop more efﬁcient algorithms that exploit these properties. A distance is any function d(x, y) that takes two points and returns a scalar value that is a measure for how different these points are: the more different, the larger the distance. Depending on the problem domain, it may make more sense to express the same information in terms of a similarity function s(x, y), which returns a scalar that tells us how similar two points are: the more different they are, the smaller the similarity. Any 298 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 distance can be transformed into a similarity and vice versa. For example if we know that our similarity measure s can take on values only in the range [0, 1], then we can form an equivalent distance by setting d = 1 − s. In other situations, we might decide to use d = 1/s,ors = e−d , and so on; the choice will depend on the problem we are working on. In what follows, I will express problems in terms of either distances or similarities, whichever seems more natural. Just keep in mind that you can always transform between the two. How we deﬁne a distance function is largely up to us, and we can express different semantics about the data set through the appropriate choice of distance. For some problems, a particular distance measure will present itself naturally (if the data points are points in space, then we will most likely employ the Euclidean distance or a measure similar to it), but for other problems, we have more freedom to deﬁne our own metric. We will see several examples shortly. There are certain properties that a distance (or similarity) function should have. Mathematicians have developed a set of properties that a function must possess to be considered a metric (or distance) in a mathematical sense. These properties can provide valuable guidance, but don’t take them too seriously: for our purposes, different properties might be more important. The four axioms of a mathematical metric are: d(x, y) ≥ 0 d(x, y) = 0 if and only if x = y d(x, y) = d(y, x) d(x, y) + d(y, z) ≥ d(x, z) The ﬁrst two axioms state that a distance is always positive and that it is null only if the two points are equal. The third property (“symmetry”) states that the distance between x and y is the same as the distance between y and x—no matter which way we consider the pair. The ﬁnal property is the so-called triangle inequality, which states that to get from x to z, it is never shorter to take a detour through a third point y instead of going directly (see Figure 13-9). This all seems rather uncontroversial, but these conditions are not necessarily fulﬁlled in practice. A funny example for an asymmetric distance occurs if you ask everyone in a group of people how much they like every other member of the group and then use the responses to construct a distance measure: it is not at all guaranteed that the feelings of person A for person B are requited by B. (Using the same example, it is also possible to construct scenarios that violate the triangle inequality.) For technical reasons, the symmetry property is usually highly desirable. You can always construct a symmetric distance function from an asymmetric one: dS(x, y) = d(x, y) + d(y, x) 2 is always symmetric. FINDING CLUSTERS 299 O’Reilly-5980006 master October 28, 2010 21:20 x y z d(x,y) d(y,z) d(x,z) ≤ d(x,y) + d(y,z) FIGURE 13-9. The triangle inequality: the direct path from x to z is always shorter than any path that goes through an intermediate point y. One property of great practical importance but not included among the distance axioms is smoothness. For example, we could deﬁne a rather simple-minded distance function that is 0 if and only if both points are equal to each other and that is 1 if the two points are not equal: d(x, y) = ⎧ ⎨ ⎩ 0ifx = y 1 otherwise You can convince yourself that this distance fulﬁlls all four of the distance axioms. However, this is not a very informative distance measure, because it gives us no information about how different two nonidentical points are! Most clustering algorithms require this information. A certain kind of tree-based algorithm, for example, works by successively considering the pairs of points with the smallest distance between them. When using this binary distance, the algorithm will make only limited progress before having exhausted all information available to it. The practical upshot of this discussion is that a good distance function for clustering should change smoothly as its inputs become more or less similar. (For classiﬁcation tasks, a binary one as in the example just discussed might be ﬁne.) Common Distance and Similarity Measures Depending on the data set and the purpose of our analysis, there are different distance and similarity measures available. First, let’s clarify some terminology. We are looking for ways to measure the distance between any two data points. Very often, we will ﬁnd that a point has a number of dimensions or features. (The ﬁrst usage is more common for numerical data, the latter for categorical data.) In other words, each point is a collection of individual values: x = {x1, x2,...,xd }, where d is the number of dimensions (or features). For example, the data point {0, 1} has two dimensions and describes a point in space; whereas the tuple [ 'male', 'retired', 'Florida' ], which describes a person, has three features. 300 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 TABLE 13-1.Commonly used distance and similarity measures for numeric data Name Definition Manhattan d(x, y) = d i |xi − yi | Euclidean d(x, y) = d i (xi − yi )2 Maximum d(x, y) = maxi |xi − yi | Minkowski d(x, y) = d i |xi − yi |p 1/p Dot product x · y = d i xi yid i x2 i d i y2 i Correlation coefficient corr(x, y) = d i (xi −¯x)(yi −¯y)d i (xi −¯x)2 d i (yi −¯y)2 ¯x = 1 d d i xi ¯y = 1 d d i yi For any given data set containing n elements, we can form n2 pairs of points. The set of all distances for all possible pairs of points can be arranged in a quadratic table known as the distance matrix. The distance matrix embodies all information about the mutual relationships between all points in the data set. If the distance function is symmetric, as is usually the case, then the matrix is also symmetric. Furthermore, the entries along the main diagonal typically are all 0, since d(x, x) = 0 for most well-behaved distance functions. Numerical data If the data is numerical and also“mixable” or vector-like (in the sense of Appendix C), then the data points bear a strong resemblance to points in space; hence we can use a metric such as the familiar Euclidean distance. The Euclidean distance is the most commonly used from a large family of related distance measures, which also contains the so-called Manhattan (or taxicab) distance and the maximum (or supremum) distance. All of these are in fact special cases of a more general Minkowski or p-distance.* Table 13-1 shows some examples. (The Manhattan distance is so named because it measures distances the way a New York taxicab moves: at right angles, along the city blocks. The Euclidean distance measures distances “as the crow ﬂies.” Finally, it is an amusing exercise to show that the maximum distance corresponds to the Minkowski p-distance as p →∞.) All these distance measures have very similar properties, and the differences between them usually do not matter much. The Euclidean distance is by far the most commonly used. I list the others here mostly to give you a sense of the kind of leeway that exists in deﬁning a suitable distance measure—without signiﬁcantly affecting the results! *The Minkowski distance deﬁned here should not be confused with the Minkowski metric, which deﬁnes the metric of the four-dimensional space-time in special relativity. FINDING CLUSTERS 301 O’Reilly-5980006 master October 28, 2010 21:20 If the data is numeric but not mixable (so that it does not make sense to add a random fraction of one data set to a random fraction of a different data set), then these distance measures are not appropriate. Instead, you may want to consider a metric based on the correlation between two data points. Correlation-based measures are measures of similarity: they are large when objects are similar and small when the objects are dissimilar. There are two related measures: the dot product and the correlation coefﬁcient, which are also deﬁned in Table 13-1. The only difference is that when calculating the correlation coefﬁcient, we ﬁrst center both data points by subtracting their respective means. In both measures, we multiply entries for the same “dimension” and sum the results; then we divide by the correlation of each data point with itself. Doing so provides a normalization and ensures that the correlation of any point with itself is always 1. This normalization step makes correlation-based distance measures suitable for data sets containing data points with widely different numeric values. By construction, the value of a dot product always falls in the interval [0, 1], and the correlation coefﬁcient always falls in the interval [−1, 1]. You can therefore transform either one into a distance measure if need be (e.g.,ifd is the dot product, then 1 − d is a proper distance). I should point out that the dot product has a geometric meaning. If we regard the data points as vectors in some suitable space, then the dot product of two points is the cosine of the angle that the two vectors make with each other. If they are perfectly aligned (i.e., they fall onto each other), then the angle is 0 and the cosine (and the correlation) is 1. If they are at right angles to each other, the cosine is 0. Correlation-based distance measures are suitable whenever numeric data is not readily mixable—for instance, when evaluating the similarity of the time series in Figure 13-5. Categorical data If the data is categorical, then we can count the number of features that do not agree in both data points (i.e., the number of mismatched features); this is the Hamming distance. (We might want to divide by the total number of features to obtain a number between 0 and 1, which is the fraction of mismatched features.) In certain data mining problems, the number of features is large, but only relatively few of them will be present for each data point. Moreover, the features may be binary: we care only whether or not they are present, but their values don’t matter. (As an example, imagine a patient’s health record: each possible medical condition constitutes a feature, and we want to know whether the patient has ever suffered from it.) In such situations, where features are not merely categorical but binary and sparse (meaning that just a few of the features are On), we may be more interested in matches between features that are On than in matches between features that are Off. This leads us to the Jaccard coefﬁcient sJ , which is the number of matches between features that are On for both points, divided by 302 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 the number of features that are On in at least one of the data points. The Jaccard coefﬁcient is a similarity measure; the corresponding distance function is the Jaccard distance dJ = 1 − sJ . n00 features that are Off in both points n10 features that are On in the ﬁrst point, and Off in the second point n01 features that are Off in the ﬁrst point, and On in the second point n11 features that are On in both points sJ = n11 n10 + n01 + n11 dJ = n10 + n01 n10 + n01 + n11 There are many other measures of similarity or dissimilarity for categorical data, but the principles are always the same. You calculate some fraction of matches, possibly emphasizing one aspect (e.g., the presence or absence of certain values) more than others. Feel free to invent your own—as far as I can see, none of these measures has achieved universal acceptance or is fundamentally better than any other. String data If the data consists of strings, then we can use a form of Hamming distance and count the number of mismatches. If the strings in the data set are not all of equal length, we can pad the shorter string and count the number of characters added as mismatches. If we are dealing with many strings that are rather similar to each other (distorted through typos, for instance), then we can use a more detailed measure of the difference between them—namely the edit or Levenshtein distance. The Levenshtein distance is the minimum number of single-character operations (insertions, deletions, and substitutions) required to transform one string into the other. (A quick Internet search will give many references to the actual algorithm and available implementations.) Another approach is to ﬁnd the length of the longest common subsequence. This metric is often used for gene sequence analysis in computational biology. This may be a good place to make a more general point: the best distance measure to use does not follow automatically from data type; rather, it depends on the semantics of the data—or, more precisely, on the semantics that you care about for your current analysis! In some cases, a simple metric that only calculates the difference in string length may be perfectly sufﬁcient. In another case, you might want to use the Hamming distance. If you really care about the details of otherwise similar strings, the Levenshtein distance is most appropriate. You might even want to calculate how often each letter appears in a string and then base your comparison on that. It all depends on what the data means and on what aspect of it you are interested at the moment (which may also change as the analysis progresses). Similar considerations apply everywhere—there are no “cookbook” rules. FINDING CLUSTERS 303 O’Reilly-5980006 master October 28, 2010 21:20 Special-purpose metrics A more abstract measure for the similarity of two points is based on the number of neighbors that the two points have in common; this metric is known as the shared nearest neighbor (SNN) similarity. To calculate the SNN for two points x and y, you ﬁnd the k nearest neighbors (using any suitable distance function) for both x and y. The number of neighbors shared by both points is their mutual SNN. The same concept can be extended to cases in which there is some property that the two points may have in common. For example, in a social network we could deﬁne the “closeness” of two people by the number of friends they share, by the number of movies they have both seen, and so on. (This application is equivalent to the Hamming distance.) Nearest-neighbor-based metrics are particularly suitable for high-dimensional data, where other distance measures can give spuriously small results. Finally, let me remind you that sometimes the solution does not consist of inventing a new metric. Instead, the trick is to map the problem to a different space that already has a predeﬁned, suitable metric. As an example, consider the problem of measuring the degree of similarity between different text documents (we here assume that these documents are long—hundreds or thousands of words). The standard approach to this problem is to count how often each word appears in each document. The resulting data structure is referred to as the document vector. You can now form a dot product between two document vectors as a measure of their correspondence. Technically speaking, we have mapped each document to a point in a (high-dimensional) vector space. Each distinct word that occurs in any of the documents spans a new dimension, and the frequency with which each word appears in a document provides the position of that document along this axis. This is very interesting, because we have transformed highly structured data (text) into numerical, even vector-like data and can therefore now manipulate it much more easily. (Of course, the beneﬁt comes at a price: in doing so we have lost all information about the sequence in which words appeared in the text. It is a separate consideration whether this is relevant for our purpose.) One last comment: one can overdo it when deﬁning distance and similarity measures. Complicated or sophisticated deﬁnitions are usually not necessary as long as you capture the fundamental semantics. The Hamming distance and the document vector correlation are two good examples of simpliﬁed metrics that intentionally discard a lot of information yet still turn out to be highly successful in practice. Clustering Methods In this section, we will discuss several very different clustering algorithms. As you will see, the basic ideas behind all three algorithms are rather simple, and it is straightforward to 304 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 come up with perfectly adequate implementations of them yourself. These algorithms are also important as starting points for more sophisticated clustering routines, which usually augment them with various heuristics or combine ideas from different algorithms. Different algorithms are suitable for different kinds of problems—depending, for example, on the shape and structure of the clusters. Some require vector-like data, whereas others require only a distance function. Different algorithms tend to be misled by different kinds of pitfalls, and they all have different performance (i.e., computational complexity) characteristics. It is therefore important to have a variety of different algorithms at your disposal so that you can choose the one most appropriate for your problem and for the kind of solution you seek! (Remember: it is pretty much the choice of algorithm that deﬁnes what constitutes a “cluster” in the end.) Center Seekers One of the most popular clustering methods is the k-means algorithm. The k-means algorithm requires the number of expected clusters k as input. (We will later discuss how to determine this number.) The k-means algorithm is an iterative scheme. The main idea is to calculate the position of each cluster’s center (or centroid) from the positions of the points belonging to the cluster and then to assign points to their nearest centroid. This process is repeated until sufﬁcient convergence is achieved. The basic algorithm can be summarized as follows: choose initial positions for the cluster centroids repeat: for each point: calculate its distance from each cluster centroid assign the point to the nearest cluster recalculate the positions of the cluster centroids The k-means algorithm is nondeterministic: a different choice of starting values may result in a different assignment of points to clusters. For this reason, it is customary to run the k-means algorithm several times and then compare the results. If you have previous knowledge of likely positions for the cluster centers, you can use it to precondition the algorithm. Otherwise, choose random data points as initial values. What makes this algorithm efﬁcient is that you don’t have to search the existing data points to ﬁnd one that would make a good centroid—instead you are free to construct a new centroid position. This is usually done by calculating the cluster’s center of mass. In two dimensions, we would have: xc = 1 n n i xi yc = 1 n n i yi FINDING CLUSTERS 305 O’Reilly-5980006 master October 28, 2010 21:20 where each sum is over all points in the cluster. (Generalizations to higher dimensions are straightforward.) You can only do this for vector-like data, however, because only such data allows us to form arbitrary “mixtures” in this way. For strictly categorical data (such as the strings in Figure 13-4), the k-means algorithm cannot be used (because it is not possible to “mix” different points to construct a new centroid). Instead, we have to use the k-medoids algorithm. The k-medoids algorithm works in the same way as the k-means algorithm except that, instead of calculating the new centroid, we search through all points in the cluster to ﬁnd the data point (the medoid) that has the smallest average distance to all other points in the cluster. The k-means algorithm is surprisingly modest in its resource consumption. On each iteration, the algorithm evaluates the distance function once for each cluster and each point; hence the computational complexity per iteration is O(k · n), where k is the number of clusters and n is the number of points in the data set. This is remarkable because it means that the algorithm is linear in the number of points. The number of iterations is usually pretty small: 10–50 iterations are typical. The k-medoids algorithm is more costly because the search to ﬁnd the medoid of each cluster is an O(n2) process. For very large data sets this might be prohibitive, but you can try running the k-medoids algorithm on random samples of all data points. The results from these runs can then be used as starting points for a run using the full data set. Despite its cheap-and-cheerful appearance, the k-means algorithm works surprisingly well. It is pretty fast and relatively robust. Convergence is usually quick. Because the algorithm is simple and highly intuitive, it is easy to augment or extend it—for example, to incorporate points with different weights. You might also want to experiment with different ways to calculate the centroid, possibly using the median position rather than the mean, and so on. That being said, the k-means algorithm can fail—annoyingly in situations that exhibit especially strong clustering! Because of its iterative nature, the algorithm works best in situations that involve gradual density changes. If your data sets consists of very dense and widely separated clusters, then the k-means algorithm can get “stuck” if initially two centroids are assigned to the same cluster: moving one centroid to a different cluster would require a large move, which is not likely to be found by the mostly local steps taken by the k-means algorithm. Among variants, a particularly important one is fuzzy clustering. In fuzzy clustering, we don’t assign each point to a single cluster; instead, for each point and each cluster, we determine the probability that the point belongs to that cluster. Each point therefore acquires a set of k probabilities or weights (one for each cluster; the probabilities must sum to 1 for each point). We then use these probabilities as weights when calculating the centroid positions. The probabilities also make it possible to declare certain points as “noise” (having low probability of belonging to any cluster) and thus can help with data 306 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 sets that contain unclustered “noise” points and with ambiguous situations such as the one shown in Figure 13-7. To summarize: • The k-means algorithms and its variants work best for globular (at least star-convex) clusters. The results will be meaningless for clusters with complicated shapes and for nested clusters (Figures 13-6 and 13-3, respectively). • The expected number of clusters is required as an input. If this number is not known, it will be necessary to repeat the algorithm with different values and compare the results. • The algorithm is iterative and nondeterministic; the speciﬁc outcome may depend on the choice of starting values. • The k-means algorithm requires vector data; use the k-medoids algorithm for categorical data. • The algorithm can be misled if there are clusters of highly different size or different density. • The k-means algorithm is linear in the number of data points; the k-medoids algorithm is quadratic in the number of points. Tree Builders Another way to ﬁnd clusters is by successively combining clusters that are “close” to each other into a larger cluster until only a single cluster remains. This approach is known as agglomerative hierarchical clustering, and it leads to a treelike hierarchy of clusters. Clusters that are close to each other are joined early (near the leaves of the tree) and more distant clusters are joined late (near the root of the tree). (One can also go in the opposite direction, continually splitting the set of points into smaller and smaller clusters. When applied to classiﬁcation problems, this leads to a decision tree—see Chapter 18.) The basic algorithm proceeds exactly as just outlined: 1. Examine all pairs of clusters. 2. Combine the two clusters that are closest to each other into a single cluster. 3. Repeat. What do we mean by the distance between clusters? The distance measures that we have deﬁned are valid only between points! To apply them, we need to select (or construct) a single “representative” point from each cluster. Depending on this choice, hierarchical clustering will lead to different results. The most important alternatives are as follows. Minimum or single link We deﬁne the distance between two clusters as the distance between the two points (one from each cluster) that are closest to each other. This choice leads to extended, FINDING CLUSTERS 307 O’Reilly-5980006 master October 28, 2010 21:20 thinly connected clusters. Because of this, this approach can handle clusters of complicated shapes, such as those in Figure 13-6, but it can be sensitive to noise points. Maximum or complete link The distance between two clusters is deﬁned as the distance between the two points (one from each cluster) that are farthest away from each other. With this choice, two clusters are not joined until all points within each cluster are connected to each other—favoring compact, globular clusters. Average In this case, we form the average over the distances between all pairs of points (one from each cluster). This choice has characteristics of both the single- and complete-link approaches. Centroid For each cluster, we calculate the position of a centroid (as in k-means clustering) and deﬁne the distance between clusters as the distance between centroids. Ward’s method Ward’s method measures the distance between two clusters in terms of the decrease in coherence that occurs when the two clusters are combined: if we combine clusters that are closer together, the resulting cluster should be more coherent than if we combine clusters that are farther apart. We can measure coherence as the average distance of all points in the cluster from a centroid, or as their average distance from each other. (We’ll come back to cohesion and other cluster properties later.) The result of hierarchical clustering is not actually a set of clusters. Instead, we obtain a treelike structure that contains the individual data points at the leaf nodes. This structure can be represented graphically in a dendrogram (see Figure 13-10). To extract actual clusters from it, we need to walk the tree, evaluate the cluster properties for each subtree, and then cut the tree to obtain clusters. Tree builders are expensive: we need at least the full distance matrix for all pairs of points (requiring O(n2) operations to evaluate). Building the complete tree takes O(n) iterations: there are n clusters (initially, points) to start with, and at each iteration, the number of clusters is reduced by one because two clusters are combined. For each iteration, we need to search the distance matrix for the closest pair of clusters—naively implemented, this is an O(n2) operation that leads to a total complexity of O(n3) operations. However, this can be reduced to O(n2 log n) by using indexed lookup. One outstanding feature of hierarchical clustering is that it does more than produce a ﬂat list of clusters; it also shows their relationships in an explicit way. You need to decide whether this information is relevant for your needs, but keep in mind that the choice of measure for the cluster distance (single- or complete-link, and so on) can have a signiﬁcant inﬂuence on the appearance of the resulting tree structure. 308 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 C D A F B E 0.0 0.2 0.4 0.6 0.8 1.0 Dissimilarity FIGURE 13-10.Atypicaldendrogram for data like the data in Figure 13-5. Individual data points are at the leaf nodes. The vertical distance between the tree nodes represents the dissimilarity between the nodes. Neighborhood Growers A third kind of clustering algorithm could be dubbed “neighborhood growers.” They work by connecting points that are “sufﬁciently close” to each other to form a cluster and then keep doing so until all points have been classiﬁed. This approach makes the most direct use of the deﬁnition of a cluster as a region of high density, and it makes no assumptions about the overall shape of the cluster. Therefore, such methods can handle clusters of complicated shapes (as in Figure 13-6), interwoven clusters, or even nested clusters (as in Figure 13-3). In general, neighborhood-based clustering algorithms are more of a special-purpose tool: either for cases that other algorithms don’t handle well (such as the ones just mentioned) or for polishing, in a second pass, the features of a cluster found by a general-purpose clustering algorithm such as k-means. The DBSCAN algorithm which we will introduce in this section is one such algorithm, and it demonstrates some typical concepts. It requires two parameters. One is the minimum density that we expect to prevail inside of a cluster—points that are less densely packed will not be considered part of any cluster. The other parameter is the size of the region over which we expect this density to be maintained: it should be larger than the average distance between neighboring points but smaller than the entire cluster. The choice of parameters is rather subtle and clearly requires an appropriate balance. FINDING CLUSTERS 309 O’Reilly-5980006 master October 28, 2010 21:20 In a practical implementation, it is easier to work with two slightly different parameters: the neighborhood radius r and the minimum number of points n that we expect to ﬁnd within the neighborhood of each point in a cluster. The DBSCAN algorithm distinguishes between three types of points: noise, edge, and core points. A noise point is a point which has fewer than n points in its neighborhood of radius r, such a point does not belong to any cluster. A core point of a cluster has more than n neighbors. An edge point is a point that has fewer neighbors than required for a core point but that is itself the neighbor of a core point. The algorithm discards noise points and concentrates on core points. Whenever it ﬁnds a core point, the algorithm assigns a cluster label to that point and then continues to add all its neighbors, and their neighbors recursively to the cluster, until all points have been classiﬁed. This description is simple enough, but actually deriving a concrete implementation that is both correct and efﬁcient is less than straightforward. The pseudo-code in the original paper* appears needlessly clumsy; on the other hand, I am not convinced that the streamlined version that can be found (for example) on Wikipedia is necessarily correct. Finally, the basic algorithm lends itself to elegant recursive implementations, but keep in mind that the recursion will not unwind until the current cluster is complete. This means that, in the worst case (of a single connected cluster), you will end up putting the entire data set onto the stack! As pointed out earlier, the main advantage of the DBSCAN algorithm is that it handles clusters of complicated shapes and nested clusters gracefully. However, it does depend sensitively on the appropriate choice of values for its two control parameters, and it provides little help in ﬁnding them. If a data set contains several clusters with widely varying densities, then a single set of parameters may not be sufﬁcient to classify all of the clusters. These problems can be ameliorated by coupling the DBSCAN algorithm with the k-means algorithm: in a ﬁrst pass, the k-means algorithm is used to identify candidates for clusters. Moreover, statistics on these subsets of points (such as range and density) can be used as input to the DBSCAN algorithm. The DBSCAN algorithm is dominated by the calculations required to ﬁnd the neighboring points. For each point in the data set, all other points have to be checked; this leads to a complexity of O(n2). In principle, algorithms and data structures exist to ﬁnd candidates for neighboring points more efﬁciently (e.g., kd-trees and global grids), but their implementations are subtle and carry their own costs (grids can be very memory intensive). Coupling the DBSCAN algorithm with a more efﬁcient ﬁrst-pass algorithm (such as k-means) may therefore be a better strategy. *“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” Martin Ester, Hans-Peter Kriegel, J¨org Sander, and Xiaowei Xu. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). 1996. 310 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 Pre- and Postprocessing The core algorithm for grouping data points into clusters is usually only part (though the most important one) of the whole strategy. Some data sets may require some cleanup or normalization before they are suitable for clustering: that’s the ﬁrst topic in this section. Furthermore, we need to inspect the results of every clustering algorithm in order to validate and characterize the clusters that have been found. We will discuss some concepts and quantities used to describe clusters and to measure the clustering quality. Finally, several cluster algorithms require certain input parameters (such as the number of clusters to ﬁnd), and we need to conﬁrm that the values we provided are consistent with the outcome of the clustering process. That will be our last topic in this section. Scale Normalization Look at Figures 13-11 and 13-12. Wouldn’t you agree that the data set in Figure 13-11 exhibits two reasonably clearly deﬁned and well-separated clusters while the data set in Figure 13-12 does not? Yet both ﬁgures show the same data set—only drawn to different scales! In Figure 13-12, I used identical units for both the x axis and the y axis; whereas Figure 13-11 was drawn to maintain a suitable aspect ratio for this data set. This example demonstrates that clustering is not independent of the units in which the data is measured. In fact, for the data set shown in Figures 13-11 and 13-12, points in two different clusters may be closer to each other than to other points in the same cluster! This is clearly a problem. If, as in this example, your data spans very different ranges along different dimensions, you need to normalize the data before starting a clustering algorithm. An easy way to achieve this is to divide the data, dimension for dimension, by the range of the data along that dimension. Alternatively, you might want to divide by the standard deviation along that dimension. This process is sometimes called whitening or prewhitening, particularly in signal-theoretic literature. You only need to worry about this problem if you are working with vector-like data and are using a distance measure like the Euclidean distance. It does not affect correlation-based similarity measures. In fact, there is a special variant of the Euclidean distance that performs the appropriate rescaling for each dimension on the ﬂy: the Mahalanobis distance. Cluster Properties and Evaluation It is easiest to think about cluster properties in the context of vector-like data and a straightforward clustering algorithm such as k-means. The algorithm already gives us the FINDING CLUSTERS 311 O’Reilly-5980006 master October 28, 2010 21:20 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -0.2 -0.1 0 0.1 0.2 FIGURE 13-11.Itiseasytoarguethat there are two clusters in this graph. (Compare Figure 13-12.) coordinates of the cluster centroids directly, hence we have the cluster location.Two additional quantities are the mass of the cluster (i.e., the number of points in the cluster) and its radius. The radius is simply the average deviation of all points from the cluster center—basically the standard deviation, when using the Euclidean distance: r 2 = i (xc − xi )2 + (yc − yi )2 in two dimensions (equivalently in higher dimensions). Here xc and yc are the coordinates of the center of the cluster, and the sum runs over all points i in the cluster. Dividing the mass by the radius gives us the density of the cluster. (These values can be used to construct input values for the DBSCAN algorithm.) We can apply the same principles to develop a measure for the overall quality of the clustering. The key concepts are cohesion within a cluster and separation between clusters. The average distance for all points within one cluster is a measure of the cohesion, and the average distance between all points in one cluster from all points in another cluster is a measure of the separation between the two clusters. (If we know the centroids of the clusters, we can use the distance between the centroids as a measure for the separation.) We can go further and form the average (weighted by the cluster mass) of the cohesion for all clusters as a measure for the overall quality. If a data set can be cleanly grouped into clusters, then we expect the distance between the clusters to be large compared to the radii of the clusters. In other words, we expect the ratio: separation cohesion to be large. 312 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 FIGURE 13-12.Itisdifficult to recognize two well-separated clusters in this figure. Yet the data is the same as in Figure 13-11 but drawn to a different scale! (Compare the horizontal and vertical scales in both graphs.) A particular measure based on this concept is the silhouette coefﬁcient S. The silhouette coefﬁcient is deﬁned for individual points as follows. Let ai be the average distance (the cohesion) that point i has from all other points in the cluster to which it belongs. Evaluate the average distance that point i has from all points in any cluster to which it does not belong, and let bi be the smallest such value (i.e., bi is the separation from the “closest” other cluster). Then the silhouette coefﬁcient of point i is deﬁned as: Si = bi − ai max(ai , bi ) The numerator is a measure for the “empty space” between clusters (i.e., it measures the amount of distance between clusters that is not occupied by the original cluster). The denominator is the greater of the two length scales in the problem—namely the cluster radius and the distance between clusters. By construction, the silhouette coefﬁcient ranges from −1 to 1. Negative values indicate that the cluster radius is greater than the distance between clusters, so that clusters overlap; this suggests poor clustering. Large values of S suggest good clustering. We can form the average of the silhouette coefﬁcients for all points belonging to a single cluster and thereby develop a measure for the quality of the entire cluster. We can further deﬁne the average over the silhouette coefﬁcients for all individual points as the overall silhouette coefﬁcient for the entire data set; this would be a measure for the quality of the clustering result. The overall silhouette coefﬁcient can be useful to determine the number of clusters present in the data set. If we run the k-means algorithm several times for different values FINDING CLUSTERS 313 O’Reilly-5980006 master October 28, 2010 21:20 FIGURE 13-13.Howmany clusters are in this data set? of the expected number of clusters and calculate the overall silhouette coefﬁcient each time, then it should exhibit a peak near the optimal number of clusters. Let’s work through an example to see how the the silhouette coefﬁcient performs in practice. Figure 13-13 shows the points of a two-dimensional data set. This is an interesting data set because, even though it exhibits clear clustering, it is not at all obvious how many distinct clusters there really are—any number between six and eight seems plausible. The total silhouette coefﬁcient (averaged over all points in the data set) for this data set (see Figure 13-14) conﬁrms this expectation, clearly leaning toward the lower end of this range. (It is interesting to note that the data set was generated, using a random-number generator, to include 10 distinct clusters, but some of those clusters are overlapping so strongly that it is not possible to distinguish them.) This example also serves as a cautionary reminder that it may not always be so easy to determine what actually constitutes a cluster! Another interesting question concerns distinguishing legitimate clusters from a random (unclustered) background. Of the algorithms that we have seen, only the DBSCAN algorithm explicitly labels some points as background; the k-means and the tree-building algorithm perform what is known as complete clustering by assigning every point to a cluster. We may want to relax this behavior by trimming those points from each cluster that exceed the average cohesion within the cluster by some amount. This is easiest for fuzzy clustering algorithms, but it can be done for other algorithms as well. Other Thoughts The three types of clustering algorithms introduced in this chapter are probably the most popular and widely used, but they certainly don’t exhaust the range of possibilities. 314 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 0.55 0.6 0.65 0.7 0.75 0.8 0 5 10 15 20 25 Silhouette Coefficient Number of Clusters FIGURE 13-14.Thesilhouette coefficient for the data in Figure 13-13. According to this measure, six or seven clusters give optimal results for this data set. Here is a brief list of other ideas that can (and have) been used to develop clustering algorithms. • We can impose a speciﬁc topology, such as a grid on the data points. Each data point will fall into a single grid cell, and we can use this information to ﬁnd cells containing unusually many points and so guide clustering. Cell-based methods will perform poorly in many dimensions, because most cells will be empty and have few occupied neighbors (the “curse of dimensionality”). • Among grid-based approaches, Kohonen maps (which we will discuss in Chapter 14) have a lot of intuitive appeal. • Some special methods have been suggested to address the challenges posed by high-dimensional feature spaces. In subspace clustering, for example, clustering is performed on only a subset of all available features. These results are then successively extended by including features ignored in previous iterations. • Remember kernel density estimates (KDEs) from Chapter 2? If the dimensionality is not too high, then we can generate a KDE for the data set. The KDE provides a smooth approximation to the local point density. We can then identify clusters by ﬁnding the maxima of this density directly, using standard methods from numerical analysis. • The QT (“quality threshold”) algorithm is a center-seeking algorithm that does not require the number of clusters as input; instead, we have to ﬁx a maximum radius. The QT algorithm treats every point in the cluster as a potential centroid and adds neighboring points (in the order of increasing distance from the centroid) until the maximum radius is exceeded. Once all candidate clusters have been completed in this FINDING CLUSTERS 315 O’Reilly-5980006 master October 28, 2010 21:20 way, the cluster with the greatest number of points is removed from the data set, and then the process starts again with the remaining points. • There is a well-known correspondence between graphs and distance matrices. Given a set of points, a graph tells us which points are directly connected to each other—but so does a distance matrix! We can exploit this equivalence by treating a distance matrix as the adjacency matrix of a graph. The distance matrix is pruned (by removing connections that are too long) to obtain a sparse graph, which can be interpreted as the backbone of a cluster. • Finally, spectral clustering uses powerful but abstract methods from linear algebra (similar to those used for principal component analysis; see Chapter 14) to structure and simplify the distance matrix. Obviously, much depends on our prior knowledge about the data set: if we expect clusters to be simple and convex, then the k-means algorithm suggests itself. On the other hand, if we have a sense for the typical radius of the clusters that we expect to ﬁnd, then QT clustering would be a more natural approach. If we expect clusters of complicated shapes or nested clusters, then an algorithm like DBSCAN will be required. Of course, it might be difﬁcult to develop this kind of intuition—especially for problems that have signiﬁcantly more than two or three dimensions! Besides thinking of different ways to combine points into clusters, we can also think of different ways to deﬁne clusters to begin with. All methods discussed so far have relied (directly or indirectly) on the information contained in the distance between any two points. We can extend this concept and begin to think about three-point (or higher) distance functions. For example, it is possible to determine the angle between any three consecutive points and use this information as the measure of the similarity between points. Such an approach might help with cases like the one shown in Figure 13-8. Yet another idea is to measure not the similarity between points but instead the similarity between a point and a property of the cluster. For example, there is a straightforward generalization of the k-means algorithm in which the centroids are no longer pointlike but are straight lines, representing the “axis” of an elongated cluster. Rather than measuring the distance for each point from the centroid, this algorithm calculates the distance from this axis when assigning points to clusters. This algorithm would be suitable for cases like that shown in Figure 13-7. I don’t think any of these ideas that try to generalize beyond pairwise distances have been explored in detail yet. A Special Case: Market Basket Analysis Which items are frequently bought together? This and similar questions arise in market basket analysis or—more generally—in association analysis. Because association analysis is looking for items that occur together, it is in some ways related to clustering. However, the speciﬁc nature of the problem is different enough to require a separate toolset. 316 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 The starting point for association analysis is usually a data set consisting of transactions— that is, items that have been purchased together (we will often stay with the market basket metaphor when illustrating these concepts). Each transaction corresponds to a single “data point” in regular clustering. For each transaction, we keep track of all items that have occurred together but typically ignore whether or not any particular item was purchased multiple times: all attributes are Boolean and indicate only the presence or absence of a certain item. Each item spans a new dimension: if the store sells N different items, then each transaction can have up to N different (Boolean) attributes, although each transaction typically contains only a tiny subset of the entire selection. (Note that we do not necessarily need to know the dimensionality N ahead of time: if we don’t know it, we can infer an approximation from the number of different items that actually occur in the data set.) From this description, you can already see how association analysis differs from regular clustering: data points in association analysis are typically very high-dimensional but also very sparse. It also differs from clustering (as we have discussed it so far) in that we are not necessarily interested in grouping entire “points” (i.e., transactions) but would like to identify those dimensions that frequently occur together. A group of zero or more items occurring together is known as an item set (or itemset). Each transaction consists of an item set, but every one of its subsets is also an item set. We can construct arbitrary item sets from the selection of available items. For each such item set, its support count is the number of actual transactions that contain the candidate item set as a subset. Besides simply identifying frequent item sets, we can also try to derive association rules—that is, rules of the form “if items A and B are bought, then item C is also likely to be bought.” Two measures are important when evaluating the strength of an association rule: its support s and its conﬁdence c. The support of a rule is the fraction of transactions in the entire data set that contain the combined item set (i.e., the fraction of transactions that contain all three items A, B, and C). A rule with low support is not very useful because it is rarely applicable. The conﬁdence is a measure for the reliability of an association rule. It is deﬁned as the number of transactions in which the rule is correct, divided by the number of transactions in which it is applicable. In our example, it would be the number of times A, B, and C occur together divided by the number of times A and B occur together. How do we go about ﬁnding frequent item sets (and association rules)? Rather than performing an open-ended search for the “best” association rule, it is customary to set thresholds for the minimum support (such as 10 percent) and conﬁdence (such as 80 percent) required of a rule and then to generate all rules that meet these conditions. FINDING CLUSTERS 317 O’Reilly-5980006 master October 28, 2010 21:20 To identify rules, we generate candidate item sets and then evaluate them against the set of transactions to determine whether they exceed the required thresholds. However, the naive approach—to create and evaluate all possible item sets of k elements—is not feasible because of the huge number (2k) of candidate item sets that could be generated, most of which will not be frequent! We must ﬁnd a way to generate candidate item sets more efﬁciently. The crucial observation is that an item set can occur frequently only if all of its subsets occur frequently. This insight is the basis for the so-called apriori algorithm, which is the most fundamental algorithm for association analysis. The apriori algorithm is a two-step algorithm: in the ﬁrst step, we identify frequent item sets; in the second step, we extract association rules. The ﬁrst part of the algorithm is the more computationally expensive one. It can be summarized as follows. Find all 1-item item sets that meet the minimum support threshold. repeat: from the current list of k-item item sets, construct (k+1)-item item sets eliminate those item sets that do not meet the minimum support threshold stop when no (k+1)-item item set meets the minimum support threshold The list of frequent item sets may be all that we require, or we may postprocess the list to extract explicit association rules. To ﬁnd association rules, we split each frequent item set into two sets, and evaluate the conﬁdence associated with this pair. From a practical point of view, rules that have a 1-item item set on the “righthand side” are the easiest to generate and the most important. (In other words, rules of the form “people who bought A and B also bought C,” rather than rules of the form “people who bought A and B also bought C and D.”) This basic description leaves out many technical details, which are important in actual implementations. For example: how exactly do we create a (k + 1)-item item set from the list of k-item item sets? We might take every single item that occurs among the k-item item sets and add it, in turn, to every one of the k-item item sets; however, this would generate a large number of duplicate item sets that need to be pruned again. Alternatively, we might combine two k-item item sets only if they agree on all but one of their items. Clearly, appropriate data structures are essential for obtaining an efﬁcient implementation. (Similar considerations apply when determining the support count of a candidate item set, and so on.)* Although the apriori algorithm is probably the most popular algorithm for association analysis, there are also very different approaches. For example, the FP-Growth Algorithm (where FP stands for “Frequent Pattern”) identiﬁes frequent item sets using something *An open source implementation of the apriori algorithm (and many other algorithms for frequent pat- tern identiﬁcation), together with notes on efﬁcient implementation, can be found at http://borgelt.net/apriori.html. The arules package for R is an alternative. It can be found on CRAN. 318 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 like a string-matching algorithm. Items in transactions are sorted by their support count, and a treelike data structure is built up by exploiting data sets that agree in the ﬁrst k items. This tree structure is then searched for frequently occurring item sets. Association analysis is a relatively complicated problem that involves many technical (as opposed to conceptual) challenges as well. The discussion in this section could only introduce the topic and attempt to give a sense of the kinds of approaches that are available. We will see some additional problems of a similar nature in Chapter 18. A Word of Warning Clustering can lead you astray, and when done carelessly it can become a huge waste of time. There are at least two reasons for this: although the algorithms are deceptively simple, it can be surprisingly difﬁcult to obtain useful results from them. Many of them depend quite sensitively on several heuristic parameters, and you can spend hours ﬁddling with the various knobs. Moreover, because the algorithms are simple and the ﬁeld has so much intuitive appeal, it can be a lot of fun to play with implementations and to develop all kinds of modiﬁcations and variations. And that assumes there actually are any clusters present! (This is the second reason.) In the absence of rigorous, independent results, you will actually spend more time on data sets that are totally worthless—perpetually hunting for those clusters that “the stupid algorithm just won’t ﬁnd.” Perversely, additional domain knowledge does not necessarily make the task any easier: knowing that there should be exactly 10 clusters present in Figure 13-13 is of no help in ﬁnding the clusters that actually can be identiﬁed! Another important question concerns the value that you ultimately derive from clustering (assuming now that at least one of the algorithms has returned something apparently meaningful). It can be difﬁcult to distinguish spurious results from real ones: like clustering algorithms, cluster evaluation methods are not particularly rigorous or unequivocal either (Figure 13-14 does not exactly inspire conﬁdence). And we still have not answered the question of what you will actually do with the results—assuming that they turn out to be signiﬁcant. I have found that understanding the actual question that needs to be answered, developing some pertinent hypotheses and models around it, and then verifying them on the data through speciﬁc, focused analysis is usually a far better use of time than to go off on a wild-goose clustering search. Finally, I should emphasize that, in keeping with the spirit of this book, the algorithms in this chapter are suitable for moderately sized data sets (a few thousand data points and a dozen dimensions, or so) and for problems that are not too pathological. Highly developed algorithms (e.g., CURE and BIRCH) exist for very large or very high-dimensional problems; these algorithms usually combine several different cluster-ﬁnding approaches FINDING CLUSTERS 319 O’Reilly-5980006 master October 28, 2010 21:20 together with a set of heuristics. You need to evaluate whether such specialized algorithms make sense for your situation. Workshop: Pycluster and the C Clustering Library The C Clustering Library (http://bonsai.hgc.jp/∼mdehoon/software/cluster/software.htm)isa mature and relatively efﬁcient clustering library originally developed to ﬁnd clusters among gene expressions in microarray experiments. It contains implementations of the k-means and k-medoids algorithms, tree clustering, and even self-organized (Kohonen) maps. It comes with its own GUI frontend as well as excellent Perl and Python bindings. It is easy to use and very well documented. In this Workshop, we use Python to demonstrate the library’s center-seeker algorithms. import Pycluster as pc import numpy as np import sys # Read data filename and desired number of clusters from command line filename, n = sys.argv[1], int( sys.argv[2] ) # x and y coordinates, whitespace-separated data = np.loadtxt( filename, usecols=(0,1) ) # Perform clustering and find centroids clustermap = pc.kcluster( data, nclusters=n, npass=50 )[0] centroids = pc.clustercentroids( data, clusterid=clustermap )[0] # Obtain distance matrix m = pc.distancematrix( data ) # Find the masses of all clusters mass = np.zeros( n ) for c in clustermap: mass[c] += 1 # Create a matrix for individual silhouette coefficients sil = np.zeros( n*len(data) ) sil.shape = ( len(data), n ) # Evaluate the distance for all pairs of points for i in range( 0, len(data) ): for j in range( i+1, len(data) ): d = m[j][i] sil[i, clustermap[j] ] += d sil[j, clustermap[i] ] += d # Normalize by cluster size (that is: form average over cluster) for i in range( 0, len(data) ): sil[i,:] /= mass 320 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 # Evaluate the silhouette coefficient s=0 for i in range( 0, len(data) ): c = clustermap[i] a = sil[i,c] b = min( sil[i, range(0,c)+range(c+1,n) ] ) si = (b-a)/max(b,a) # This is the silhouette coeff of point i s+=si # Print overall silhouette coefficient print n, s/len(data) The listing shows the code used to generate Figure 13-14, showing how the silhouette coefﬁcient depends on the number of clusters. Let’s step through it. We import both the Pycluster library itself as well as the NumPy package. We will use some of the vector manipulation abilities of the latter. The point coordinates are read from the ﬁle speciﬁed on the command line. (The ﬁle is assumed to contain the x and y coordinates of each point, separated by whitespace; one point per line.) The point coordinates are then passed to the kcluster() function, which performs the actual k-means algorithm. This function takes a number of optional arguments: nclusters is the desired number of clusters, and npass holds the number of trials that should be performed with different starting values. (Remember that k-means clustering is nondeterministic with regard to the initial guesses for the positions of the cluster centroids.) The kcluster() function will make npass different trials and report on the best one. The function returns three values. The ﬁrst return value is an array that, for each point in the original data set, holds the index of the cluster to which it has been assigned. The second and third return values provide information about the quality of the clustering (which we ignore in this example). This function signature is a reﬂection of the underlying C API, where you pass in an array of the same length as the data array and then the cluster assignments of each point are communicated via this additional array. This frees the kcluster() function from having to do its own resource management, which makes sense in C (and possibly also for extremely large data sets). All information about the result of the clustering procedure are contained in the clustermap data structure. The Pycluster library provides several functions to extract this information; here we demonstrate just one: we can pass the clustermap to the clustercentroids() function to obtain the coordinates of the cluster centroids. (However, we won’t actually use these coordinates in the rest of the program.) You may have noticed that we did not specify the distance function to use in the listing. The C Clustering Library does not give us the option of a user-deﬁned distance function with k-means. It does include several standard distance measures (Euclidean, Manhattan, correlation, and several others), which can be selected through a keyword argument to kcluster() (the default is to use the Euclidean distance). Distance calculations can be a rather expensive part of the algorithm, and having them implemented in C makes the FINDING CLUSTERS 321 O’Reilly-5980006 master October 28, 2010 21:20 k = 6 FIGURE 13-15.Theresultofrunningthek-means algorithm on the data from Figure 13-13, finding six clusters. Different clusters are shown in black and gray, and the cluster centroids are indicated by filled dots. overall program faster. (If we want to deﬁne our own distance function, then we have to use the kmedoids() function, which we will discuss in a moment.) To evaluate the silhouette coefﬁcient we need the point-to-point distances, and so we obtain the distance matrix from the Pycluster library. We will also need the number of points in each cluster (the cluster’s “mass”) later. Next, we calculate the individual silhouette coefﬁcients for all data points. Recall that the silhouette coefﬁcient involves both the average distance to the all points in the same cluster as well as the average distance to all points in the nearest cluster. Since we don’t know ahead of time which one will be the nearest cluster to each point, we simply go ahead and calculate the average distance to all clusters. The results are stored in the matrix sil. (In the implementation, we make use of some of the vector manipulation features of NumPy: in the expression sil[i,:] /= mass, each entry in row i is divided componentwise by the corresponding entry in mass. Further down, we make use of “advanced indexing” when looking for the minimum distance between the point i and a cluster to which it does not belong: in the expression b = min( sil[i, range(0,c)+range(c+1,n) ] ),we construct an indexing vector that includes indices for all clusters except the one that the point i belongs to. See the Workshop in Chapter 2 for more details.) Finally, we form the average over all single-point silhouette coefﬁcients and print the results. Figure 13-14 shows them as a graph. Figures 13-15 and 13-16 show how the program assigned points to clusters in two runs, ﬁnding 6 and 10 clusters, respectively. These results agree with Figure 13-14: k = 6is 322 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 k = 10 FIGURE 13-16.SimilartoFigure 13-15 but for k = 10. Ten seems too high a number of clusters for this data set, which agrees with the results from calculating the silhouette coefficient in Figure 13-14. close to the optimal number of clusters, whereas k = 10 seems to split some clusters artiﬁcially. The next listing demonstrates the kmedoids() function, which we have to use if we want to provide our own distance function. As implemented by the Pycluster library, the k-medoids algorithm does not require the data at all—all it needs is the distance matrix! import Pycluster as pc import numpy as np import sys # Our own distance function: maximum norm def dist( a, b ): return max( abs( a-b)) # Read data filename and desired number of clusters from command line filename, n = sys.argv[1], int( sys.argv[2] ) # x and y coordinates, whitespace-separated data = np.loadtxt( filename, usecols=(0,1) ) k = len(data) # Calculate the distance matrix m = np.zeros( k*k ) m.shape = ( k, k ) for i in range( 0, k ): for j in range( i, k ): d = dist( data[i], data[j] ) FINDING CLUSTERS 323 O’Reilly-5980006 master October 28, 2010 21:20 m[i][j] = d m[j][i] = d # Perform the actual clustering clustermap = pc.kmedoids( m, n, npass=20 )[0] # Find the indices of the points used as medoids, and the cluster masses medoids = {} for i in clustermap: medoids[i] = medoids.get(i,0) + 1 # Print points, grouped by cluster for i in medoids.keys(): print "Cluster=", i, " Mass=", medoids[i], " Centroid: ", data[i] for j in range( 0, len(data) ): if clustermap[j] == i: print "\t", data[j] In the listing, we calculate the distance matrix using the maximum norm (which is not supplied by Pycluster) as distance function. Obviously, we could use any other function here—such as the Levenshtein distance if we wanted to cluster the strings in Figure 13-4. We then call the kmedoids() function, which returns a clustermap data structure similar to the one returned by kcluster(). For the kmedoids() function, the data structure contains—for each data point—the index of the data point that is the centroid of the assigned cluster. Finally, we calculate the masses of the clusters and print the coordinates of the cluster medoids as well as the coordinates of all points assigned to that cluster. The C Clustering Library is small and relatively easy to use. You might also want to explore its tree-clustering implementation. The library also includes routines for Kohonen maps and principal component analysis, which we will discuss in Chapter 14. Further Reading • Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Addison-Wesley. 2005. This is my favorite book on data mining. The presentation is compact and more technical than in most other books on this topic. The section on clustering is particularly strong. • Data Clustering: Theory, Algorithms, and Applications. Guojun Gan, Chaoqun Ma, and Jianhong Wu. SIAM. 2007. This book is a recent survey of results from clustering research. The presentation is too terse to be useful, but it provides a good source of concepts and keywords for further investigation. 324 CHAPTER THIRTEEN O’Reilly-5980006 master October 28, 2010 21:20 • Algorithms for Clustering Data. Anil K. Jain and Richard C. Dubes. Prentice Hall. 1988. An older book on clustering as freely available at http://www.cse.msu.edu/∼jain/Clustering Jain Dubes.pdf . • Metric Spaces: Iteration and Application. Victor Bryant. Cambridge University Press. 1985. If you are interested in thinking about distance measures in arbitrary spaces in a more abstract way, then this short (100-page) book is a wonderful introduction. It requires no more than some passing familiarity with real analysis, but it does a remarkable job of demonstrating the power of purely abstract reasoning—both from a conceptual point of view but also with an eye to real applications. FINDING CLUSTERS 325 O’Reilly-5980006 master October 28, 2010 21:20 O’Reilly-5980006 master October 28, 2010 21:25 CHAPTER FOURTEEN Seeing the Forest for the Trees: Finding Important Attributes WHAT DO YOU DO WHEN YOU DON’T KNOW WHERE TO START? WHEN YOU ARE DEALING WITH A DATA SET THAT offers no structure that would suggest an angle of attack? For example, I remember looking through a company’s contracts with its suppliers for a certain consumable. These contracts all differed in regards to the supplier, the number of units ordered, the duration of the contract and the lead time, the destination location that the items were supposed to be shipped to, the actual shipping date, and the procurement agent that had authorized the contract—and, of course, the unit price. What I tried to ﬁgure out was which of these quantities had the greatest inﬂuence on the unit price. This kind of problem can be very difﬁcult: there are so many different variables, none of which seems, at ﬁrst glance, to be predominant. Furthermore, I have no assurance that the variables are all independent; many of them may be expressing related information. (In this case, the supplier and the shipping destination may be related, since suppliers are chosen to be near the place where the items are required.) Because all variables arise on more or less equal footing, we can’t identify a few as the obvious “control” or independent variables and then track the behavior of all the other variables in response to these independent variables. We can try to look at all possible pairings—for example, using graphical techniques such as scatter-plot matrices (Chapter 5)—but that may not really reveal much either, particularly if the number of variables is truly large. We need some form of computational guidance. In this chapter, we will introduce a number of different techniques for exactly this purpose. All of them help us select the most important variables or features from a multivariate data set in which all variables appear to arise on equal footing. In doing so, 327 O’Reilly-5980006 master October 28, 2010 21:25 we reduce the dimension of the data set from the original number of variables (or features) to a smaller set, which (hopefully) captures most of the “interesting” behavior of the data. These methods are therefore also known as feature selection or dimensionality reduction techniques. A word of warning: the material in this chapter is probably the most advanced and least obvious in the whole book, both conceptually and also with respect to actual implementations. In particular, the following section (on principal component analysis) is very abstract, and it may not make much sense if you haven’t had some previous exposure to matrices and linear algebra (including eigentheory). Other sections are more accessible. I include these techniques here nevertheless, because they are of considerable practical importance but also to give you a sense of the kinds of (more advanced) techniques that are available, and also as a possible pointer for further study. Principal Component Analysis Principal component analysis (PCA) is the primary tool for dimensionality reduction in multivariate problems. It is a foundational technique that ﬁnds applications as part of many other, more advanced procedures. Motivation To understand what PCA can do for us, let’s consider a simple example. Let’s go back to the contract example given earlier and now assume that there are only two variables for each contract: its lead time and the number of units to be delivered. What can we say about them? Well, we can draw histograms for each to understand the distribution of values and to see whether there are “typical” values for either of these quantities. The histograms (in the form of kernel density estimates—see Chapter 2) are shown in Figure 14-1 and don’t reveal anything of interest. Because there are only two variables in this case, we can also plot one variable against the other in a scatter plot. The resulting graph is shown in Figure 14-2 and is very revealing: the lead time of the contract grows with its size. So far, so good. But we can also look at Figure 14-2 in a different way. Recall that the contract data depends on two variables (lead time and number of items), so that we would expect the points to ﬁll the two-dimensional space spanned by the two axes (lead time and number of items). But in reality, all the points fall very close to a straight line. A straight line, however, is only one-dimensional, and this means that we need only a single variable to describe the position of each point: the distance along the straight line. In other words, although it appears to depend on two variables, the contract data mostly depends on a single variable that lies halfway between the original ones. In this sense, the data is of lower dimensionality than it originally appeared. 328 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 0 2 4 6 8 10 12 14 16 Number of Units Lead Time FIGURE 14-1.Contractdata: distribution of points for the lead time and the number of units per order. The distributions do not reveal anything in particular about the data. 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Lead Time Number of Units FIGURE 14-2.Contractdata: individual contracts in a scatter plot spanned by the two original variables. All the points fall close to a straight line that is not parallel to either of the original coordinate axes. Of course, the data still depends on two variables—as it did originally. But most of the variation in the data occurs along only one direction. If we were to measure the data only along this direction, we would still capture most of what is “interesting” about the data. In Figure 14-3, we see another kernel density estimate of the same data, but this time not taken along the original variables but instead showing the distribution of data points along SEEING THE FOREST FOR THE TREES 329 O’Reilly-5980006 master October 28, 2010 21:25 -8 -6 -4 -2 0 2 4 6 8 Long Direction Short Direction FIGURE 14-3.Contractdata: distribution of points along the principal directions. Most of the variation is along the “long” direction, whereas there is almost no variation perpendicular to it. (The vertical scales have been adjusted to make the curves comparable.) the two “new” directions indicated by the arrows in the scatter plot of Figure 14-2. In contrast to the variation occurring along the “long” component, the “short” component is basically irrelevant. For this simple example, which had only two variables to begin with, it was easy enough to ﬁnd the lower-dimensional representation just by looking at it. But that won’t work when there are signiﬁcantly more than two variables involved. If there aren’t too many variables, then we can generate a scatter-plot matrix (see Chapter 5) containing all possible pairs of variables, but even this becomes impractical once there are more than seven or eight variables. Moreover, scatter-plot matrices can never show us more than the combination of any two of the original variables. What if the data in a three-dimensional problem falls onto a straight line that runs along the space diagonal of the original three-dimensional data cube? We will not ﬁnd this by plotting the data against any (two-dimensional!) pair of the original variables. Fortunately, there is a calculational scheme that—given a set of points—will give us the principal directions (in essence, the arrows in Figure 14-2) as a combination of the original variables. That is the topic of the next section. Optional: Theory We can make progress by using a technique that works for many multi-dimensional problems. If we can summarize the available information regarding the multi-dimensional system in matrix form, then we can invoke a large and powerful body of results from linear 330 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 algebra to transform this matrix into a form that reveals any underlying structure (such as the structure visible in Figure 14-2). In what follows, I will often appeal to the two-dimensional example of Figure 14-2, but the real purpose here is to develop a procedure that will be applicable to any number of dimensions. These techniques become necessary when the number of dimensions exceeds two or three so that simple visualizations like the ones discussed so far will no longer work. To express what we know about the system, we ﬁrst need to ask ourselves how best to summarize the way any two variables relate to each other. Looking at Figure 14-2, the correlation coefﬁcient suggests itself. In Chapter 13, we introduced the correlation coefﬁcient as a measure for the similarity between two multi-dimensional data points x and y. Here, we use the same concept to express the similarity between two dimensions in a multivariate data set. Let x and y be two different dimensions (“variables”) in such a data set, then the correlation coefﬁcient is deﬁned by: corr(x, y) = 1 N i (xi − ¯x)(yi − ¯y) σ(x)σ (y) where the sum is over all data points, ¯x and ¯y are the means of the xi and the yi , respectively, and σ(x) = 1 N i (xi − ¯x)2 is the standard deviation of x (and equivalently for y). The denominator in the expression of the correlation coefﬁcient amounts to a rescaling of the values of both variables to a standard interval. If that is not what we want, then we can instead use the covariance between the xi and the yi : cov(x, y) = 1 N N i (xi − ¯x)(yi − ¯y) All of these quantities can be deﬁned for any two variables (just supply values for, say xi and zi ). For a p-dimensional problem, we can ﬁnd all the p(p − 1)/2 different combinations (remember that these coefﬁcients are symmetric: cov(x, y) = cov(y, x)). It is now convenient to group the values in a matrix, which is typically called (not to be confused with the summation sign!) = ⎛ ⎜⎜⎝ cov(x, x) cov(x, y) ... cov(y, x) cov(y, y) ... ... ⎞ ⎟⎟⎠ and similarly for the correlation matrix. Because the covariance (or correlation) itself is symmetric under an interchange of its arguments, the matrix is also symmetric (so that it equals its transpose). SEEING THE FOREST FOR THE TREES 331 O’Reilly-5980006 master October 28, 2010 21:25 We can now invoke an extremely important result from linear algebra, known as the spectral decomposition theorem, as follows. For any real, symmetric N × N matrix A, there exists an orthogonal matrix U such that: B = ⎛ ⎜⎜⎜⎜⎝ λ1 λ2 ... λN ⎞ ⎟⎟⎟⎟⎠ = U −1 AU is a diagonal matrix. Let’s explain some of the terminology. A matrix is diagonal if its only nonzero entries are along the main diagonal from the top left to the bottom right. A matrix is orthogonal if its transpose equals its inverse: U T = U −1 or U T U = UUT = 1. The entries λi in the diagonal matrix are called the eigenvalues of matrix A, and the column vectors of U are the eigenvectors. The spectral theorem also implies that all eigenvectors are mutually orthogonal. Finally, the ith column vector in U is the eigenvector “associated” with the eigenvalue λi ; each eigenvalue has an associated eigenvector. What does all of this mean? In a nutshell, it means that we can perform a change of variables that turns any symmetric matrix A into a diagonal matrix B. Although it may not be obvious, the matrix B contains the same information as A—it’s just packaged differently. The change of variables required for this transformation consists of a rotation of the original coordinate system into a new coordinate system in which the correlation matrix has a particularly convenient (diagonal) shape. (Notice how in Figure 14-2, the new directions are rotated with respect to the original horizontal and vertical axes.) When expressed in the original coordinate system (i.e., the original variables that the problem was initially expressed in), the matrix is a complicated object with off-diagonal entries that are nonzero. However, the eigenvectors span a new coordinate system that is rotated with respect to the old one. In this new coordinate system, the matrix takes on a simple, diagonal form in which all entries that are not on the diagonal vanish. The arrows in Figure 14-2 show the directions of the new coordinate axes, and the histogram in Figure 14-3 measures the distribution of points along these new directions. The purpose of performing a matrix diagonalization is to ﬁnd the directions of this new coordinate system, which is more suitable for describing the data than was the original coordinate system. Because the new coordinate system is merely rotated relative to the original one, we can express its coordinate axes as linear combinations of the original ones. In Figure 14-2, for instance, to make a step in the new direction (along the diagonal), you take a step along the (old) x axis, followed by a step along the (old) y axis. We can therefore express the 332 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 new direction (call it ˆx) in terms of the old ones: ˆx = (x + y)/ √ 2 (the factor √ 2 is just a normalization factor). Interpretation The spectral decomposition theorem applies to any symmetric matrix. For any such matrix, we can ﬁnd a new coordinate system, in which the matrix is diagonal. But the interpretation of the results (what do the eigenvalues and eigenvectors mean?) depends on the speciﬁc application. In our case, we apply the spectral theorem to the covariance or correlation matrix of a set of points, and the results of the decomposition will give us the principal axes of the distribution of points (hence the name of the technique). Look again at Figure 14-2. Points are distributed in a region shaped like an extremely stretched ellipse. If we calculate the eigenvalues and eigenvectors of the correlation matrix of this point distribution, we ﬁnd that the eigenvectors lie in the directions of the principal axes of the ellipse while the eigenvalues give the relative length of the corresponding principal axes. Put another way, the eigenvalues point along the directions of greatest variance: the data is most stretched out if we measure it along the principal directions. Moreover, the eigenvalue corresponding to each eigenvector is a measure of the width of the distribution along this direction. (In fact, the eigenvalue is the square of the standard deviation along that direction; remember that the diagonal entries of the covariance matrix are σ 2(x) = i (xi − ¯x)2. Once we diagonalize , the entries along the diagonal—that is, the eigenvalues—are the variances along the “new” directions.) You should also observe that the variables measured along the principal directions are uncorrelated with each other. (By construction, their correlation matrix is diagonal, which means that the correlation between any two different variables is zero.) This, then, is what the principal component analysis does for us: if the data points are distributed as a globular cloud in the space spanned by all the original variables (which may be more than two!), then the eigenvectors will give us the directions of the principal axes of the ellipsoidal cloud of data points and the eigenvalues will give us the length of the cloud along each of these directions. The eigenvectors and eigenvalues therefore describe the shape of the point distribution. This becomes especially useful if the data set has more than just two dimensions, so that a simple plot (as in Figure 14-2) is no longer feasible. (There are special varieties of PCA, such as “Kernel PCA” or “ISOMAP,” that work even with point distributions that do not form globular ellipsoids but have more complicated, contorted shapes.) The description of the shape of the point distribution provided by the PCA is already helpful. But it gets even better, because we may suspect that not all of the original SEEING THE FOREST FOR THE TREES 333 O’Reilly-5980006 master October 28, 2010 21:25 variables are really needed. Some of them may be redundant (expressing more or less the same thing), and others may be irrelevant (carrying little information). An indication that variables may be redundant (i.e., express the “same thing”) is that they are correlated. (That’s pretty much the deﬁnition of correlation: knowing that if we change one variable, then there will be a corresponding change in the other.) The PCA uses the information contained in the mutual correlations between variables to identify those that are redundant. By construction, the principal coordinates are uncorrelated (i.e., not redundant), which means that the information contained in the original (redundant) set of variables has been concentrated in only a few of the new variables while the remaining variables have become irrelevant. The irrelevant variables are those corresponding to small eigenvalues: the point distribution will have only little spread in the corresponding directions (which means that these variables are almost constants and can therefore be ignored). The price we have to pay for the reduction in dimensions is that the new directions will not, in general, map neatly to the original variables. Instead, the new directions will correspond to combinations of the original variables. There is an important consequence of the preceding discussion: the principal component analysis works with the correlation between variables. If the original variables are uncorrelated, then there is no point in carrying out a PCA! For instance, if the data points in Figure 14-2 had shown no structure but had ﬁlled the entire two-dimensional parameter space randomly, then we would not have been able to simplify the problem by reducing it to a one-dimensional one consisting of the new direction along the main diagonal. Computation The theory just described would be of only limited interest if there weren’t practical algorithms for calculating both eigenvalues and eigenvectors. These calculations are always numerical. You may have encountered algebraic methods matrix diagonalization methods in school, but they are impractical for matrices larger than 2 × 2 and infeasible for matrices larger than about 4 × 4. However, there are several elegant numerical algorithms to invert and diagonalize matrices, and they tend to form the foundational part of any numerical library. They are not trivial to understand, and developing high-quality implementations (that avoid, say round-off error) is a specialized skill. There are no good reasons to write your own, so you should always use an established library. (Every numerical library or package will include the required functionality.) Matrix operations are relatively expensive, and run time performance can be a serious concern for large matrices. Matrix operations tend to be of O(N 3) complexity, which means that doubling the size of the matrix will increase the time to perform an operation 334 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 by a factor of 23 = 8. In other words, doubling the problem size will result in nearly a tenfold increase in runtime! This is not an issue for small matrices (up to 100 × 100 or so), but you will hit a brick wall at a certain size (somewhere between 5,000 × 5,000 and 50,000 × 50,000). Such large matrices do occur in practice but usually not in the context of the topic of this chapter. For even larger matrices there are alternative algorithms— which, however, calculate only the most important of the eigenvalues and eigenvectors. I will not go into details about different algorithms, but I want to mention one explicitly because it is of particular importance in this context. If you read about principal component analysis (PCA), then you will likely encounter the term singular value decomposition (SVD); in fact, many books treat PCA and SVD as equivalent expressions for the same thing. That is not correct; they are really quite different. PCA is the application of spectral methods to covariance or correlation matrices; it is a conceptual technique, not an algorithm. In contrast, the SVD is a speciﬁc algorithm that can be applied to many different problems one of which is the PCA. The reason that the SVD features so prominently in discussions of the PCA is that the SVD combines two required steps into one. In our discussion of the PCA, we assumed that you ﬁrst calculate the covariance or correlation matrix explicitly from the set of data points and then diagonalize it. The SVD performs these two steps in one fell swoop: you pass the set of data points directly to the SVD, and it calculates the eigenvalues and eigenvectors of the correlation matrix directly from those data points. The SVD is a very interesting and versatile algorithm, which is unfortunately rarely included in introductory classes on linear algebra. Practical Points As you can see, principal component analysis is an involved technique—although with the appropriate tools it becomes almost ridiculously easy to perform (see the Workshop in this chapter). But convenient implementations don’t make the conceptual difﬁculties go away or ensure that the method is applied appropriately. First, I’d like to emphasize that the mathematical operations underlying principal component analysis (namely, the diagonalization of a matrix) are very general: they consist of a set of formal transformations that apply to any symmetric matrix. (Transformations of this sort are used for many different purposes in literally all ﬁelds of science and engineering.) In particular, there is nothing speciﬁc to data analysis about these techniques. The PCA thus does not involve any of the concepts that we usually deal with in statistics or analysis: there is no mention of populations, samples, distributions, or models. Instead, principal component analysis is a set of formal transformations, which are applied to the covariance matrix of a data set. As such, it can be either exploratory or preparatory. SEEING THE FOREST FOR THE TREES 335 O’Reilly-5980006 master October 28, 2010 21:25 As an exploratory technique, we may inspect its results (the eigenvalues and eigenvectors) for anything that helps us develop an understanding of the data set. For example, we may look at the contributions to the ﬁrst few principal components to see whether we can ﬁnd an intuitive interpretation of them (we will see an example of this in the Workshop section). Biplots (discussed in the following section) are a graphical technique that can be useful in this context. But we should keep in mind that this kind of investigation is exploratory in nature: there is no guarantee that the results of a principal component analysis will turn up anything useful. In particular, we should not expect the principal components to have an intuitive interpretation in general. On the other hand, PCA may also be used as a preparatory technique. Keep in mind that, by construction, the principal components are uncorrelated. We can therefore transform any multivariate data set into an equivalent form, in which all variables are mutually independent, before performing any subsequent analysis. Identifying a subset of principal components that captures most of the variability in the data set—for the purpose of reducing the dimensionality of the problem, as we discussed earlier—is another preparatory use of principal component analysis. As a preparatory technique, principal component analysis is always applicable but may not always be useful. For instance, if the original variables are already uncorrelated, then the PCA cannot do anything for us. Similarly, if none of the eigenvalues are signiﬁcantly smaller (so that their corresponding principal components can be dropped), then again we gain nothing from the PCA. Finally, let me reiterate that PCA is just a mathematical transformation that can be applied to any symmetric matrix. This means that its results are not uniquely determined by the data set but instead are sensitive to the way the inputs are prepared. In particular, the results of a PCA depend on the actual numerical values of the data points and therefore on the units in which the measurements have been recorded. If the numerical values for one of the original variables are consistently larger than the values of the other variables, then the variable with the large values will unduly dominate the spectrum of eigenvalues. (We will see an example of this problem in the Workshop.) To avoid this kind of problem, all variables should be of comparable scale. A systematic way to achieve this is to work with the correlation matrix (in which all entries are normalized by their autocorrelation) instead of the covariance matrix. Biplots Biplots are an interesting way to visualize the results of a principal component analysis. In a biplot, we plot the data points in a coordinate system spanned by the ﬁrst two principal components (i.e., those two of the new variables corresponding to the largest eigenvalues). In addition, we also plot a representation of the original variables but now projected into 336 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 the space of the new variables. The data points are represented by symbols, whereas the directions of the original variables are represented by arrows. (See Figure 14-5 in the Workshop section.) In a biplot, we can immediately see the distribution of points when represented through the new variables (and can also look for clusters, outliers, or other interesting features). Moreover, we can see how the original variables relate to the ﬁrst two principal components and to each other: if any of the original variables are approximately aligned with the horizontal (or vertical) axis, then they are approximately aligned with the ﬁrst (or second) principal component (because in a biplot, the horizonal and vertical axes coincide with the ﬁrst and second principal components). We can thus see which of the original variables contribute strongly to the ﬁrst principal components, which might help us develop an intuitive interpretation for those components. Furthermore, any of the original variables that are roughly redundant will show up as more or less parallel to each other in a biplot—which can likewise help us identify such combinations of variables in the original problem. Biplots may or may not be helpful. There is a whole complicated set of techniques for interpreting biplots and reading off various quantities from them, but these techniques seem rarely used, and I have not found them to be very practical. If I do a PCA, I will routinely also draw a biplot: if it tells me something worthwhile, that’s great; but if not, then I’m not going to spend much time on it. Visual Techniques Principal component analysis is a rigorous prescription, and example of a “data-centric” technique: it transforms the original data in a precisely prescribed way, without ambiguity and without making further assumptions. The results are an expression of properties of the data set. It is up to us to interpret them, but the results are true regardless of whether we ﬁnd them useful or not. In contrast, the methods described in this section are convenience methods that attempt to make multi-dimensional data sets more “palatable” for human consumption. These methods do not calculate any rigorous properties inherent in the data set; instead, they try to transform the data in such a way that it can be plotted while at the same time trying to be as faithful to the data as possible. We will not discuss any of these methods in depth, since personally, I do not ﬁnd them worth the effort: on the one hand, they are (merely) exploratory in nature; on the other hand, they require rather heavy numerical computations and some nontrivial theory. Their primary results are projections (i.e., graphs) of data sets, which can be difﬁcult to interpret if the number of data points or their dimensionality becomes large—which is exactly when I expect a computationally intensive method to be helpful! Nevertheless, SEEING THE FOREST FOR THE TREES 337 O’Reilly-5980006 master October 28, 2010 21:25 there are situations where you might ﬁnd these methods useful, and they do provide some interesting concepts for how to think about data. This last reason is the most important to me, which is why this section emphasizes concepts while skipping most of the technical details. The methods described in this section try to calculate speciﬁc “views” or projections of the data into a lower number of dimensions. Instead of selecting a speciﬁc projection, we can also try to display many of them in sequence, leaving it to the human observer to choose those that are “interesting.” That is the method we introduced in Chapter 5, when we discussed Grand Tours and Projection Pursuits—they provide yet another approach to the problem of dimensionality reduction for multivariate data sets. Multidimensional Scaling Given a set of data points (i.e., the coordinates of each data point), we can easily ﬁnd the distance between any pair of points (see Chapter 13 for a discussion of distance measures). Multidimensional scaling (MDS) attempts to answer the opposite question: given a distance matrix, can we recover the explicit coordinates of the points? This question has a certain intellectual appeal in its own right, but of course, it is relevant in situations where our information about a certain system is limited to the differences between data points. For example, in usability studies or surveys we may ask respondents to list which of a set of cars (or whiskeys, or pop singers) they ﬁnd the most or the least alike; in fact, the entire method was ﬁrst developed for use in psychological studies. The question is: given such a matrix of relative preferences or distances, can we come up with a set of absolute positions for each entry? First, we must choose the desired number of dimensions of our points. The dimension D = 2 is used often, so that the results can be plotted easily, but other values for D are also possible. If the distance measure is Euclidean—that is, if the distance between two points is given by: d(x, y) = D i (xi − yi )2 where the sum is running over all dimensions—then it turns out that we can invert this relationship explicitly and ﬁnd expressions for the coordinates in terms of the distances. (The only additional assumption we need to make is that the center of mass of the entire data set lies at the origin, but this amounts to no more than an arbitrary translation of all points.) This technique is known as classical or metric scaling. The situation is more complicated if we cannot assume that the distance measure is Euclidean. Now we can no longer invert the relationship exactly and must resort instead to iterative approximation schemes. Because the resulting coordinates may not replicate 338 CHAPTER FOURTEEN O’Reilly-5980006 master October 28, 2010 21:25 the original distances exactly, we include an additional constraint: the distance matrix calculated from the new positions must obey the same rank order as the original distance matrix: if the original distances between any three points obeyed the relationship d(x, y) 10? Class 0 B < 5? A > 0? Class 1 Class 1 Class 0 Yes No Yes No Yes No FIGURE 18-5. A very simple decision tree. Another important quantity related to decision trees is the gain ratio from a parent node to its children. This quantity measures the gain in purity from parent to children, weighted by the relative size of the subsets: = I (parent) − children j N j N I (child j) where I is the purity (or impurity) of a node, N j is the number of elements assigned to child node j, and N is the total number of elements at the parent node. We want to ﬁnd a splitting that maximizes this gain ratio. What I have described so far is the outline of the basic algorithm. As with all greedy algorithms, there is no guarantee that it will ﬁnd the optimal solution, and therefore various heuristics play a large role to ensure that the solution is as good as possible. Hence the various published (and proprietary) algorithms for decision trees (you may ﬁnd references to CART, C4.5, and ID3) differ in such details such as the following: • What choice of purity/impurity measure is used? • At what level of purity does the splitting procedure stop? (Continuing to split a training set until all leaf nodes are entirely pure usually results in overﬁtting.) PREDICTIVE ANALYTICS 417 O’Reilly-5980006 master October 28, 2010 21:44 • Is the tree binary, or can a node have more than two children? • How should noncategorical attributes be treated? (For attributes that take on a continuum of values, we need to deﬁne the optimal splitting point.) • Is the tree postprocessed? (To reduce overﬁtting, some algorithms employ a pruning step that attempts to eliminate leaf nodes having too few elements.) Decision trees are popular and combine several attractive features: with good algorithms, decision trees are relatively cheap to build and are always very fast to evaluate. They are also rather robust in the presence of noise. It can even be instructive to examine the decision points of a decision tree, because they frequently reveal interesting information about the distribution of class labels (such as when 80 percent of the class information is contained in the topmost node). However, algorithms for building decision trees are almost entirely black-box and do not lend themselves to ad hoc modiﬁcations or extensions. There is an equivalence between decision trees and rule-based classiﬁers. The latter consist of a set of rules (i.e., logical conditions on attribute values) that, when taken in aggregate, determine the class label of a test instance. There are two ways to build a rule-based classiﬁer. We can build a decision tree ﬁrst and then transform each complete path through the decision tree into a single rule. Alternatively, we can build rule-based classiﬁers directly from a training set by ﬁnding a subset of instances that can be described by a simple rule. These instances are then removed from the training set, and the process is repeated. (This amounts to a bottom-up approach, whereas using a variant of Hunt’s algorithm to build a decision-tree follows a top-down approach.) Other Classiﬁers In addition to the classiﬁers discussed so far, you will ﬁnd others mentioned in the literature. I’ll name just two—mostly because of their historical importance. Fisher’s linear discriminant analysis (LDA) was one of the ﬁrst classiﬁers developed. It is similar to principal component analysis (see Chapter 14). Whereas in PCA, we introduce a new coordinate system to maximize the spread along the new coordinates axes, in LDA we introduce new coordinates to maximize the separation between two classes that we try to distinguish. The position of the means, calculated separately for each class, are taken as the location of each class. Artiﬁcial neural networks were conceived as extremely simpliﬁed models for biological brains. The idea was to have a network of nodes; each node receives input from several other nodes, forms a weighted average of its input, and then sends it out to the next layer of nodes. During the learning stage, the weights used in the weighted average are adjusted to minimize training error. Neural networks were very popular for a while but have recently fallen out of favor somewhat. One reason is that the calculations required are 418 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 more complicated than for other classiﬁers; another is that the whole concept is very ad hoc and lacks a solid theoretical grounding. The Process In addition to the primary algorithms for classiﬁcation, various techniques are important for dealing with practical problems. In this section, we look at some standard methods commonly used to enhance accuracy—especially for the important case when the most “interesting” type of class occurs much less frequently than the other types. Ensemble Methods: Bagging and Boosting The term ensemble methods refers to a set of techniques for improving accuracy by combining the results of individual or “base” classiﬁers. The rationale is the same as when performing some experiment or measurement multiple times and then averaging the results: as long as the experimental runs are independent, we can expect that errors will cancel and that the average will be more accurate than any individual trial. The same logic applies to classiﬁcation techniques: as long as the individual base classiﬁers are independent, combining their results will lead to cancellation of errors and the end result will have greater accuracy than the individual contributions. To generate a set of independent classiﬁers, we have to introduce some randomness into the process by which they are built. We can manipulate virtually any aspect of the overall system: we can play with the selection of training instances (as in bagging and boosting), with the selection of features (often in conjunction with random forests), or with parameters that are speciﬁc to the type of classiﬁer used. Bagging is an application of the bootstrap idea (see Chapter 12) to classiﬁcation. We generate additional training sets by sampling with replacement from the original training set. Each of these training sets is then used to train a separate classiﬁer instance. During production, we let each of these instances provide a separate assessment for each item we want to classify. The ﬁnal class label is then assigned based on a majority vote or similar technique. Boosting is another technique to generate additional training sets using a bootstrap approach. In contrast to bagging, boosting is an iterative process that assigns higher weights to instances misclassiﬁed in previous rounds. As the iteration progresses, higher emphasis is placed on training instances that have proven hard to classify correctly. The ﬁnal result consists of the aggregate result of all base classiﬁers generated during the iteration. A popular variant of this technique is known as “AdaBoost.” Random forests apply speciﬁcally to decision trees. In this technique, randomness is introduced not by sampling from the training set but by randomly choosing what features PREDICTIVE ANALYTICS 419 O’Reilly-5980006 master October 28, 2010 21:44 to use when building the decision tree. Instead of examining all features at every node to ﬁnd the feature that gives the greatest gain ratio, only a subset of features is evaluated for each tree. Estimating Prediction Error Earlier, we already talked about the difference between the training and the generalization error: the training error is the ﬁnal error rate that the classiﬁer achieves on the training set. It is usually not a good measure for the accuracy of the classiﬁer on new data (i.e., on data that was not used to train the classiﬁer). For this reason, we hold some of the data back during training, and use it later as a test set. The error that the classiﬁer achieves on this test set is a much better measure for the generalization error that we can expect when using the classiﬁer on entirely new data. If the original data set is very large, there is no problem in splitting it into a training and a test set. In reality, however, available data sets are always “too small,” so that we need to make sure we use the available data most efﬁciently, using a process known as cross-validation. The basic idea is that we randomly divide the original data set into k equally sized chunks. We then perform k training and test runs. In each run, we omit one of the chunks from the training set and instead use it as the test set. Finally, we average the generalization errors from all k runs to obtain the overall expected generalization error. A value of k = 10 is typical, but you can also use a value like k = 3. Setting k = n, where n is the number of available data points, is special: in this so-called “leave-one-out” cross-validation, we train the classiﬁer on all data points except one and then try to predict the omitted data point—this procedure is then repeated for all data points. (This prescription is similar to the jackknife process that was mentioned brieﬂy in Chapter 12.) Yet another method uses the idea of random sampling with replacement, which is characteristic of bootstrap techniques (see Chapter 12). Instead of dividing the available data into k nonoverlapping chunks, we generate a bootstrap sample by drawing n data points with replacement from the original n data points. This bootstrap sample will contain some of the data points more than once, and some not at all: overall, the fraction of the unique data points included in the bootstrap sample will be about 1 − e−1 ≈ 0.632 of the available data points—for this reason, the method is often known as the 0.632 bootstrap. The bootstrap sample is used for training, and the data points not included in the bootstrap sample become the test set. This process can be repeated several times, and the results averaged as for cross-validation, to obtain the ﬁnal estimate for the generalization error. (By the way, this is basically the “unique visitor” problem that we discussed in Chapters 9 and 12—after n days (draws) with one random visitor each day (one data point selected per draw), we will have seen 1 − e− 1 n n = 1 − e−1 unique visitors (unique data points).) 420 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 TABLE 18-2.Terminology for the confusion matrix in the case of class imbalance (i.e.“bad” outcomes are much less frequent than “good” outcomes) Predicted: Bad Predicted: Good Actually: Bad True positive: “Hit” False negative: “Miss” Actually: Good False positive: “False alarm” True negative: “Correct rejection” Class Imbalance Problems A special case of particular importance concerns situations where one of the classes occurs much less frequently than any of the other classes in the data set—and, as luck would have it, that’s usually the class we are interested in! Consider credit card fraud detection, for instance: only one of every hundred credit card transactions may be fraudulent, but those are exactly the ones we are interested in. Screening lab results for patients with elevated heart attack risk or inspecting manufactured items for defects falls into the same camp: the “interesting” cases are rare, perhaps extremely rare, but those are precisely the cases that we want to identify. For cases like this, there is some additional terminology as well as some special techniques for overcoming the technical difﬁculties. Because there is one particular class that is of greater interest, we refer to an instance belonging to this class as a positive event and the class itself as the positive class. With this terminology, entries in the confusion matrix (see Table 18-1) are often referred to as true (or false) positives (or negatives). I have always found this terminology very confusing, in part because what is called “positive” is usually something bad: a fraudulent transaction, a defective item, a bad heart. Table 18-2 shows a confusion matrix employing the special terminology for problems with a class imbalance—and also an alternative terminology that may be more intuitive. The two different types of errors may have very different costs associated with them. From the point of view of a merchant accepting credit cards as payment, a false negative (i.e., a fraudulent transaction incorrectly classiﬁed as “not fraudulent”—a “miss”) results in the total loss of the item purchased, whereas a false positive (a valid transaction incorrectly classiﬁed as “not valid”—a “false alarm”) results only in the loss of the proﬁt margin on that item. The usual metrics by which we evaluate a classiﬁer (such as accuracy and error rate), may not be very meaningful in situations with pronounced class imbalances: keep in mind that the trivial classiﬁer that labels every credit card transaction as “valid” is 99 percent accurate—and entirely useless! Two metrics that provide better insight into the ability of a classiﬁer to detect instances belonging to the positive class are recall and precision. The precision is the fraction of correct classiﬁcations among all instances labeled positive; the PREDICTIVE ANALYTICS 421 O’Reilly-5980006 master October 28, 2010 21:44 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 True Positive Rate False Positive Rate Optimum Random Classifier FIGURE 18-6.AROC(receiver operating characteristic) curve: the trade-off between true positives (“hits”) and false positives (“false alarms”), for three different classifier implementations. recall is the fraction of correct classiﬁcations among all instances labeled negative: precision = true positives true positives + false positives recall = true positives true positives + false negatives You can see that we will need to strike a balance. On the one hand, we can build a classiﬁer that is very aggressive, labeling many transactions as “bad,” but it will have a high false-positive rate, and therefore low precision. On the other hand, we can build a classiﬁer that is highly selective, marking only those instances that are blatantly fraudulent as “bad,” but it will have a high rate of false negatives and therefore low recall. These two competing goals (to have few false positives and few false negatives) can be summarized in a graph known as a receiver operating characteristic (ROC) curve. (The concept originated in signal processing, where it was used to describe the ability of a receiver to distinguish a true signal from a spurious one in the presence of noise, hence the name.) Figure 18-6 shows an example of a ROC curve. Along the horizontal axis, we plot the false positive rate (good events that were labeled as bad—“false alarms”) and along the vertical axis we plot the true positive rate (bad events labeled as bad—“hits”). The lower-left corner corresponds to a maximally conservative classiﬁer, which labels every instance as good; the upper-right corner corresponds to a maximally aggressive classiﬁer, which labels everything as bad. We can now imagine tuning the parameters and thresholds of our classiﬁer to shift the balance between “misses” and “false alarms” and thereby mapping out the characteristic curve for our classiﬁer. The curve for a random classiﬁer (which 422 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 assigns a positive class label with ﬁxed probability p, irrespective of attribute values) will be close to the diagonal: it is equally likely to classify a good instance as good as it is to classify a bad one as good, hence its false positive rate equals its true positive rate. In contrast, the ideal classiﬁer would have a true positive rate equal to 1 throughout. We want to tune our classiﬁer so that it approximates the ideal classiﬁer as nearly as possible. Class imbalances pose some technical issues during the training phase: if positive instances are extremely rare, then we want to make sure to retain as much of their information as possible in the training set. One way to achieve this is by oversampling (i.e., resampling) from the positive class instances—and undersampling from the negative class instances—when generating a training set. The Secret Sauce All this detail about different algorithms and processes can easily leave the impression that that’s all there is to classiﬁcation. That would be unfortunate, because it leaves out what can be the most important but also the most difﬁcult part of the puzzle: ﬁnding the right attributes! The choice of attributes matters for successful classiﬁcation—arguably more so than the choice of classiﬁcation algorithm. Here is an interesting case story. Paul Graham has written two essays on using Bayesian classiﬁers for spam ﬁltering.* In the second one, he describes how using the information contained in the email headers is critical to obtaining good classiﬁcation results, whereas using only information in the body is not enough. The punch line here is clear: in practice, it matters a lot which features or attributes you choose to include. Unfortunately, when compared with the extremely detailed information available on different classiﬁer algorithms and their theoretical properties, it is much more difﬁcult to ﬁnd good guidance regarding how best to choose, prepare, and encode features for classiﬁcation. (None of the current books on classiﬁcation discuss this topic at all.) I think there are several reasons for this relative lack of easily available information— despite the importance of the topic. One of them is lack of rigor: whereas one can prove rigorous theorems on classiﬁcation algorithms, most recommendations for feature preparation and encoding would necessarily be empirical and heuristic. Furthermore, every problem domain is different, which makes it difﬁcult to come up with recommendations that would be applicable more generally. The implication is that factors such as experience, familiarity with the problem domain, and lots of time-consuming trial and error are essential when choosing attributes for classiﬁcation. (A last reason for the *“A Plan for Spam” (http://www.paulgraham.com/spam.html) and “Better Bayesian Filtering” (http://www.paulgraham.com/better.html). PREDICTIVE ANALYTICS 423 O’Reilly-5980006 master October 28, 2010 21:44 relative lack of available information on this topic may be that some prefer to keep their cards a little closer to their chest: they may tell you how it works “in theory,” but they won’t reveal all the tricks of the trade necessary to fully replicate the results.) The difﬁculty of developing some recommendations that work in general and for a broad range of application domains may also explain one particular observation regarding classiﬁcation: the apparent scarcity of spectacular, well-publicized successes. Spam ﬁltering seems to be about the only application that clearly works and affects many people directly. Credit card fraud detection and credit scoring are two other widely used (if less directly visible) applications. But beyond those two, I see only a host of smaller, specialized applications. This suggests again that every successful classiﬁer implementation depends strongly on the details of the particular problem—probably more so than on the choice of algorithm. The Nature of Statistical Learning Now that we have seen some of the most commonly used algorithms for classiﬁcation as well as some of the related practical techniques, it’s easy to feel a bit overwhelmed—there seem to be so many different approaches (each nontrivial in its own way) that it can be hard to see the commonalities among them: the “big picture” is easily lost. So let’s step back for a moment and reﬂect on the speciﬁc challenges posed by classiﬁcation problems and on the overall strategy by which the various algorithms overcome these challenges. The crucial problem is that from the outset, we don’t have good insight into which features are the most relevant in predicting the class—in fact, we may have no idea at all about the processes (if any!) that link observable features to the resulting class. Because we don’t know ahead of time which features are likely to be most important, we need to retain them all and perhaps even expand the feature set in an attempt to include any possible clue we can get. In this way, the problem quickly becomes very multi-dimensional. That’s the ﬁrst challenge. But now we run into a problem: multi-dimensional data sets are invariably sparse data sets. Think of a histogram with (say) 5 bins per dimension. In one dimension, we have 5 bins total. If we want on average at least 5 items per bin, we can make do with 25 items total. Now consider the same data set in two dimensions. If we still require 5 bins per dimension, we have a total of 25 bins, so that each bin contains on average only a single element. But it is in three dimensions that the situation becomes truly dramatic: now there are 125 bins, so we can be sure that the majority of bins will contain no element at all! It gets even worse in higher dimensions. (Mathematically speaking, the problem is that the number of bins grows exponentially with the number of dimensions: N d , where d is the number of dimensions and N is the number of bins per dimension. No matter what you do, the number of cells is going to grow faster than you can obtain data. This problem is known as the curse of dimensionality.) That’s the second challenge. 424 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 It is this combinatorial explosion that drives the need for larger and larger data sets. We have just seen that the the number of possible attribute value combinations grows exponentially; therefore, if we want to have a reasonable chance of ﬁnding at least one instance of each possible combination in our training data, we need to have very large data sets indeed. Yet despite our best efforts, we will frequently end up with a sparse data set (as discussed above). Nevertheless, we will often deal with inconveniently large data sets. That’s the third challenge. Basically all classiﬁcation algorithms deal with these challenges by using some form of interpolation between points in the sparse data set. In other words, they attempt to smoothly ﬁll the gaps left in the high-dimensional feature space, supported only by a (necessarily sparse) set of points (i.e., the training instances). Different algorithms do this in different ways: nearest-neighbor methods and naive Bayesian classiﬁers explicitly “smear out” the training instances to ﬁll the gaps locally, whereas regression and support vector classiﬁers construct global structures to form a smooth decision boundary from the sparse set of supporting points. Decision trees are similar to nearest-neighbor methods in this regard but provide a particularly fast and efﬁcient lookup of the most relevant neighbors. Their differences aside, all algorithms aim to ﬁll the gaps between the existing data points in some smooth, consistent way. This brings us to the question of what can actually be predicted in this fashion. Obviously, class labels must depend on attribute values, and they should do so in some smooth, predictable fashion. If the relationship between attribute values and class labels is too crazy, no classiﬁer will be very useful. Furthermore, the distribution of attribute values for different classes must differ, for otherwise no classiﬁer will be able to distinguish classes by examining the attribute values. Unfortunately, there is—to my knowledge—no independent, rigorous way of determining whether the information contained in a data set is sufﬁcient to allow the data to be classiﬁed. To ﬁnd out, we must build an actual classiﬁer. If it works, then obviously there is enough information in the data set for classiﬁcation. But if it does not work, we have learned nothing, because it is always possible that a different or more sophisticated classiﬁer would work. But without an independent test, we can spend an inﬁnite amount of time building and reﬁning classiﬁers on data sets that contain no useful information. We encountered this kind of difﬁculty already in Chapter 13 in the context of clustering algorithms, but it strikes me as even more of a problem here. The reason is that classiﬁcation is by nature predictive (or at least should be), whereas uncertainty of this sort seems more acceptable in an exploratory technique such as clustering. To make this more clear, suppose we have a large, rich data set: many records with many features. We then arbitrarily assign class labels A and B to the records in the data set. Now, by construction, it is clear that there is no way to predict the labels from the “data”—they PREDICTIVE ANALYTICS 425 O’Reilly-5980006 master October 28, 2010 21:44 are, after all, purely random! However, there is no unambiguous test that will clearly say so. We can calculate the correlation coefﬁcients between each feature (or combination of features) and the class label, we can look at the distribution of feature values and see whether they differ from class to class, and so eventually convince ourselves that we won’t be able to build a good classiﬁer given this data set. But there is no clear test or diagnostic that would give us, for instance, an upper bound on the quality of any classiﬁer that could be built based on this data set. If we are not careful, we may spend a lot of time vainly attempting to build a classiﬁer capable of extracting useful information from this data set. This kind of problem is a trap to be aware of! Workshop: Two Do-It-Yourself Classiﬁers With classiﬁcation especially, it is really easy to end up with a black-box solution: a tool or library that provides an implementation of a classiﬁcation algorithm—but one that we would not be able to write ourselves if we had to. This kind of situation always makes me a bit uncomfortable, especially if the algorithms require any parameter tuning to work properly. In order to adjust such parameters intelligently, I need to understand the algorithm well enough that I could at least provide a rough-cut version myself (much as I am happy to rely on the library designer for the high-performance version). In this spirit, instead of discussing an existing classiﬁcation library, I want to show you how to write straightforward (you might say “toy version”) implementations for two simple classiﬁers: a nearest-neighbor lazy learner and a naive Bayesian classiﬁer. (I’ll give some pointers to other libraries near end of the section.) We will test our implementations on the classic data set in all of classiﬁcation: Fisher’s Iris data set.* The data set contains measurements of four different parts of an iris ﬂower (sepal length and width, petal length and width). There are 150 records in the data set, distributed equally among three species of Iris (Iris setosa, versicolor, and virginica). The task is to predict the species based on a given a set of measurements. First of all, let’s take a quick look at the distributions of the four quantities, to see whether it seems feasible to distinguish the three classes this way. Figure 18-7 shows histograms (actually, kernel density estimates) for all four quantities, separately for the three classes. One of the features (sepal width) does not seem very promising, but the distributions of the other three features seem sufﬁciently separated that it should be possible to obtain good classiﬁcation results. *First published in 1936. The data set is available from many sources, for example in the “Iris” data set on the UCI Machine Learning repository at http://archive.ics.uci.edu/ml/. 426 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 3 4 5 6 7 8 9 Sepal Length Iris Setosa Versicolor Virginica 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Sepal Width Iris Setosa Versicolor Virginica 0 1 2 3 4 5 6 7 Petal Length Iris Setosa Versicolor Virginica 0 0.5 1 1.5 2 2.5 3 Petal Width Iris Setosa Versicolor Virginica FIGURE 18-7.Thedistribution of the four attributes in the Iris data set, displayed separately for the three classes. As preparation, I split the original data set into two parts: a training set (in the ﬁle iris.trn) and a test set (in ﬁle iris.tst). I randomly selected ﬁve records from each class for the test set; the remaining records were used for training. The test set is shown in full below: the columns are (in order) sepal length, sepal width, petal length, petal width, and the class label. (All measurements are in centimeters and to millimeter precision.) 5.0,3.6,1.4,0.2,Iris-setosa 4.8,3.0,1.4,0.1,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.7,2.8,4.5,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.0,3.4,4.5,1.6,Iris-versicolor 6.3,2.9,5.6,1.8,Iris-virginica 6.2,2.8,4.8,1.8,Iris-virginica 7.9,3.8,6.4,2.0,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica Our implementation of the nearest-neighbor classiﬁer is shown in the next listing. The implementation is exceedingly simple—especially once you realize that about two thirds of the listing deal with ﬁle input and output. The actual “classiﬁcation” is a matter of three PREDICTIVE ANALYTICS 427 O’Reilly-5980006 master October 28, 2010 21:44 lines in the middle: # A Nearest-Neighbor Classifier from numpy import * train = loadtxt( "iris.trn", delimiter=',', usecols=(0,1,2,3) ) trainlabel = loadtxt( "iris.trn", delimiter=',', usecols=(4,), dtype=str ) test = loadtxt( "iris.tst", delimiter=',', usecols=(0,1,2,3) ) testlabel = loadtxt( "iris.tst", delimiter=',', usecols=(4,), dtype=str ) hit, miss = 0, 0 for i in range( test.shape[0] ): dist = sqrt( sum( (test[i] - train)**2, axis=1 ) ) k = argmin( dist ) if trainlabel[k] == testlabel[i]: flag = '+' hit += 1 else: flag = '-' miss += 1 print flag, "\t Predicted: ", trainlabel[k], "\t True: ", testlabel[i] print print hit, "out of", hit+miss, "correct - Accuracy: ", hit/(hit+miss+0.0) The algorithm loads both the training and the test data set into two-dimensional NumPy arrays. Because all elements in a NumPy array must be of the same type, we store the class labels (which are strings, not numbers) in separate vectors. Now follows the actual classiﬁcation step: for each element of the test set, we calculate the Euclidean distance to each element in the training set. We make use of NumPy “broadcasting” (see the Workshop in Chapter 2) to calculate the distance of the test instance test[i] from all training instances in one fell swoop. (The argument axis=1 is necessary to tell NumPy that the sum in the Euclidean distance should be taken over the inner (horizontal) dimension of the two-dimensional array.) Next, we use the argmin() function to obtain the index of the training record that has the smallest distance to the current test record: this is our predicted class label. (Notice that we base our result only on a single record—namely the closest training instance.) Simple as it is, the classiﬁer works very well (on this data set). For the test set shown, all records in the test set are classiﬁed correctly! The naive Bayesian classiﬁer implementation is next. A naive Bayesian classiﬁer needs an estimate of the probability distribution P(class C | feature x), which we ﬁnd from a histogram of attribute values, separately for each class. In this case, we need a total of 12 histograms (3 classes × 4 features). I maintain this data in a triply nested data structure: histo[label][feature][value]. The ﬁrst index is the class label, the second index speciﬁes 428 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 the feature, and the third contains the values of the feature that occur in the histogram. The value stored in the histogram is the number of times that each value has been observed: # A Naive Bayesian Classifier total = {} # Training instances per class label histo = {} # Histogram # Read the training set and build up a histogram train = open( "iris.trn" ) for line in train: # seplen, sepwid, petlen, petwid, label f = line.rstrip().split( ',' ) label = f.pop() if not total.has_key( label ): total[ label ] = 0 histo[ label ] = [ {}, {}, {}, {} ] # Count training instances for the current label total[label] += 1 # Iterate over features for i in range( 4 ): histo[label][i][f[i]] = 1 + histo[label][i].get( f[i], 0.0 ) train.close() # Read the test set and evaluate the probabilities hit, miss = 0, 0 test = open( "iris.tst" ) for line in test: f = line.rstrip().split( ',' ) true = f.pop() p={} # Probability for class label, given the test features for label in total.keys(): p[label] = 1 for i in range( 4 ): p[label] *= histo[label][i].get(f[i],0.0)/total[label] # Find the label with the largest probability mx, predicted = 0, -1 for k in p.keys(): if p[k] >= mx: mx, predicted = p[k], k if true == predicted: flag = '+' hit += 1 else: flag = '-' miss += 1 PREDICTIVE ANALYTICS 429 O’Reilly-5980006 master October 29, 2010 17:52 print flag, "\t", true, "\t", predicted, "\t", for label in p.keys(): print label, ":", p[label], "\t", print print print hit, "out of", hit+miss, "correct - Accuracy: ", hit/(hit+miss+0.0) test.close() I’d like to point out two implementation details. The ﬁrst is that the second index is an integer, which I use instead of the feature names; this simpliﬁes some of the loops in the program. The second detail is more important: I know that the feature values are given in centimeters, with exactly one digit after the decimal point. In other words, the values are already discretized, and so I don’t need to “bin” them any further—in effect, each bin in the histogram is one millimeter wide. Because I never need to operate on the feature values, I don’t even convert them to numbers: I read them as strings from ﬁle and use them (as strings) as keys in the histogram. Of course, if we wanted to use a different bin width, then we would have to convert them into numerical values so that we can operate on them. In the evaluation part, the program reads data points from the test set and then evaluates the probability that the record belongs to a certain class for all three class labels. We then pick the class label that has the highest probability. (Notice that we don’t need an explicit factor for the prior probability, since we know that each class is equally likely.) On the test set shown earlier, the Bayesian classiﬁer does a little worse than the nearest neighbor classiﬁer: it correctly classiﬁes 12 of 15 instances for a total accuracy of 80 percent. If you look at the results of the classiﬁer more closely, you will immediately notice a couple of problems that are common with Bayesian classiﬁers. First of all, the posterior probabilities are small. This should come as no surprise: each Bayes factor is smaller than 1 (because it’s a probability), so their product becomes very small very quickly. To avoid underﬂows, it’s usually a good idea to add the logarithms of the probabilities instead of multiplying the probabilities directly. In fact, if you have a greater number of features, this becomes a necessity. The second problem is that many of the posterior probabilities come out as exactly zero: this occurs whenever no entry in the histogram can be found for at least one of the feature values in the test record; in this case the histogram evaluates to zero, which means the entire product of probabilities is also identical to zero. There are different ways of dealing with this problem—in our case, you might want to experiment with replacing the histogram of discrete feature values with a kernel density estimate (similar to those in Figure 18-7), which, by construction, is nonzero everywhere. Keep in mind that you will need to determine a suitable bandwidth for each histogram! Let me be clear: the implementations of both classiﬁers are extremely simpleminded. My intention here is to demonstrate the basic ideas behind these algorithms in as few lines of code as possible—and also to show that there is nothing mystical about writing a simple classiﬁer. Because the implementations are so simple, it is easy to continue experimenting 430 CHAPTER EIGHTEEN O’Reilly-5980006 master October 28, 2010 21:44 with them: can we do better if we use a larger number of neighbors in our nearest- neighbor classiﬁer? How about a different distance function? In the naive Bayesian classiﬁer, we can experiment with different bin widths in the histogram or, better yet, replace the histogram of discrete bins with a kernel density estimate. In either case, we need to start thinking about runtime efﬁciency: for a data set of only 150 elements this does not matter much, but evaluating a kernel density estimate of a few thousand points can be quite expensive! If you want to use an established tool or library, there are several choices in the open source world. Three projects have put together entire data analysis and mining “toolboxes,” complete with graphical user interface, plotting capabilities, and various plug-ins: RapidMiner (http://rapid-i.com/) and WEKA (http://www.cs.waikato.ac.nz/ml/ weka/), which are both in Java as well as Orange (http://www.ailab.si/orange/), which is in Python. WEKA has been around for a long time and is very well established; RapidMiner is part of a more comprehensive tool suite (and includes WEKA as a plug-in). Orange is an alternative using Python. All three of these projects use a “pipeline” metaphor: you select different processing steps (discretizers, smoothers, principal component analysis, regression, classiﬁers) from a toolbox and string them together to build up the whole analysis workﬂow entirely within the tool. Give it a shot—the idea has a lot of appeal, but I must confess that I have never succeeded in doing anything nontrivial with any of them! There are some additional libraries worth checking out that have Python interfaces: libSVM (http://www.csie.ntu.edu.tw/˜cjlin/libsvm/) and Shogun (http://www.shogun-toolbox .org/) provide implementations of support vector machines, while the Modular toolkit for Data Processing (http://mdp-toolkit.sourceforge.net/) is more general. (The latter also adheres to the “pipeline” metaphor.) Finally, all classiﬁcation algorithms are also available as R packages. I’ll mention just three: the class package for nearest-neighbor classiﬁers and the rpart package for decision trees (both part of the R standard distribution) as well as the e1071 package (which can be found on CRAN) for support vector machines and nai