[数理统计与数据分析].(Mathematical.Statistics.and.Data.Analysis,.3ed).pdf


THIRD EDITION Mathematical Statistics and Data Analysis John A. Rice University of California, Berkeley Australia • Brazil • Canada • Mexico • Singapore • Spain United Kingdom • United States Mathematical Statistics and Data Analysis, Third Edition John A. Rice Acquisitions Editor: Carolyn Crockett Assistant Editor: Ann Day Editorial Assistant: Elizabeth Gershman Technology Project Manager: Fiona Chong Marketing Manager: Joe Rogove Marketing Assistant: Brian Smith Marketing Communications Manager: Darlene Amidon-Brent Project Manager, Editorial Production: Kelsey McGee Creative Director: Rob Hugel Art Director: Lee Friedman Print Buyer: Karen Hunt Permissions Editor: Bob Kauser Production Service: Interactive Composition Corporation Text Designer: Roy Neuhaus Copy Editor: Victoria Thurman Illustrator: Interactive Composition Corporation Cover Designer: Denise Davidson Cover Printer: Coral Graphic Services Compositor: Interactive Composition Corporation Printer: R.R. Donnelley/Crawfordsville © 2007 Duxbury, an imprint of Thomson Brooks/Cole, a part of The Thomson Corporation. Thomson, the Star logo, and Brooks/Cole are trademarks used herein under license. ALL RIGHTS RESERVED. No part of this work covered by the copyright hereon may be reproduced or used in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, Web distribution, information storage and retrieval systems, or in any other manner—without the written permission of the publisher. Printed in the United States of America 12345671009080706 For more information about our products, contact us at: Thomson Learning Academic Resource Center 1-800-423-0563 For permission to use material from this text or product, submit a request online at http://www.thomsonrights.com. Any additional questions about permissions can be submitted by e-mail to thomsonrights@thomson.com. Thomson Higher Education 10 Davis Drive Belmont, CA 94002-3098 USA Library of Congress Control Number: 2005938314 Student Edition: ISBN 0-534-39942-8 We must be careful not to confuse data with the abstractions we use to analyze them. WILLIAM JAMES (1842–1910) Contents Preface xi 1 Probability 1 1.1 Introduction 1 1.2 Sample Spaces 2 1.3 Probability Measures 4 1.4 Computing Probabilities: Counting Methods 6 1.4.1 The Multiplication Principle 7 1.4.2 Permutations and Combinations 9 1.5 Conditional Probability 16 1.6 Independence 23 1.7 Concluding Remarks 26 1.8 Problems 26 2 Random Variables 35 2.1 Discrete Random Variables 35 2.1.1 Bernoulli Random Variables 37 2.1.2 The Binomial Distribution 38 2.1.3 The Geometric and Negative Binomial Distributions 40 2.1.4 The Hypergeometric Distribution 42 2.1.5 The Poisson Distribution 42 2.2 Continuous Random Variables 47 2.2.1 The Exponential Density 50 2.2.2 The Gamma Density 53 iv Contents v 2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random Variables 72 3.3 Continuous Random Variables 75 3.4 Independent Random Variables 84 3.5 Conditional Distributions 87 3.5.1 The Discrete Case 87 3.5.2 The Continuous Case 88 3.6 Functions of Jointly Distributed Random Variables 96 3.6.1 Sums and Quotients 96 3.6.2 The General Case 99 3.7 Extrema and Order Statistics 104 3.8 Problems 107 4 Expected Values 116 4.1 The Expected Value of a Random Variable 116 4.1.1 Expectations of Functions of Random Variables 121 4.1.2 Expectations of Linear Combinations of Random Variables 124 4.2 Variance and Standard Deviation 130 4.2.1 A Model for Measurement Error 135 4.3 Covariance and Correlation 138 4.4 Conditional Expectation and Prediction 147 4.4.1 Definitions and Examples 147 4.4.2 Prediction 152 4.5 The Moment-Generating Function 155 4.6 Approximate Methods 161 4.7 Problems 166 vi Contents 5 Limit Theorems 177 5.1 Introduction 177 5.2 The Law of Large Numbers 177 5.3 Convergence in Distribution and the Central Limit Theorem 181 5.4 Problems 188 6 Distributions Derived from the Normal Distribution 192 6.1 Introduction 192 6.2 χ2, t, and F Distributions 192 6.3 The Sample Mean and the Sample Variance 195 6.4 Problems 198 7 Survey Sampling 199 7.1 Introduction 199 7.2 Population Parameters 200 7.3 Simple Random Sampling 202 7.3.1 The Expectation and Variance of the Sample Mean 203 7.3.2 Estimation of the Population Variance 210 7.3.3 The Normal Approximation to the Sampling Distribution of X 214 7.4 Estimation of a Ratio 220 7.5 Stratified Random Sampling 227 7.5.1 Introduction and Notation 227 7.5.2 Properties of Stratified Estimates 228 7.5.3 Methods of Allocation 232 7.6 Concluding Remarks 238 7.7 Problems 239 8 Estimation of Parameters and Fitting of Probability Distributions 255 8.1 Introduction 255 8.2 Fitting the Poisson Distribution to Emissions of Alpha Particles 255 8.3 Parameter Estimation 257 8.4 The Method of Moments 260 8.5 The Method of Maximum Likelihood 267 Contents vii 8.5.1 Maximum Likelihood Estimates of Multinomial Cell Probabilities 272 8.5.2 Large Sample Theory for Maximum Likelihood Estimates 274 8.5.3 Confidence Intervals from Maximum Likelihood Estimates 279 8.6 The Bayesian Approach to Parameter Estimation 285 8.6.1 Further Remarks on Priors 294 8.6.2 Large Sample Normal Approximation to the Posterior 296 8.6.3 Computational Aspects 297 8.7 Efficiency and the Cram´er-Rao Lower Bound 298 8.7.1 An Example: The Negative Binomial Distribution 302 8.8 Sufficiency 305 8.8.1 A Factorization Theorem 306 8.8.2 The Rao-Blackwell Theorem 310 8.9 Concluding Remarks 311 8.10 Problems 312 9 Testing Hypotheses and Assessing Goodness of Fit 329 9.1 Introduction 329 9.2 The Neyman-Pearson Paradigm 331 9.2.1 Specification of the Significance Level and the Concept of a p-value 334 9.2.2 The Null Hypothesis 335 9.2.3 Uniformly Most Powerful Tests 336 9.3 The Duality of Confidence Intervals and Hypothesis Tests 337 9.4 Generalized Likelihood Ratio Tests 339 9.5 Likelihood Ratio Tests for the Multinomial Distribution 341 9.6 The Poisson Dispersion Test 347 9.7 Hanging Rootograms 349 9.8 Probability Plots 352 9.9 Tests for Normality 358 9.10 Concluding Remarks 361 9.11 Problems 362 10 Summarizing Data 377 10.1 Introduction 377 10.2 Methods Based on the Cumulative Distribution Function 378 viii Contents 10.2.1 The Empirical Cumulative Distribution Function 378 10.2.2 The Survival Function 380 10.2.3 Quantile-Quantile Plots 385 10.3 Histograms, Density Curves, and Stem-and-Leaf Plots 389 10.4 Measures of Location 392 10.4.1 The Arithmetic Mean 393 10.4.2 The Median 395 10.4.3 The Trimmed Mean 397 10.4.4 M Estimates 397 10.4.5 Comparison of Location Estimates 398 10.4.6 Estimating Variability of Location Estimates by the Bootstrap 399 10.5 Measures of Dispersion 401 10.6 Boxplots 402 10.7 Exploring Relationships with Scatterplots 404 10.8 Concluding Remarks 407 10.9 Problems 408 11 Comparing Two Samples 420 11.1 Introduction 420 11.2 Comparing Two Independent Samples 421 11.2.1 Methods Based on the Normal Distribution 421 11.2.2 Power 433 11.2.3 A Nonparametric Method—The Mann-Whitney Test 435 11.2.4 Bayesian Approach 443 11.3 Comparing Paired Samples 444 11.3.1 Methods Based on the Normal Distribution 446 11.3.2 A Nonparametric Method—The Signed Rank Test 448 11.3.3 An Example—Measuring Mercury Levels in Fish 450 11.4 Experimental Design 452 11.4.1 Mammary Artery Ligation 452 11.4.2 The Placebo Effect 453 11.4.3 The Lanarkshire Milk Experiment 453 11.4.4 The Portacaval Shunt 454 11.4.5 FD&C Red No. 40 455 11.4.6 Further Remarks on Randomization 456 Contents ix 11.4.7 Observational Studies, Confounding, and Bias in Graduate Admissions 457 11.4.8 Fishing Expeditions 458 11.5 Concluding Remarks 459 11.6 Problems 459 12 The Analysis of Variance 477 12.1 Introduction 477 12.2 The One-Way Layout 477 12.2.1 Normal Theory; the F Test 478 12.2.2 The Problem of Multiple Comparisons 485 12.2.3 A Nonparametric Method—The Kruskal-Wallis Test 488 12.3 The Two-Way Layout 489 12.3.1 Additive Parametrization 489 12.3.2 Normal Theory for the Two-Way Layout 492 12.3.3 Randomized Block Designs 500 12.3.4 A Nonparametric Method—Friedman’s Test 503 12.4 Concluding Remarks 504 12.5 Problems 505 13 The Analysis of Categorical Data 514 13.1 Introduction 514 13.2 Fisher’s Exact Test 514 13.3 The Chi-Square Test of Homogeneity 516 13.4 The Chi-Square Test of Independence 520 13.5 Matched-Pairs Designs 523 13.6 Odds Ratios 526 13.7 Concluding Remarks 530 13.8 Problems 530 14 Linear Least Squares 542 14.1 Introduction 542 14.2 Simple Linear Regression 547 14.2.1 Statistical Properties of the Estimated Slope and Intercept 547 x Contents 14.2.2 Assessing the Fit 550 14.2.3 Correlation and Regression 560 14.3 The Matrix Approach to Linear Least Squares 564 14.4 Statistical Properties of Least Squares Estimates 567 14.4.1 Vector-Valued Random Variables 567 14.4.2 Mean and Covariance of Least Squares Estimates 573 14.4.3 Estimation of σ 2 575 14.4.4 Residuals and Standardized Residuals 576 14.4.5 Inference about β 577 14.5 Multiple Linear Regression—An Example 580 14.6 Conditional Inference, Unconditional Inference, and the Bootstrap 585 14.7 Local Linear Smoothing 587 14.8 Concluding Remarks 591 14.9 Problems 591 Appendix A Common Distributions A1 Appendix B Tables A4 Bibliography A25 Answers to Selected Problems A32 Author Index A48 Applications Index A51 Subject Index A54 Preface Intended Audience This text is intended for juniors, seniors, or beginning graduate students in statistics, mathematics, natural sciences, and engineering as well as for adequately prepared students in the social sciences and economics. A year of calculus, including Taylor Series and multivariable calculus, and an introductory course in linear algebra are prerequisites. This Book’s Objectives This book reflects my view of what a first, and for many students a last, course in statistics should be. Such a course should include some traditional topics in mathe- matical statistics (such as methods based on likelihood), topics in descriptive statistics and data analysis with special attention to graphical displays, aspects of experimental design, and realistic applications of some complexity. It should also reflect the inte- gral role played by computers in statistics. These themes, properly interwoven, can give students a view of the nature of modern statistics. The alternative of teaching two separate courses, one on theory and one on data analysis, seems to me artificial. Furthermore, many students take only one course in statistics and do not have time for two or more. Analysis of Data and the Practice of Statistics In order to draw the above themes together, I have endeavored to write a book closely tied to the practice of statistics. It is in the analysis of real data that one sees the roles played by both formal theory and informal data analytic methods. I have organized this book around various kinds of problems that entail the use of statistical methods and have included many real examples to motivate and introduce the theory. Among xi xii Preface the advantages of such an approach are that theoretical constructs are presented in meaningful contexts, that they are gradually supplemented and reinforced, and that they are integrated with more informal methods. This is, I think, a fitting approach to statistics, the historical development of which has been spurred on primarily by practical needs rather than by abstract or aesthetic considerations. At the same time, I have not shied away from using the mathematics that the students are supposed to know. The Third Edition Eighteen years have passed since the first edition of this book was published and eleven years since the second. Although the basic intent and stucture of the book have not changed, the new editions reflect developments in the discipline of statistics, primarily the computational revolution. The most significant change in this edition is the treatment of Bayesian infer- ence. I moved the material from the last chapter, a point that was never reached by many instructors, and integrated it into earlier chapters. Bayesian inference is now first previewed in Chapter 3, in the context of conditional distributions. It is then placed side-by-side with frequentist methods in Chapter 8, where it complements the material on maximum likelihood estimation very naturally. The introductory section on hypothesis testing in Chapter 9 now begins with a Bayesian formulation before moving on to the Neyman-Pearson paradigm. One advantage of this is that the funda- mental importance of the likelihood ratio is now much more apparent. In applications, I stress uninformative priors and show the similarity of the qualitative conclusions that would be reached by frequentist and Bayesian methods. Other new material includes the use of examples from genomics and financial statistics in the probability chapters. In addition to its value as topically relevant, this material naturally reinforces basic concepts. For example, the material on copulas underscores the relationships of marginal and joint distributions. Other changes in- clude the introduction of scatterplots and correlation coefficients within the context of exploratory data analysis in Chapter 10 and a brief introduction to nonparametric smoothing via local linear least squares in Chapter 14. There are nearly 100 new problems, mainly in Chapters 7–14, including several new data sets. Some of the data sets are sufficiently substantial to be the basis for computer lab assignments. I also elucidated many passages that were obscure in earlier editions. Brief Outline A complete outline can be found, of course, in the Table of Contents. Here I will just highlight some points and indicate various curricular options for the instructor. The first six chapters contain an introduction to probability theory, particularly those aspects most relevant to statistics. Chapter 1 introduces the basic ingredients of probability theory and elementary combinatorial methods from a non measure theoretic point of view. In this and the other probability chapters, I tried to use real- world examples rather than balls and urns whenever possible. Preface xiii The concept of a random variable is introduced in Chapter 2. I chose to discuss discrete and continuous random variables together, instead of putting off the contin- uous case until later. Several common distributions are introduced. An advantage of this approach is that it provides something to work with and develop in later chapters. Chapter 3 continues the treatment of random variables by going into joint dis- tributions. The instructor may wish to skip lightly over Jacobians; this can be done with little loss of continuity, since they are rarely used in the rest of the book. The material in Section 3.7 on extrema and order statistics can be omitted if the instructor is willing to do a little backtracking later. Expectation, variance, covariance, conditional expectation, and moment-gene- rating functions are taken up in Chapter 4. The instructor may wish to pass lightly over conditional expectation and prediction, especially if he or she does not plan to cover sufficiency later. The last section of this chapter introduces the δ method, or the method of propagation of error. This method is used several times in the statistics chapters. The law of large numbers and the central limit theorem are proved in Chapter 5 under fairly strong assumptions. Chapter 6 is a compendium of the common distributions related to the normal and sampling distributions of statistics computed from the usual normal random sample. I don’t spend a lot of time on this material here but do develop the necessary facts as they are needed in the statistics chapters. It is useful for students to have these distributions collected in one place. Chapter 7 is on survey sampling, an unconventional, but in some ways natural, beginning to the study of statistics. Survey sampling is an area of statistics with which most students have some vague familiarity, and a set of fairly specific, concrete statistical problems can be naturally posed. It is a context in which, historically, many important statistical concepts have developed, and it can be used as a vehicle for introducing concepts and techniques that are developed further in later chapters, for example: • The idea of an estimate as a random variable with an associated sampling distribution • The concepts of bias, standard error, and mean squared error • Confidence intervals and the application of the central limit theorem • An exposure to notions of experimental design via the study of stratified estimates and the concept of relative efficiency • Calculation of expectations, variances, and covariances One of the unattractive aspects of survey sampling is that the calculations are rather grubby. However, there is a certain virtue in this grubbiness, and students are given practice in such calculations. The instructor has quite a lot of flexibility as to how deeply to cover the concepts in this chapter. The sections on ratio estimation and stratification are optional and can be skipped entirely or returned to at a later time without loss of continuity. Chapter 8 is concerned with parameter estimation, a subject that is motivated and illustrated by the problem of fitting probability laws to data. The method of moments, the method of maximum likelihood, and Bayesian inference are developed. The concept of efficiency is introduced, and the Cram´er-Rao Inequality is proved. Section 8.8 introduces the concept of sufficiency and some of its ramifications. The xiv Preface material on the Cram´er-Rao lower bound and on sufficiency can be skipped; to my mind, the importance of sufficiency is usually overstated. Section 8.7.1 (the negative binomial distribution) can also be skipped. Chapter 9 is an introduction to hypothesis testing with particular application to testing for goodness of fit, which ties in with Chapter 8. (This subject is further developed in Chapter 11.) Informal, graphical methods are presented here as well. Several of the last sections of this chapter can be skipped if the instructor is pressed for time. These include Section 9.6 (the Poisson dispersion test), Section 9.7 (hanging rootograms), and Section 9.9 (tests for normality). A variety of descriptive methods are introduced in Chapter 10. Many of these techniques are used in later chapters. The importance of graphical procedures is stressed, and notions of robustness are introduced. The placement of a chapter on descriptive methods this late in a book may seem strange. I chose to do so be- cause descriptive procedures usually have a stochastic side and, having been through the three chapters preceding this one, students are by now better equipped to study the statistical behavior of various summary statistics (for example, a confidence interval for the median). When I teach the course, I introduce some of this material earlier. For example, I have students make boxplots and histograms from samples drawn in labs on survey sampling. If the instructor wishes, the material on survival and hazard functions can be skipped. Classical and nonparametric methods for two-sample problems are introduced in Chapter 11. The concepts of hypothesis testing, first introduced in Chapter 9, are further developed. The chapter concludes with some discussion of experimental design and the interpretation of observational studies. The first eleven chapters are the heart of an introductory course; the theoretical constructs of estimation and hypothesis testing have been developed, graphical and descriptive methods have been introduced, and aspects of experimental design have been discussed. The instructor has much more freedom in selecting material from Chapters 12 through 14. In particular, it is not necessary to proceed through these chapters in the order in which they are presented. Chapter 12 treats the one-way and two-way layouts via analysis of variance and nonparametric techniques. The problem of multiple comparisons, first introduced at the end of Chapter 11, is discussed. Chapter 13 is a rather brief treatment of the analysis of categorical data. Likeli- hood ratio tests are developed for homogeneity and independence. McNemar’s test is presented and finally, estimation of the odds ratio is motivated by a discussion of prospective and retrospective studies. Chapter 14 concerns linear least squares. Simple linear regression is developed first and is followed by a more general treatment using linear algebra. I chose to employ matrix algebra but keep the level of the discussion as simple and concrete as possible, not going beyond concepts typically taught in an introductory one-quarter course. In particular, I did not develop a geometric analysis of the general linear model or make any attempt to unify regression and analysis of variance. Throughout this chapter, theoretical results are balanced by more qualitative data analytic procedures based on analysis of residuals. At the end of the chapter, I introduce nonparametric regression via local linear least squares. Preface xv Computer Use and Problem Solving Computation is an integral part of contemporary statistics. It is essential for data analysis and can be an aid to clarifying basic concepts. My students use the open- source package R, which they can install on their own computers. Other packages could be used as well but I do not discuss any particular programs in the text. The data in the text are available on the CD that is bound in the U.S. edition or can be downloaded from www.thomsonedu.com/statistics. This book contains a large number of problems, ranging from routine reinforce- ment of basic concepts to some that students will find quite difficult. I think that problem solving, especially of nonroutine problems, is very important. Acknowledgments I am indebted to a large number of people who contributed directly and indirectly to the first edition. Earlier versions were used in courses taught by Richard Olshen, Yosi Rinnot, Donald Ylvisaker, Len Haff, and David Lane, who made many helpful comments. Students in their classes and in my own had many constructive comments. Teaching assistants, especially Joan Staniswalis, Roger Johnson, Terri Bittner, and Peter Kim, worked through many of the problems and found numerous errors. Many reviewers provided useful suggestions: Rollin Brant, University of Toronto; George Casella, Cornell University; Howard B. Christensen, Brigham Young University; David Fairley, Ohio State University; Peter Guttorp, University of Washington; Hari Iyer, Colorado State University; Douglas G. Kelly, University of North Carolina; Thomas Leonard, University of Wisconsin; Albert S. Paulson, Rensselaer Polytechnic Institute; Charles Peters, University of Houston, University Park; Andrew Rukhin, University of Massachusetts, Amherst; Robert Schaefer, Miami University; and Ruth Williams, University of California, San Diego. Richard Royall and W. G. Cumberland kindly provided the data sets used in Chapter 7 on survey sampling. Several other data sets were brought to my attention by statisticians at the National Bureau of Standards, where I was fortunate to spend a year while on sabbatical. I deeply appreciate the patience, persistence, and faith of my editor, John Kimmel, in bringing this project to fruition. The candid comments of many students and faculty who used the first edition of the book were influential in the creation of the second edition. In particular I would like to thank Ian Abramson, Edward Bedrick, Jon Frank, Richard Gill, Roger Johnson, Torgny Lindvall, Michael Martin, Deb Nolan, Roger Pinkham, Yosi Rinott, Philip Stark, and Bin Yu; I apologize to any individuals who have inadvertently been left off this list. Finally, I would like to thank Alex Kugushev for his encouragement and support in carrying out the revision and the work done by Terri Bittner in carefully reading the manuscript for accuracy and in the solutions of the new problems. Many people contributed to the third edition. I would like to thank the reviewers of this edition: Marten Wegkamp, Yale University; Aparna Huzurbazar, University of New Mexico; Laura Bernhofen, Clark University; Joe Glaz, University of Connecti- cut; and Michael Minnotte, Utah State University. I deeply appreciate many readers xvi Preface for generously taking the time to point out errors and make suggestions on improving the exposition. In particular, Roger Pinkham sent many helpful email messages and Nick Cox provided a very long list of grammatical errors. Alice Hsiaw made detailed comments on Chapters 7–14. I also wish to thank Ani Adhikari, Paulo Berata, Patrick Brewer, Sang-Hoon Cho Gier Eide, John Einmahl, David Freedman, Roger Johnson, Paul van der Laan, Patrick Lee, Yi Lin, Jim Linnemann, Rasaan Moshesh, Eugene Schuster, Dylan Small, Luis Tenorio, Richard De Veaux, and Ping Zhang. Bob Stine contributed financial data, Diane Cook provided data on Italian olive oils, and Jim Albert provided a baseball data set that nicely illustrates regression toward the mean. Rainer Sachs provided the lovely data on chromatin separations. I thank my editor, Carolyn Crockett, for her graceful persistence and patience in bringing about this revision, and also the energetic production team. I apologize to any others whose names have inadvertently been left off this list. CHAPTER 1 Probability 1.1 Introduction The idea of probability, chance, or randomness is quite old, whereas its rigorous axiomatization in mathematical terms occurred relatively recently. Many of the ideas of probability theory originated in the study of games of chance. In this century, the mathematical theory of probability has been applied to a wide variety of phenomena; the following are some representative examples: • Probability theory has been used in genetics as a model for mutations and ensuing natural variability, and plays a central role in bioinformatics. • The kinetic theory of gases has an important probabilistic component. • In designing and analyzing computer operating systems, the lengths of various queues in the system are modeled as random phenomena. • There are highly developed theories that treat noise in electrical devices and com- munication systems as random processes. • Many models of atmospheric turbulence use concepts of probability theory. • In operations research, the demands on inventories of goods are often modeled as random. • Actuarial science, which is used by insurance companies, relies heavily on the tools of probability theory. • Probability theory is used to study complex systems and improve their reliability, such as in modern commercial or military aircraft. • Probability theory is a cornerstone of the theory of finance. The list could go on and on. This book develops the basic ideas of probability and statistics. The first part explores the theory of probability as a mathematical model for chance phenomena. The second part of the book is about statistics, which is essentially concerned with 1 2 Chapter 1 Probability procedures for analyzing data, especially data that in some vague sense have a random character. To comprehend the theory of statistics, you must have a sound background in probability. 1.2 Sample Spaces Probability theory is concerned with situations in which the outcomes occur randomly. Generically, such situations are calledexperiments, and the set of all possible outcomes is the sample space corresponding to an experiment. The sample space is denoted by , and an element of  is denoted by ω. The following are some examples. EXAMPLE A Driving to work, a commuter passes through a sequence of three intersections with traffic lights. At each light, she either stops, s, or continues, c. The sample space is the set of all possible outcomes:  ={ccc, ccs, css, csc, sss, ssc, scc, scs} where csc, for example, denotes the outcome that the commuter continues through the first light, stops at the second light, and continues through the third light. ■ EXAMPLE B The number of jobs in a print queue of a mainframe computer may be modeled as random. Here the sample space can be taken as  ={0, 1, 2, 3,...} that is, all the nonnegative integers. In practice, there is probably an upper limit, N, on how large the print queue can be, so instead the sample space might be defined as  ={0, 1, 2,...,N} ■ EXAMPLE C Earthquakes exhibit very erratic behavior, which is sometimes modeled as random. For example, the length of time between successive earthquakes in a particular region that are greater in magnitude than a given threshold may be regarded as an experiment. Here  is the set of all nonnegative real numbers:  ={t | t ≥ 0} ■ We are often interested in particular subsets of , which in probability language are called events. In Example A, the event that the commuter stops at the first light is the subset of  denoted by A ={sss, ssc, scc, scs} 1.2 Sample Spaces 3 (Events, or subsets, are usually denoted by italic uppercase letters.) In Example B, the event that there are fewer than five jobs in the print queue can be denoted by A ={0, 1, 2, 3, 4} The algebra of set theory carries over directly into probability theory. The union of two events, A and B,istheeventC that either A occurs or B occurs or both occur: C = A ∪ B. For example, if A is the event that the commuter stops at the first light (listed before), and if B is the event that she stops at the third light, B ={sss, scs, ccs, css} then C is the event that she stops at the first light or stops at the third light and consists of the outcomes that are in A or in B or in both: C ={sss, ssc, scc, scs, ccs, css} The intersection of two events, C = A∩ B, is the event that both A and B occur. If A and B are as given previously, then C is the event that the commuter stops at the first light and stops at the third light and thus consists of those outcomes that are common to both A and B: C ={sss, scs} The complement of an event, Ac, is the event that A does not occur and thus consists of all those elements in the sample space that are not in A. The complement of the event that the commuter stops at the first light is the event that she continues at the first light: Ac ={ccc, ccs, css, csc} You may recall from previous exposure to set theory a rather mysterious set called the empty set, usually denoted by ∅. The empty set is the set with no elements; it is the event with no outcomes. For example, if A is the event that the commuter stops at the first light and C is the event that she continues through all three lights, C ={ccc}, then A and C have no outcomes in common, and we can write A ∩ C =∅ In such cases, A and C are said to be disjoint. Venn diagrams, such as those in Figure 1.1, are often a useful tool for visualizing set operations. The following are some laws of set theory. Commutative Laws: A ∪ B = B ∪ A A ∩ B = B ∩ A Associative Laws: (A ∪ B) ∪ C = A ∪ (B ∪ C) (A ∩ B) ∩ C = A ∩ (B ∩ C) 4 Chapter 1 Probability Distributive Laws: (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C) Of these, the distributive laws are the least intuitive, and you may find it instructive to illustrate them with Venn diagrams. AB A ʜ B AB A ʝ B FIGURE 1.1 Venn diagrams of A ∪ B and A ∩ B. 1.3 Probability Measures A probability measure on  is a function P from subsets of  to the real numbers that satisfies the following axioms: 1. P() = 1. 2. If A ⊂ , then P(A) ≥ 0. 3. If A1 and A2 are disjoint, then P(A1 ∪ A2) = P(A1) + P(A2). More generally, if A1, A2,...,An,...are mutually disjoint, then P ∞ i=1 Ai = ∞ i=1 P(Ai ) The first two axioms are obviously desirable. Since  consists of all possible out- comes, P() = 1. The second axiom simply states that a probability is nonnegative. The third axiom states that if A and B are disjoint—that is, have no outcomes in common—then P(A ∪ B) = P(A) + P(B) and also that this property extends to limits. For example, the probability that the print queue contains either one or three jobs is equal to the probability that it contains one plus the probability that it contains three. The following properties of probability measures are consequences of the axioms. Property A P(Ac) = 1 − P(A). This property follows since A and Ac are disjoint with A ∪ Ac =  and thus, by the first and third axioms, P(A) + P(Ac) = 1. In words, this property says that the probability that an event does not occur equals one minus the probability that it does occur. Property B P(∅) = 0. This property follows from Property A since ∅=c.In words, this says that the probability that there is no outcome at all is zero. 1.3 Probability Measures 5 Property C If A ⊂ B, then P(A) ≤ P(B). This property states that if B occurs whenever A occurs, then P(A) ≤ P(B). For example, if whenever it rains (A) it is cloudy (B), then the probability that it rains is less than or equal to the probability that it is cloudy. Formally, it can be proved as follows: B can be expressed as the union of two disjoint sets: B = A ∪ (B ∩ Ac) Then, from the third axiom, P(B) = P(A) + P(B ∩ Ac) and thus P(A) = P(B) − P(B ∩ Ac) ≤ P(B) Property D Addition Law P(A ∪ B) = P(A) + P(B) − P(A ∩ B). This property is easy to see from the Venn diagram in Figure 1.2. If P(A) and P(B) are added together, P(A ∩ B) is counted twice. To prove it, we decompose A ∪ B into three disjoint subsets, as shown in Figure 1.2: C = A ∩ Bc D = A ∩ B E = Ac ∩ B CEBA D FIGURE 1.2 Venn diagram illustrating the addition law. We then have, from the third axiom, P(A ∪ B) = P(C) + P(D) + P(E) Also, A = C ∪ D, and C and D are disjoint; so P(A) = P(C) + P(D). Similarly, P(B) = P(D) + P(E). Putting these results together, we see that P(A) + P(B) = P(C) + P(E) + 2P(D) = P(A ∪ B) + P(D) or P(A ∪ B) = P(A) + P(B) − P(D) 6 Chapter 1 Probability EXAMPLE A Suppose that a fair coin is thrown twice. Let A denote the event of heads on the first toss, and let B denote the event of heads on the second toss. The sample space is  ={hh, ht, th, tt} We assume that each elementary outcome in  is equally likely and has probability 1 4 . C = A ∪ B is the event that heads comes up on the first toss or on the second toss. Clearly, P(C) = P(A) + P(B) = 1. Rather, since A ∩ B is the event that heads comes up on the first toss and on the second toss, P(C) = P(A) + P(B) − P(A ∩ B) = .5 + .5 − .25 = .75 ■ EXAMPLE B An article in the Los Angeles Times (August 24, 1987) discussed the statistical risks of AIDS infection: Several studies of sexual partners of people infected with the virus show that a single act of unprotected vaginal intercourse has a surprisingly low risk of infecting the uninfected partner—perhaps one in 100 to one in 1000. For an average, consider the risk to be one in 500. If there are 100 acts of intercourse with an infected partner, the odds of infection increase to one in five. Statistically, 500 acts of intercourse with one infected partner or 100 acts with five partners lead to a 100% probability of infection (statistically, not necessarily in reality). Following this reasoning, 1000 acts of intercourse with one infected partner would lead to a probability of infection equal to 2 (statistically, but not necessarily in reality). To see the flaw in the reasoning that leads to this conclusion, consider two acts of intercourse. Let A1 denote the event that infection occurs on the first act and let A2 denote the event that infection occurs on the second act. Then the event that infection occurs is B = A1 ∪ A2 and P(B) = P(A1) + P(A2) − P(A1 ∩ A2) ≤ P(A1) + P(A2) = 2 500 ■ 1.4 Computing Probabilities: Counting Methods Probabilities are especially easy to compute for finite sample spaces. Suppose that  ={ω1,ω2,...,ωN } and that P(ωi ) = pi . To find the probability of an event A, we simply add the probabilities of the ωi that constitute A. EXAMPLE A Suppose that a fair coin is thrown twice and the sequence of heads and tails is recorded. The sample space is  ={hh, ht, th, tt} 1.4 Computing Probabilities: Counting Methods 7 As in Example A of the previous section, we assume that each outcome in  has probability .25. Let A denote the event that at least one head is thrown. Then A = {hh, ht, th}, and P(A) = .75. ■ This is a simple example of a fairly common situation. The elements of  all have equal probability; so if there are N elements in , each of them has probability 1/N.IfA can occur in any of n mutually exclusive ways, then P(A) = n/N,or P(A) = number of ways A can occur total number of outcomes Note that this formula holds only if all the outcomes are equally likely. In Exam- ple A, if only the number of heads were recorded, then  would be {0, 1, 2}. These outcomes are not equally likely, and P(A) is not 2 3 . ■ EXAMPLE B Simpson’s Paradox A black urn contains 5 red and 6 green balls, and a white urn contains 3 red and 4 green balls. You are allowed to choose an urn and then choose a ball at random from the urn. If you choose a red ball, you get a prize. Which urn should you choose to draw from? If you draw from the black urn, the probability of choosing a red ball is 5 11 = .455 (the number of ways you can draw a red ball divided by the total number of outcomes). If you choose to draw from the white urn, the probability of choosing a red ball is 3 7 = .429, so you should choose to draw from the black urn. Now consider another game in which a second black urn has 6 red and 3 green balls, and a second white urn has 9 red and 5 green balls. If you draw from the black urn, the probability of a red ball is 6 9 = .667, whereas if you choose to draw from the white urn, the probability is 9 14 = .643. So, again you should choose to draw from the black urn. In the final game, the contents of the second black urn are added to the first black urn, and the contents of the second white urn are added to the first white urn. Again, you can choose which urn to draw from. Which should you choose? Intuition says choose the black urn, but let’s calculate the probabilities. The black urn now contains 11 red and 9 green balls, so the probability of drawing a red ball from it is 11 20 = .55. The white urn now contains 12 red and 9 green balls, so the probability of drawing a red ball from it is 12 21 = .571. So, you should choose the white urn. This counterintuitive result is an example of Simpson’s paradox. For an example that occurred in real life, see Section 11.4.7. For more amusing examples, see Gardner (1976). ■ In the preceding examples, it was easy to count the number of outcomes and calculate probabilities. To compute probabilities for more complex situations, we must develop systematic ways of counting outcomes, which are the subject of the next two sections. 1.4.1 The Multiplication Principle The following is a statement of the very useful multiplication principle. 8 Chapter 1 Probability MULTIPLICATION PRINCIPLE If one experiment has m outcomes and another experiment has n outcomes, then there are mn possible outcomes for the two experiments. Proof Denote the outcomes of the first experiment by a1,...,am and the outcomes of the second experiment by b1,...,bn. The outcomes for the two experiments are the ordered pairs (ai , b j ). These pairs can be exhibited as the entries of an m × n rectangular array, in which the pair (ai , b j ) is in the ith row and the jth column. There are mn entries in this array. ■ EXAMPLE A Playing cards have 13 face values and 4 suits. There are thus 4 × 13 = 52 face- value/suit combinations. ■ EXAMPLE B A class has 12 boys and 18 girls. The teacher selects 1 boy and 1 girl to act as representatives to the student government. She can do this in any of 12 × 18 = 216 different ways. ■ EXTENDED MULTIPLICATION PRINCIPLE If there are p experiments and the first has n1 possible outcomes, the second n2,...,and the pthn p possible outcomes, then there are a total ofn1 × n2 × ··· × n p possible outcomes for the p experiments. Proof This principle can be proved from the multiplication principle by induction. We saw that it is true for p = 2. Assume that it is true for p = q—that is, that there are n1 × n2 ×···×nq possible outcomes for the first q experiments. To complete the proof by induction, we must show that it follows that the prop- erty holds for p = q + 1. We apply the multiplication principle, regarding the first q experiments as a single experiment with n1 × ··· ×nq outcomes, and conclude that there are (n1 × ··· ×nq) × nq+1 outcomes for the q + 1 experiments. ■ EXAMPLE C An 8-bit binary word is a sequence of 8 digits, of which each may be eithera0ora1. How many different 8-bit words are there? 1.4 Computing Probabilities: Counting Methods 9 There are two choices for the first bit, two for the second, etc., and thus there are 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 28 = 256 such words. ■ EXAMPLE D A DNA molecule is a sequence of four types of nucleotides, denoted by A, G, C, and T. The molecule can be millions of units long and can thus encode an enormous amount of information. For example, for a molecule 1 million (106) units long, there are 4106 different possible sequences. This is a staggeringly large number having nearly a million digits. An amino acid is coded for by a sequence of three nucleotides; there are 43 = 64 different codes, but there are only 20 amino acids since some of them can be coded for in several ways. A protein molecule is composed of as many as hundreds of amino acid units, and thus there are an incredibly large number of possible proteins. For example, there are 20100 different sequences of 100 amino acids. ■ 1.4.2 Permutations and Combinations A permutation is an ordered arrangement of objects. Suppose that from the set C ={c1, c2,...,cn} we choose r elements and list them in order. How many ways can we do this? The answer depends on whether we are allowed to duplicate items in the list. If no duplication is allowed, we are sampling without replacement. If duplication is allowed, we are sampling with replacement. We can think of the problem as that of taking labeled balls from an urn. In the first type of sampling, we are not allowed to put a ball back before choosing the next one, but in the second, we are. In either case, when we are done choosing, we have a list of r balls ordered in the sequence in which they were drawn. The extended multiplication principle can be used to count the number of different ordered samples possible for a set of n elements. First, suppose that sampling is done with replacement. The first ball can be chosen in any of n ways, the second in any of n ways, etc., so that there are n × n ×···×n = nr samples. Next, suppose that sampling is done without replacement. There are n choices for the first ball, n − 1 choices for the second ball, n − 2 for the third, ..., andn − r + 1 for the rth. We have just proved the following proposition. PROPOSITION A For a set of size n and a sample of size r, there are nr different ordered sam- ples with replacement and n(n − 1)(n − 2) ···(n − r + 1) different ordered samples without replacement. ■ COROLLARY A The number of orderings of n elements is n(n − 1)(n − 2) ···1 = n!. ■ 10 Chapter 1 Probability EXAMPLE A How many ways can five children be lined up? This corresponds to sampling without replacement. According to Corollary A, there are 5! = 5 × 4 × 3 × 2 × 1 = 120 different lines. ■ EXAMPLE B Suppose that from ten children, five are to be chosen and lined up. How many different lines are possible? From Proposition A, there are 10 × 9 × 8 × 7 × 6 = 30,240 different lines. ■ EXAMPLE C In some states, license plates have six characters: three letters followed by three numbers. How many distinct such plates are possible? This corresponds to sampling with replacement. There are 263 = 17,576 different ways to choose the letters and 103 = 1000 ways to choose the numbers. Using the multiplication principle again, we find there are 17,576 × 1000 = 17,576,000 different plates. ■ EXAMPLE D If all sequences of six characters are equally likely, what is the probability that the license plate for a new car will contain no duplicate letters or numbers? Call the desired event A;  consists of all 17,576,000 possible sequences. Since these are all equally likely, the probability of A is the ratio of the number of ways that A can occur to the total number of possible outcomes. There are 26 choices for the first letter, 25 for the second, 24 for the third, and hence 26 × 25 × 24 = 15,600 ways to choose the letters without duplication (doing so corresponds to sampling without replacement), and 10 × 9 × 8 = 720 ways to choose the numbers without duplication. From the multiplication principle, there are 15,600×720 = 11,232,000 nonrepeating sequences. The probability of A is thus P(A) = 11,232,000 17,576,000 = .64 ■ EXAMPLE E Birthday Problem Suppose that a room contains n people. What is the probability that at least two of them have a common birthday? This is a famous problem with a counterintuitive answer. Assume that every day of the year is equally likely to be a birthday, disregard leap years, and denote by A the event that at least two people have a common birthday. As is sometimes the case, finding P(Ac) is easier than finding P(A). This is because A can happen in many ways, whereas Ac is much simpler. There are 365n possible outcomes, and Ac can happen in 365 × 364 ×···×(365 − n + 1) ways. Thus, P(Ac) = 365 × 364 ×···×(365 − n + 1) 365n 1.4 Computing Probabilities: Counting Methods 11 and P(A) = 1 − 365 × 364 ×···×(365 − n + 1) 365n The following table exhibits the latter probabilities for various values of n: nP(A) 4 .016 16 .284 23 .507 32 .753 40 .891 56 .988 From the table, we see that if there are only 23 people, the probability of at least one match exceeds .5. The probabilities in the table are larger than one might intuitively guess, showing that the coincidence is not unlikely. Try it in your class. ■ EXAMPLE F How many people must you ask to have a 50 : 50 chance of finding someone who shares your birthday? Suppose that you ask n people; let A denote the event that someone’s birthday is the same as yours. Again, working with Ac is easier. The total number of outcomes is 365n, and the total number of ways that Ac can happen is 364n. Thus, P(Ac) = 364n 365n and P(A) = 1 − 364n 365n For the latter probability to be .5, n should be 253, which may seem counterintuitive. ■ We now shift our attention from counting permutations to counting combina- tions. Here we are no longer interested in ordered samples, but in the constituents of the samples regardless of the order in which they were obtained. In particular, we ask the following question: If r objects are taken from a set of n objects without replacement and disregarding order, how many different samples are possible? From the multiplication principle, the number of ordered samples equals the number of unordered samples multiplied by the number of ways to order each sample. Since the number of ordered samples is n(n − 1) ···(n − r + 1), and since a sample of size r can be ordered in r! ways (Corollary A), the number of unordered samples is n(n − 1) ···(n − r + 1) r! = n! (n − r)!r! This number is also denoted as n r . We have proved the following proposition. 12 Chapter 1 Probability PROPOSITION B The number of unordered samples of r objects selected from n objects without replacement is n r . The numbers n k , called the binomial coefficients, occur in the expansion (a + b)n = n k=0 n k akbn−k In particular, 2n = n k=0 n k This latter result can be interpreted as the number of subsets of a set of n objects. We just add the number of subsets of size 0 (with the usual convention that 0! = 1), and the number of subsets of size 1, and the number of subsets of size 2, etc. ■ EXAMPLE G Up until 1991, a player of the California state lottery could win the jackpot prize by choosing the 6 numbers from 1 to 49 that were subsequently chosen at random by the lottery officials. There are 49 6 = 13,983,816 possible ways to choose 6 numbers from 49, and so the probability of winning was about 1 in 14 million. If there were no winners, the funds thus accumulated were rolled over (carried over) into the next round of play, producing a bigger jackpot. In 1991, the rules were changed so that a winner had to correctly select 6 numbers from 1 to 53. Since 53 6 = 22,957,480, the probability of winning decreased to about 1 in 23 million. Because of the ensuing rollover, the jackpot accumulated to a record of about $120 million. This produced a fever of play—people were buying tickets at the rate of between 1 and 2 million per hour and state revenues burgeoned. ■ EXAMPLE H In the practice of quality control, only a fraction of the output of a manufacturing process is sampled and examined, since it may be too time-consuming and expensive to examine each item, or because sometimes the testing is destructive. Suppose that n items are in a lot and a sample of size r is taken. There are n r such samples. Now suppose that the lot contains k defective items. What is the probability that the sample contains exactly m defectives? Clearly, this question is relevant to the efficacy of the sampling scheme, and the most desirable sample size can be determined by computing such probabilities for various values of r. Call the event in question A. The probability of A is the number of ways A can occur divided by the total number of outcomes. To find the number of ways A can occur, we use the multiplication principle. There are k m ways to choose the m defective items in the sample from the k defectives in the lot, and there aren−k r−m ways to choose the r − m nondefective items in the sample from the n − k nondefectives in the lot. Therefore, A can occur in k m n−k r−m ways. Thus, P(A) is the 1.4 Computing Probabilities: Counting Methods 13 ratio of the number of ways A can occur to the total number of outcomes, or P(A) = k m n−k r−m n r ■ EXAMPLE I Capture/Recapture Method The so-called capture/recapture method is sometimes used to estimate the size of a wildlife population. Suppose that 10 animals are captured, tagged, and released. On a later occasion, 20 animals are captured, and it is found that 4 of them are tagged. How large is the population? We assume that there are n animals in the population, of which 10 are tagged. If the 20 animals captured later are taken in such a way that all n 20 possible groups are equally likely (this is a big assumption), then the probability that 4 of them are tagged is (using the technique of the previous example)10 4 n−10 16 n 20 Clearly, n cannot be precisely determined from the information at hand, but it can be estimated. One method of estimation, called maximum likelihood, is to choose that value of n that makes the observed outcome most probable. (The method of maximum likelihood is one of the main subjects of a later chapter in this text.) The probability of the observed outcome as a function of n is called the likelihood. Figure 1.3 shows the likelihood as a function of n; the likelihood is maximized at n = 50. 0.05 30 n Likelihood 0 0.10 0.20 0.25 0.30 0.35 40 50 60 70 80 90 10020 0.15 FIGURE 1.3 Likelihood for Example I. To find the maximum likelihood estimate, suppose that, in general, t animals are tagged. Then, of a second sample of size m, r tagged animals are recaptured. We estimate n by the maximizer of the likelihood Ln = t r n−t m−r n m 14 Chapter 1 Probability To find the value of n that maximizes Ln, consider the ratio of successive terms, which after some algebra is found to be Ln Ln−1 = (n − t)(n − m) n(n − t − m + r) This ratio is greater than 1, i.e., Ln is increasing, if (n − t)(n − m)>n(n − t − m + r) n2 − nm − nt + mt > n2 − nt − nm − nr mt > nr mt r > n Thus, Ln increases for n < mt/r and decreases for n > mt/r; so the value of n that maximizes Ln is the greatest integer not exceeding mt/r. Applying this result to the data given previously, we see that the maximum likelihood estimate of n is mt r = 20·10 4 = 50. This estimate has some intuitive appeal, as it equates the proportion of tagged animals in the second sample to the proportion in the population: 4 20 = 10 n ■ Proposition B has the following extension. PROPOSITION C The number of ways that n objects can be grouped into r classes with ni in the ith class, i = 1,...,r, and r i=1 ni = n is n n1n2 ···nr = n! n1!n2! ···nr ! Proof This can be seen by using Proposition B and the multiplication principle. (Note that Proposition B is the special case for which r = 2.) There are n n1 ways to choose the objects for the first class. Having done that, there are n−n1 n2 ways of choosing the objects for the second class. Continuing in this manner, there are n! n1!(n − n1)! (n − n1)! (n − n1 − n2)!n2! ···(n − n1 − n2 −···−nr−1)! 0!nr ! choices in all. After cancellation, this yields the desired result. ■ 1.4 Computing Probabilities: Counting Methods 15 EXAMPLE J A committee of seven members is to be divided into three subcommittees of size three, two, and two. This can be done in 7 322 = 7! 3!2!2! = 210 ways. ■ EXAMPLE K In how many ways can the set of nucleotides {A, A, G, G, G, G, C, C, C} be arranged in a sequence of nine letters? Proposition C can be applied by realizing that this problem can be cast as determining the number of ways that the nine positions in the sequence can be divided into subgroups of sizes two, four, and three (the locations of the letters A, G, and C): 9 243 = 9! 2!4!3! = 1260 ■ EXAMPLE L In how many ways can n = 2m people be paired and assigned to m courts for the first round of a tennis tournament? In this problem, ni = 2, i = 1,...,m, and, according to Proposition C, there are (2m)! 2m assignments. One has to be careful with problems such as this one. Suppose we were asked how many ways 2m people could be arranged in pairs without assigning the pairs to courts. Since there are m! ways to assign the m pairs to m courts, the preceding result should be divided by m!, giving (2m)! m!2m pairs in all. ■ The numbers n n1n2···nr are called multinomial coefficients. They occur in the expansion (x1 + x2 +···+xr )n = n n1n2 ···nr xn1 1 xn2 2 ···xnr r where the sum is over all nonnegative integers n1, n2,...,nr such that n1 + n2 + ···+nr = n. 16 Chapter 1 Probability 1.5 Conditional Probability We introduce the definition and use of conditional probability with an example. Digi- talis therapy is often beneficial to patients who have suffered congestive heart failure, but there is the risk of digitalis intoxication, a serious side effect that is difficult to diagnose. To improve the chances of a correct diagnosis, the concentration of digitalis in the blood can be measured. Bellar et al. (1971) conducted a study of the relation of the concentration of digitalis in the blood to digitalis intoxication in 135 patients. Their results are simplified slightly in the following table, where this notation is used: T +=high blood concentration (positive test) T −=low blood concentration (negative test) D+=toxicity (disease present) D−=no toxicity (disease absent) D+ D− Total T + 25 14 39 T − 18 78 96 Total 43 92 135 Thus, for example, 25 of the 135 patients had a high blood concentration of digitalis and suffered toxicity. Assume that the relative frequencies in the study roughly hold in some larger population of patients. (Making inferences about the frequencies in a large population from those observed in a small sample is a statistical problem, which will be taken up in a later chapter of this book.) Converting the frequencies in the preceding table to proportions (relative to 135), which we will regard as probabilities, we obtain the following table: D+ D− Total T + .185 .104 .289 T − .133 .578 .711 Total .318 .682 1.000 From the table, P(T +) = .289 and P(D+) = .318, for example. Now if a doctor knows that the test was positive (that there was a high blood concentration), what is the probability of disease (toxicity) given this knowledge? We can restrict our attention to the first row of the table, and we see that of the 39 patients who had positive tests, 25 suffered from toxicity. We denote the probability that a patient shows toxicity given that the test is positive by P(D +|T +), which is called the conditional probability of D + given T +. P(D +|T +) = 25 39 = .640 1.5 Conditional Probability 17 Equivalently, we can calculate this probability as P(D +|T +) = P(D +∩T +) P(T +) = .185 .289 = .640 In summary, we see that the unconditional probability of D + is .318, whereas the conditional probability D + given T + is .640. Therefore, knowing that the test is positive makes toxicity more than twice as likely. What if the test is negative? P(D −|T −) = .578 .711 = .848 For comparison, P(D−) = .682. Two other conditional probabilities from this ex- ample are of interest: The probability of a false positive is P(D −|T +) = .360, and the probability of a false negative is P(D +|T −) = .187. In general, we have the following definition. DEFINITION Let A and B be two events with P(B) = 0. The conditional probability of A given B is defined to be P(A | B) = P(A ∩ B) P(B) ■ The idea behind this definition is that if we are given that event B occurred, the relevant sample space becomes B rather than , and conditional probability is a probability measure on B. In the digitalis example, to find P(D+|T +), we restricted our attention to the 39 patients who had positive tests. For this new measure to be a probability measure, it must satisfy the axioms, and this can be shown. In some situations, P(A | B) and P(B) can be found rather easily, and we can then find P(A ∩ B). MULTIPLICATION LAW Let A and B be events and assume P(B) = 0. Then P(A ∩ B) = P(A | B)P(B) ■ The multiplication law is often useful in finding the probabilities of intersections, as the following examples illustrate. EXAMPLE A An urn contains three red balls and one blue ball. Two balls are selected without replacement. What is the probability that they are both red? 18 Chapter 1 Probability Let R1 and R2 denote the events that a red ball is drawn on the first trial and on the second trial, respectively. From the multiplication law, P(R1 ∩ R2) = P(R1)P(R2 | R1) P(R1) is clearly 3 4 , and if a red ball has been removed on the first trial, there are two red balls and one blue ball left. Therefore, P(R2 | R1) = 2 3 . Thus, P(R1 ∩ R2) = 1 2 . ■ EXAMPLE B Suppose that if it is cloudy (B), the probability that it is raining (A) is .3, and that the probability that it is cloudy is P(B) = .2 The probability that it is cloudy and raining is P(A ∩ B) = P(A | B)P(B) = .3 × .2 = .06 ■ Another useful tool for computing probabilities is provided by the following law. LAW OF TOTAL PROBABILITY Let B1, B2,...,Bn be such that n i=1 Bi =  and Bi ∩ B j =∅for i = j, with P(Bi )>0 for all i. Then, for any event A, P(A) = n i=1 P(A | Bi )P(Bi ) Proof Before going through a formal proof, it is helpful to state the result in words. The Bi are mutually disjoint events whose union is . To find the probability of an event A, we sum the conditional probabilities of A given Bi , weighted by P(Bi ). Now, for the proof, we first observe that P(A) = P(A ∩ ) = P A ∩ n i=1 Bi = P n i=1 (A ∩ Bi ) Since the events A ∩ Bi are disjoint, P n i=1 (A ∩ Bi ) = n i=1 P(A ∩ Bi ) = n i=1 P(A | Bi )P(Bi ) ■ 1.5 Conditional Probability 19 The law of total probability is useful in situations where it is not obvious how to calculate P(A) directly but in which P(A | Bi ) and P(Bi ) are more straightforward, such as in the following example. EXAMPLE C Referring to Example A, what is the probability that a red ball is selected on the second draw? The answer may or may not be intuitively obvious—that depends on your in- tuition. On the one hand, you could argue that it is “clear from symmetry” that P(R2) = P(R1) = 3 4 . On the other hand, you could say that it is obvious that a red ball is likely to be selected on the first draw, leaving fewer red balls for the second draw, so that P(R2)0 for all i. Then P(B j | A) = P(A | B j )P(B j ) n i=1 P(A | Bi )P(Bi ) The proof of Bayes’ rule follows exactly as in the preceding discussion. ■ EXAMPLE E Diamond and Forrester (1979) applied Bayes’ rule to the diagnosis of coronary artery disease. A procedure called cardiac fluoroscopy is used to determine whether there is calcification of coronary arteries and thereby to diagnose coronary artery disease. From the test, it can be determined if 0, 1, 2, or 3 coronary arteries are calcified. Let T0, T1, T2, T3 denote these events. Let D+ or D− denote the event that disease is present or absent, respectively. Diamond and Forrester presented the following table, based on medical studies: iP(Ti | D+) P(Ti | D−) 0 .42 .96 1 .24 .02 2 .20 .02 3 .15 .00 1.5 Conditional Probability 21 According to Bayes’ rule, P(D+|Ti ) = P(Ti | D+)P(D+) P(Ti | D+)P(D+) + P(Ti | D−)P(D−) Thus, if the initial probabilities P(D+) and P(D−) are known, the probability that a patient has coronary artery disease can be calculated. Let us consider two specific cases. For the first, suppose that a male between the ages of 30 and 39 suffers from nonanginal chest pain. For such a patient, it is known from medical statistics that P(D+) ≈ .05. Suppose that the test shows that no arteries are calcified. From the preceding equation, P(D+|T0) = .42 × .05 .42 × .05 + .96 × .95 = .02 It is unlikely that the patient has coronary artery disease. On the other hand, suppose that the test shows that one artery is calcified. Then P(D+|T1) = .24 × .05 .24 × .05 + .02 × .95 = .39 Now it is more likely that this patient has coronary artery disease, but by no means certain. As a second case, suppose that the patient is a male between ages 50 and 59 who suffers typical angina. For such a patient, P(D+) = .92. For him, we find that P(D+|T0) = .42 × .92 .42 × .92 + .96 × .08 = .83 P(D+|T1) = .24 × .92 .24 × .92 + .02 × .08 = .99 Comparing the two patients, we see the strong influence of the prior probability, P(D+). ■ EXAMPLE F Polygraph tests (lie-detector tests) are often routinely administered to employees or prospective employees in sensitive positions. Let + denote the event that the polygraph reading is positive, indicating that the subject is lying; let T denote the event that the subject is telling the truth; and let L denote the event that the subject is lying. According to studies of polygraph reliability (Gastwirth 1987), P(+|L) = .88 from which it follows that P(−|L) = .12 also P(−|T ) = .86 from which it follows that P(+|T ) = .14. In words, if a person is lying, the prob- ability that this is detected by the polygraph is .88, whereas if he is telling the truth, the polygraph indicates that he is telling the truth with probability .86. Now suppose that polygraphs are routinely administered to screen employees for security reasons, and that on a particular question the vast majority of subjects have no reason to lie so 22 Chapter 1 Probability that P(T ) = .99, whereas P(L) = .01. A subject produces a positive response on the polygraph. What is the probability that the polygraph is incorrect and that she is in fact telling the truth? We can evaluate this probability with Bayes’ rule: P(T |+) = P(+|T )P(T ) P(+|T )P(T ) + P(+|L)P(L) = (.14)(.99) (.14)(.99) + (.88)(.01) = .94 Thus, in screening this population of largely innocent people, 94% of the positive polygraph readings will be in error. Most of those placed under suspicion because of the polygraph result will, in fact, be innocent. This example illustrates some of the dangers in using screening procedures on large populations. ■ Bayes’ rule is the fundamental mathematical ingredient of a subjective, or “Bayesian,” approach to epistemology, theories of evidence, and theories of learning. According to this point of view, an individual’s beliefs about the world can be coded in probabilities. For example, an individual’s belief that it will hail tomorrow can be represented by a probability P(H). This probability varies from individual to indi- vidual. In principle, each individual’s probability can be ascertained, or elicited, by offering him or her a series of bets at different odds. According to Bayesian theory, our beliefs are modified as we are confronted with evidence. If, initially, my probability for a hypothesis is P(H), after seeing evidence E (e.g., a weather forecast), my probability becomes P(H|E). P(E|H) is often easier to evaluate than P(H|E). In this case, the application of Bayes’ rule gives P(H|E) = P(E|H)P(H) P(E|H)P(H) + P(E| ¯H)P( ¯H) where ¯H is the event that H does not hold. This point can be illustrated by the preceding polygraph example. Suppose an investigator is questioning a particular suspect and that the investigator’s prior opinion that the suspect is telling the truth is P(T ). Then, upon observing a positive polygraph reading, his opinion becomes P(T |+). Note that different investigators will have different prior probabilities P(T ) for different suspects, and thus different posterior probabilities. As appealing as this formulation might be, a long line of research has demon- strated that humans are actually not very good at doing probability calculations in evaluating evidence. For example, Tversky and Kahneman (1974) presented subjects with the following question: “If Linda is a 31-year-old single woman who is outspo- ken on social issues such as disarmament and equal rights, which of the following statements is more likely to be true? • Linda is bank teller. • Linda is a bank teller and active in the feminist movement.” More than 80% of those questioned chose the second statement, despite Property C of Section 1.3. 1.6 Independence 23 Even highly trained professionals are not good at doing probability calculations, as illustrated by the following example of Eddy (1982), regarding interpreting the results from mammogram screening. One hundred physicians were presented with the following information: • In the absence of any special information, the probability that a woman (of the age and health status of this patient) has breast cancer is 1%. • If the patient has breast cancer, the probability that the radiologist will correctly diagnose it is 80%. • If the patient has a benign lesion (no breast cancer), the probability that the radiol- ogist will incorrectly diagnose it as cancer is 10%. They were then asked, “What is the probability that a patient with a positive mam- mogram actually has breast cancer?” Ninety-five of the 100 physicians estimated the probability to be about 75%. The correct probability, as given by Bayes’ rule, is 7.5%. (You should check this.) So even experts radically overestimate the strength of the evidence provided by a positive outcome on the screening test. Thus the Bayesian probability calculus does not describe the way people actually assimilate evidence. Advocates for Bayesian learning theory might assert that the theory describes the way people “should think.” A softer point of view is that Bayesian learning theory is a model for learning, and it has the merit of being a simple model that can be programmed on computers. Probability theory in general, and Bayesian learning theory in particular, are part of the core of artificial intelligence. 1.6 Independence Intuitively, we would say that two events, A and B, are independent if knowing that one had occurred gave us no information about whether the other had occurred; that is, P(A | B) = P(A) and P(B | A) = P(B).Now,if P(A) = P(A | B) = P(A ∩ B) P(B) then P(A ∩ B) = P(A)P(B) Wewill use this last relation as the definition of independence. Note that it is symmetric in A and in B, and does not require the existence of a conditional probability, that is, P(B) can be 0. DEFINITION A and B are said to be independent events if P(A ∩ B) = P(A)P(B). ■ EXAMPLE A A card is selected randomly from a deck. Let A denote the event that it is an ace and D the event that it is a diamond. Knowing that the card is an ace gives no 24 Chapter 1 Probability information about its suit. Checking formally that the events are independent, we have P(A) = 4 52 = 1 13 and P(D) = 1 4 . Also, A∩ D is the event that the card is the ace of diamonds and P(A ∩ D) = 1 52 . Since P(A)P(D) = ( 1 4 ) × ( 1 13 ) = 1 52 , the events are in fact independent. ■ EXAMPLE B A system is designed so that it fails only if a unit and a backup unit both fail. Assuming that these failures are independent and that each unit fails with probability p, the system fails with probability p2. If, for example, the probability that any unit fails during a given year is .1, then the probability that the system fails is .01, which represents a considerable improvement in reliability. ■ Things become more complicated when we consider more than two events. For example, suppose we know that events A, B, and C are pairwise independent (any two are independent). We would like to be able to say that they are all independent based on the assumption that knowing something about two of the events does not tell us anything about the third, for example, P(C | A ∩ B) = P(C). But as the following example shows, pairwise independence does not guarantee mutual independence. EXAMPLE C A fair coin is tossed twice. Let A denote the event of heads on the first toss, B the event of heads on the second toss, and C the event that exactly one head is thrown. A and B are clearly independent, and P(A) = P(B) = P(C) = .5. To see that A and C are independent, we observe that P(C | A) = .5. But P(A ∩ B ∩ C) = 0 = P(A)P(B)P(C) ■ To encompass situations such as that in Example C, we define a collection of events, A1, A2,...,An,tobemutually independent if for any subcollection, Ai1 ,...,Aim , P(Ai1 ∩···∩ Aim ) = P(Ai1 ) ···P(Aim ) EXAMPLE D We return to Example B of Section 1.3 (infectivity of AIDS). Suppose that virus transmissions in 500 acts of intercourse are mutually independent events and that the probability of transmission in any one act is 1/500. Under this model, what is the probability of infection? It is easier to first find the probability of the complement of this event. Let C1, C2,...,C500 denote the events that virus transmission does not occur during encounters 1, 2, ...,500. Then the probability of no infection is P(C1 ∩ C2 ∩···∩C500) = 1 − 1 500 500 = .37 so the probability of infection is 1 − .37 = .63, not 1, which is the answer produced by incorrectly adding probabilities. ■ 1.6 Independence 25 EXAMPLE E Consider a circuit with three relays (Figure 1.4). Let Ai denote the event that the ith relay works, and assume that P(Ai ) = p and that the relays are mutually independent. If F denotes the event that current flows through the circuit, then F = A3 ∪(A1 ∩ A2) and, from the addition law and the assumption of independence, P(F) = P(A3) + P(A1 ∩ A2) − P(A1 ∩ A2 ∩ A3) = p + p2 − p3 ■ 1 2 3 FIGURE 1.4 Circuit with three relays. EXAMPLE F Suppose that a system consists of components connected in a series, so the system fails if any one component fails. If there are n mutually independent components and each fails with probability p, what is the probability that the system will fail? It is easier to find the probability of the complement of this event; the system works if and only if all the components work, and this situation has probability (1 − p)n. The probability that the system fails is then 1 − (1 − p)n. For example, if n = 10 and p = .05, the probability that the system works is only .9510 = .60, and the probability that the system fails is .40. Suppose, instead, that the components are connected in parallel, so the system fails only when all components fail. In this case, the probability that the system fails is only .0510 = 9.8 × 10−14. ■ Calculations like those in Example F are made in reliability studies for sys- tems consisting of quite complicated networks of components. The absolutely crucial assumption is that the components are independent of one another. Theoretical studies of the reliability of nuclear power plants have been criticized on the grounds that they incorrectly assume independence of the components. EXAMPLE G Matching DNA Fragments Fragments of DNA are often compared for similarity, for example, across species. A simple way to make a comparison is to count the number of locations, or sites, at which these fragments agree. For example, consider these two sequences, which agree at three sites: fragment 1: AGATCAGT; and fragment 2: TGGATACT. Many such comparisons are made, and to sort the wheat from the chaff, a prob- ability model is often used. A comparison is deemed interesting if the number of 26 Chapter 1 Probability matches is much larger than would be expected by chance alone. This requires a chance model; a simple one stipulates that the nucleotide at each site of fragment 1 occurs randomly with probabilities pA1, pG1, pC1, pT 1, and that the second fragment is similarly composed with probabilities pA2,...,pT 2. What is the chance that the fragments match at a particular site if in fact the identity of the nucleotide on frag- ment 1 is independent of that on fragment 2? The match probability can be calculated using the law of total probability: P(match) = P(match|A on fragment 1)P(A on fragment 1) + ...+ P(match|T on fragment 1)P(T on fragment 1) = pA2 pA1 + pG2 pG1 + pC2 pC1 + pT 2 pT 1 The problem of determining the probability that they match at k out of a total of n sites is discussed later. ■ 1.7 Concluding Remarks This chapter provides a simple axiomatic development of the mathematical theory of probability. Some subtle issues that arise in a careful analysis of infinite sample spaces have been neglected. Such issues are typically addressed in graduate-level courses in measure theory and probability theory. Certain philosophical questions have also been avoided. One might ask what is meant by the statement “The probability that this coin will land heads up is 1 2 .” Two commonly advocated views are the frequen- tist approach and the Bayesian approach. According to the frequentist approach, the statement means that if the experiment were repeated many times, the long-run average number of heads would tend to 1 2 . According to the Bayesian approach, the statement is a quantification of the speaker’s uncertainty about the outcome of the experiment and thus is a personal or subjective notion; the probability that the coin will land heads up may be different for different speakers, depending on their ex- perience and knowledge of the situation. There has been vigorous and occasionally acrimonious debate among proponents of various versions of these points of view. In this and ensuing chapters, there are many examples of the use of probability as a model for various phenomena. In any such modeling endeavor, an idealized mathematical theory is hoped to provide an adequate match to characteristics of the phenomenon under study. The standard of adequacy is relative to the field of study and the modeler’s goals. 1.8 Problems 1. A coin is tossed three times and the sequence of heads and tails is recorded. a. List the sample space. b. List the elements that make up the following events: (1) A = at least two heads, (2) B = the first two tosses are heads, (3) C = the last toss is a tail. c. List the elements of the following events: (1) Ac, (2) A ∩ B, (3) A ∪ C. 1.8 Problems 27 2. Two six-sided dice are thrown sequentially, and the face values that come up are recorded. a. List the sample space. b. List the elements that make up the following events: (1) A = the sum of the two values is at least 5, (2) B = the value of the first die is higher than the value of the second, (3) C = the first value is 4. c. List the elements of the following events: (1) A∩C, (2) B∪C, (3) A∩(B∪C). 3. An urn contains three red balls, two green balls, and one white ball. Three balls are drawn without replacement from the urn, and the colors are noted in sequence. List the sample space. Define events A, B, and C as you wish and find their unions and intersections. 4. Draw Venn diagrams to illustrate De Morgan’s laws: (A ∪ B)c = Ac ∩ Bc (A ∩ B)c = Ac ∪ Bc 5. Let A and B be arbitrary events. Let C be the event that either A occurs or B occurs, but not both. Express C in terms of A and B using any of the basic operations of union, intersection, and complement. 6. Verify the following extension of the addition rule (a) by an appropriate Venn diagram and (b) by a formal argument using the axioms of probability and the propositions in this chapter. P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) 7. Prove Bonferroni’s inequality: P(A ∩ B) ≥ P(A) + P(B) − 1 8. Prove that P n i=1 Ai ≤ n i=1 P(Ai ) 9. The weather forecaster says that the probability of rain on Saturday is 25% and that the probability of rain on Sunday is 25%. Is the probability of rain during the weekend 50%? Why or why not? 10. Make up another example of Simpson’s paradox by changing the numbers in Example B of Section 1.4. 11. The first three digits of a university telephone exchange are 452. If all the se- quences of the remaining four digits are equally likely, what is the probability that a randomly selected university phone number contains seven distinct digits? 12. In a game of poker, five players are each dealt 5 cards from a 52-card deck. How many ways are there to deal the cards? 28 Chapter 1 Probability 13. In a game of poker, what is the probability that a five-card hand will contain (a) a straight (five cards in unbroken numerical sequence), (b) four of a kind, and (c) a full house (three cards of one value and two cards of another value)? 14. The four players in a bridge game are each dealt 13 cards. How many ways are there to do this? 15. How many different meals can be made from four kinds of meat, six vegetables, and three starches if a meal consists of one selection from each group? 16. How many different letter arrangements can be obtained from the letters of the word statistically, using all the letters? 17. In acceptance sampling, a purchaser samples 4 items from a lot of 100 and rejects the lot if 1 or more are defective. Graph the probability that the lot is accepted as a function of the percentage of defective items in the lot. 18. A lot of n items contains k defectives, and m are selected randomly and inspected. How should the value of m be chosen so that the probability that at least one defective item turns up is .90? Apply your answer to (a) n = 1000, k = 10, and (b) n = 10,000, k = 100. 19. A committee consists of five Chicanos, two Asians, three African Americans, and two Caucasians. a. A subcommittee of four is chosen at random. What is the probability that all the ethnic groups are represented on the subcommittee? b. Answer the question for part (a) if a subcommittee of five is chosen. 20. A deck of 52 cards is shuffled thoroughly. What is the probability that the four aces are all next to each other? 21. A fair coin is tossed five times. What is the probability of getting a sequence of three heads? 22. A standard deck of 52 cards is shuffled thoroughly, and n cards are turned up. What is the probability that a face card turns up? For what value of n is this probability about .5? 23. How many ways are there to place n indistinguishable balls in n urns so that exactly one urn is empty? 24. If n balls are distributed randomly into k urns, what is the probability that the last urn contains j balls? 25. A woman getting dressed up for a night out is asked by her significant other to wear a red dress, high-heeled sneakers, and a wig. In how many orders can she put on these objects? 26. The game of Mastermind starts in the following way: One player selects four pegs, each peg having six possible colors, and places them in a line. The sec- ond player then tries to guess the sequence of colors. What is the probability of guessing correctly? 1.8 Problems 29 27. If a five-letter word is formed at random (meaning that all sequences of five letters are equally likely), what is the probability that no letter occurs more than once? 28. How many ways are there to encode the 26-letter English alphabet into 8-bit binary words (sequences of eight 0s and 1s)? 29. A poker player is dealt three spades and two hearts. He discards the two hearts and draws two more cards. What is the probability that he draws two more spades? 30. A group of 60 second graders is to be randomly assigned to two classes of 30 each. (The random assignment is ordered by the school district to ensure against any bias.) Five of the second graders, Marcelle, Sarah, Michelle, Katy, and Camerin, are close friends. What is the probability that they will all be in the same class? What is the probability that exactly four of them will be? What is the probability that Marcelle will be in one class and her friends in the other? 31. Six male and six female dancers perform the Virginia reel. This dance requires that they form a line consisting of six male/female pairs. How many such ar- rangements are there? 32. A wine taster claims that she can distinguish four vintages of a particular Caber- net. What is the probability that she can do this by merely guessing? (She is confronted with four unlabeled glasses.) 33. An elevator containing five people can stop at any of seven floors. What is the probability that no two people get off at the same floor? Assume that the occupants act independently and that all floors are equally likely for each occupant. 34. Prove the following identity: n k=0 n k m − n n − k = m n (Hint: How can each of the summands be interpreted?) 35. Prove the following two identities both algebraically and by interpreting their meaning combinatorially. a. n r = n n−r b. n r = n−1 r−1 + n−1 r 36. What is the coefficient of x3 y4 in the expansion of (x + y)7? 37. What is the coefficient of x2 y2z3 in the expansion of (x + y + z)7? 38. A child has six blocks, three of which are red and three of which are green. How many patterns can she make by placing them all in a line? If she is given three white blocks, how many total patterns can she make by placing all nine blocks in a line? 39. A monkey at a typewriter types each of the 26 letters of the alphabet exactly once, the order being random. a. What is the probability that the word Hamlet appears somewhere in the string of letters? 30 Chapter 1 Probability b. How many independent monkey typists would you need in order that the probability that the word appears is at least .90? 40. In how many ways can two octopi shake hands? (There are a number of ways to interpret this question—choose one.) 41. A drawer of socks contains seven black socks, eight blue socks, and nine green socks. Two socks are chosen in the dark. a. What is the probability that they match? b. What is the probability that a black pair is chosen? 42. How many ways can 11 boys on a soccer team be grouped into 4 forwards, 3 midfielders, 3 defenders, and 1 goalie? 43. A software development company has three jobs to do. Two of the jobs require three programmers, and the other requires four. If the company employs ten programmers, how many different ways are there to assign them to the jobs? 44. In how many ways can 12 people be divided into three groups of 4 for an evening of bridge? In how many ways can this be done if the 12 consist of six pairs of partners? 45. Show that if the conditional probabilities exist, then P(A1 ∩ A2 ∩···∩ An) = P(A1)P(A2 | A1)P(A3 | A1 ∩ A2) ···P(An | A1 ∩ A2 ∩···∩ An−1) 46. Urn A has three red balls and two white balls, and urn B has two red balls and five white balls. A fair coin is tossed. If it lands heads up, a ball is drawn from urn A; otherwise, a ball is drawn from urn B. a. What is the probability that a red ball is drawn? b. If a red ball is drawn, what is the probability that the coin landed heads up? 47. Urn A has four red, three blue, and two green balls. Urn B has two red, three blue, and four green balls. A ball is drawn from urn A and put into urn B, and then a ball is drawn from urn B. a. What is the probability that a red ball is drawn from urn B? b. If a red ball is drawn from urn B, what is the probability that a red ball was drawn from urn A? 48. An urn contains three red and two white balls. A ball is drawn, and then it and another ball of the same color are placed back in the urn. Finally, a second ball is drawn. a. What is the probability that the second ball drawn is white? b. If the second ball drawn is white, what is the probability that the first ball drawn was red? 49. A fair coin is tossed three times. a. What is the probability of two or more heads given that there was at least one head? b. What is the probability given that there was at least one tail? 1.8 Problems 31 50. Two dice are rolled, and the sum of the face values is six. What is the probability that at least one of the dice came up a three? 51. Answer Problem 50 again given that the sum is less than six. 52. Suppose that 5 cards are dealt from a 52-card deck and the first one is a king. What is the probability of at least one more king? 53. A fire insurance company has high-risk, medium-risk, and low-risk clients, who have, respectively, probabilities .02, .01, and .0025 of filing claims within a given year. The proportions of the numbers of clients in the three categories are .10, .20, and .70, respectively. What proportion of the claims filed each year come from high-risk clients? 54. This problem introduces a simple meteorological model, more complicated versions of which have been proposed in the meteorological literature. Consider a sequence of days and let Ri denote the event that it rains on day i. Suppose that P(Ri | Ri−1) = α and P(Rc i | Rc i−1) = β. Suppose further that only today’s weather is relevant to predicting tomorrow’s; that is, P(Ri | Ri−1 ∩ Ri−2 ∩···∩ R0) = P(Ri | Ri−1). a. If the probability of rain today is p, what is the probability of rain tomorrow? b. What is the probability of rain the day after tomorrow? c. What is the probability of rainn days from now? What happens asn approaches infinity? 55. This problem continues Example D of Section 1.5 and concerns occupational mobility. a. Find P(M1 | M2) and P(L1 | L2). b. Find the proportions that will be in the three occupational levels in the third generation. To do this, assume that a son’s occupational status depends on his father’s status, but that given his father’s status, it does not depend on his grandfather’s. 56. A couple has two children. What is the probability that both are girls given that the oldest is a girl? What is the probability that both are girls given that one of them is a girl? 57. There are three cabinets, A, B, and C, each of which has two drawers. Each drawer contains one coin; A has two gold coins, B has two silver coins, and C has one gold and one silver coin. A cabinet is chosen at random, one drawer is opened, and a silver coin is found. What is the probability that the other drawer in that cabinet contains a silver coin? 58. A teacher tells three boys, Drew, Chris, and Jason, that two of them will have to stay after school to help her clean erasers and that one of them will be able to leave. She further says that she has made the decision as to who will leave and who will stay at random by rolling a special three-sided Dungeons and Dragons die. Drew wants to leave to play soccer and has a clever idea about how to increase his chances of doing so. He figures that one of Jason and Chris will certainly stay and asks the teacher to tell him the name of one of the two who will stay. Drew’s idea 32 Chapter 1 Probability is that if, for example, Jason is named, then he and Chris are left and they each have a probability .5 of leaving; similarly, if Chris is named, Drew’s probability of leaving is still .5. Thus, by merely asking the teacher a question, Drew will increase his probability of leaving from 1 3 to 1 2 . What do you think of this scheme? 59. A box has three coins. One has two heads, one has two tails, and the other is a fair coin with one head and one tail. A coin is chosen at random, is flipped, and comes up heads. a. What is the probability that the coin chosen is the two-headed coin? b. What is the probability that if it is thrown another time it will come up heads? c. Answer part (a) again, supposing that the coin is thrown a second time and comes up heads again. 60. A factory runs three shifts. In a given day, 1% of the items produced by the first shift are defective, 2% of the second shift’s items are defective, and 5% of the third shift’s items are defective. If the shifts all have the same productivity, what percentage of the items produced in a day are defective? If an item is defective, what is the probability that it was produced by the third shift? 61. Suppose that chips for an integrated circuit are tested and that the probability that they are detected if they are defective is .95, and the probability that they are declared sound if in fact they are sound is .97. If .5% of the chips are faulty, what is the probability that a chip that is declared faulty is sound? 62. Show that if P(A | E) ≥ P(B | E) and P(A | Ec) ≥ P(B | Ec), then P(A) ≥ P(B). 63. Suppose that the probability of living to be older than 70 is .6 and the probability of living to be older than 80 is .2. If a person reaches her 70th birthday, what is the probability that she will celebrate her 80th? 64. If B is an event, with P(B)>0, show that the set function Q(A) = P(A | B) satisfies the axioms for a probability measure. Thus, for example, P(A ∪ C | B) = P(A | B) + P(C | B) − P(A ∩ C | B) 65. Show that if A and B are independent, then A and Bc as well as Ac and Bc are independent. 66. Show that ∅ is independent of A for any A. 67. Show that if A and B are independent, then P(A ∪ B) = P(A) + P(B) − P(A)P(B) 68. If A is independent of B and B is independent of C, then A is independent of C. Prove this statement or give a counterexample if it is false. 69. If A and B are disjoint, can they be independent? 70. If A ⊂ B, can A and B be independent? 71. Show that if A, B, and C are mutually independent, then A ∩ B and C are independent and A ∪ B and C are independent. 1.8 Problems 33 72. Suppose that n components are connected in series. For each unit, there is a backup unit, and the system fails if and only if both a unit and its backup fail. Assuming that all the units are independent and fail with probability p, what is the probability that the system works? For n = 10 and p = .05, compare these results with those of Example F in Section 1.6. 73. A system has n independent units, each of which fails with probability p. The system fails only if k or more of the units fail. What is the probability that the system fails? 74. What is the probability that the following system works if each unit fails inde- pendently with probability p (see Figure 1.5)? FIGURE 1.5 75. This problem deals with an elementary aspect of a simple branching process. A population starts with one member; at time t = 1, it either divides with prob- ability p or dies with probability 1 − p. If it divides, then both of its children behave independently with the same two alternatives at time t = 2. What is the probability that there are no members in the third generation? For what value of p is this probability equal to .5? 76. Here is a simple model of a queue. The queue runs in discrete time (t = 0, 1, 2,...), and at each unit of time the first person in the queue is served with probability p and, independently, a new person arrives with probability q.At time t = 0, there is one person in the queue. Find the probabilities that there are 0, 1, 2, 3 people in line at time t = 2. 77. A player throws darts at a target. On each trial, independently of the other trials, he hits the bull’s-eye with probability .05. How many times should he throw so that his probability of hitting the bull’s-eye at least once is .5? 78. This problem introduces some aspects of a simple genetic model. Assume that genes in an organism occur in pairs and that each member of the pair can be either of the types a or A. The possible genotypes of an organism are then AA, Aa, and aa (Aa and aAare equivalent). When two organisms mate, each independently contributes one of its two genes; either one of the pair is transmitted with prob- ability .5. a. Suppose that the genotypes of the parents are AA and Aa. Find the possible genotypes of their offspring and the corresponding probabilities. 34 Chapter 1 Probability b. Suppose that the probabilities of the genotypes AA, Aa, and aa are p, 2q, and r, respectively, in the first generation. Find the probabilities in the second and third generations, and show that these are the same. This result is called the Hardy-Weinberg Law. c. Compute the probabilities for the second and third generations as in part (b) but under the additional assumption that the probabilities that an individual of type AA, Aa, or aa survives to mate are u,v,and w, respectively. 79. Many human diseases are genetically transmitted (for example, hemophilia or Tay-Sachs disease). Here is a simple model for such a disease. The genotype aa is diseased and dies before it mates. The genotype Aa is a carrier but is not diseased. The genotype AA is not a carrier and is not diseased. a. If two carriers mate, what are the probabilities that their offspring are of each of the three genotypes? b. If the male offspring of two carriers is not diseased, what is the probability that he is a carrier? c. Suppose that the nondiseased offspring of part (b) mates with a member of the population for whom no family history is available and who is thus assumed to have probability p of being a carrier ( p is a very small number). What are the probabilities that their first offspring has the genotypes AA, Aa, and aa? d. Suppose that the first offspring of part (c) is not diseased. What is the proba- bility that the father is a carrier in light of this evidence? 80. If a parent has genotype Aa, he transmits either A or a to an offspring (each with a 1 2 chance). The gene he transmits to one offspring is independent of the one he transmits to another. Consider a parent with three children and the following events: A ={children 1 and 2 have the same gene}, B ={children 1 and 3 have the same gene}, C ={children 2 and 3 have the same gene}. Show that these events are pairwise independent but not mutually independent. CHAPTER 2 Random Variables 2.1 Discrete Random Variables A random variable is essentially a random number. As motivation for a definition, let us consider an example. A coin is thrown three times, and the sequence of heads and tails is observed; thus,  ={hhh, hht, htt, hth, ttt, tth, thh, tht} Examples of random variables defined on  are (1) the total number of heads, (2) the total number of tails, and (3) the number of heads minus the number of tails. Each of these is a real-valued function defined on ; that is, each is a rule that assigns a real number to every point ω ∈ . Since the outcome in  is random, the corresponding number is random as well. In general, a random variable is a function from  to the real numbers. Because the outcome of the experiment with sample space  is random, the number produced by the function is random as well. It is conventional to denote random variables by italic uppercase letters from the end of the alphabet. For example, we might define X to be the total number of heads in the experiment described above. A discrete random variable is a random variable that can take on only a finite or at most a countably infinite number of values. The random variable X just defined is a discrete random variable since it can take on only the values 0, 1, 2, and 3. For an example of a random variable that can take on a countably infinite number of values, consider an experiment that consists of tossing a coin until a head turns up and defining Y to be the total number of tosses. The possible values of Y are0,1,2,3,....Ingeneral, a countably infinite set is one that can be put into one-to-one correspondence with the integers. If the coin is fair, then each of the outcomes in  above has probability 1 8 , from which the probabilities that X takes on the values 0, 1, 2, and 3 can be easily 35 36 Chapter 2 Random Variables computed: P(X = 0) = 1 8 P(X = 1) = 3 8 P(X = 2) = 3 8 P(X = 3) = 1 8 Generally, the probability measure on the sample space determines the probabilities of the various values of X; if those values are denoted by x1, x2,...,then there is a function p such that p(xi ) = P(X = xi ) and i p(xi ) = 1. This function is called the probability mass function, or the frequency function, of the random variable X. Figure 2.1 shows a graph of p(x) for the coin tossing experiment. The frequency function describes completely the probability properties of the random variable. .4 .1 .2 .3 0123 x p ( x ) 0 FIGURE 2.1 A probability mass function. In addition to the frequency function, it is sometimes convenient to use the cumulative distribution function (cdf) of a random variable, which is defined to be F(x) = P(X ≤ x), −∞ < x < ∞ Cumulative distribution functions are usually denoted by uppercase letters and fre- quency functions by lowercase letters. Figure 2.2 is a graph of the cumulative distri- bution function of the random variable X of the preceding paragraph. Note that the cdf jumps wherever p(x)>0 and that the jump at xi is p(xi ). For example, if 0 < x < 1, F(x) = 1 8 ; at x = 1, F(x) jumps to F(1) = 4 8 = 1 2 . The jump at x = 1isp(1) = 3 8 . The cumulative distribution function is non-decreasing and satisfies limx→−∞ F(x) = 0 and limx→∞ F(x) = 1. Chapter 3 will cover in detail the joint frequency functions of several random variables defined on the same sample space, but it is useful to define here the concept 2.1 Discrete Random Variables 37 .2 0 x F ( x ) 0 .4 .6 .8 1.0 1 1 2 3 4 FIGURE 2.2 The cumulative distribution function corresponding to Figure 2.1. of independence of random variables. In the case of two discrete random variables X and Y, taking on possible values x1, x2,...,and y1, y2,...,X and Y are said to be independent if, for all i and j, P(X = xi and Y = y j ) = P(X = xi )P(Y = y j ) The definition is extended to collections of more than two discrete random variables in the obvious way; for example, X, Y, and Z are said to be mutually independent if, for all i, j, and k, P(X = xi , Y = y j , Z = zk) = P(X = xi )P(Y = y j )P(Z = zk) We next discuss some common discrete distributions that arise in applications. 2.1.1 Bernoulli Random Variables A Bernoulli random variable takes on only two values: 0 and 1, with probabilities 1 − p and p, respectively. Its frequency function is thus p(1) = p p(0) = 1 − p p(x) = 0, if x = 0 and x = 1 An alternative and sometimes useful representation of this function is p(x) = px (1 − p)1−x , if x = 0orx = 1 0, otherwise If A is an event, then the indicator random variable, IA, takes on the value 1 if A occurs and the value 0 if A does not occur: IA(ω) = 1, if ω ∈ A 0, otherwise 38 Chapter 2 Random Variables IA is a Bernoulli random variable. In applications, Bernoulli random variables often occur as indicators. A Bernoulli random variable might take on the value 1 or 0 according to whether a guess was a success or a failure. 2.1.2 The Binomial Distribution Suppose that n independent experiments, or trials, are performed, where n isafixed number, and that each experiment results in a “success” with probability p and a “failure” with probability 1 − p. The total number of successes, X, is a binomial random variable with parameters n and p. For example, a coin is tossed 10 times and the total number of heads is counted (“head” is identified with “success”). The probability that X = k,orp(k), can be found in the following way: Any particular sequence of k successes occurs with probability pk(1 − p)n−k, from the multiplication principle. The total number of such sequences is n k , since there aren k ways to assign k successes to n trials. P(X = k) is thus the probability of any particular sequence times the number of such sequences: p(k) = n k pk(1 − p)n−k Two binomial frequency functions are shown in Figure 2.3. Note how the shape varies as a function of p. .2 x p ( x ) 0 .4 0 12345678910 (a) .10 x p ( x ) 0 .20 0 12345678910 (b) FIGURE 2.3 Binomial frequency functions, (a) n = 10 and p = .1 and (b) n = 10 and p = .5. EXAMPLE A Tay-Sachs disease is a rare but fatal disease of genetic origin occurring chiefly in infants and children, especially those of Jewish or eastern European extraction. If a couple are both carriers of Tay-Sachs disease, a child of theirs has probability .25 of being born with the disease. If such a couple has four children, what is the frequency function for the number of children who will have the disease? 2.1 Discrete Random Variables 39 We assume that the four outcomes are independent of each other, so, if X denotes the number of children with the disease, its frequency function is p(k) = 4 k .25k × .754−k, k = 0, 1, 2, 3, 4 These probabilities are given in the following table: kp(k) 0 .316 1 .422 2 .211 3 .047 4 .004 ■ EXAMPLE B If a single bit (0 or 1) is transmitted over a noisy communications channel, it has probability p of being incorrectly transmitted. To improve the reliability of the trans- mission, the bit is transmitted n times, where n is odd. A decoder at the receiving end, called a majority decoder, decides that the correct message is that carried by a majority of the received bits. Under a simple noise model, each bit is independently subject to being corrupted with the same probability p. The number of bits that is in error, X, is thus a binomial random variable with n trials and probability p of success on each trial (in this case, and frequently elsewhere, the word success is used in a generic sense; here a success is an error). Suppose, for example, that n = 5 and p = .1. The probability that the message is correctly received is the probability of two or fewer errors, which is 2 k=0 n k pk(1 − p)n−k = p0(1 − p)5 + 5p(1 − p)4 + 10p2(1 − p)3 = .9914 The result is a considerable improvement in reliability. ■ EXAMPLE C DNA Matching We continue Example G of Section 1.6. There we derived the probability p that two fragments agree at a particular site under the assumption that the nucleotide proba- bilities were the same at every site and the identities on fragment 1 were independent of those on fragment 2. To find the probability of the total number of matches, further assumptions must be made. Suppose that the fragments are each of length n and that the nucleotide identities are independent from site to site as well as between frag- ments. Thus, the identity of the nucleotide at site 1 of fragment 1 is independent of the identity at site 2, etc. We did not make this assumption in Example G in Section 1.6; in that case, the identity at site 2 could have depended on the identity at site 1, for example. Now, under the current assumption, the two fragments agree at each site with probability p as calculated in Example G of Section 1.6, and agreement is in- dependent from site to site. So, the total number of agreements is a binomial random variable with n trials and probability p of success. ■ 40 Chapter 2 Random Variables A random variable with a binomial distribution can be expressed in terms of inde- pendent Bernoulli random variables, a fact that will be quite useful for analyzing some properties of binomial random variables in later chapters of this book. Specifically, let X1, X2,...,Xn be independent Bernoulli random variables with p(Xi = 1) = p. Then Y = X1 + X2 +···+Xn is a binomial random variable. 2.1.3 The Geometric and Negative Binomial Distributions The geometric distribution is also constructed from independent Bernoulli trials, but from an infinite sequence. On each trial, a success occurs with probability p, and X is the total number of trials up to and including the first success. So that X = k, there must be k − 1 failures followed by a success. From the independence of the trials, this occurs with probability p(k) = P(X = k) = (1 − p)k−1 p, k = 1, 2, 3,... Note that these probabilities sum to 1: ∞ k=1 (1 − p)k−1 p = p ∞ j=0 (1 − p) j = 1 EXAMPLE A The probability of winning in a certain state lottery is said to be about 1 9 . If it is exactly 1 9 , the distribution of the number of tickets a person must purchase up to and including the first winning ticket is a geometric random variable with p = 1 9 . Figure 2.4 shows the frequency function. ■ .02 10 x p ( x ) 0 .04 .08 .10 .12 20 30 40 500 .06 FIGURE 2.4 The probability mass function of a geometric random variable with p = 1 9 . 2.1 Discrete Random Variables 41 The negative binomial distribution arises as a generalization of the geometric distribution. Suppose that a sequence of independent trials, each with probability of success p, is performed until there are r successes in all; let X denote the total number of trials. To find P(X = k), we can argue in the following way: Any particular such sequence has probability pr (1 − p)k−r , from the independence assumption. The last trial is a success, and the remaining r − 1 successes can be assigned to the remaining k − 1 trials in k−1 r−1 ways. Thus, P(X = k) = k − 1 r − 1 pr (1 − p)k−r It is sometimes helpful in analyzing properties of the negative binomial distribu- tion to note that a negative binomial random variable can be expressed as the sum of r independent geometric random variables: the number of trials up to and including the first success plus the number of trials after the first success up to and including the second success, . . . plus the number of trials from the (r − 1)st success up to and including the rth success. EXAMPLE B Continuing Example A, the distribution of the number of tickets purchased up to and including the second winning ticket is negative binomial: p(k) = (k − 1)p2(1 − p)k−2 This frequency function is shown in Figure 2.5. ■ .01 10 x p ( x ) 0 .02 20 30 40 500 .03 .04 .05 FIGURE 2.5 The probability mass function of a negative binomial random variable with p = 1 9 and r = 2. The definitions of the geometric and negative binomial distributions vary slightly from one textbook to another; for example, instead of X being the total number of trials in the definition of the geometric distribution, X is sometimes defined as the total number of failures. 42 Chapter 2 Random Variables 2.1.4 The Hypergeometric Distribution The hypergeometric distribution was introduced in Chapter 1 but was not named there. Suppose that an urn contains n balls, of whichr are black and n−r are white. Let X denote the number of black balls drawn when taking m balls without replacement. Following the line of reasoning of Examples H and I of Section 1.4.2, P(X = k) = r k n − r m − k n m X is a hypergeometric random variable with parameters r, n, and m. EXAMPLE A As explained in Example G of Section 1.4.2, a player in the California lottery chooses 6 numbers from 53 and the lottery officials later choose 6 numbers at random. Let X equal the number of matches. Then P(X = k) = 6 k 47 6 − k 53 6 The probability mass function of X is displayed in the following table: k 0123 4 5 6 p(k) .468 .401 .117 .014 7.06 × 10−4 1.22 × 10−5 4.36 × 10−8 ■ 2.1.5 The Poisson Distribution The Poisson frequency function with parameter λ(λ>0) is P(X = k) = λk k! e−λ, k = 0, 1, 2,... Since eλ = ∞ k=0(λk/k!), it follows that the frequency function sums to 1. Figure 2.6 shows four Poisson frequency functions. Note how the shape varies as a function of λ. The Poisson distribution can be derived as the limit of a binomial distribution as the number of trials, n, approaches infinity and the probability of success on each trial, p, approaches zero in such a way that np = λ. The binomial frequency function is p(k) = n! k!(n − k)! pk(1 − p)n−k Setting np = λ, this expression becomes p(k) = n! k!(n − k)! λ n k 1 − λ n n−k = λk k! n! (n − k)! 1 nk 1 − λ n n 1 − λ n −k 2.1 Discrete Random Variables 43 .4 x p ( x ) 0 .8 012345 .10 x p ( x ) 0 0 (a) .2 x p ( x ) 0 012345 (b) .4 (c) .20 1234567891011121314151617181920 .04 x p ( x ) 0 0 (d) 1234567891011121314151617181920 .08 .12 FIGURE 2.6 Poisson frequency functions, (a) λ = .1, (b) λ = 1, (c) λ = 5, (d) λ = 10. As n →∞, λ n → 0 n! (n − k)!nk → 1 1 − λ n n → e−λ 44 Chapter 2 Random Variables and 1 − λ n −k → 1 We thus have p(k) → λke−λ k! which is the Poisson frequency function. EXAMPLE A Two dice are rolled 100 times, and the number of double sixes, X, is counted. The distribution of X is binomial with n = 100 and p = 1 36 = .0278. Since n is large and p is small, we can approximate the binomial probabilities by Poisson probabilities with λ = np = 2.78. The exact binomial probabilities and the Poisson approximations are shown in the following table: Binomial Poisson k Probability Approximation 0 .0596 .0620 1 .1705 .1725 2 .2414 .2397 3 .2255 .2221 4 .1564 .1544 5 .0858 .0858 6 .0389 .0398 7 .0149 .0158 8 .0050 .0055 9 .0015 .0017 10 .0004 .0005 11 .0001 .0001 The approximation is quite good. ■ The Poisson frequency function can be used to approximate binomial probabil- ities for large n and small p. This suggests how Poisson distributions can arise in practice. Suppose that X is a random variable that equals the number of times some event occurs in a given interval of time. Heuristically, let us think of dividing the interval into a very large number of small subintervals of equal length, and let us assume that the subintervals are so small that the probability of more than one event in a subinterval is negligible relative to the probability of one event, which is itself very small. Let us also assume that the probability of an event is the same in each subinterval and that whether an event occurs in one subinterval is independent of what happens in the other subintervals. The random variable X is thus nearly a binomial random variable, with the subintervals consitituting the trials, and, from the limiting result above, X has nearly a Poisson distribution. The preceding argument is not formal, of course, but merely suggestive. But, in fact, it can be made rigorous. The important assumptions underlying it are (1) what 2.1 Discrete Random Variables 45 happens in one subinterval is independent of what happens in any other subinterval, (2) the probability of an event is the same in each subinterval, and (3) events do not happen simultaneously. The same kind of argument can be made if we are concerned with an area or a volume of space rather than with an interval on the real line. The Poisson distribution is of fundamental theoretical and practical importance. It has been used in many areas, including the following: • The Poisson distribution has been used in the analysis of telephone systems. The number of calls coming into an exchange during a unit of time might be modeled as a Poisson variable if the exchange services a large number of customers who act more or less independently. • One of the earliest uses of the Poisson distribution was to model the number of alpha particles emitted from a radioactive source during a given period of time. • The Poisson distribution has been used as a model by insurance companies. For example, the number of freak acidents, such as falls in the shower, for a large popu- lation of people in a given time period might be modeled as a Poisson distribution, because the accidents would presumably be rare and independent (provided there was only one person in the shower). • The Poisson distribution has been used by traffic engineers as a model for light traffic. The number of vehicles that pass a marker on a roadway during a unit of time can be counted. If traffic is light, the individual vehicles act independently of each other. In heavy traffic, however, one vehicle’s movement may influence another’s, so the approximation might not be good. EXAMPLE B This amusing classical example is from von Bortkiewicz (1898). The number of fatalities that resulted from being kicked by a horse was recorded for 10 corps of Prussian cavalry over a period of 20 years, giving 200 corps-years worth of data. These data and the probabilities from a Poisson model with λ = .61 are displayed in the following table. The first column of the table gives the number of deaths per year, ranging from 0 to 4. The second column lists how many times that number of deaths was observed. Thus, for example, in 65 of the 200 corps-years, there was one death. In the third column of the table, the observed numbers are converted to relative frequencies by dividing them by 200. The fourth column of the table gives Poisson probabilities with the parameter λ = .61. In Chapters 8 and 9, we discuss how to choose a parameter value to fit a theoretical probability model to observed frequencies and methods for testing goodness of fit. For now, we will just remark that the value λ = .61 was chosen to match the average number of deaths per year. Number of Deaths Relative Poisson per Year Observed Frequency Probability 0 109 .545 .543 1 65 .325 .331 2 22 .110 .101 3 3 .015 .021 4 1 .005 .003 ■ 46 Chapter 2 Random Variables The Poisson distribution often arises from a model called a Poisson process for the distribution of random events in a set S, which is typically one-, two-, or three- dimensional, corresponding to time, a plane, or a volume of space. Basically, this model states that if S1, S2,...,Sn are disjoint subsets of S, then the numbers of events in these subsets, N1, N2,...,Nn, are independent random variables that follow Pois- son distributions with parameters λ|S1|,λ|S2|,...,λ|Sn|, where |Si | denotes the mea- sure of Si (length, area, or volume, for example). The crucial assumptions here are that events in disjoint subsets are independent of each other and that the Poisson parameter for a subset is proportional to the subset’s size. Later, we will see that this latter assump- tion implies that the average number of events in a subset is proportional to its size. EXAMPLE C Suppose that an office receives telephone calls as a Poisson process with λ = .5 per min. The number of calls in a 5-min. interval follows a Poisson distribution with parameter ω = 5λ = 2.5. Thus, the probability of no calls in a 5-min. interval is e−2.5 = .082. The probability of exactly one call is 2.5e−2.5 = .205. ■ EXAMPLE D Figure 2.7 shows four realizations of a Poisson process with λ = 25 in the unit square, 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. It is interesting that the eye tends to perceive patterns, such 01 x y 1 0 01 x y 1 01 x y 1 0 01 x y 1 0 0 FIGURE 2.7 Four realizations of a Poisson process with λ = 25. 2.2 Continuous Random Variables 47 as clusters of points and large blank spaces. But by the nature of a Poisson process, the locations of the points have no relationship to one another, and these patterns are entirely a result of chance. ■ 2.2 Continuous Random Variables In applications, we are often interested in random variables that can take on a contin- uum of values rather than a finite or countably infinite number. For example, a model for the lifetime of an electronic component might be that it is random and can be any positive real number. For a continuous random variable, the role of the frequency function is taken by a density function, f (x), which has the properties that f (x) ≥ 0, f is piecewise continuous, and ∞ −∞ f (x) dx = 1. If X is a random variable with a density function f , then for any a < b, the probability that X falls in the interval (a, b) is the area under the density function between a and b: P(a < X < b) = b a f (x) dx EXAMPLE A A uniform random variable on the interval [0, 1] is a model for what we mean when we say “choose a number at random between 0 and 1.” Any real number in the interval is a possible outcome, and the probability model should have the property that the probability that X is in any subinterval of length h is equal to h. The following density function does the job: f (x) = 1, 0 ≤ x ≤ 1 0, x < 0orx > 1 This is called the uniform density on [0, 1]. The uniform density on a general interval [a, b]is f (x) = 1/(b − a), a ≤ x ≤ b 0, x < a or x > b ■ One consequence of this definition is that the probability that a continuous random variable X takes on any particular value is 0: P(X = c) = c c f (x) dx = 0 Although this may seem strange initially, it is really quite natural. If the uniform random variable of Example A had a positive probability of being any particular number, it should have the same probability for any number in [0, 1], in which case the sum of the probabilities of any countably infinite subset of [0, 1] (for example, the rational numbers) would be infinite. If X is a continuous random variable, then P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) Note that this is not true for a discrete random variable. 48 Chapter 2 Random Variables For small δ,if f is continuous at x, P x − δ 2 ≤ X ≤ x + δ 2 = x+δ/2 x−δ/2 f (u) du ≈ δf (x) Therefore, the probability of a small interval around x is proportional to f (x).Itis sometimes useful to employ differential notation: P(x ≤ X ≤ x + dx) = f (x) dx. The cumulative distribution function of a continuous random variable X is defined in the same way as for a discrete random variable: F(x) = P(X ≤ x) F(x) can be expressed in terms of the density function: F(x) = x −∞ f (u) du From the fundamental theorem of calculus, if f is continuous at x, f (x) = F (x). The cdf can be used to evaluate the probability that X falls in an interval: P(a ≤ X ≤ b) = b a f (x) dx = F(b) − F(a) EXAMPLE B From this definition, we see that the cdf of a uniform random variable on [0, 1] (Example A) is F(x) = 0, x ≤ 0 x, 0 ≤ x ≤ 1 1, x ≥ 1 ■ Suppose that F is the cdf of a continuous random variable and is strictly increasing on some interval I, and that F = 0 to the left of I and F = 1 to the right of I; I may be unbounded. Under this assumption, the inverse function F−1 is well defined; x = F−1(y) if y = F(x). The pth quantile of the distribution F is defined to be that value x p such that F(x p) = p,orP(X ≤ x p) = p. Under the preceding assumption stated, x p is uniquely defined as x p = F−1(p); see Figure 2.8. Special cases are p = 1 2 , which corresponds to the median of F; and p = 1 4 and p = 3 4 , which correspond to the lower and upper quartiles of F. EXAMPLE C Suppose that F(x) = x2 for 0 ≤ x ≤ 1. This statement is shorthand for the more explicit statement F(x) = 0, x ≤ 0 x 2, 0 ≤ x ≤ 1 1, x ≥ 1 To find F−1, we solve y = F(x) = x2 for x, obtaining x = F−1(y) = √ y. The median is F−1(.5) = .707, the lower quartile is F−1(.25) = .50, and the upper quartile is F−1(.75) = .866. ■ 2.2 Continuous Random Variables 49 .2 2 x F ( x ) 0 .4 .6 .8 1.0 3 1 0 1 32xp p FIGURE 2.8 A cdf F and F −1. EXAMPLE D Value at Risk Financial firms need to quantify and monitor the risk of their investments. Value at Risk (VaR) is a widely used measure of potential losses. It involves two parameters: a time horizon and a level of confidence. For example, if the VaR of an institution is $10 million with a one-day horizon and a level of confidence of 95%, the interpretation is that there is a 5% chance of losses exceeding $10 million. Such a loss should be anticipated about once in 20 days. To see how VaR is computed, suppose the current value of the investment is V0 and the future value is V1. The return on the investment is R = (V1 − V0)/V0, which is modeled as a continuous random variable with cdf FR(r). Let the desired level of confidence be denoted by 1 − α. We want to find v∗, the VaR. Then α = P(V0 − V1 ≥ v∗) = P V1 − V0 V0 ≤−v∗ V0 = FR − v∗ V0 Thus, −v∗/V0 is the α quantile, rα; and v∗ =−V0rα. The VaR is minus the current value times the α quantile of the return distribution. ■ We next discuss some density functions that commonly arise in practice. 50 Chapter 2 Random Variables 2.2.1 The Exponential Density The exponential density function is f (x) = λe−λx , x ≥ 0 0, x < 0 Like the Poisson distribution, the exponential density depends on a single parameter, λ>0, and it would therefore be more accurate to refer to it as the family of expo- nential densities. Several exponential densities are shown in Figure 2.9. Note that as λ becomes larger, the density drops off more rapidly. .5 2 x f ( x ) 0 1.0 1.5 2.0 0 4 6 8 10 FIGURE 2.9 Exponential densities with λ = .5 (solid), λ = 1 (dotted), and λ = 2 (dashed). The cumulative distribution function is easily found: F(x) = x −∞ f (u) du = 1 − e−λx , x ≥ 0 0, x < 0 The median of an exponential distribution, η, say, is readily found from the cdf. We solve F(η) = 1 2 : 1 − e−λη = 1 2 from which we have η = log 2 λ The exponential distribution is often used to model lifetimes or waiting times, in which context it is conventional to replace x by t. Suppose that we consider modeling the lifetime of an electronic component as an exponential random variable, that the 2.2 Continuous Random Variables 51 component has lasted a length of time s, and that we wish to calculate the probability that it will last at least t more time units; that is, we wish to find P(T > t +s | T > s): P(T > t + s | T > s) = P(T > t + s and T > s) P(T > s) = P(T > t + s) P(T > s) = e−λ(t+s) e−λs = e−λt We see that the probability that the unit will last t more time units does not depend on s. The exponential distribution is consequently said to be memoryless; it is clearly not a good model for human lifetimes, since the probability that a 16-year-old will live at least 10 more years is not the same as the probability that an 80-year-old will live at least 10 more years. It can be shown that the exponential distribution is characterized by this memoryless property—that is, the memorylessness implies that the distribution is exponential. It may be somewhat surprising that a qualitative characterization, the property of memorylessness, actually determines the form of this density function. The memoryless character of the exponential distribution follows directly from its relation to a Poisson process. Suppose that events occur in time as a Poisson process with parameter λ and that an event occurs at time t0. Let T denote the length of time until the next event. The density of T can be found as follows: P(T > t) = P(no events in (t0, t0 + t)) Since the number of events in the interval (t0, t0 + t), which is of length t, follows a Poisson distribution with parameter λt, this probability is e−λt , and thus T follows an exponential distribution with parameter λ. We can continue in this fashion. Suppose that the next event occurs at time t1; the distribution of time until the third event is again exponential by the same analysis and, from the independence property of the Poisson process, is independent of the length of time between the first two events. Generally, the times between events of a Poisson process are independent, identically distributed, exponential random variables. Proteins and other biologically important molecules are regulated in various ways. Some undergo aging and are thus more likely to degrade when they are old than when they are young. If a molecule was not subject to aging, but its chance of degradation was the same at any age, its lifetime would follow an exponential distribution. EXAMPLE A Muscle and nerve cell membranes contain large numbers of channels through which selected ions can pass when the channels are open. Using sophisticated experimental techniques, neurophysiologists can measure the resulting current that flows through a single channel, and experimental records often indicate that a channel opens and closes at seemingly random times. In some cases, simple kinetic models predict that the duration of the open time should be exponentially distributed. 52 Chapter 2 Random Variables 0 1 2 3 4 5 6 7 8 9 10 50 100 150 200 250 300 350 A Open time (ms) Frequency ( per 0.5 ms) 10 M-Sux 1.18 ms 0 1 2 3 4 5 6 7 8 9 10 50 100 150 200 B Open time (ms) Frequency ( per 0.5 ms) 20 M-Sux 873 s 0 0.5 50 100 150 200 250 300 D Open time (ms) Frequency ( per 0.15 ms) 100 M-Sux 307 s 1.0 1.5 2.0 2.5 3.00 0.5 25 50 75 100 C Open time (ms) Frequency ( per 0.15 ms) 1.0 1.5 2.0 2.5 3.0 50 M-Sux 573 s 0 0.25 50 100 150 200 E Open time (ms) Frequency ( per 0.05 ms) 0.50 0.75 1.0 200 M-Sux 117 s 250 0 0.1 50 100 150 F Open time (ms) Frequency ( per 0.25 ms) 0.2 0.3 0.4 500 M-Sux 38 s FIGURE 2.10 Histograms of open times at varying concentrations of suxamethonium and fitted exponential densities. Marshall et al. (1990) studied the action of a channel-blocking agent (suxa- methonium) on a channel (the nicotinic receptor of frog muscle). Figure 2.10 displays histograms of open times and fitted exponential distributions at a range of concentra- tions of suxamethonium. In this example, the exponential distribution is parametrized as f (t) = (1/τ)exp(−t/τ). τ is thus in units of time, whereas λ is in units of the reciprocal of time. From the figure, we see that the intervals become shorter and that the parameter τ decreases with increasing concentrations of the blocker. It can also be seen, especially at higher concentrations, that very short intervals are not recorded because of limitations of the instrumentation. ■ 2.2 Continuous Random Variables 53 2.2.2 The Gamma Density The gamma density function depends on two parameters, α and λ: g(t) = λα (α)tα−1e−λt , t ≥ 0 For t < 0, g(t) = 0. So that the density be well defined and integrate to 1, α>0 and λ>0. The gamma function, (x), is defined as (x) = ∞ 0 ux−1e−u du, x > 0 Some properties of the gamma function are developed in the problems at the end of this chapter. Note that if α = 1, the gamma density coincides with the exponential density. The parameter α is called a shape parameter for the gamma density, and λ is called a scale parameter. Varying α changes the shape of the density, whereas varying λ corresponds to changing the units of measurement (say, from seconds to minutes) and does not affect the shape of the density. Figure 2.11 shows several gamma densities. Gamma densities provide a fairly flexible class for modeling nonnegative random variables. .5 .5 t g ( t ) 00 1.0 1.5 2.0 1.0 1.5 2.0 2.5 (a) .05 5 t g ( t ) 001015 20 .10 .15 .20 (b) FIGURE 2.11 Gamma densities, (a) α = .5 (solid) and α = 1 (dotted) and (b) α = 5 (solid) and α = 10 (dotted); λ = 1 in all cases. EXAMPLE A The patterns of occurrence of earthquakes in terms of time, space, and magnitude are very erratic, and attempts are sometimes made to construct probabilistic models for these events. The models may be used in a purely descriptive manner or, more ambitiously, for purposes of predicting future occurrences and consequent damage. Figure 2.12 shows the fit of a gamma density and an exponential density to the observed times separating a sequence of small earthquakes (Udias and Rice, 1975). The gamma density clearly gives a better fit (α = .509 and λ = .00115). Note that an 54 Chapter 2 Random Variables 100 5 Hours Frequency 0 01015 20 200 300 400 500 25 30 35 40 45 50 600 1369 (1331) 19681971 Time intervals N 4764 FIGURE 2.12 Fit of gamma density (triangles) and of exponential density (circles) to times between microearthquakes. exponential model for interoccurrence times would be memoryless; that is, knowing that an earthquake had not occurred in the last t time units would tell us nothing about the probability of occurrence during the next s time units. The gamma model does not have this property. In fact, although we will not show this, the gamma model with these parameter values has the character that there is a large likelihood that the next earthquake will immediately follow any given one and this likelihood decreases monotonically with time. ■ 2.2.3 The Normal Distribution The normal distribution plays a central role in probability and statistics, for reasons that will become apparent in later chapters of this book. This distribution is also called the Gaussian distribution after Carl Friedrich Gauss, who proposed it as a model for measurement errors. The central limit theorem, which will be discussed in Chapter 6, justifies the use of the normal distribution in many applications. Roughly, the central limit theorem says that if a random variable is the sum of a large number of independent random variables, it is approximately normally distributed. The normal distribution has been used as a model for such diverse phenomena as a person’s height, the distribu- tion of IQ scores, and the velocity of a gas molecule. The density function of the normal distribution depends on two parameters, μ and σ (where −∞ <μ<∞,σ >0): f (x) = 1 σ √ 2π e−(x−μ)2/2σ 2 , −∞ < x < ∞ 2.2 Continuous Random Variables 55 The parameters μ and σ are called the mean and standard deviation of the normal density. The cdf cannot be evaluated in closed form from this density function (the integral that defines the cdf cannot be evaluated by an explicit formula but must be found numerically). A problem at the end of this chapter asks you to show that the normal density just given integrates to one. As shorthand for the statement “X follows a normal distribution with parameters μ and σ,” it is convenient to use X ∼ N(μ, σ 2). From the form of the density function, we see that the density is symmetric about μ, f (μ − x) = f (μ + x), where it has a maximum, and that the rate at which it falls off is determined by σ. Figure 2.13 shows several normal densities. Normal densities are sometimes referred to as bell-shaped curves. The special case for which μ = 0 and σ = 1 is called the standard normal density. Its cdf is denoted by  and its density by φ (not to be confused with the empty set). The relationship between the general normal density and the standard normal density will be developed in the next section. .2 4 x f ( x ) 0 6 2 0 2 .4 .6 .8 4 6 FIGURE 2.13 Normal densities, μ = 0 and σ = .5 (solid), μ = 0 and σ = 1 (dotted), and μ = 0 and σ = 2 (dashed). EXAMPLE A Acoustic recordings made in the ocean contain substantial background noise. To de- tect sonar signals of interest, it is useful to characterize this noise as accurately as possible. In the Arctic, much of the background noise is produced by the cracking and straining of ice. Veitch and Wilks (1985) studied recordings of Arctic undersea noise and characterized the noise as a mixture of a Gaussian component and occa- sional large-amplitude bursts. Figure 2.14 is a trace of one recording that includes a burst. Figure 2.15 shows a Gaussian distribution fit to observations from a “quiet” (nonbursty) period of this noise. ■ 56 Chapter 2 Random Variables 8 20 Time (milliseconds) Standardized Amplitude 04060 80 6 4 2 100 0 2 FIGURE 2.14 A record of undersea noise containing a large burst. .1 40 2 0 2 .2 .3 4 .4 .5 FIGURE 2.15 A histogram from a “quiet” period of undersea noise with a fitted normal density. EXAMPLE B Turbulent air flow is sometimes modeled as a random process. Since the velocity of the flow at any point is subject to the influence of a large number of random eddies in the neighborhood of that point, one might expect from the central limit theorem that the velocity would be normally distributed. Van Atta and Chen (1968) analyzed data gathered in a wind tunnel. Figure 2.16, taken from their paper, shows a normal distribution fit to 409,600 observations of one component of the velocity; the fit is remarkably good. ■ EXAMPLE C S&P 500 The Standard and Poors 500 is an index of important U.S. stocks; each stock’s weight in the index is proportional to its market value. Individuals can invest in mutual funds that track the index. The top panel of Figure 2.17 shows the sequential values of the 2.2 Continuous Random Variables 57 ␴ p ( u/ ␴ ) .05 2 u/␴ 03 1 0 1 .10 .15 .20 2 3 .25 .30 .35 .40 FIGURE 2.16 A normal density (solid line) fit to 409,600 measurements of one component of the velocity of a turbulent wind flow. The dots show the values from a histogram. 0 50 100 150 200 250 0.03 0 0.03 Return Density 0.04 0.02 0 0.02 0.04 40 30 20 10 0 Time Returns FIGURE 2.17 Returns on the S&P 500 during 2003 (top panel) and a normal curve fitted to their histogram (bottom panel). The region area to the left of the 0.05 quantile is shaded. 58 Chapter 2 Random Variables returns during 2003. The average return during this period was 0.1% per day, and we can see from the figure that daily fluctuations were as large as 3% or 4%. The lower panel of the figure shows a histogram of the returns and a fitted normal density with μ = 0.001 and σ = 0.01. A financial company could use the fitted normal density in calculating its Value at Risk (see Example D of Section 2.2). Using a time horizon of one day and a confidence level of 95%, the VaR is the current investment in the index, V0, multiplied by the negative of the 0.05 quantile of the distribution of returns. In this case, the quantile can be calculated to be −0.0165, so the VaR is .0165V0. Thus, if V0 is $10 million, the VaR is $165,000. The company can have 95% “confidence” that its losses will not exceed that amount on a given day. However, it should not be surprised if that amount is exceeded about once in every 20 trading days. ■ 2.2.4 The Beta Density The beta density is useful for modeling random variables that are restricted to the interval [0, 1]: f (u) = (a + b) (a) (b)ua−1(1 − u)b−1, 0 ≤ u ≤ 1 Figure 2.18 shows beta densities for various values of a and b. Note that the case a = b = 1 is the uniform distribution. The beta distribution is important in Bayesian statistics, as you will see later. 2.3 Functions of a Random Variable Suppose that a random variable X has a density function f (x). We often need to find the density function of Y = g(X) for some given function g. For example, X might be the velocity of a particle of mass m, and we might be interested in the probability density function of the particle’s kinetic energy, Y = 1 2 mX2. Often, the density and cdf of X are denoted by fX and FX ; and those of Y,by fY and FY . To illustrate techniques for solving such a problem, we first develop some useful facts about the normal distribution. Suppose that X ∼ N(μ, σ 2) and that Y = aX+b, where a > 0. The cumulative distribution function of Y is FY (y) = P(Y ≤ y) = P(aX + b ≤ y) = P X ≤ y − b a = FX y − b a 2.3 Functions of a Random Variable 59 0.4 0 0.8 1.2 1.6 0.2 p Beta density (a) .4 .6 .8 1.0 1.0 0 2.0 3.0 0.2 p Beta density (b) .4 .6 .8 1.0 1.0 0 2.0 3.0 0.2 p Beta density (c) .4 .6 .8 1.0 4 0 8 12 0.2 p Beta density (d) .4 .6 .8 1.0 10 6 2 FIGURE 2.18 Beta density functions for various values of a and b: (a) a = 2, b = 2; (b) a = 6, b = 2; (c) a = 6, b = 6; and (d) a = .5, b = 4. Thus, fY (y) = d dyFX y − b a = 1 a fX y − b a Up to this point, we have not used the assumption of normality at all, so this result holds for a general continuous random variable, provided that FX is appropriately differentiable. If fX is a normal density function with parameters μ and σ, we find that, after substitution, fY (y) = 1 aσ √ 2π exp −1 2 y − b − aμ aσ 2 From this, we see that Y follows a normal distribution with parameters aμ + b and aσ. The case for which a < 0 can be analyzed similarly (see Problem 57 in the end-of-chapter problems), yielding the following proposition. PROPOSITION A If X ∼ N(μ, σ 2) and Y = aX + b, then Y ∼ N(aμ + b, a2σ 2). ■ This proposition is quite useful for finding probabilities from the normal dis- tribution. Suppose that X ∼ N(μ, σ 2) and we wish to find P(x0 < X < x1) for 60 Chapter 2 Random Variables some numbers x0 and x1. Consider the random variable Z = X − μ σ = X σ − μ σ Applying Proposition A with a = 1/σ and b =−μ/σ, we see that Z ∼ N(0, 1), that is, Z follows a standard normal distribution. Therefore, FX (x) = P(X ≤ x) = P X − μ σ ≤ x − μ σ = P Z ≤ x − μ σ =  x − μ σ We thus have P(x0 < X < x1) = FX (x1) − FX (x0) =  x1 − μ σ −  x0 − μ σ Thus, probabilities for general normal random variables can be evaluated in terms of probabilities for standard normal random variables. This is quite useful, since tables need to be made up only for the standard normal distribution rather than separately for every μ and σ. EXAMPLE A Scores on a certain standardized test, IQ scores, are approximately normally dis- tributed with mean μ = 100 and standard deviation σ = 15. Here we are referring to the distribution of scores over a very large population, and we approximate that discrete cumulative distribution function by a normal continuous cumulative distri- bution function. An individual is selected at random. What is the probability that his score X satisfies 120 < X < 130? We can calculate this probability by using the standard normal distribution as follows: P(120 < X < 130) = P 120 − 100 15 < X − 100 15 < 130 − 100 15 = P(1.33 < Z < 2) where Z follows a standard normal distribution. Using a table of the standard normal distribution (Table 2 of Appendix B), this probability is P(1.33 < Z < 2) = (2) − (1.33) = .9772 − .9082 = .069 Thus, approximately 7% of the population will have scores in this range. ■ 2.3 Functions of a Random Variable 61 EXAMPLE B Let X ∼ N(μ, σ 2), and find the probability that X is less than σ away from μ; that is, find P(|X − μ| <σ). This probability is P(−σ 0 if it does not function at all; if it does function, the lifetime could be modeled as a continuous random variable. 2.5 Problems 1. Suppose that X is a discrete random variable with P(X = 0) = .25, P(X = 1) = .125, P(X = 2) = .125, and P(X = 3) = .5. Graph the frequency function and the cumulative distribution function of X. 2. An experiment consists of throwing a fair coin four times. Find the frequency function and the cumulative distribution function of the following random vari- ables: (a) the number of heads before the first tail, (b) the number of heads following the first tail, (c) the number of heads minus the number of tails, and (d) the number of tails times the number of heads. 3. The following table shows the cumulative distribution function of a discrete random variable. Find the frequency function. kF(k) 00 1.1 2.3 3.7 4.8 5 1.0 4. If X is an integer-valued random variable, show that the frequency function is related to the cdf by p(k) = F(k) − F(k − 1). 5. Show that P(u < X ≤ v) = F(v) − F(u) for any u and v in the cases that (a) X is a discrete random variable and (b) X is a continuous random variable. 6. Let A and B be events, and let IA and IB be the associated indicator random variables. Show that IA∩B = IA IB = min(IA, IB) and IA∪B = max(IA, IB) 2.5 Problems 65 7. Find the cdf of a Bernoulli random variable. 8. Show that the binomial probabilities sum to 1. 9. For what values of p is a two-out-of-three majority decoder better than transmis- sion of the message once? 10. Appending three extra bits to a 4-bit word in a particular way (a Hamming code) allows detection and correction of up to one error in any of the bits. If each bit has probability .05 of being changed during communication, and the bits are changed independently of each other, what is the probability that the word is correctly received (that is, 0 or 1 bit is in error)? How does this probability compare to the probability that the word will be transmitted correctly with no check bits, in which case all four bits would have to be transmitted correctly for the word to be correct? 11. Consider the binomial distribution with n trials and probability p of success on each trial. For what value of k is P(X = k) maximized? This value is called the mode of the distribution. (Hint: Consider the ratio of successive terms.) 12. Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses? 13. A multiple-choice test consists of 20 items, each with four choices. A student is able to eliminate one of the choices on each question as incorrect and chooses randomly from the remaining three choices. A passing grade is 12 items or more correct. a. What is the probability that the student passes? b. Answer the question in part (a) again, assuming that the student can eliminate two of the choices on each question. 14. Two boys play basketball in the following way. They take turns shooting and stop when a basket is made. Player A goes first and has probability p1 of mak- ing a basket on any throw. Player B, who shoots second, has probability p2 of making a basket. The outcomes of the successive trials are assumed to be inde- pendent. a. Find the frequency function for the total number of attempts. b. What is the probability that player A wins? 15. Two teams, A and B, play a series of games. If team A has probability .4 of winning each game, is it to its advantage to play the best three out of five games or the best four out of seven? Assume the outcomes of successive games are independent. 16. Show that if n approaches ∞ and r/n approaches p and m is fixed, the hyper- geometric frequency function tends to the binomial frequency function. Give a heuristic argument for why this is true. 17. Suppose that in a sequence of independent Bernoulli trials, each with probability of success p, the number of failures up to the first success is counted. What is the frequency function for this random variable? 18. Continuing with Problem 17, find the frequency function for the number of failures up to the rth success. 66 Chapter 2 Random Variables 19. Find an expression for the cumulative distribution function of a geometric random variable. 20. If X is a geometric random variable with p = .5, for what value of k is P(X ≤ k) ≈ .99? 21. If X is a geometric random variable, show that P(X > n + k − 1|X > n − 1) = P(X > k) In light of the construction of a geometric distribution from a sequence of inde- pendent Bernoulli trials, how can this be interpreted so that it is “obvious”? 22. Three identical fair coins are thrown simultaneously until all three show the same face. What is the probability that they are thrown more than three times? 23. In a sequence of independent trials with probability p of success, what is the probability that there are r successes before the kth failure? 24. (Banach Match Problem) A pipe smoker carries one box of matches in his left pocket and one box in his right. Initially, each box contains n matches. If he needs a match, the smoker is equally likely to choose either pocket. What is the frequency function for the number of matches in the other box when he first discovers that one box is empty? 25. The probability of being dealt a royal straight flush (ace, king, queen, jack, and ten of the same suit) in poker is about 1.3 × 10−8. Suppose that an avid poker player sees 100 hands a week, 52 weeks a year, for 20 years. a. What is the probability that she is never dealt a royal straight flush dealt? b. What is the probability that she is dealt exactly two royal straight flushes? 26. The university administration assures a mathematician that he has only 1 chance in 10,000 of being trapped in a much-maligned elevator in the mathematics building. If he goes to work 5 days a week, 52 weeks a year, for 10 years, and always rides the elevator up to his office when he first arrives, what is the probability that he will never be trapped? That he will be trapped once? Twice? Assume that the outcomes on all the days are mutually independent (a dubious assumption in practice). 27. Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members of the population are affected independently, find the probability of k cases in a population of 100,000 for k = 0, 1, 2. 28. Let p0, p1,...,pn denote the probability mass function of the binomial distribu- tion with parameters n and p. Let q = 1− p. Show that the binomial probabilities can be computed recursively by p0 = qn and pk = (n − k + 1)p kq pk−1, k = 1, 2,...,n Use this relation to find P(X ≤ 4) for n = 9000 and p = .0005. 2.5 Problems 67 29. Show that the Poisson probabilities p0, p1,...can be computed recursively by p0 = exp(−λ) and pk = λ k pk−1, k = 1, 2,... Use this scheme to find P(X ≤ 4) for λ = 4.5 and compare to the results of Problem 28. 30. Suppose that in a city, the number of suicides can be approximated by a Poisson process with λ = .33 per month. a. Find the probability of k suicides in a year for k = 0, 1, 2,.... What is the most probable number of suicides? b. What is the probability of two suicides in one week? 31. Phone calls are received at a certain residence as a Poisson process with parameter λ = 2 per hour. a. If Diane takes a 10-min. shower, what is the probability that the phone rings during that time? b. How long can her shower be if she wishes the probability of receiving no phone calls to be at most .5? 32. For what value of k is the Poisson frequency function with parameter λ maxi- mized? (Hint: Consider the ratio of consecutive terms.) 33. Let F(x) = 1 − exp(−αxβ) for x ≥ 0,α >0,β >0, and F(x) = 0 for x < 0. Show that F is a cdf, and find the corresponding density. 34. Let f (x) = (1 + αx)/2 for −1 ≤ x ≤ 1 and f (x) = 0 otherwise, where −1 ≤ α ≤ 1. Show that f is a density, and find the corresponding cdf. Find the quartiles and the median of the distribution in terms of α. 35. Sketch the pdf and cdf of a random variable that is uniform on [−1, 1]. 36. If U is a uniform random variable on [0, 1], what is the distribution of the random variable X = [nU], where [t] denotes the greatest integer less than or equal to t? 37. A line segment of length 1 is cut once at random. What is the probability that the longer piece is more than twice the length of the shorter piece? 38. If f and g are densities, show that αf + (1 − α)g is a density, where 0 ≤ α ≤ 1. 39. The Cauchy cumulative distribution function is F(x) = 1 2 + 1 π tan−1(x), −∞ < x < ∞ a. Show that this is a cdf. b. Find the density function. c. Find x such that P(X > x) = .1. 40. Suppose that X has the density function f (x) = cx2 for 0 ≤ x ≤ 1 and f (x) = 0 otherwise. a. Find c. b. Find the cdf. c. What is P(.1 ≤ X <.5)? 68 Chapter 2 Random Variables 41. Find the upper and lower quartiles of the exponential distribution. 42. Find the probability density for the distance from an event to its nearest neighbor for a Poisson process in the plane. 43. Find the probability density for the distance from an event to its nearest neighbor for a Poisson process in three-dimensional space. 44. Let T be an exponential random variable with parameter λ. Let X be a discrete random variable defined as X = k if k ≤ T < k + 1, k = 0, 1,.... Find the frequency function of X. 45. Suppose that the lifetime of an electronic component follows an exponential distribution with λ = .1. a. Find the probability that the lifetime is less than 10. b. Find the probability that the lifetime is between 5 and 15. c. Find t such that the probability that the lifetime is greater than t is .01. 46. T is an exponential random variable, and P(T < 1) = .05. What is λ? 47. If α>1, show that the gamma density has a maximum at (α − 1)/λ. 48. Show that the gamma density integrates to 1. 49. The gamma function is a generalized factorial function. a. Show that (1) = 1. b. Show that (x + 1) = x (x).(Hint: Use integration by parts.) c. Conclude that (n) = (n − 1)!, for n = 1, 2, 3,.... d. Use the fact that (1 2 ) = √π to show that, if n is an odd integer, n 2 = √π(n − 1)! 2n−1 n−1 2 ! 50. Show by a change of variables that (x) = 2 ∞ 0 t2x−1e−t2 dt = ∞ −∞ exte−et dt 51. Show that the normal density integrates to 1. (Hint: First make a change of variables to reduce the integral to that for the standard normal. The problem is then to show that ∞ −∞ exp(−x2/2) dx = √ 2π. Square both sides and reexpress the problem as that of showing ∞ −∞ exp(−x2/2) dx ∞ −∞ exp(−y2/2) dy = 2π Finally, write the product of integrals as a double integral and change to polar coordinates.) 2.5 Problems 69 52. Suppose that in a certain population, individuals’ heights are approximately nor- mally distributed with parameters μ = 70 and σ = 3 in. a. What proportion of the population is over 6 ft. tall? b. What is the distribution of heights if they are expressed in centimeters? In meters? 53. Let X be a normal random variable with μ = 5 and σ = 10. Find (a) P(X > 10), (b) P(−20 < X < 15), and (c) the value of x such that P(X > x) = .05. 54. If X ∼ N(μ, σ 2), show that P(|X − μ|≤.675σ)= .5. 55. X ∼ N(μ, σ 2), find the value of c in terms of σ such that P(μ − c ≤ X ≤ μ + c) = .95. 56. If X ∼ N(0,σ2), find the density of Y =|X|. 57. X ∼ N(μ, σ 2) and Y = aX+b, where a < 0, show that Y ∼ N(aμ+b, a2σ 2). 58. If U is uniform on [0, 1], find the density function of √ U. 59. If U is uniform on [−1, 1], find the density function of U 2. 60. Find the density function of Y = eZ , where Z ∼ N(μ, σ 2). This is called the lognormal density, since log Y is normally distributed. 61. Find the density of cX when X follows a gamma distribution. Show that only λ is affected by such a transformation, which justifies calling λ a scale parameter. 62. Show that if X has a density function fX and Y = aX + b, then fY (y) = 1 |a| fX y − b a 63. Suppose that  follows a uniform distribution on the interval [−π/2,π/2]. Find the cdf and density of tan . 64. A particle of mass m has a random velocity, V , which is normally distributed with parameters μ = 0 and σ. Find the density function of the kinetic energy, E = 1 2 mV2. 65. How could random variables with the following density function be generated from a uniform random number generator? f (x) = 1 + αx 2 , −1 ≤ x ≤ 1, −1 ≤ α ≤ 1 66. Let f (x) = αx−α−1 for x ≥ 1 and f (x) = 0 otherwise, where α is a positive parameter. Show how to generate random variables from this density from a uniform random number generator. 67. The Weibull cumulative distribution function is F(x) = 1 − e−(x/α)β , x ≥ 0,α>0,β>0 a. Find the density function. 70 Chapter 2 Random Variables b. Show that if W follows a Weibull distribution, then X = (W/α)β follows an exponential distribution. c. How could Weibull random variables be generated from a uniform random number generator? 68. If the radius of a circle is an exponential random variable, find the density function of the area. 69. If the radius of a sphere is an exponential random variable, find the density function of the volume. 70. Let U be a uniform random variable. Find the density function of V = U −α, α>0. Compare the rates of decrease of the tails of the densities as a function of α. Does the comparison make sense intuitively? 71. This problem shows one way to generate discrete random variables from a uni- form random number generator. Suppose that F is the cdf of an integer-valued random variable; let U be uniform on [0, 1]. Define a random variable Y = k if F(k − 1) Y) can be found by integrating f over the set {(x, y)|0 ≤ y ≤ x ≤ 1}: P(X > Y) = 12 7 1 0 x 0 (x2 + xy) dy dx = 9 14 ■ 76 Chapter 3 Joint Distributions 1 0.8 0.6 0.2 0.4 01 0.8 0.6 0.4 0.2 0 0 1 2 3 x y f (x, y) FIGURE 3.4 The density function f (x, y) = 12 7 (x2 + xy), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. The marginal cdf of X,orFX ,is FX (x) = P(X ≤ x) = limy→∞ F(x, y) = x −∞ ∞ −∞ f (u, y) dy du From this, it follows that the density function of X alone, known as the marginal density of X,is fX (x) = F X (x) = ∞ −∞ f (x, y) dy In the discrete case, the marginal frequency function was found by summing the joint frequency function over the other variable; in the continuous case, it is found by integration. EXAMPLE B Continuing Example A, the marginal density of X is fX (x) = 12 7 1 0 (x2 + xy) dy = 12 7 x2 + x 2 A similar calculation shows that the marginal density of Y is fY (y) = 12 7 ( 1 3 + y/2). ■ For several jointly continuous random variables, we can make the obvious gen- eralizations. The joint density function is a function of several variables, and the marginal density functions are found by integration. There are marginal density 3.3 Continuous Random Variables 77 functions of various dimensions. Suppose that X, Y, and Z are jointly continuous random variables with density function f (x, y, z). The one-dimensional marginal distribution of X is fX (x) = ∞ −∞ ∞ −∞ f (x, y, z) dy dz and the two-dimensional marginal distribution of X and Y is fXY(x, y) = ∞ −∞ f (x, y, z) dz EXAMPLE C Farlie-Morgenstern Family If F(x) and G(y) are one-dimensional cdfs, it can be shown that, for any α for which |α|≤1, H(x, y) = F(x)G(y){1 + α[1 − F(x)][1 − G(y)]} is a bivariate cumulative distribution function. Because limx→∞ F(x) = limy→∞ F(y) = 1, the marginal distributions are H(x, ∞) = F(x) H(∞, y) = G(y) In this way, an infinite number of different bivariate distributions with given marginals can be constructed. As an example, we will construct bivariate distributions with marginals that are uniform on [0, 1] [F(x) = x, 0 ≤ x ≤ 1, and G(y) = y, 0 ≤ y ≤ 1]. First, with α =−1, we have H(x, y) = xy[1 − (1 − x)(1 − y)] = x2 y + y2x − x2 y2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The bivariate density is h(x, y) = ∂2 ∂x∂y H(x, y) = 2x + 2y − 4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The density is shown in Figure 3.5. Perhaps you can imagine integrating over y (pushing all the mass onto the x axis) to produce a marginal uniform density for x. Next, if α = 1, H(x, y) = xy[1 + (1 − x)(1 − y)] = 2xy − x2 y − y2x + x2 y2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 The density is h(x, y) = 2 − 2x − 2y + 4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 This density is shown in Figure 3.6. We just constructed two different bivariate distributions, both of which have uniform marginals. ■ 78 Chapter 3 Joint Distributions 1 0.8 0.6 0.4 0.2 0 2 1.5 1 0.5 0 h(x, y) 0 0.2 0.4 0.6 0.8 1x y FIGURE 3.5 The joint density h(x, y) = 2x + 2y − 4xy, where 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, which has uniform marginal densities. 1 0.8 0.6 0.4 0.2 02 1.5 1 0.5 0 h(x, y) 0 0.2 0.4 0.6 0.8 1x y FIGURE 3.6 The joint density h(x, y) = 2 − 2x − 2y + 4xy, where 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, which has uniform marginal densities. A copula is a joint cumulative distribution function of random variables that have uniform marginal distributions. The functions H(x, y) in the preceding example are copulas. Note that a copula C(u,v) is nondecreasing in each variable, because it is a cdf. Also, P(U ≤ u) = C(u, 1) = u and C(1,v) = v, since the marginal distributions are uniform. We will restrict ourselves to copulas that have densities, in which case the density is c(u,v)= ∂2 ∂u∂vC(u,v)≥ 0 3.3 Continuous Random Variables 79 Now, suppose that X and Y are continuous random variables with cdfs FX (x) and FY (y). Then U = FX (x) and V = FY (y) are uniform random variables (Propo- sition 2.3C). For a copula C(u,v), consider the joint distribution defined by FXY(x, y) = C(FX (x), FY (y)) Since C(FX (x), 1) = FX (x), the marginal cdfs corresponding to FXY are FX (x) and FY (y). Using the chain rule, the corresponding density is fXY(x, y) = c(FX (x), FY (y)) fX (x) fY (y) This construction points out that from the ingredients of two marginal distributions and any copula, a joint distribution with those marginals can be constructed. It is thus clear that the marginal distributions do not determine the joint distribution. The dependence between the random variables is captured in the copula. Copulas are not just academic curiousities—they have been extensively used in financial statistics in recent years to model dependencies in the returns of financial instruments. EXAMPLE D Consider the following joint density: f (x, y) = λ2e−λy, 0 ≤ x ≤ y,λ>0 0, elsewhere This joint density is plotted in Figure 3.7. To find the marginal densities, it is helpful to draw a picture showing where the density is nonzero to aid in determining the limits of integration (see Figure 3.8). 3 2 1 03 2 1 0 x y 1 .75 .5 .25 0 f (x, y) FIGURE 3.7 The joint density of Example D. 80 Chapter 3 Joint Distributions x y y x FIGURE 3.8 The joint density of Example D is nonzero over the shaded region of the plane. First consider the marginal density fX (x) = ∞ −∞ fXY(x, y)dy. Since f (x, y) = 0 for x ≥ y, fX (x) = ∞ x λ2e−λydy = λe−λx , x ≥ 0 and we see that the marginal distribution of X is exponential. Next, because fXY(x, y) = 0 for x ≤ 0 and x > y, fY (y) = y 0 λ2e−λydx = λ2 ye−λy, y ≥ 0 The marginal distribution of Y is a gamma distribution. ■ In some applications, it is useful to analyze distributions that are uniform over some region of space. For example, in the plane, the random point (X, Y) is uniform over a region, R, if for any A ⊂ R, P((X, Y) ∈ A) = |A| |R| where ||denotes area. EXAMPLE E A point is chosen randomly in a disk of radius 1. Since the area of the disk is π, f (x, y) = 1 π , if x2 + y2 ≤ 1 0, otherwise 3.3 Continuous Random Variables 81 We can calculate the distribution of R, the distance of the point from the origin. R ≤ r if the point lies in a disk of radius r. Since this disk has area πr 2, FR(r) = P(R ≤ r) = πr 2 π = r 2 The density function of R is thus fR(r) = 2r, 0 ≤ r ≤ 1. Let us now find the marginal density of the x coordinate of the random point: fX (x) = ∞ −∞ f (x, y) dy = 1 π √ 1−x2 − √ 1−x2 dy = 2 π 1 − x2, −1 ≤ x ≤ 1 Note that we chose the limits of integration carefully; outside these limits the joint density is zero. (Draw a picture of the region over which f (x, y)>0 and indicate the preceding limits of integration.) By symmetry, the marginal density of Y is fY (y) = 2 π 1 − y2, −1 ≤ y ≤ 1 ■ EXAMPLE F Bivariate Normal Density The bivariate normal density is given by the complicated expression f (x, y) = 1 2πσX σY 1 − ρ2 exp − 1 2(1 − ρ2) (x − μX )2 σ 2 X + (y − μY )2 σ 2 Y −2ρ(x − μX )(y − μY ) σX σY One of the earliest uses of this bivariate density was as a model for the joint distribution of the heights of fathers and sons. The density depends on five parameters: −∞ <μX < ∞−∞<μY < ∞ σX > 0 σY > 0 −1 <ρ<1 The contour lines of the density are the lines in the xyplane on which the joint density is constant. From the preceding equation, we see that f (x, y) is constant if (x − μX )2 σ 2 X + (y − μY )2 σ 2 Y − 2ρ(x − μX )(y − μY ) σX σY = constant The locus of such points is an ellipse centered at (μX ,μY ).Ifρ = 0, the axes of the ellipse are parallel to the x and y axes, and if ρ = 0, they are tilted. Figure 3.9 shows several bivariate normal densities, and Figure 3.10 shows the corresponding elliptical contours. 82 Chapter 3 Joint Distributions .3 .2 .1 0 .4 2 0 2 2 0 2 (c) .3 .2 .1 0 .4 2 0 2 2 0 2 (b) .3 .2 .1 0 .4 2 0 2 2 0 2 (d) .3 .2 .1 0 .4 2 0 2 2 0 2 (a) FIGURE 3.9 Bivariate normal densities with μX = μY = 0 and σX = σY = 1 and (a) ρ = 0, (b) ρ = .3, (c) ρ = .6, (d) ρ = .9. The marginal distributions of X and Y are N(μX ,σ2 X ) and N(μY ,σ2 Y ), respec- tively, as we will now demonstrate. The marginal density of X is fX (x) = ∞ −∞ fXY(x, y) dy Making the changes of variables u = (x − μX )/σX and v = (y − μY )/σY gives us fX (x) = 1 2πσX 1 − ρ2 ∞ −∞ exp − 1 2(1 − ρ2)(u2 + v2 − 2ρuv) dv 3.3 Continuous Random Variables 83 2 1 0 1 2012 (a) 2 2 1 0 1 2 10 12 (b) 2 2 1 0 1 2 10 12 (c) 2 2 1 0 1 2 10 12 (d) 2 1 FIGURE 3.10 The elliptical contours of the bivariate normal densities of Figure 3.9. To evaluate this integral, we use the technique of completing the square. Using the identity u2 + v2 − 2ρuv = (v − ρu)2 + u2(1 − ρ2) we have fX (x) = 1 2πσX 1 − ρ2 e−u2/2 ∞ −∞ exp − 1 2(1 − ρ2)(v − ρu)2 dv Finally, recognizing the integral as that of a normal density with mean ρu and variance (1 − ρ2), we obtain fX (x) = 1 σX √ 2π e−(1/2) (x−μX )2/σ 2 X which is a normal density, as was to be shown. Thus, for example, the marginal distributions of x and y in Figure 3.9 are all standard normal, even though the joint distributions of (a)–(d) are quite different from each other. ■ We saw in our discussion of copulas earlier in this section that marginal densities do not determine joint densities. For example, we can take both marginal densities to be normal with parameters μ = 0 and σ = 1 and use the Farlie-Morgenstern 84 Chapter 3 Joint Distributions 0 3 32 2 1 12 1 0 1 2 3 x y 0 0.05 0.1 0.15 0.2 f (x, y) FIGURE 3.11 A bivariate density that has normal marginals but is not bivariate normal. The contours of the density shown in the xy plane are not elliptical. copula with density c(u,v)= 2 − 2u − 2v + 4uv. Denoting the normal density and cumulative distribution functions by φ(x) and (x), the bivariate density is f (x, y) = (2 − 2(x) − 2(y) + 4(x)(y))φ(x)φ(y) This density and its contours are shown in Figure 3.11. Note that the contours are not elliptical. This bivariate density has normal marginals, but it is not a bivariate normal density. 3.4 Independent Random Variables DEFINITION Random variables X1, X2,...,Xn are said to be independent if their joint cdf factors into the product of their marginal cdf’s: F(x1, x2,...,xn) = FX1 (x1)FX2 (x2) ···FXn (xn) for all x1, x2,...,xn. ■ The definition holds for both continuous and discrete random variables. For discrete random variables, it is equivalent to state that their joint frequency function factors; for continuous random variables, it is equivalent to state that their joint density function factors. To see why this is true, consider the case of two jointly continuous random variables, X and Y. If they are independent, then F(x, y) = FX (x)FY (y) 3.4 Independent Random Variables 85 and taking the second mixed partial derivative makes it clear that the density function factors. On the other hand, if the density function factors, then the joint cdf can be expressed as a product: F(x, y) = x −∞ y −∞ fX (u) fY (v) dv du = x −∞ fX (u) du y −∞ fY (v) dv = FX (x)FY (y) It can be shown that the definition implies that if X and Y are independent, then P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B) It can also be shown that if g and h are functions, then Z = g(X) and W = h(Y) are independent as well. A sketch of an argument goes like this (the details are beyond the level of this course): We wish to find P(Z ≤ z, W ≤ w). Let A(z) be the set of x such that g(x) ≤ z, and let B(w) be the set of y such that h(y) ≤ w. Then P(Z ≤ z, W ≤ w) = P(X ∈ A(z), Y ∈ B(w)) = P(X ∈ A(z))P(Y ∈ B(w)) = P(Z ≤ z)P(W ≤ w) EXAMPLE A Suppose that the point (X, Y) is uniformly distributed on the square S ={(x, y) | −1/2 ≤ x ≤ 1/2, −1/2 ≤ y ≤ 1/2}: fXY(x, y) = 1 for (x, y) in S and 0 elsewhere. Make a sketch of this square. You can visualize that the marginal distributions of X and Y are uniform on [−1/2, 1/2]. For example, the marginal density at a point x, −1/2 ≤ x ≤ 1/2 is found by integrating (summing) the joint density over the vertical line that meets the horizontal axis at x. Thus, fX (x) = 1, −1/2 ≤ x ≤ 1/2 and fY (y) = 1, and − 1/2 ≤ y ≤ 1/2. The joint density is equal to the product of the marginal densities, so X and Y are independent. You should be able to see from our sketch that knowing the value of X gives no information about the possible values of Y. ■ EXAMPLE B Now consider rotating the square of the previous example by 90◦ to form a diamond. Sketch this diamond. From the sketch, you can see that the marginal density of X is nonnegative for −1/2 ≤ x ≤ 1/2 as before, but it is not uniform, and similarly for the marginal density of Y. Thus, for example, fX (.9)>0 and fY (.9)>0. But from the sketch you can also see that fXY(.9,.9) = 0. Thus, X and Y are not independent. Finally, the sketch shows you that knowing the value of X— for example, X = .9— constrains the possible values of Y. ■ 86 Chapter 3 Joint Distributions EXAMPLE C Farlie-Morgenstern Family From Example C in Section 3.3, we see that X and Y are independent only if α = 0, since only in this case does the joint cdf H factor into the product of the marginals F and G. ■ EXAMPLE D If X and Y follow a bivariate normal distribution (Example F from Section 3.3) and ρ = 0, their joint density factors into the product of two normal densities, and therefore X and Y are independent. ■ EXAMPLE E Suppose that a node in a communications network has the property that if two packets of information arrive within time τ of each other, they “collide” and then have to be retransmitted. If the times of arrival of the two packets are independent and uniform on [0, T ], what is the probability that they collide? The times of arrival of two packets, T1 and T2, are independent and uniform on [0, T ], so their joint density is the product of the marginals, or f (t1, t2) = 1 T 2 for t1 and t2 in the square with sides [0, T ]. Therefore, (T1, T2) is uniformly distributed over the square. The probability that the two packets collide is proportional to the area of the shaded strip in Figure 3.12. Each of the unshaded triangles of the figure has area (T −τ)2/2, and thus the area of the shaded area is T 2 −(T −τ)2. Integrating f (t1, t2) over this area gives the desired probability: 1 − (1 − τ/T )2. ■ t2 T T t1 FIGURE 3.12 The probability that the two packets collide is proportional to the area of the shaded region |t1 − t2| <τ 3.5 Conditional Distributions 87 3.5 Conditional Distributions 3.5.1 The Discrete Case If X and Y are jointly distributed discrete random variables, the conditional probability that X = xi given that Y = y j is, if pY (y j )>0, P(X = xi |Y = y j ) = P(X = xi , Y = y j ) P(Y = y j ) = pXY(xi , y j ) pY (y j ) This probability is defined to be zero if pY (y j ) = 0. We will denote this conditional probability by pX|Y (x|y). Note that this function of x is a genuine frequency function since it is nonnegative and sums to 1 and that pY|X (y|x) = pY (y) if X and Y are independent. EXAMPLE A We return to the simple discrete distribution considered in Section 3.2, reproducing the table of values for convenience here: y x 0123 0 1 8 2 8 1 8 0 1 0 1 8 2 8 1 8 The conditional frequency function of X given Y = 1is pX|Y (0|1) = 2 8 3 8 = 2 3 pX|Y (1|1) = 1 8 3 8 = 1 3 ■ The definition of the conditional frequency function just given can be reexpressed as pXY(x, y) = pX|Y (x|y)pY (y) (the multiplication law of Chapter 1). This useful equation gives a relationship between the joint and conditional frequency functions. Summing both sides over all values of y, we have an extremely useful application of the law of total probability: pX (x) = y pX|Y (x|y)pY (y) 88 Chapter 3 Joint Distributions EXAMPLE B Suppose that a particle counter is imperfect and independently detects each incoming particle with probability p. If the distribution of the number of incoming particles in a unit of time is a Poisson distribution with parameter λ, what is the distribution of the number of counted particles? Let N denote the true number of particles and X the counted number. From the statement of the problem, the conditional distribution of X given N = n is binomial, with n trials and probability p of success. By the law of total probability, P(X = k) = ∞ n=0 P(N = n)P(X = k|N = n) = ∞ n=k λne−λ n! n k pk(1 − p)n−k = (λp)k k! e−λ ∞ n=k λn−k (1 − p)n−k (n − k)! = (λp)k k! e−λ ∞ j=0 λ j (1 − p) j j! = (λp)k k! e−λeλ(1−p) = (λp)k k! e−λp We see that the distribution of X is a Poisson distribution with parameter λp. This model arises in other applications as well. For example, N might denote the number of traffic accidents in a given time period, with each accident being fatal or nonfatal; X would then be the number of fatal accidents. ■ 3.5.2 The Continuous Case In analogy with the definition in the preceding section, if X and Y are jointly contin- uous random variables, the conditional density of Y given X is defined to be fY|X (y|x) = fXY(x, y) fX (x) if 0 < fX (x)<∞, and 0 otherwise. This definition is in accord with the result to which a differential argument would lead. We would define fY|X (y|x) dy as P(y ≤ Y ≤ y + dy|x ≤ X ≤ x + dx) and calculate P(y ≤ Y ≤ y + dy|x ≤ X ≤ x + dx) = fXY(x, y) dx dy fX (x) dx = fXY(x, y) fX (x) dy Note that the rightmost expression is interpreted as a function of y, x being fixed. The numerator is the joint density fXY(x, y), viewed as a function of y for fixed x: you can visualize it as the curve formed by slicing through the joint density function 3.5 Conditional Distributions 89 perpendicular to the x axis. The denominator normalizes that curve to have unit area. The joint density can be expressed in terms of the marginal and conditional densities as follows: fXY(x, y) = fY|X (y|x) fX (x) Integrating both sides over x allows the marginal density of Y to be expressed as fY (y) = ∞ −∞ fY|X (y|x) fX (x) dx which is the law of total probability for the continuous case. EXAMPLE A In Example D in Section 3.3, we saw that fXY(x, y) = λ2e−λy, 0 ≤ x ≤ y fX (x) = λe−λx , x ≥ 0 fY (y) = λ2 ye−λy, y ≥ 0 Let us find the conditional densities. Before doing the formal calculations, it is in- formative to examine the joint density for x and y, respectively, held constant. If x is constant, the joint density decays exponentially in y for y ≥ x;ify is constant, the joint density is constant for 0 ≤ x ≤ y. (See Figure 3.7.) Now let us find the conditional densities according to the preceding definition. First, fY|X (y|x) = λ2e−λy λe−λx = λe−λ(y−x), y ≥ x The conditional density of Y given X = x is exponential on the interval [x, ∞). Expressing the joint density as fXY(x, y) = fY|X (y|x) fX (x) we see that we could generate X and Y according to fXY in the following way: First, generate X as an exponential random variable ( fX ), and then generate Y as another exponential random variable ( fY|X ) on the interval [x, ∞). From this representation, we see that Y may be interpreted as the sum of two independent exponential random variables and that the distribution of this sum is gamma, a fact that we will derive later by a different method. Now, fX|Y (x|y) = λ2e−λy λ2 ye−λy = 1 y , 0 ≤ x ≤ y The conditional density of X given Y = y is uniform on the interval [0, y]. Finally, expressing the joint density as fXY(x, y) = fX|Y (x|y) fY (y) 90 Chapter 3 Joint Distributions we see that alternatively we could generate X and Y according to the density fXY by first generating Y from a gamma density and then generating X uniformly on [0, y]. Another interpretation of this result is that, conditional on the sum of two independent exponential random variables, the first is uniformly distributed. ■ EXAMPLE B Stereology In metallography and other applications of quantitative microscopy, aspects of a three- dimensional structure are deduced from studying two-dimensional cross sections. Concepts of probability and statistics play an important role (DeHoff and Rhines 1968). In particular, the following problem arises. Spherical particles are dispersed in a medium (grains in a metal, for example); the density function of the radii of the spheres can be denoted as fR(r). When the medium is sliced, two-dimensional, circular cross sections of the spheres are observed; let the density function of the radii of these circles be denoted by fX (x). How are these density functions related? x y H x r FIGURE 3.13 A plane slices a sphere of radius r at a distance H from its center, producing a circle of radius x. To derive the relationship, we assume that the cross-sectioning plane is chosen at random, fix R = r, and find the conditional density fX|R(x|r). As shown in Figure 3.13, let H denote the distance from the center of the sphere to the planar cross section. By our assumption, H is uniformly distributed on [0,r], and X = √ r 2 − H 2. We can thus find the conditional distribution of X given R = r: FX|R(x|r) = P(X ≤ x) = P( r 2 − H 2 ≤ x) = P(H ≥ r 2 − x2) = 1 − √ r 2 − x2 r , 0 ≤ x ≤ r 3.5 Conditional Distributions 91 Differentiating, we find fX|R(x|r) = x r √ r 2 − x2 , 0 ≤ x ≤ r The marginal density of X is, from the law of total probability, fX (x) = ∞ −∞ fX|R(x|r) fR(r) dr = ∞ x x r √ r 2 − x2 fR(r) dr [The limits of integration are x and ∞ since for r ≤ x, fX|R(x|r) = 0.] This equation is called Abel’s equation. In practice, the marginal density fX can be approximated by making measurements of the radii of cross-sectional circles. Then the problem becomes that of trying to solve for an approximation to fR, since it is the distribution of spherical radii that is of real interest. ■ EXAMPLE C Bivariate Normal Density The conditional density of Y given X is the ratio of the bivariate normal density to a univariate normal density. After some messy algebra, this ratio simplifies to fY|X (y|x) = 1 σY 2π(1 − ρ2) exp ⎛ ⎜⎜⎜⎝−1 2 y − μY − ρ σY σX (x − μX ) 2 σ 2 Y (1 − ρ2) ⎞ ⎟⎟⎟⎠ This is a normal density with mean μY + ρ(x − μX )σY /σX and variance σ 2 Y (1 − ρ2). The conditional distribution of Y given X is a univariate normal distribution. In Example B in Section 2.2.3, the distribution of the velocity of a turbulent wind flow was shown to be approximately normally distributed. Van Atta and Chen (1968) also measured the joint distribution of the velocity at a point at two different times, t and t + τ. Figure 3.14 shows the measured conditional density of the ve- locity, v2, at time t + τ, given various values of v1. There is a systematic departure from the normal distribution. Therefore, it appears that, even though the velocity is normally distributed, the joint distribution of v1 and v2 is not bivariate normal. This should not be totally unexpected, since the relation of v1 and v2 must con- form to equations of motion and continuity, which may not permit a joint normal distribution. ■ Example C illustrates that even when two random variables are marginally nor- mally distributed, they need not be jointly normally distributed. 92 Chapter 3 Joint Distributions .05 2 2 p ( 1, 2 ) 0 .10 .15 .20 .25 .01 3 2 0 2 1 0 0 .04 1 2 3 1 2.4611 2.514 1 1.9811 2.032 1 1.501 1.551 1 1.0181 1.069 1 0.055 .05 0 .10 FIGURE 3.14 The conditional densities of v2 given v1 for selected values of v1, where v1 and v2 are components of the velocity of a turbulent wind flow at different times. The solid lines are the conditional densities according to a normal fit, and the triangles and squares are empirical values determined from 409,600 observations. EXAMPLE D Rejection Method The rejection method is commonly used to generate random variables from a density function, especially when the inverse of the cdf cannot be found in closed form and therefore the inverse cdf method, Proposition D in Section 2.3, cannot be used. Suppose that f is a density function that is nonzero on an interval [a, b] and zero outside the interval (a and b may be infinite). Let M(x) be a function such that M(x) ≥ f (x) on [a, b], and let m(x) = M(x) b a M(x) dx 3.5 Conditional Distributions 93 be a probability density function. As we will see, the idea is to choose M so that it is easy to generate random variables from m.If[a, b] is finite, m can be chosen to be the uniform distribution on [a, b]. The algorithm is as follows: Step 1: Generate T with the density m. Step 2: Generate U, uniform on [0, 1] and independent of T .IfM(T )×U ≤ f (T ), then let X = T (accept T ). Otherwise, go to Step 1 (reject T ). See Figure 3.15. From the figure, we can see that a geometrical interpretation of this algorithm is as follows: Throw a dart that lands uniformly in the rectangular region of the figure. If the dart lands below the curve f (x), record its x coordinate; otherwise, reject it. x a y bT M f accept reject FIGURE 3.15 Illustration of the rejection method. We must check that the density function of the random variable X thus obtained is in fact f : P(x ≤ X ≤ x + dx) = P(x ≤ T ≤ x + dx | accept) = P(x ≤ T ≤ x + dx and accept) P(accept) = P(accept|x ≤ T ≤ x + dx)P(x ≤ T ≤ x + dx) P(accept) First consider the numerator of this expression. We have P(accept|x ≤ T ≤ x + dx) = P(U ≤ f (x)/M(x)) = f (x) M(x) so that the numerator is m(x) dx f(x) M(x) = f (x) dx b a M(x) dx From the law of total probability, the denominator is P(accept) = P(U ≤ f (T )/M(T )) = b a f (t) M(t)m(t) dt = 1 b a M(t) dt where the last two steps follow from the definition of m and since f integrates to 1. Finally, we see that the numerator over the denominator is f (x) dx. ■ 94 Chapter 3 Joint Distributions In order for the rejection method to be computationally efficient, the algorithm should lead to acceptance with high probability; otherwise, many rejection steps may have to be looped through for each acceptance. EXAMPLE E Bayesian Inference A freshly minted coin has a certain probability of coming up heads if it is spun on its edge, but that probability is not necessarily equal to 1 2 . Now suppose it is spun n times and comes up heads X times. What has been learned about the chance the coin comes up heads? We will go through a Bayesian treatment of this problem. Let  denote the probability that the coin will come up heads. We represent our knowledge about  before gathering any data by a probability density on [0, 1], called the prior density. If we are totally ignorant about , we might represent our state of knowledge by a uniform density on [0, 1]: f(θ) = 1, 0 ≤ θ ≤ 1. We will see how observing X changes our knowledge about , transforming the prior distribution into a “posterior” distribution. Given a value θ, X follows a binomial distribution with n trials and probability of success θ: fX|(x|θ) = n x θ x (1 − θ)n−x , x = 0, 1,...,n Now  is continuous and X is discrete, and they have a joint probability distribution: f,X (θ, x) = fX|(x|θ)f(θ) = n x θ x (1 − θ)n−x , x = 0, 1,...,n, 0 ≤ θ ≤ 1 This is a density function in θ and a probability mass function in x, an object of a kind we have not seen before. We can calculate the marginal density X by integrating the joint over θ: fX (x) = 1 0 n x θ x (1 − θ)n−x dθ We can calculate this formidable looking integral by a trick. First write n x = n! x!(n − x)! = (n + 1) (x + 1) (n − x + 1) (If k is an integer, (k) = (k − 1)!; see Problem 49 in Chapter 2). Recall the beta density (Section 2.2.4) g(u) = (a + b) (a) (b)ua−1(1 − u)b−1, 0 ≤ u ≤ 1 The fact that this density integrates to 1 tells us that 1 0 ua−1(1 − u)b−1du = (a) (b) (a + b) 3.5 Conditional Distributions 95 Thus, identifying u with θ, a − 1 with x, and b − 1 with n − x, fX (x) = (n + 1) (x + 1) (n − x + 1) 1 0 θ x (1 − θ)n−x dθ = (n + 1) (x + 1) (n − x + 1) (x + 1) (n − x + 1) (n + 2) = 1 n + 1 , x = 0, 1,...,n Thus, if our prior on θ is uniform, each outcome of X is a priori equally likely. Our knowledge about  having observed X = x is quantified in the conditional density of  given X = x: f|X (θ|x) = f,X (θ, x) fX (x) = (n + 1) n x θ x (1 − θ)n−x = (n + 1) (n + 1) (x + 1) (n − x + 1)θ x (1 − θ)n−x = (n + 2) (x + 1) (n − x + 1)θ x (1 − θ)n−x The relationship x (x) = (x +1) has been used in the second step (see Problem 49, Chapter 2). Bear in mind that for each fixed x, this is a function of θ—the posterior density of θ given x—which quantifies our opinion about  having observed x heads in n spins. The posterior density is a beta density with parameters a = x + 1, b = n − x + 1. A one-Euro coin has the number 1 on one face and a bird on the other face. I spun such a coin 20 times: the 1 came up 13 of the 20 times. Using the prior,  ∼ U[0, 1], the posterior is beta with a = x + 1 = 14 and b = n − x + 1 = 8. Figure 3.16 shows this posterior, which represents my opinion if I was initially totally ignorant of θ and then observed thirteen 1s in 20 spins. From the figure, it is extremely unlikely that θ<0.25, for example. My probability, or belief, that θ is greater than 1 2 is the area under the density to the right of 1 2 , which can be calculated to be 0.91. I can be 91% certain that θ is greater than 1 2 . We need to distinguish between the steps of the preceding probability calcu- lations, which are are mathematically straightforward; and the interpretation of the results, which goes beyond the mathematics and requires a model that belief can be expressed in terms of probability and revised using the laws of probability. See Figure 3.16. ■ 96 Chapter 3 Joint Distributions 1 0.2 p 0 2 3 4 0.0 0.4 0.6 0.8 1.0 FIGURE 3.16 Beta density with parameters a = 14 and b = 8. 3.6 Functions of Jointly Distributed Random Variables The distribution of a function of a single random variable was developed in Section 2.3. In this section, that development is extended to several random variables, but first some important special cases are considered. 3.6.1 Sums and Quotients Suppose that X and Y are discrete random variables taking values on the integers and having the joint frequency function p(x, y), and let Z = X + Y. To find the frequency function of Z, we note that Z = z whenever X = x and Y = z − x, where x is an integer. The probability that Z = z is thus the sum over all x of these joint probabilities, or pZ (z) = ∞ x=−∞ p(x, z − x) If X and Y are independent so that p(x, y) = pX (x)pY (y), then pZ (z) = ∞ x=−∞ pX (x)pY (z − x) This sum is called the convolution of the sequences pX and pY . 3.6 Functions of Jointly Distributed Random Variables 97 z y Rz (0, z) (z, 0) x y z FIGURE 3.17 X + Y ≤ z whenever (X, Y) is in the shaded region Rz. The continuous case is very similar. Supposing that X and Y are continuous ran- dom variables, we first find the cdf of Z and then differentiate to find the density. Since Z ≤ z whenever the point (X, Y) is in the shaded region Rz shown in Figure 3.17, we have FZ (z) = Rz f (x, y) dx dy = ∞ −∞ z−x −∞ f (x, y) dy dx In the inner integral, we make the change of variables y = v − x to obtain FZ (z) = ∞ −∞ z −∞ f (x,v− x) dv dx = z −∞ ∞ −∞ f (x,v− x) dx dv Differentiating, we have, if ∞ −∞ f (x, z − x) dx is continuous at z, fZ (z) = ∞ −∞ f (x, z − x) dx which is the obvious analogue of the result for the discrete case. If X and Y are independent, fZ (z) = ∞ −∞ fX (x) fY (z − x) dx This integral is called the convolution of the functions fX and fY . 98 Chapter 3 Joint Distributions EXAMPLE A Suppose that the lifetime of a component is exponentially distributed and that an identical and independent backup component is available. The system operates as long as one of the components is functional; therefore, the distribution of the life of the system is that of the sum of two independent exponential random variables. Let T1 and T2 be independent exponentials with parameter λ, and let S = T1 + T2. fS(s) = s 0 λe−λt λe−λ(s−t)dt It is important to note the limits of integration. Beyond these limits, one of the two component densities is zero. When dealing with densities that are nonzero only on some subset of the real line, we must always be careful. Continuing, we have fS(s) = λ2 s 0 e−λs dt = λ2se−λs This is a gamma distribution with parameters 2 and λ (compare with Example A in Section 3.5.2). ■ Let us next consider the quotient of two continuous random variables. The deriva- tion is very similar to that for the sum of such variables, given previously: We first find the cdf and then differentiate to find the density. Suppose that X and Y are continuous with joint density function f and that Z = Y/X. Then FZ (z) = P(Z ≤ z) is the probability of the set of (x, y) such that y/x ≤ z.Ifx > 0, this is the set y ≤ xz;if x < 0, it is the set y ≥ xz. Thus, FZ (z) = 0 −∞ ∞ xz f (x, y) dy dx + ∞ 0 xz −∞ f (x, y) dy dx To remove the dependence of the inner integrals on x, we make the change of vari- ables y = xv in the inner integrals and obtain FZ (z) = 0 −∞ −∞ z xf(x, xv) dv dx + ∞ 0 z −∞ xf(x, xv) dv dx = 0 −∞ z −∞ (−x) f (x, xv) dv dx + ∞ 0 z −∞ xf(x, xv) dv dx = z −∞ ∞ −∞ |x| f (x, xv) dx dv Finally, differentiating (again under an assumption of continuity), we find fZ (z) = ∞ −∞ |x| f (x, xz) dx In particular, if X and Y are independent, fZ (z) = ∞ −∞ |x| fX (x) fY (xz) dx 3.6 Functions of Jointly Distributed Random Variables 99 EXAMPLE B Suppose that X and Y are independent standard normal random variables and that Z = Y/X. We then have fZ (z) = ∞ −∞ |x| 2π e−x2/2e−x2z2/2 dx From the symmetry of the integrand about zero, fZ (z) = 1 π ∞ 0 xe−x2((z2+1)/2) dx To simplify this, we make the change of variables u = x2 to obtain fZ (z) = 1 2π ∞ 0 e−u((z2+1)/2) du Next, using the fact that ∞ 0 λ exp(−λx) dx = 1 with λ = (z2 + 1)/2, we get fZ (z) = 1 π(z2 + 1), −∞ < z < ∞ This density is called the Cauchy density. Like the standard normal density, the Cauchy density is symmetric about zero and bell-shaped, but the tails of the Cauchy tend to zero very slowly compared to the tails of the normal. This can be interpreted as being because of a substantial probability that X in the quotient Y/X is near zero. ■ Example B indicates one method of generating Cauchy random variables—we can generate independent standard normal random variables and form their quotient. The next section shows how to generate standard normals. 3.6.2 The General Case The following example illustrates the concepts that are important to the general case of functions of several random variables and is also interesting in its own right. EXAMPLE A Suppose that X and Y are independent standard normal random variables, which means that their joint distribution is the standard bivariate normal distribution, or fXY(x, y) = 1 2π e−(x2/2)−(y2/2) We change to polar coordinates and then reexpress the density in this new coordinate system (R ≥ 0, 0 ≤  ≤ 2π): R = X 2 + Y 2 100 Chapter 3 Joint Distributions  = ⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ tan−1 Y X , if X > 0 tan−1 Y X + π, if X < 0 π 2 sgn(Y), if X = 0, Y = 0 0, if X = 0, Y = 0 (The range of the inverse tangent function is taken to be − π 2 << π 2 .) The inverse transformation is X = R cos  Y = R sin  The joint density of R and  is fR(r,θ)dr dθ = P(r ≤ R ≤ r + dr,θ ≤  ≤ θ + dθ) This probability is equal to the area of the shaded patch in Figure 3.18 times fXY[x(r,θ),y(r,θ)]. The area in question is clearly rdrdθ,so P(r ≤ R ≤ r + dr,θ ≤  ≤ θ + dθ) = fXY(r cos θ,r sin θ)rdrdθ and fR(r,θ)= rfXY(r cos θ,r sin θ) Thus, fR(r,θ)= r 2π e[−(r2 cos2 θ)/2−(r2 sin2 θ)/2] = 1 2π re−r2/2 From this, we see that the joint density factors implying that R and  are independent random variables, that  is uniform on [0, 2π], and that R has the density fR(r) = re−r2/2, r ≥ 0 which is called the Rayleigh density. d rr dr dr rd FIGURE 3.18 The area of the shaded patch is rdrdθ. 3.6 Functions of Jointly Distributed Random Variables 101 An interesting relationship can be found by changing variables again, letting T = R2. Using the standard techniques for finding the density of a function of a single random variable, we obtain fT (t) = 1 2e−t/2, t ≥ 0 This is an exponential distribution with parameter 1 2 . Because R and  are indepen- dent, so are T and , and the joint density of the latter pair is fT (t,θ)= 1 2π 1 2 e−t/2 We have thus arrived at a characterization of the standard bivariate normal distribu- tion:  is uniform on [0, 2π], and R2 is exponential with parameter 1 2 . (Also, from Example B in Section 3.6.1, tan  follows a Cauchy distribution.) These relationships can be used to construct an algorithm for generating standard normal random variables, which is quite useful since , the cdf, and −1 cannot be expressed in closed form. First, generate U1 and U2, which are independent and uniform on [0, 1]. Then −2 log U1 is exponential with parameter 1 2 , and 2πU2 is uniform on [0, 2π]. It follows that X = −2 log U1 cos(2πU2) and Y = −2 log U1 sin(2πU2) are independent standard normal random variables. This method of generating nor- mally distributed random variables is sometimes called the polar method. ■ For the general case, suppose that X and Y are jointly distributed continuous random variables, that X and Y are mapped onto U and V by the transformation u = g1(x, y) v = g2(x, y) and that the transformation can be inverted to obtain x = h1(u,v) y = h2(u,v) Assume that g1 and g2 have continuous partial derivatives and that the Jacobian J(x, y) = det ⎡ ⎢⎢⎣ ∂g1 ∂x ∂g1 ∂y ∂g2 ∂x ∂g2 ∂y ⎤ ⎥⎥⎦ = ∂g1 ∂x ∂g2 ∂y − ∂g2 ∂x ∂g1 ∂y = 0 for all x and y. This leads directly to the following result. 102 Chapter 3 Joint Distributions PROPOSITION A Under the assumptions just stated, the joint density of U and V is fUV(u,v)= fXY(h1(u,v),h2(u,v))|J −1(h1(u,v),h2(u,v))| for (u,v) such that u = g1(x, y) and v = g2(x, y) for some (x, y) and 0 elsewhere. ■ We will not prove Proposition A here. It follows from the formula established in advanced calculus for a change of variables in multiple integrals. The essential elements of the proof follow the discussion in Example A. EXAMPLE B To illustrate the formalism, let us redo Example A. The roles of u and v are played by r and θ: r = x2 + y2 θ = tan−1 y x The inverse transformation is x = r cos θ y = r sin θ After some algebra, we obtain the partial derivatives: ∂r ∂x = x x2 + y2 ∂r ∂y = y x2 + y2 ∂θ ∂x = −y x2 + y2 ∂θ ∂y = x x2 + y2 The Jacobian is the determinant of the matrix of these expressions, or J(x, y) = 1 x2 + y2 = 1 r Proposition A therefore says that fR(r,θ)= rfXY(r cos θ,r sin θ) for r ≥ 0, 0 ≤ θ ≤ 2π, and 0 elsewhere, which is the same as the result we obtained by a direct argument in Example A. ■ Proposition A extends readily to transformations of more than two random vari- ables. If X1,...,Xn have the joint density function fX1···Xn and Yi = gi (X1,...,Xn), i = 1,...,n Xi = hi (Y1,...,Yn), i = 1,...,n 3.6 Functions of Jointly Distributed Random Variables 103 and if J(x1,...,xn) is the determinant of the matrix with the ij entry ∂gi /∂x j , then the joint density of Y1,...,Yn is fY1···Yn (y1,...,yn) = fX1···Xn (x1,...,xn)|J −1(x1,...,xn)| wherein each xi is expressed in terms of the y’s; xi = hi (y1,...,yn). EXAMPLE C Suppose that X1 and X2 are independent standard normal random variables and that Y1 = X1 Y2 = X1 + X2 We will show that the joint distribution of Y1 and Y2 is bivariate normal. The Jacobian of the transformation is simply J(x, y) = det 10 11 = 1 Since the inverse transformation is x1 = y1 and x2 = y2 − y1, from Proposition A the joint density of Y1 and Y2 is fY1Y2 (y1, y2) = 1 2π exp −1 2 y2 1 + (y2 − y1)2 = 1 2π exp −1 2 2y2 1 + y2 2 − 2y1 y2 This can be recognized to be a bivariate normal density, the parameters of which can be identified by comparing the constants in this expression with the general form of the bivariate normal (see Example F of Section 3.3). First, since the exponential contains only quadratic terms in y1 and y2,wehaveμY1 = μY2 = 0. (If μY1 were nonzero, for example, examination of the equation for the bivariate density in Example F of Section 3.3 shows that there would be a term y1μY1 .) Next, from the constant that occurs in front of the exponential, we have σY1 σY2 1 − ρ2 = 1 From the coefficient of y1 we have σ 2 Y1 (1 − ρ2) = 1 2 Dividing the second relationship into the square of the first gives σ 2 Y2 = 2. From the coefficient of y2,wehave σ 2 Y2 (1 − ρ2) = 1 from which it follows that ρ2 = 1 2 . From the sign of the cross product, we see that ρ = 1/ √ 2. Finally, we have σ 2 Y1 = 1. We thus see that this linear transformation of two independent standard normal random variables follows a bivariate normal distribution. This is a special case of a more general result: A nonsingular linear transformation of two random variables whose joint distribution is bivariate normal yields two random variables 104 Chapter 3 Joint Distributions whose joint distribution is still bivariate normal, although with different parameters. (See Problem 58.) ■ 3.7 Extrema and Order Statistics This section is concerned with ordering a collection of independent continuous random variables. In particular, let us assume that X1, X2,...,Xn are independent random variables with the common cdf F and density f . Let U denote the maximum of the Xi and V the minimum. The cdfs of U and V , and therefore their densities, can be found by a simple trick. First, we note that U ≤ u if and only if Xi ≤ u for all i. Thus, FU (u) = P(U ≤ u) = P(X1 ≤ u)P(X2 ≤ u) ···P(Xn ≤ u) = [F(u)]n Differentiating, we find the density, fU (u) = nf(u)[F(u)]n−1 Similarly, V ≥ v if and only if Xi ≥ v for all i. Thus, 1 − FV (v) = [1 − F(v)]n and FV (v) = 1 − [1 − F(v)]n The density function of V is therefore fV (v) = nf(v)[1 − F(v)]n−1 EXAMPLE A Suppose that n system components are connected in series, which means that the sys- tem fails if any one of them fails, and that the lifetimes of the components, T1,...,Tn, are independent random variables that are exponentially distributed with parameter λ: F(t) = 1−e−λt . The random variable that represents the length of time the system op- erates is V , which is the minimum of the Ti and by the preceding result has the density fV (v) = nλe−λv(e−λv)n−1 = nλe−nλv We see that V is exponentially distributed with parameter nλ. ■ EXAMPLE B Suppose that a system has components as described in Example A but connected in parallel, which means that the system fails only when they all fail. The system’s lifetime is thus the maximum of n exponential random variables and has the 3.7 Extrema and Order Statistics 105 density fU (u) = nλe−λu(1 − e−λu)n−1 By expanding the last term using the binomial theorem, we see that this density is a weighted sum of exponential terms rather than a simple exponential density. ■ We will now derive the preceding results once more, by the differential technique, and generalize them. To find fU (u), we observe that u ≤ U ≤ u + du if one of the nXi falls in the interval (u, u + du) and the other (n − 1)Xi fall to the left of u. The probability of any particular such arrangement is [F(u)]n−1 f (u)du, and because there are n such arrangements, fU (u) = n[F(u)]n−1 f (u) Now we again assume that X1,...,Xn are independent continuous random vari- ables with density f (x). We sort the Xi and denote by X(1) < X(2) < ···< X(n) the order statistics. Note that X1 is not necessarily equal to X(1). (In fact, this equality holds with probability n−1.) Thus, X(n) is the maximum, and X(1) is the minimum. If n is odd, say, n = 2m + 1, then X(m+1) is called the median of the Xi . THEOREM A The density of X(k), the kth-order statistic, is fk(x) = n! (k − 1)!(n − k)! f (x)Fk−1(x)[1 − F(x)]n−k Proof We will use a differential argument to derive this result heuristically. (The alter- native approach of first deriving the cdf and then differentiating is developed in Problem 66 at the end of this chapter.) The event x ≤ X(k) ≤ x + dx occurs if k − 1 observations are less than x, one observation is in the interval [x, x + dx], and n − k observations are greater than x + dx. The probability of any particular arrangement of this type is f (x)Fk−1(x)[1 − F(x)]n−kdx, and, by the multi- nomial theorem, there are n!/[(k − 1)!1!(n − k)!] such arrangements, which completes the argument. ■ EXAMPLE C For the case where the Xi are uniform on [0, 1], the density of the kth-order statistic reduces to n! (k − 1)!(n − k)!x k−1(1 − x)n−k, 0 ≤ x ≤ 1 This is the beta density. An interesting by-product of this result is that since the 106 Chapter 3 Joint Distributions density integrates to 1, 1 0 x k−1(1 − x)n−kdx = (k − 1)!(n − k)! n! ■ Joint distributions of order statistics can also be worked out. For example, to find the joint density of the minimum and maximum, we note that x ≤ X(1) ≤ x + dx and y ≤ X(n) ≤ y + dy if one Xi falls in [x, x + dx], one falls in [y, y + dy], and n − 2 fall in [x, y]. There are n(n − 1) ways to choose the minimum and maximum, and thus V = X(1) and U = X(n) have the joint density f (u,v)= n(n − 1) f (v) f (u)[F(u) − F(v)]n−2, u ≥ v For example, for the uniform case, f (u,v)= n(n − 1)(u − v)n−2, 1 ≥ u ≥ v ≥ 0 The range of X(1),...,X(n) is R = X(n) − X(1). Using the same kind of analysis we used in Section 3.6.1 to derive the distribution of a sum, we find fR(r) = ∞ −∞ f (v + r,v)dv EXAMPLE D Find the distribution of the range, U − V , for the uniform [0, 1] case. The integrand is f (v +r,v)= n(n −1)r n−2 for 0 ≤ v ≤ v +r ≤ 1 or, equivalently, 0 ≤ v ≤ 1−r. Thus, fR(r) = 1−r 0 n(n − 1)r n−2 dv = n(n − 1)r n−2(1 − r), 0 ≤ r ≤ 1 The corresponding cdf is FR(r) = nrn−1 − (n − 1)r n, 0 ≤ r ≤ 1 ■ EXAMPLE E Tolerance Interval If a large number of independent random variables having the common density func- tion f are observed, it seems intuitively likely that most of the probability mass of the density f (x) is contained in the interval (X(1), X(n)) and unlikely that a future observation will lie outside this interval. In fact, very precise statements can be made. For example, the amount of the probability mass in the interval is F(X(n))− F(X(1)), a random variable that we will denote by Q. From Proposition C of Section 2.3, the distribution of F(Xi ) is uniform; therefore, the distribution of Q is the distribution of U(n) − U(1), which is the range of n independent uniform random variables. Thus, P(Q >α), the probability that more than 100α% of the probability mass is contained in the range is from Example D, P(Q >α)= 1 − nαn−1 + (n − 1)αn 3.8 Problems 107 For example, if n = 100 and α = .95, this probability is .96. In words, this means that the probability is .96 that the range of 100 independent random variables covers 95% or more of the probability mass, or, with probability .96, 95% of all further observations from the same distribution will fall between the minimum and maximum. This statement does not depend on the actual form of the distribution. ■ 3.8 Problems 1. The joint frequency function of two discrete random variables, X and Y,isgiven in the following table: x y 1234 1 .10 .05 .02 .02 2 .05 .20 .05 .02 3 .02 .05 .20 .04 4 .02 .02 .04 .10 a. Find the marginal frequency functions of X and Y. b. Find the conditional frequency function of X given Y = 1 and of Y given X = 1. 2. An urn contains p black balls, q white balls, and r red balls; and n balls are chosen without replacement. a. Find the joint distribution of the numbers of black, white, and red balls in the sample. b. Find the joint distribution of the numbers of black and white balls in the sample. c. Find the marginal distribution of the number of white balls in the sample. 3. Three players play 10 independent rounds of a game, and each player has prob- ability 1 3 of winning each round. Find the joint distribution of the numbers of games won by each of the three players. 4. A sieve is made of a square mesh of wires. Each wire has diameter d, and the holes in the mesh are squares whose side length is w. A spherical particle of radius r is dropped on the mesh. What is the probability that it passes through? What is the probability that it fails to pass through if it is dropped n times? (Calculations such as these are relevant to the theory of sieving for analyzing the size distribution of particulate matter.) 5. (Buffon’s Needle Problem) A needle of length L is dropped randomly on a plane ruled with parallel lines that are a distance D apart, where D ≥ L. Show that the probability that the needle comes to rest crossing a line is 2L/(π D). Explain how this gives a mechanical means of estimating the value of π. 108 Chapter 3 Joint Distributions 6. A point is chosen randomly in the interior of an ellipse: x2 a2 + y2 b2 = 1 Find the marginal densities of the x and y coordinates of the point. 7. Find the joint and marginal densities corresponding to the cdf F(x, y) = (1−e−αx )(1−e−βy), x ≥ 0, y ≥ 0,α>0,β>0 8. Let X and Y have the joint density f (x, y) = 6 7 (x + y)2, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 a. By integrating over the appropriate regions, find (i) P(X > Y), (ii) P(X + Y ≤ 1), (iii) P(X ≤ 1 2 ). b. Find the marginal densities of X and Y. c. Find the two conditional densities. 9. Suppose that (X, Y) is uniformly distributed over the region defined by 0 ≤ y ≤ 1 − x2 and −1 ≤ x ≤ 1. a. Find the marginal densities of X and Y. b. Find the two conditional densities. 10. A point is uniformly distributed in a unit sphere in three dimensions. a. Find the marginal densities of the x, y, and z coordinates. b. Find the joint density of the x and y coordinates. c. Find the density of the xy coordinates conditional on Z = 0. 11. Let U1, U2, and U3 be independent random variables uniform on [0, 1]. Find the probability that the roots of the quadratic U1x2 + U2x + U3 are real. 12. Let f (x, y) = c(x2 − y2)e−x , 0 ≤ x < ∞, −x ≤ y < x a. Find c. b. Find the marginal densities. c. Find the conditional densities. 13. A fair coin is thrown once; if it lands heads up, it is thrown a second time. Find the frequency function of the total number of heads. 14. Suppose that f (x, y) = xe−x(y+1), 0 ≤ x < ∞, 0 ≤ y < ∞ a. Find the marginal densities of X and Y. Are X and Y independent? b. Find the conditional densities of X and Y. 15. Suppose that X and Y have the joint density function f (x, y) = c 1 − x2 − y2, x2 + y2 ≤ 1 a. Find c. 3.8 Problems 109 b. Sketch the joint density. c. Find P(X 2 + Y 2) ≤ 1 2 . d. Find the marginal densities of X and Y. Are X and Y independent random variables? e. Find the conditional densities. 16. What is the probability density of the time between the arrival of the two packets of Example E in Section 3.4? 17. Let (X, Y) be a random point chosen uniformly on the region R ={(x, y) : |x|+|y|≤1}. a. Sketch R. b. Find the marginal densities of X and Y using your sketch. Be careful of the range of integration. c. Find the conditional density of Y given X. 18. Let X and Y have the joint density function f (x, y) = k(x − y), 0 ≤ y ≤ x ≤ 1 and 0 elsewhere. a. Sketch the region over which the density is positive and use it in determining limits of integration to answer the following questions. b. Find k. c. Find the marginal densities of X and Y. d. Find the conditional densities of Y given X and X given Y. 19. Suppose that two components have independent exponentially distributed life- times, T1 and T2, with parameters α and β, respectively. Find (a) P(T1 > T2) and (b) P(T1 > 2T2). 20. If X1 is uniform on [0, 1], and, conditional on X1, X2, is uniform on [0, X1], find the joint and marginal distributions of X1 and X2. 21. An instrument is used to measure very small concentrations, X, of a certain chemical in soil samples. Suppose that the values of X in those soils in which the chemical is present is modeled as a random variable with density function f (x). The assay of a soil reports a concentration only if the chemical is first determined to be present. At very low concentrations, however, the chemical may fail to be detected even if it is present. This phenomenon is modeled by assuming that if the concentration is x, the chemical is detected with probability R(x). Let Y denote the concentration of a chemical in a soil in which it has been determined to be present. Show that the density function of Y is g(y) = R(y) f (y) ∞ 0 R(x) f (x) dx 22. Consider a Poisson process on the real line, and denote by N(t1, t2) the number of events in the interval (t1, t2).Ift0 < t1 < t2, find the conditional distribution of N(t0, t1) given that N(t0, t2) = n.(Hint: Use the fact that the numbers of events in disjoint subsets are independent.) 110 Chapter 3 Joint Distributions 23. Suppose that, conditional on N, X has a binomial distribution with N trials and probability p of success, and that N is a binomial random variable with m trials and probability r of success. Find the unconditional distribution of X. 24. Let P have a uniform distribution on [0, 1], and, conditional on P = p, let X have a Bernoulli distribution with parameter p. Find the conditional distribution of P given X. 25. Let X have the density function f , and let Y = X with probability 1 2 and Y =−X with probability 1 2 . Show that the density of Y is symmetric about zero—that is, fY (y) = fY (−y). 26. Spherical particles whose radii have the density function fR(r) are dropped on a mesh as in Problem 4. Find an expression for the density function of the particles that pass through. 27. Prove that X and Y are independent if and only if fX|Y (x|y) = fX (x) for all x and y. 28. Show that C(u,v)= uv is a copula. Why is it called “the independence copula”? 29. Use the Farlie-Morgenstern copula to construct a bivariate density whose marginal densities are exponential. Find an expression for the joint density. 30. For 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1, show that C(u,v) = min(u1−αv,uv1−β) is a copula (the Marshall-Olkin copula). What is the joint density? 31. Suppose that (X, Y) is uniform on the disk of radius 1 as in Example E of Sec- tion 3.3. Without doing any calculations, argue that X and Y are not independent. 32. Continuing Example E of Section 3.5.2, suppose you had to guess a value of θ. One plausible guess would be the value of θ that maximizes the posterior density. Find that value. Does the result make intuitive sense? 33. Suppose that, as in Example E of Section 3.5.2, your prior opinion that the coin will land with heads up is represented by a uniform density on [0, 1]. You now spin the coin repeatedly and record the number of times, N, until a heads comes up. So if heads comes up on the first spin, N = 1, etc. a. Find the posterior density of  given N. b. Do this with a newly minted penny and graph the posterior density. 34. This problem continues Example E of Section 3.5.2. In that example, the prior opinion for the value of  was represented by the uniform density. Suppose that the prior density had been a beta density with parameters a = b = 3, reflecting a stronger prior belief that the chance ofa1wasnear 1 2 . Graph this prior density. Following the reasoning of the example, find the posterior density, plot it, and compare it to the posterior density shown in the example. 35. Find a newly minted penny. Place it on its edge and spin it 20 times. Following Example E of Section 3.5.2, calculate and graph the posterior distribution. Spin another 20 times, and calculate and graph the posterior based on all 40 spins. What happens as you increase the number of spins? 3.8 Problems 111 36. Let f (x) = (1 + αx)/2, for −1 ≤ x ≤ 1 and −1 ≤ α ≤ 1. a. Describe an algorithm to generate random variables from this density using the rejection method. b. Write a computer program to do so, and test it out. 37. Let f (x) = 6x2(1 − x)2, for −1 ≤ x ≤ 1. a. Describe an algorithm to generate random variables from this density using the rejection method. In what proportion of the trials will the acceptance step be taken? b. Write a computer program to do so, and test it out. 38. Show that the number of iterations necessary to generate a random variable using the rejection method is a geometric random variable, and evaluate the parameter of the geometric frequency function. Show that in order to keep the number of iterations small, M(x) should be chosen to be close to f (x). 39. Show that the following method of generating discrete random variables works (D. R. Fredkin). Suppose, for concreteness, that X takes on values 0, 1, 2, . . . with probabilities p0, p1, p2,.... Let U be a uniform random variable. If U < p0, return X = 0. If not, replace U by U − p0, and if the new U is less than p1, return X = 1. If not, decrement U by p1, compare U to p2, etc. 40. Suppose that X and Y are discrete random variables with a joint probability mass function pXY(x, y). Show that the following procedure generates a random variable X ∼ pX|Y (x|y). a. Generate X ∼ pX (x). b. Accept X with probability p(y|X). c. If X is accepted, terminate and return X. Otherwise go to Step a. Now suppose that X is uniformly distributed on the integers 1, 2,...,100 and that given X = x, Y is uniform on the integers 1, 2,...,x. You observe Y = 44. What does this tell you about X? Simulate the distribution of X,givenY = 44, 1000 times and make a histogram of the value obtained. How would you estimate E(X|Y = 44)? 41. How could you extend the procedure of the previous problem in the case that X and Y are continuous random variables? 42. a. Let T be an exponential random variable with parameter λ; let W be a random variable independent of T , which is ±1 with probability 1 2 each; and let X = WT. Show that the density of X is fX (x) = λ 2e−λ|x| which is called the double exponential density. b. Show that for some constant c, 1√ 2π e−x2/2 ≤ ce−|x| 112 Chapter 3 Joint Distributions Use this result and that of part (a) to show how to use the rejection method to generate random variables from a standard normal density. 43. Let U1 and U2 be independent and uniform on [0, 1]. Find and sketch the density function of S = U1 + U2. 44. Let N1 and N2 be independent random variables following Poisson distributions with parameters λ1 and λ2. Show that the distribution of N = N1 + N2 is Poisson with parameter λ1 + λ2. 45. For a Poisson distribution, suppose that events are independently labeled A and B with probabilities pA + pB = 1. If the parameter of the Poisson distribution is λ, show that the number of events labeled A follows a Poisson distribution with parameter pAλ. 46. Let X and Y be jointly continuous random variables. Find an expression for the density of Z = X − Y. 47. Let X and Y be independent standard normal random variables. Find the density of Z = X + Y, and show that Z is normally distributed as well. (Hint: Use the technique of completing the square to help in evaluating the integral.) 48. Let T1 and T2 be independent exponentials with parameters λ1 and λ2. Find the density function of T1 + T2. 49. Find the density function of X + Y, where X and Y have a joint density as given in Example D in Section 3.3. 50. Suppose that X and Y are independent discrete random variables and each as- sumes the values 0, 1, and 2 with probability 1 3 each. Find the frequency function of X + Y. 51. Let X and Y have the joint density function f (x, y), and let Z = XY. Show that the density function of Z is fZ (z) = ∞ −∞ f y, z y 1 |y| dy 52. Find the density of the quotient of two independent uniform random variables. 53. Consider forming a random rectangle in two ways. Let U1, U2, and U3 be inde- pendent random variables uniform on [0, 1]. One rectangle has sides U1 and U2, and the other is a square with sides U3. Find the probability that the area of the square is greater than the area of the other rectangle. 54. Let X, Y, and Z be independent N(0,σ2). Let , , and R be the corresponding random variables that are the spherical coordinates of (X, Y, Z): x = r sin φ cos θ y = r sin φ sin θ z = r cos φ 0 ≤ φ ≤ π, 0 ≤ θ ≤ 2π 3.8 Problems 113 Find the joint and marginal densities of , , and R.(Hint: dx dy dz = r 2 sin φ dr dθ dφ.) 55. A point is generated on a unit disk in the following way: The radius, R, is uniform on [0, 1], and the angle  is uniform on [0, 2π] and is independent of R. a. Find the joint density of X = R cos  and Y = R sin . b. Find the marginal densities of X and Y. c. Is the density uniform over the disk? If not, modify the method to produce a uniform density. 56. If X and Y are independent exponential random variables, find the joint density of the polar coordinates R and  of the point (X, Y). Are R and  independent? 57. Suppose that Y1 and Y2 follow a bivariate normal distribution with parameters μY1 = μY2 = 0,σ2 Y1 = 1,σ2 Y2 = 2, and ρ = 1/ √ 2. Find a linear transformation x1 = a11 y1 + a12 y2, x2 = a21 y1 + a22 y2 such that x1 and x2 are independent standard normal random variables. (Hint: See Example C of Section 3.6.2.) 58. Show that if the joint distribution of X1 and X2 is bivariate normal, then the joint distribution of Y1 = a1 X1 + b1 and Y2 = a2 X2 + b2 is bivariate normal. 59. Let X1 and X2 be independent standard normal random variables. Show that the joint distribution of Y1 = a11 X1 + a12 X2 + b1 Y2 = a21 X1 + a22 X2 + b2 is bivariate normal. 60. Using the results of the previous problem, describe a method for generating pseu- dorandom variables that have a bivariate normal distribution from independent pseudorandom uniform variables. 61. Let X and Y be jointly continuous random variables. Find an expression for the joint density of U = a + bX and V = c + dY. 62. If X and Y are independent standard normal random variables, find P(X 2 + Y 2 ≤ 1). 63. Let X and Y be jointly continuous random variables. a. Develop an expression for the joint density of X + Y and X − Y. b. Develop an expression for the joint density of XY and Y/X. c. Specialize the expressions from parts (a) and (b) to the case where X and Y are independent. 64. Find the joint density of X + Y and X/Y, where X and Y are independent exponential random variables with parameter λ. Show that X + Y and X/Y are independent. 65. Suppose that a system’s components are connected in series and have lifetimes that are independent exponential random variables with parameters λi . Show that the lifetime of the system is exponential with parameter λi . 114 Chapter 3 Joint Distributions 66. Each component of the following system (Figure 3.19) has an independent ex- ponentially distributed lifetime with parameter λ. Find the cdf and the density of the system’s lifetime. FIGURE 3.19 67. A card contains n chips and has an error-correcting mechanism such that the card still functions if a single chip fails but does not function if two or more chips fail. If each chip has a lifetime that is an independent exponential with parameter λ, find the density function of the card’s lifetime. 68. Suppose that a queue has n servers and that the length of time to complete a job is an exponential random variable. If a job is at the top of the queue and will be handled by the next available server, what is the distribution of the waiting time until service? What is the distribution of the waiting time until service of the next job in the queue? 69. Find the density of the minimum of n independent Weibull random variables, each of which has the density f (t) = βα−βtβ−1e−(t/α)β , t ≥ 0 70. If five numbers are chosen at random in the interval [0, 1], what is the probability that they all lie in the middle half of the interval? 71. Let X1,...,Xn be independent random variables, each with the density func- tion f. Find an expression for the probability that the interval (−∞, X(n)] encompasses at least 100ν% of the probability mass of f. 72. Let X1, X2,...,Xn be independent continuous random variables each with cu- mulative distribution function F. Show that the joint cdf of X(1) and X(n) is F(x, y) = Fn(y) − [F(y) − F(x)]n, x ≤ y 73. If X1,...,Xn are independent random variables, each with the density function f , show that the joint density of X(1),...,X(n) is n! f (x1) f (x2) ··· f (xn), x1 < x2 < ···< xn 74. Let U1, U2, and U3 be independent uniform random variables. a. Find the joint density of U(1), U(2), and U(3). b. The locations of three gas stations are independently and randomly placed along a mile of highway. What is the probability that no two gas stations are less than 1 3 mile apart? 3.8 Problems 115 75. Use the differential method to find the joint density of X(i) and X( j), where i < j. 76. Prove Theorem A of Section 3.7 by finding the cdf of X(k) and differentiating. (Hint: X(k) ≤ x if and only if k or more of the Xi are less than or equal to x. The number of Xi less than or equal to x is a binomial random variable.) 77. Find the density of U(k) −U(k−1) if the Ui , i = 1,...,n are independent uniform random variables. This is the density of the spacing between adjacent points chosen uniformly in the interval [0, 1]. 78. Show that 1 0 y 0 (y − x)n dx dy = 1 (n + 1)(n + 2) 79. If T1 and T2 are independent exponential random variables, find the density function of R = T(2) − T(1). 80. Let U1,...,Un be independent uniform random variables, and let V be uniform and independent of the Ui . a. Find P(V ≤ U(n)). b. Find P(U(1) < V < U(n)). 81. Do both parts of Problem 80 again, assuming that the Ui and V have the density function f and the cdf F, with F−1 uniquely defined. Hint: F(Ui ) has a uniform distribution. CHAPTER 4 Expected Values 4.1 The Expected Value of a Random Variable The concept of the expected value of a random variable parallels the notion of a weighted average. The possible values of the random variable are weighted by their probabilities, as specified in the following definition. DEFINITION If X is a discrete random variable with frequency function p(x), the expected value of X, denoted by E(X),is E(X) = i xi p(xi ) provided that i |xi |p(xi )<∞. If the sum diverges, the expectation is unde- fined. ■ E(X) is also referred to as the mean of X and is often denoted by μ or μX . It might be helpful to think of the expected value of X as the center of mass of the frequency function. Imagine placing the masses p(xi ) at the points xi on a beam; the balance point of the beam is the expected value of X. EXAMPLE A Roulette A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1 that an odd number comes up, you win or lose $1 according to whether that event occurs. If X denotes your net gain, X = 1 with probability 18 38 and X =−1 with 116 4.1 The Expected Value of a Random Variable 117 probability 20 38 . The expected value of X is E(X) = 1 × 18 38 + (−1) × 20 38 =−1 19 Thus, your expected loss is about $.05. In Chapter 5, it will be shown that this coincides in the limit with the actual average loss per game if you play a long sequence of independent games. ■ EXAMPLE B Expectation of a Geometric Random Variable Suppose that items produced in a plant are independently defective with probability p. Items are inspected one by one until a defective item is found. On the average, how many items must be inspected? The number of items inspected, X, is a geometric random variable, with P(X = k) = q k−1 p, where q = 1 − p. Therefore, E(X) = ∞ k=1 kpq k−1 = p ∞ k=1 kq k−1 We use a trick to calculate the sum. Since kq k−1 = d dqq k, we interchange the oper- ations of summation and differentiation to obtain E(X) = p d dq ∞ k=1 q k = p d dq q 1 − q = p (1 − q)2 = 1 p It can be shown that the interchange of differentiation and summation is justified. Thus, for example, if 10% of the items are defective, an average of 10 items must be examined to find one that is defective, as might have been guessed. ■ EXAMPLE C Poisson Distribution The expected value of a Poisson random variable is E(X) = ∞ k=0 kλk k! e−λ = λe−λ ∞ k=1 λk−1 (k − 1)! = λe−λ ∞ j=0 λ j j! Since ∞ j=0(λ j /j!) = eλ,wehaveE(X) = λ. The parameter λ of the Poisson distribution can thus be interpreted as the average count. ■ 118 Chapter 4 Expected Values EXAMPLE D St. Petersburg Paradox A gambler has the following strategy for playing a sequence of games: He starts off betting $1; if he loses, he doubles his bet; and he continues to double his bet until he finally wins. To analyze this scheme, suppose that the game is fair and that he wins or loses the amount he bets. At trial 0, he bets $1; if he loses, he bets $2 at trial 1; and if he has not won by the kth trial, he bets $2k. When he finally wins, he will be $1 ahead, which can be checked by going through the scheme for the first few values of k. This seems like a foolproof way to win $1. What could be wrong with it? Let X denote the amount of money bet on the very last game (the game he wins). Because the probability that k losses are followed by one win is 2−(k+1), P(X = 2k) = 1 2k+1 and E(X) = ∞ n=0 nP(X = n) = ∞ k=0 2k 1 2k+1 =∞ Formally, E(X) is not defined. Practically, the analysis shows a flaw in this scheme, which is that it does not take into account the enormous amount of capital required. ■ The definition of expectation for a continuous random variable is a fairly obvious extension of the discrete case—summation is replaced by integration. DEFINITION If X is a continuous random variable with density f (x), then E(X) = ∞ −∞ xf(x) dx provided that |x| f (x)dx < ∞. If the integral diverges, the expectation is un- defined. ■ Again E(X) can be regarded as the center of mass of the density. We next consider some examples. EXAMPLE E Gamma Density If X follows a gamma density with parameters α and λ, E(X) = ∞ 0 λα (α)xαe−λx dx 4.1 The Expected Value of a Random Variable 119 This integral is easy to evaluate once we realize that λα+1xαe−λx / (α+1) is a gamma density and therefore integrates to 1. We thus have ∞ 0 xαe−λx dx = (α + 1) λα+1 from which it follows that E(X) = λα (α) (α + 1) λα+1 Finally, using the relation (α + 1) = α (α),wefind E(X) = α λ For the exponential density, α = 1, so E(X) = 1/λ. This may be contrasted to the median of the exponential density, which was found in Section 2.2.1 to be log 2/λ. The mean and the median can both be interpreted as “typical” values of X, but they measure different attributes of the probability distribution. ■ EXAMPLE F Normal Distribution From the definition of the expectation, we have E(X) = 1 σ √ 2π ∞ −∞ xe− 1 2 (x−μ)2 σ2 dx Making the change of variables z = x − μ changes this equation to E(X) = 1 σ √ 2π ∞ −∞ ze−z2/2σ 2 dz + μ σ √ 2π ∞ −∞ e−z2/2σ 2 dz The first integral is 0 since the contributions from z < 0 cancel those from z > 0, and the second integral is μ because the normal density integrates to 1. Thus, E(X) = μ The parameter μ of the normal density is the expectation, or mean value. We could have made the derivation much shorter by claiming that it was “obvious” that since the center of symmetry of the density is μ, the expectation must be μ. ■ EXAMPLE G Cauchy Density Recall that the Cauchy density is f (x) = 1 π 1 1 + x2 , −∞ < x < ∞ The density is symmetric about zero, so it would seem that E(X) = 0. However, ∞ −∞ |x| 1 + x2 =∞ 120 Chapter 4 Expected Values Therefore, the expectation does not exist. The reason that it fails to exist is, that the density decreases so slowly that very large values of X can occur with substantial probability. ■ The expected value can be interpreted as a long-run average. In Chapter 5, it will be shown that if E(X) exists and if X1, X2, ...is a sequence of independent random variables with the same distribution as X, and if Sn = n i=1 Xi , then, as n →∞, Sn n → E(X) This statement will be made more precise in Chapter 5. For now, a simple empirical demonstration will be sufficient. EXAMPLE H Using a pseudorandom number generator, a sequence X1, X2,... of independent standard normal random variables was generated, as well as a sequence Y1, Y2, ... of independent Cauchy random variables. Figure 4.1 shows the graphs of G(n) = 1 n n i=1 Xi and C(n) = 1 n n i=1 Yi n = 1, 2, ..., 500 Note how G(n) appears to be tending to a limit, whereas C(n) does not. ■ 2 2 1 0 1 3 4 0 (b) 100 200 300 400 500 C ( n ) 0 .4 .8 (a) G ( n ) n FIGURE 4.1 The average of n independent random variables as a function of n for (a) normal random variables and (b) Cauchy random variables. 4.1 The Expected Value of a Random Variable 121 We conclude this section with a simple result that is of great utility in probability theory. THEOREM A Markov’s Inequality If X is a random variable with P(X ≥ 0) = 1 and for which E(X) exists, then P(X ≥ t) ≤ E(X)/t. Proof We will prove this for the discrete case; the continuous case is entirely analogous. E(X) = x xp(x) = x kE(X)) ≤ k−1. This holds for any nonnegative random variable, regardless of its probability distribution. 4.1.1 Expectations of Functions of Random Variables We often need to find E[g(X)], where X is a random variable and g is a fixed function. For example, according to the kinetic theory of gases, the magnitude of the velocity of a gas molecule is random and its probability density is given by fX (x) = √ 2/π σ 3 x2e− 1 2 x2 σ2 (This is Maxwell’s distribution: the parameter σ depends on the temperature of the gas.) From this density, we can find the average velocity, but suppose that we are interested in finding the average kinetic energy Y = 1 2 mX2, where m is the mass of the molecule. The straightforward way to do this would seem to be the following: Let Y = g(X); find the density or frequency function of Y, say, fY ; and then compute E(Y) from the definition. It turns out, fortunately, that the process is much simpler than that. 122 Chapter 4 Expected Values THEOREM A Suppose that Y = g(X). a. If X is discrete with frequency function p(x), then E(Y) = x g(x)p(x) provided that |g(x)|p(x)<∞. b. If X is continuous with density function f (x), then E(Y) = ∞ −∞ g(x) f (x) dx provided that |g(x)| f (x) dx < ∞. Proof We will prove this result for the discrete case. The basic argument is the same for the continuous case, but making that proof rigorous requires some advanced theory of integration. By definition, E(Y) = i yi pY (yi ) Let Ai denote the set of x’s mapped to yi by g; that is, x ∈ Ai if g(x) = yi . Then pY (yi ) = x ∈ Ai p(x) and E(Y) = i yi x ∈ Ai p(x) = i x ∈ Ai yi p(x) = i x ∈ Ai g(x)p(x) = x g(x)p(x) This last step follows because the Ai are disjoint and every x belongs to some Ai . ■ It is worth pointing out explicitly that E[g(X)] = g[E(X)]; that is, the average value of the function is not equal to the function of the average value. Suppose, for example, that X takes on values 1 and 2, each with probability 1 2 ,soE(X) = 3 2 . Let Y = 1/X. Then E(Y) is clearly 1 × .5 + .5 × .5 = .75, but 1/E(X) = 2 3 . 4.1 The Expected Value of a Random Variable 123 EXAMPLE A Let us now apply Theorem A to find the average kinetic energy of a gas molecule. E(Y) = ∞ 0 1 2mx2 fX (x) dx = m 2 √ 2/π σ 3 ∞ 0 x4e− 1 2 x2 σ2 dx To evaluate the integral, we make the change of variables u = x2/2σ 2 to reduce it to 2mσ 2 √π ∞ 0 u3/2e−u du = 2mσ 2 √π 5 2 Finally, using the facts (1 2 ) = √π and (α + 1) = α (α),wehave E(Y) = 3 2 mσ 2 ■ Now suppose that Y = g(X1,...,Xn), where Xi have a joint distribution, and that we want to find E(Y). We do not have to find the density or frequency function of Y, which again could be a formidable task. THEOREM B Suppose that X1,...,Xn are jointly distributed random variables and Y = g(X1,...,Xn). a. If the Xi are discrete with frequency function p(x1,...,xn), then E(Y) = x1,...,xn g(x1,...,xn)p(x1,...,xn) provided that x1,...,xn |g(x1,...,xn)|p(x1,...,xn)<∞. b. If the Xi are continuous with joint density function f (x1,...,xn), then E(Y) = ··· g(x1,...,xn) f (x1,...,xn)dx1 ···dxn provided that the integral with |g| in place of g converges. Proof The proof is similar to that of Theorem A. ■ 124 Chapter 4 Expected Values EXAMPLE B A stick of unit length is broken randomly in two places. What is the average length of the middle piece? We interpret this question to mean that the locations of the two break points are independent uniform random variables U1 and U2. Therefore, we need to compute E|U1 − U2|. Theorem B tells us that we do not need to find the density function of |U1 − U2| and that we can just integrate |u1 − u2| against the joint density of U1 and U2, f (u1, u2) = 1, 0 ≤ u1 ≤ 1, 0 ≤ U2 ≤ 1. Thus, E|U1 − U2|= 1 0 1 0 |u1 − u2| du1 du2 = 1 0 u1 0 (u1 − u2) du2 du1 + 1 0 1 u1 (u2 − u1) du2 du1 With some care, we find the expectation to be 1 3 . This is in accord with the intu- itive argument that the smaller of U1 and U2 should be 1 3 on the average and the larger should be 2 3 on the average, which means that the average difference should be 1 3 . ■ We note the following immediate consequence of Theorem B. COROLLARY A If X and Y are independent random variables and g and h are fixed functions, then E[g(X)h(Y)] ={E[g(X)]}{E[h(Y)]}, provided that the expectations on the right-hand side exist. ■ In particular, if X and Y are independent, E(XY) = E(X)E(Y). The proof of this corollary is left to Problem 29 of the end-of-chapter problems. 4.1.2 Expectations of Linear Combinations of Random Variables One of the most useful properties of the expectation is that it is a linear operation. Suppose that you were told that the average temperature on July 1 in a certain location was 70◦F, and you were asked what the average temperature in degrees Celsius was. You can simply convert to degrees Celsius and obtain 5 9 × 70 − 17.7 = 21.2◦C. The notion of the average value of a random variable, which we have defined as the expected value of a random variable, behaves in the same fashion. If Y = aX+b, then E(Y) = aE(X) + b. More generally, this property extends to linear combinations of random variables. 4.1 The Expected Value of a Random Variable 125 THEOREM A If X1, ..., Xn are jointly distributed random variables with expectations E(Xi ) and Y is a linear function of the Xi , Y = a + n i=1 bi Xi , then E(Y) = a + n i=1 bi E(Xi ) Proof We will prove this for the continuous case. The proof in the discrete case is parallel and is left to Problem 24 at the end of this chapter. For notational simplicity, we take n = 2. From Theorem B of Section 4.1.1, we have E(Y) = (a + b1x1 + b2x2) f (x1, x2) dx1 dx2 = a f (x1, x2) dx1 dx2 + b1 x1 f (x1, x2) dx1 dx2 + b2 x2 f (x1, x2) dx1 dx2 The first double integral of the last expression is merely the integral of the bivariate density, which is equal to 1. The second double integral can be evaluated as follows: x1 f (x1, x2) dx1 dx2 = x1 f (x1, x2) dx2 dx1 = x1 fx1 (x1) dx1 = E(X1) A similar evaluation for the third double integral brings us to E(Y) = a + b1 E(X1) + b2 E(X2) This proves the theorem once we check that the expectation is well defined, or that |a + b1x1 + b2x2| f (x1, x2) dx1 dx2 < ∞ This can be verified using the inequality |a + b1x1 + b2x2|≤|a|+|b1||x1|+|b2||x2| and the assumption that the E(Xi ) exist. ■ Theorem A is extremely useful. We will illustrate its utility with several examples. 126 Chapter 4 Expected Values EXAMPLE A Suppose that we wish to find the expectation of a binomial random variable, Y. From the binomial frequency function, E(Y) = n k=0 n k kpk(1 − p)n−k It is not immediately obvious how to evaluate this sum. We can, however, represent Y as the sum of Bernoulli random variables, Xi , which equal 1 or 0 depending on whether there is success or failure on the ith trial, Y = n i=1 Xi Because E(Xi ) = 0 × (1 − p) + 1 × p = p, it follows immediately that E(Y) = np. An application of the binomial distribution and its expectation occurs in “shotgun sequencing” in genomics, a method of trying to figure out the sequence of letters that make up a long segment of DNA. It is technically too difficult to sequence the entire segment at once if it is very long. The basic idea of shotgun sequencing is to chop the DNA randomly into many small fragments, sequence each fragment, and then somehow assemble the fragments into one long “contig.” The hope is that if there are many fragments, their overlaps can be used to assemble the contig. Suppose, then, that the length of the DNA sequence is G and that there are N fragments each of length L. G might be at least 100,000 and L about 500. Assume that the left end of each fragment is equally likely to be at positions 1, 2,...,G − L + 1. What is the probability that a particular location x ∈{L, L + 1,...,G} is covered by at least one fragment? How many fragments are expected to cover a particular location? (The positions {1, 2,...,L −1} are not included in this discussion because the boundary effect makes them a little different; for example, the only fragment that covers position 1 has its left end at position 1.) To answer these questions, first consider a single fragment. The chance that it covers x equals the chance that its left end is in one of the L locations {x − L + 1, x − L,...,x}, and because the location of the left end is uniform, this probability is p = L G − L + 1 ≈ L G where the approximation holds because G  L. Thus, the distribution of W, the number of fragments that cover a particular location, is binomial with parameters N and p. From the binomial probability formula, the chance of coverage is P(W > 0) = 1 − P(W = 0) = 1 − 1 − L G N Since N is large and p is small, the distribution of W is nearly Poisson with parameter λ = Np = NL/G. From the Poisson probability formula, P(W = 0) ≈ e−NL/G, so the probability that a particular location is covered is approximately 1 − e−NL/G. Observe that NLis the total length of all the fragments; the ratio NL/G is called the coverage. Calculations of this kind are thus useful in deciding how many fragments to use. If the coverage is 8, for example, the chance that a site is covered is .9997. 4.1 The Expected Value of a Random Variable 127 Overlap of fragments is important when trying to assemble them. Since W is a binomial random variable, the expected number of fragments that cover a given site is Np = NL/G, precisely the coverage. We can also now answer this closely related question: How many sites do we expect to be entirely missed? We will calculate this using indicator random variables: let Ix equal 1 if site x is missed and 0 elsewhere. Then E(Ix ) = 1 × P(Ix = 1) + 0 × P(Ix = 0) = e−NL/G. The number of sites that are not covered is V = G x=1 Ix and from the linearity of expectation E(V ) = G x=L E(Ix ) ≈ Ge−NL/G. The length of the human genome is approximately G = 3 × 109, so with eight times coverage, we would expect about a million sites to be missed. ■ EXAMPLE B Coupon Collection Suppose that you collect coupons, that there are n distinct types of coupons, and that on each trial you are equally likely to get a coupon of any of the types. How many trials would you expect to go through until you had a complete set of coupons? (This might be a model for collecting baseball cards or for certain grocery store promotions.) The solution of this problem is greatly simplified by representing the number of trials as a sum. Let X1 be the number of trials up to and including the trial on which the first coupon is collected: X1 = 1. Let X2 be the number of trials from that point up to and including the trial on which the next coupon different from the first is obtained; let X3 be the number of trials from that point up to and including the trial on which the third distinct coupon is collected; and so on, up to Xn. Then the total number of trials, X, is the sum of the Xi , i = 1, 2, ..., n. We now find the distribution of Xr . At this point, r − 1ofn coupons have been collected, so on each trial the probability of success is (n − r + 1)/n. Therefore, Xr is a geometric random variable, with E(Xr ) = n/(n − r + 1). (See Example B of Section 4.1.) Thus, E(X) = n r=1 E(Xr ) = n n + n n − 1 + n n − 2 +···+ n 1 = n n r=1 1 r For example, if there are 10 types of coupons, the expected number of trials necessary to obtain at least one of each kind is 29.3. 128 Chapter 4 Expected Values Finally, we note the following famous approximation: n r=1 1 r = log n + γ + εn where log is the natural log or loge (unless otherwise specified, log means natural log throughout this text), γ is Euler’s constant, γ = .57...,and εn approaches zero as n approaches infinity. Using this approximation for n = 10, we find that the approximate expected number of trials is 28.8. Generally, we see that the expected number of trials grows at the rate n log n, or slightly faster than n. ■ EXAMPLE C Group Testing Suppose that a large number, n, of blood samples are to be screened for a relatively rare disease. If each sample is assayed individually, n tests will be required. On the other hand, if each sample is divided in half and one of the halves is put into a pool with all the other halves, the pooled lot can be tested. Then, provided that the test method is sensitive enough, if this test is negative, no further assays are necessary and only one test has to be performed. If the test on the pooled blood is positive, each reserved half-sample can be tested individually. In this case, a total of n + 1 tests will be required. It is therefore plausible, assuming that the disease is rare, that some savings can be achieved through this pooling procedure. To analyze this more quantitatively, let us first generalize the scheme and suppose that the n samples are first grouped into m groups of k samples each, or n = mk. Each group is then tested; if a group tests positively, each individual in the group is tested. If Xi is the number of tests run on the ith group, the total number of tests run is N = m i=1 Xi , and the expected total number of tests is E(N) = m i=1 E(Xi ) Let us find E(Xi ). If the probability of a negative on any individual sample is p, then the Xi take on the value 1 with probability pk or the value k + 1 with probability 1 − pk. Thus, E(Xi ) = pk + (k + 1)(1 − pk) = k + 1 − kpk We now have E(N) = m(k + 1) − mkpk = n 1 + 1 k − pk Recalling that n tests are necessary with no pooling, we see that the factor (1+1/k − pk) is the average number of samples used in group testing as a proportion of n. Figure 4.2 shows this proportion as a function of k for p = .99. From the figure, we 4.1 The Expected Value of a Random Variable 129 0 .2 0 5 10 15 20 Proportion k .4 .6 .8 1.0 1.2 FIGURE 4.2 The proportion of n in the average number of samples tested using group testing as a function of k. see that for group testing with a group size of about 10, only 20% of the number of tests used with the straightforward method are needed on the average. ■ EXAMPLE D Counting Word Occurrences in DNA Sequences Here we consider another example from genomics, and one that again illustrates the power of using indicator random variables. In searching for patterns in DNA sequences, there might be reason to expect that a “word” such as TATA would occur more frequently than expected in a random sequence. Or suppose we want to identify regions of a DNA sequence in which the occurrence of the word is unusually large. To quantify these ideas, we need to specify the meaning of random. In this example, we will take it to mean that the sequence is randomly composed of the letters A,C,G, and T in the sense that the letters at sites are independent and, at every site, each letter has probability 1 4 . We also need to be careful to specify how we count. Consider the following sequence ACTATATAGATATA We will count overlaps, so in the preceding sequence, TATA occurs three times. Now suppose that the sequence is of length N and that the word is of length q. Let In be an indicator random variable taking on the value 1 if the word begins at position n and 0 otherwise: P(In = 1) = 1 4 q from the assumption of independence and E(In) = P(In = 1). Now the total number of times the word occurs is W = N−q+1 n=1 In 130 Chapter 4 Expected Values and E(W) = N−q+1 n=1 E(In) = (N − q + 1) 1 4 q Note that the In are not independent—for example, in the case of the word TATA, if I1 = 1, then I2 = 0. Thus W is not a binomial random variable. But despite the lack of independence, we can find E(W) by expressing W as a linear combination of indicator variables. ■ EXAMPLE E Investment Portfolios An investor plans to apportion an amount of capital, C0, between two investments, placing a fraction π,0≤ π ≤ 1, in one investment and a fraction 1 − π in the other for a fixed period of time. Denoting the returns (final value divided by initial value) on the investments by R1 and R2, her capital at the end of the period will be C1 = πC0 R1 + (1 − π)C0 R2. Her return will then be R = C1 C0 = π R1 + (1 − π)R2 Suppose that the returns are unknown, as would be the case if they were stocks, for example, and that they are hence modeled as random variables, with expected values E(R1) and E(R2). Then her expected return is E(R) = π E(R1) + (1 − π)E(R2) How should she choose π? A simple solution would apparently be to choose π = 1 if E(R1)>E(R2) and π = 0 otherwise. But there is more to the story as we will see later. ■ 4.2 Variance and Standard Deviation The expected value of a random variable is its average value and can be viewed as an indication of the central value of the density or frequency function. The expected value is therefore sometimes referred to as a location parameter. The median of a distribution is also a location parameter, one that does not necessarily equal the mean. This section introduces another parameter, the standard deviation of a random variable, which is an indication of how dispersed the probability distribution is about its center, of how spread out on the average are the values of the random variable about its expectation. We first define the variance of a random variable and then define the standard deviation in terms of the variance. 4.2 Variance and Standard Deviation 131 DEFINITION If X is a random variable with expected value E(X), the variance of X is Var(X) = E{[X − E(X)]2} provided that the expectation exists. The standard deviation of X is the square root of the variance. ■ If X is a discrete random variable with frequency function p(x) and expected value μ = E(X), then according to the definition and Theorem A of Section 4.1.1, Var(X) = i (xi − μ)2 p(xi ) whereas if X is a continuous random variable with density function f (x) and E(X) = μ Var(X) = ∞ −∞ (x − μ)2 f (x) dx The variance is often denoted by σ 2 and the standard deviation by σ. From the preceding definition, the variance of X is the average value of the squared deviation of X from its mean. If X has units of meters, for example, the vari- ance has units of meters squared, and the standard deviation has units of meters. Although we are often interested ultimately in the standard deviation rather than the variance, it is usually easier to find the variance first and then take the square root. The variance of a random variable changes in a simple way under linear trans- formations. THEOREM A If Var(X) exists and Y = a + bX, then Var(Y) = b2Var(X). Proof Since E(Y) = a + bE(X), E[(Y − E(Y))2] = E{[a + bX − a − bE(X)]2} = E{b2[X − E(X)]2} = b2 E{[X − E(X)]2} = b2Var(X) ■ This result seems reasonable once you realize that the addition of a constant does not affect the variance, since the variance is a measure of the spread around a center and the center has merely been shifted. 132 Chapter 4 Expected Values The standard deviation transforms in a natural way: σY =|b|σX . Thus, if the units of measurement are changed from meters to centimeters, for example, the standard deviation is simply multiplied by 100. EXAMPLE A Bernoulli Distribution If X has a Bernoulli distribution—that is, X takes on values 0 and 1 with probability 1 − p and p, respectively—then we have seen (Example A of Section 4.1.2) that E(X) = p. By the definition of variance, Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p2 − p3 + p − 2p2 + p3 = p(1 − p) Note that the expression p(1 − p) is a quadratic with a maximum at p = 1 2 .Ifp is 0 or 1, the variance is 0, which makes sense since the probability distribution is concentrated at a single point and the random variable is not variable at all. The distribution is most dispersed when p = 1 2 . ■ EXAMPLE B Normal Distribution We have seen that E(X) = μ. Then Var(X) = E[(X − μ)2] = 1 σ √ 2π ∞ −∞ (x − μ)2e− 1 2 (x−μ)2 σ2 dx Making the change of variables z = (x − μ)/σ changes the right-hand side to σ 2 √ 2π ∞ −∞ z2e−z2/2 dz Finally, making the change of variables u = z2/2 reduces the integral to a gamma function, and we find that Var(X) = σ 2. ■ The following theorem gives an alternative way of calculating the variance. THEOREM B The variance of X, if it exists, may also be calculated as follows: Var(X) = E(X 2) − [E(X)]2 4.2 Variance and Standard Deviation 133 Proof Denote E(X) by μ. Var(X) = E[(X − μ)2] = E(X 2 − 2μX + μ2) By the linearity of the expectation, this becomes Var(X) = E(X 2) − 2μE(X) + μ2 = E(X 2) − 2μ2 + μ2 = E(X 2) − μ2 as was to be shown. ■ According to Theorem B, the variance of X can be found in two steps: First find E(X), and then find E(X 2). EXAMPLE C Uniform Distribution Let us apply Theorem B to find the variance of a random variable that is uniform on [0, 1]. We know that E(X) = 1 2 ; next we need to find E(X 2): E(X 2) = 1 0 x2 dx = 1 3We thus have Var(X) = 1 3 − 1 2 2 = 1 12 ■ It was stated earlier that the variance or standard deviation of a random variable gives an indication as to how spread out its possible values are. A famous inequality, Chebyshev’s inequality, lends a quantitative aspect to this indication. THEOREM C Chebyshev’s Inequality Let X be a random variable with mean μ and variance σ 2. Then, for any t > 0, P(|X − μ| > t) ≤ σ 2 t2 Proof Let Y = (X−μ)2. Then E(Y) = σ 2, and the result follows by applying Markov’s inequality to Y. ■ 134 Chapter 4 Expected Values Theorem C says that if σ 2 is very small, there is a high probability that X will not deviate much from μ. For another interpretation, we can set t = kσ so that the inequality becomes P(|X − μ|≥kσ)≤ 1 k2 For example, the probability that X is more than 4σ away from μ is less than or equal to 1 16 . These results hold for any random variable with any distribution provided the variance exists. In particular cases, the bounds are often much narrower. For example, if X is normally distributed, we find from tables of the normal distribution that P(|X −μ| > 2σ)= .05 (compared to 1 4 obtained from Chebyshev’s inequality). Chebyshev’s inequality has the following consequence. COROLLARY A If Var(X) = 0, then P(X = μ) = 1. Proof We will give a proof by contradiction. Suppose that P(X = μ) < 1. Then, for some ε>0, P(|X − μ|≥ε) > 0. However, by Chebyshev’s inequality, for any ε>0, P(|X − μ|≥ε) = 0 ■ EXAMPLE D Investment Portfolios We continue Example E in Section 4.1.2. Suppose that one of the two investments is risky and the other is risk free. The first might be a stock and the other an insured savings account. The stock has a return R1, which is modeled as a random variable with expectation μ1 = 0.10 and standard deviation σ1 = 0.075. The standard deviation is a measure of risk—a large standard deviation means that the returns fluctuate a lot so that the investor might be lucky and get a large return, but might also be unlucky and lose a lot. Suppose that the savings account has a certain return R2 = 0.03. The expected value of this return is μ2 = 0.03 and its standard deviation is 0—it is risk free. If the investor places a fraction π1 in the stock and a fraction π2 = 1 − π1 in the savings account, her return is R = π1 R1 + (1 − π1)R2 and her expected return is E(R) = π1μ1 + (1 − π1)μ2 Since μ1 >μ2, her expected return is maximized by π1 = 1, putting all her money in the stock. However, this point of view is too narrow; it does not take into account the risk of the stock. By Theorem A Var(R) = π2 1 σ 2 1 4.2 Variance and Standard Deviation 135 and the standard deviation of the return is σR = π1σ1. The larger π1, the larger the expected return, but also the larger the risk. In choosing π1, the investor has to balance the risk she is willing to take against the expected gain; the desired balance will be different for different investors. If she is risk averse, she will choose a small value of π1, being leery of volatile investments. By tracing out the expected return and the standard deviation as functions of π1, she can strike a balance with which she is comfortable. ■ 4.2.1 A Model for Measurement Error Values of physical constants are not precisely known but must be determined by experimental procedures. Such seemingly simple operations as weighting an object, determining a voltage, or measuring an interval of time are actually quite complicated when all the details and possible sources of error are taken into account. The National Institute of Standards and Technology (NIST) in the United States and similar agen- cies in other countries are charged with developing and maintaining measurement standards. Such agencies employ probabilists and statisticians as well as physical scientists in this endeavor. A distinction is usually made between random and systematic measurement errors. A sequence of repeated independent measurements made with no deliberate change in the apparatus or experimental procedure may not yield identical values, and the uncontrollable fluctuations are often modeled as random. At the same time, there may be errors that have the same effect on every measurement; equipment may be slightly out of calibration, for example, or there may be errors associated with the theory underlying the method of measurement. If the true value of the quantity being measured is denoted by x0, the measurement, X, is modeled as X = x0 + β + ε where β is the constant, or systematic, error and ε is the random component of the error; ε is a random variable with E(ε) = 0 and Var(ε) = σ 2. We then have E(X) = x0 + β and Var(X) = σ 2 β is often called the bias of the measurement procedure. The two factors affecting the size of the error are the bias and the size of the variance, σ 2. A perfect measurement would have β = 0 and σ 2 = 0. EXAMPLE A Measurement of the Gravity Constant This and the next example are taken from an interesting and readable paper by Youden (1972), a statistician at NIST. Measurement of the acceleration due to gravity at Ottawa was done 32 times with each of two different methods (Preston-Thomas et al. 1960). The results are displayed as histograms in Figure 4.3. There is clearly some systematic difference between the two methods as well as some variation within each method. It 136 Chapter 4 Expected Values Mean 980.6139 cm/sec2 Standard deviation 0.9 mgal Maximum spread 4.1 mgal Dec 1959 32 Drops Rule No. 2 Aug 1958 32 Drops Rule No. 1 Mean 980.6124 cm/sec2 Standard deviation 0.6 mgal Maximum spread 2.9 mgal 980.610 980.611 980.612 980.613 980.614 980.615 980.616 cm/sec2 FIGURE 4.3 Histograms of two sets of measurements of the acceleration due to gravity. appears that the two biases are unequal. The results from Rule 1 are more scattered than those of Rule 2, and their standard deviation is larger. ■ An overall measure of the size of the measurement error that is often used is the mean squared error, which is defined as MSE = E[(X − x0)2] The mean squared error, which is the expected squared deviation of X from x0, can be decomposed into contributions from the bias and the variance. THEOREM A MSE = β2 + σ 2. Proof From Theorem B of Section 4.2, E[(X − x0)2] = Var(X − x0) + [E(X − x0)]2 = Var(X) + β2 = σ 2 + β2 ■ 4.2 Variance and Standard Deviation 137 Measurements are often reported in the form 102±1.6, for example. Although it is not always clear what precisely is meant by such notation, 102 is the experimentally determined value and 1.6 is some measure of the error. It is often claimed or hoped that β is negligible relative to σ, and in that case 1.6 represents σ or some multiple of σ. In the graphical presentation of experimentally obtained data, error bars, usually of width σ or some multiple of σ, are placed around measured values. In some cases, efforts are made to bound the magnitude of β, and the bound is incorporated into the error bars in some fashion. EXAMPLE B Measurement of the Speed of Light Figure 4.4, taken from McNish (1962) and discussed by Youden (1972), shows 24 independent determinations of the speed of light, c, with error bars. The right col- umn of the figure contains codes for the experimental methods used to obtain the measurements; for example, G denotes a method called the geodimeter method. The range of values for c is about 3.5 km/sec, and many of the errors are less than .5 km/sec. Examination of the figure makes it clear that the error bars are too small and that the spread of values cannot be accounted for by different experimental techniques alone—the geodimeter method produced both the smallest and the next to largest value for c. Youden remarks, “Surely the evidence suggests that individual investigators are unable to set realistic limits of error to their reported values.” He goes on to suggest that the differences are largely a result of calibration errors for equipment. ■ 299790 91 92 93 km/sec 9594 96 Designation Method US CUTKOSKY-THOMAS GB US US GB AU AU AU ESSEN FROOME FROOME SW FROOME SW SW AU SW US AU US US US FLORMAN G Ru G Sh G G G G G CR MI MI G MI G G G G Sh G G Sh G RI FIGURE 4.4 A plot of 24 independent determinations of the speed of light with the reported error bars. The investigator or country is listed in the left column, and the experimental method is coded in the right column. 138 Chapter 4 Expected Values 4.3 Covariance and Correlation The variance of a random variable is a measure of its variability, and the covariance of two random variables is a measure of their joint variability, or their degree of asso- ciation. After defining covariance, we will develop some of its properties and discuss a measure of association called correlation, which is defined in terms of covariance. You may find this material somewhat formal and abstract at first, but as you use them, covariance, correlation, and their properties will begin to seem natural and familiar. DEFINITION If X and Y are jointly distributed random variables with expectations μX and μY , respectively, the covariance of X and Y is Cov(X, Y) = E[(X − μX )(Y − μY )] provided that the expectation exists. ■ The covariance is the average value of the product of the deviation of X from its mean and the deviation of Y from its mean. If the random variables are positively associated—that is, when X is larger than its mean, Y tends to be larger than its mean as well—the covariance will be positive. If the association is negative—that is, when X is larger than its mean, Y tends to be smaller than its mean—the covariance is negative. These statements will be expanded in the discussion of correlation. By expanding the product and using the linearity of the expectation, we obtain an alternative expression for the covariance, paralleling Theorem B of Section 4.2: Cov(X, Y) = E(XY − XμY − YμX + μX μY ) = E(XY) − E(X)μY − E(Y)μX + μX μY = E(XY) − E(X)E(Y) In particular, if X andY are independent, then E(XY) = E(X)E(Y)and Cov(X, Y) = 0 (but the converse is not true). See Problems 59 and 60 at the end of this chapter for examples. EXAMPLE A Let us return to the bivariate uniform distributions of Example C in Section 3.3. First, note that since the marginal distributions are uniform, E(X) = E(Y) = 1 2 . For the case α =−1, the joint density of X and Y is f (x, y) = (2x + 2y − 4xy),0≤ x ≤ 1, 0 ≤ y ≤ 1. E(XY) = xyf(x, y) dx dy = 1 0 1 0 xy(2x + 2y − 4xy) dx dy = 2 9 4.3 Covariance and Correlation 139 Thus, Cov(X, Y) = 2 9 − 1 2 1 2 =−1 36 The covariance is negative, indicating a negative relationship between X and Y.In fact, from Figure 3.5, we see that if X is less than its mean, 1 2 , then Y tends to be larger than its mean, and vice versa. A similar analysis shows that when α = 1, Cov(X, Y) = 1 36 . ■ We will now develop an expression for the covariance of linear combinations of random variables, proceeding in a number of small steps. First, since E(a + X) = a + E(X), Cov(a + X, Y) = E{[a + X − E(a + X)][Y − E(Y)]} = E{[X − E(X)][Y − E(Y)]} = Cov(X, Y) Next, since E(aX) = aE(X), Cov(aX, bY) = E{[aX − aE(X)][bY − bE(Y)]} = E{ab[X − E(X)][Y − E(Y)]} = abE{[X − E(X)][Y − E(Y)]} = abCov(X, Y) Next, we consider Cov(X, Y + Z): Cov(X, Y + Z) = E([X − E(X)]{[Y − E(Y)] + [Z − E(Z)]}) = E{[X − E(X)][Y − E(Y)] + [X − E(X)][Z − E(Z)]} = E{[X − E(X)][Y − E(Y)]} + E{[X − E(X)][Z − E(Z)]} = Cov(X, Y) + Cov(X, Z) We can now put these results together to find Cov(aW + bX, cY + dZ): Cov(aW + bX, cY + dZ) = Cov(aW + bX, cY) + Cov(aW + bX, dZ) = Cov(aW, cY) + Cov(bX, cY) + Cov(aW, dZ) + Cov(bX, dZ) = acCov(W, Y) + bc Cov(X, Y) + ad Cov(W, Z) + bd Cov(X, Z) In general, the same kind of argument gives the following important bilinear property of covariance. 140 Chapter 4 Expected Values THEOREM A Suppose that U = a + n i=1 bi Xi and V = c + m j=1 d j Y j . Then Cov(U, V ) = n i=1 m j=1 bi d j Cov(Xi , Y j ) ■ This theorem has many applications. In particular, since Var(X) = Cov(X, X), Var(X + Y) = Cov(X + Y, X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) More generally, we have the following result for the variance of a linear combination of random variables. COROLLARY A Var(a + n i=1 bi Xi ) = n i=1 n j=1 bi b j Cov(Xi , X j ). ■ If the Xi are independent, then Cov(Xi , X j ) = 0 for i = j, and we have another corollary. COROLLARY B Var(n i=1 Xi ) = n i=1 Var(Xi ),iftheXi are independent. ■ Corollary B is very useful. Note that E( Xi ) = E(Xi ) whether or not the Xi are independent, but it is generally not the case that Var( Xi ) = Var(Xi ). EXAMPLE B Finding the variance of a binomial random variable from the definition of variance and the frequency function of the binomial distribution is not easy (try it). But expressing a binomial random variable as a sum of independent Bernoulli random variables makes the computation of the variance trivial. Specifically, if Y is a binomial random variable, it can be expressed as Y = X1 + X2 +···+Xn, where the Xi are independent Bernoulli random variables with P(Xi = 1) = p. We saw earlier (Example A in Section 4.2) that Var(Xi ) = p(1 − p), from which it follows from Corollary B that Var(Y) = np(1 − p). ■ EXAMPLE C Random Walk A drunken walker starts out at a point x0 on the real line. He takes a step on length X1, which is a random variable with expected value μ and variance σ, and his position 4.3 Covariance and Correlation 141 at that time is S(1) = x0 + X1. He then takes another step of length X2, which is independent of X1 with the same mean and standard deviation. His position after n such steps is S(n) = x0 + n i=1 Xi . Then E(S(n)) = x0 + E n i=1 Xi = x0 + nμ Var(S(n)) = Var n i=1 Xi = nσ 2 He thus expects to be at the position x0 + nμ with an uncertainty as measured by the standard deviation of √ nσ. Note that if μ>0, for example, for large values of n he will be to the right of the point x0 with very high probability (using Chebyshev’s inequality). Random walks have found applications in many areas of science. Brownian mo- tion is a continuous time version of a random walk with the steps being normally distributed random variables. The name derives from observations of the biologist Robert Brown in 1827 of the apparently spontaneous motion of pollen grains sus- pended in water. This was later explained by Einstein to be due to the collisions of the grains with randomly moving water molecules. The theory of Brownian motion was developed by Louis Bachelier in 1900 in his PhD thesis “The theory of speculation,” which related random walks to the evolution of stock prices. If the value of a stock evolves through time as a random walk, its short-term behavior is unpredictable. The efficient market hypothesis states that stock prices already reflect all known information so that the future price is random and unknowable. The solid line in Figure 4.5 shows the value of the S&P 500 during 900 50 Time Value 800 1000 1100 1200 1300 0 100 150 200 250 FIGURE 4.5 The solid line is the value of the S&P 500 during 2003. The dashed lines are simulations of random walks. 142 Chapter 4 Expected Values 2003. The average of the increments (steps) was 0.81 and the standard deviation was 9.82. The dashed lines are simulations of random walks with the same intial value and increments that were normally distributed random variables with μ = 0.81 and σ = 9.82. Notice the long stretches of upturns and downturns that occurred in the random walks as the markets reacted in ways that would have been explained ex post facto by analysts. See Malkiel (2004) for a popular exposition of the implications of random walk theory for stock market investors. ■ The correlation coefficient is defined in terms of the covariance. DEFINITION If X and Y are jointly distributed random variables and the variances and covari- ances of both X and Y exist and the variances are nonzero, then the correlation of X and Y, denoted by ρ,is ρ = Cov(X, Y)√ Var(X)Var(Y) ■ Note that because of the way the ratio is formed, the correlation is a dimension- less quantity (it has no units, such as inches, since the units in the numerator and denominator cancel). From the properties of the variance and covariance that we have established, it follows easily that if X and Y are both subjected to linear transforma- tions (such as changing their units from inches to meters), the correlation coefficient does not change. Since it does not depend on the units of measurement, ρ is in many cases a more useful measure of association than is the covariance. EXAMPLE D Let us return to the bivariate uniform distribution of Example A. Because X and Y are marginally uniform, Var(X) = Var(Y) = 1 12 . In the one case (α =−1), we found Cov(X, Y) =−1 36 ,so ρ =−1 36 × 12 =−1 3 In the other case (α = 1), the covariance was 1 36 , so the correlation is 1 3 . ■ The following notation and relationship are often useful. The standard deviations of X and Y are denoted by σX and σY and their covariance by σXY. We thus have ρ = σXY σX σY and σXY = ρσX σY The following theorem states some further properties of ρ. 4.3 Covariance and Correlation 143 THEOREM B −1 ≤ ρ ≤ 1. Furthermore, ρ =±1 if and only if P(Y = a + bX) = 1 for some constants a and b. Proof Since the variance of a random variable is nonnegative, 0 ≤ Var X σX + Y σY = Var X σX + Var Y σY + 2Cov X σX , Y σY = Var(X) σ 2 X + Var(Y) σ 2 Y + 2Cov(X, Y) σX σY = 2(1 + ρ) From this, we see that ρ ≥−1. Similarly, 0 ≤ Var X σX − Y σY = 2(1 − ρ) implies that ρ ≤ 1. Suppose that ρ = 1. Then Var X σX − Y σY = 0 which by Corollary A of Section 4.2 implies that P X σX − Y σY = c = 1 for some constant, c. This is equivalent to P(Y = a + bX) = 1 for some a and b. A similar argument holds for ρ =−1. ■ EXAMPLE E Investment Portfolio We are now in a position to further develop the investment theory discussed in Sec- tion 4.1.2, Example E, and Section 4.2, Example D. Please review those examples be- fore continuing. We first consider the simple example of two securities, assuming that they have the same expected returns μ1 = μ2 = μ and their returns are uncorrelated: σij = Cov(Ri , R j ) = 0. For a portfolio (π, 1 − π), the expected return is E(R(π)) = πμ+ (1 − π)μ = μ so that when considering expected return only, the choice of π makes no difference. However, taking risk into account, Var(R(π)) = π2σ 2 1 + (1 − π)2σ 2 2 . 144 Chapter 4 Expected Values Minimizing this with respect to π gives the optimal portfolio πopt = σ 2 2 σ 2 1 + σ 2 2 For example, if the investments are equally risky, σ1 = σ2 = σ, then π = 1/2, so the best strategy is to split her total investment equally between the two securities. If she does so, the variance of her return is, by Theorem A, Var R 1 2 = σ 2 2 whereas if she put all her money in one security, the variance of her return would be σ 2. The expected return is the same in both cases. This is a particularly simple example of the value of diversification of investments. Suppose now that the two securities do not have the same expected returns, μ1 <μ2. Let the standard deviations of the returns be σ1 and σ2; usually less risky investments have lower expected returns, σ1 <σ2. Furthermore, the two returns may be correlated: Cov(R1, R2) = ρσ1σ2. Corresponding to the portfolio (π, 1 − π),we have expected return E(R(π)) = πμ1 + (1 − π)μ2 and the variance of the return is Var(R(π)) = π2σ 2 1 + 2π(1 − π)ρσ1σ2 + (1 − π)2σ 2 2 Comparing this to the result when the returns were independent, we see the risk is lower when the returns are independent than when they are positively correlated. It would thus be better to invest in two unrelated or weakly related market sectors than to make two investments in the same sector. In deciding the choice of the portfolio vector, the investor can study how the risk (the standard deviation of R(π)) changes as the expected return increases, and balance expected return versus risk. In actual investment decisions, many more than two possible investments are involved, but the basic idea remains the same. Suppose there are n possible invest- ments. Let the portfolio weights be denoted by the vector π = (π1,π2,...,πn). Let E(Ri ) = μi ,Cov(Ri , R j ) = σij (so, in particular, Var(Ri ) is denoted by σii), then E(R(π)) = πi μi and Var(R(π)) = n i=1 n j=1 πi πi σij. The investment decision, the choice of the portfolio vector π, is often couched as that of maximizing expected return subject to the risk being less than some value the individual investor is willing to tolerate. Some investors are more risk averse than others, so the portfolio vectors will differ from investor to investor. Equivalently, the decision may be phrased as that of finding the portfolio vector with the minimum risk subject to a desired return; there may well be many portfolio choices that give the 4.3 Covariance and Correlation 145 Standard deviation Monthly return (%) 0 1 3 4 5 5 1015200 2 Philipines Thailand Brazil Taiwan Argentina Turkey Greece S&P 500 Portugal Korea Mexico Indonesia Chile Malaysia INDEX FIGURE 4.6 The benefit of diversification. The monthly average return from January 1992 to June 1994 of 13 stock markets, plotted against their standard deviations. The performance of the Standard and Poor's 500 index of U.S. stocks is plotted for comparison. same expected return, and the wise investor would choose the one among them that had the lowest risk. As a general rule, risk is reduced by diversification and can be decreased with only a small sacrifice of returns. Figure 4.6 from Bernstein (1996, p. 254) illustrates this point empirically. The point labeled “Index” shows the monthly average versus standard deviation for an investment that was equally weighted across all the markets. A reasonably high return with relatively little risk would thus have been obtained by spreading investments equally over the 13 stock markets. In fact, the risk is less than that of any of the individual markets. Note that these emerging markets were riskier than the U.S. market, but that they were more profitable. ■ EXAMPLE F Bivariate Normal Distribution We will show that the covariance of X and Y when they follow a bivariate normal distribution is ρσX σY , which means that ρ is the correlation coefficient. The covari- ance is Cov(X, Y) = ∞ −∞ ∞ −∞ (x − μX )(y − μY ) f (x, y) dx dy Making the changes of variables u = (x − μX )/σX and v = (y − μY )/σY changes the right-hand side to σX σY 2π 1 − ρ2 ∞ −∞ ∞ −∞ uv exp − 1 2(1 − ρ2)(u2 + v2 − 2ρuv) du dv 146 Chapter 4 Expected Values As in Example F in Section 3.3, we use the technique of completing the square to rewrite this expression as σX σY 2π 1 − ρ2 ∞ −∞ v exp(−v2/2) ∞ −∞ u exp − 1 2(1 − ρ2)(u − ρv)2 du dv The inner integral is the mean of an N[ρv,(1 − ρ2)] random variable, lacking only the normalizing constant [2π(1 − ρ2)]−1/2, and we thus have Cov(X, Y) = ρσX σY√ 2π ∞ −∞ v2e−v2/2dv = ρσX σY as was to be shown. ■ The correlation coefficient ρ measures the strength of the linear relationship between X and Y (compare with Figure 3.9). Correlation also affects the appearance of a scatterplot, which is constructed by generating n independent pairs (Xi , Yi ), where i = 1, ..., n, and plotting the points. Figure 4.7 shows scatterplots of 100 pairs of pseudorandom bivariate normal random variables for various values of ρ. Note that the clouds of points are roughly elliptical in shape. 1 3 2 1 0 2 3 2 (a) 101234 x y 1 3 2 1 0 2 3 2 (b) 10 1 2 3 4 x x y 1 3 2 1 0 2 3 2 (c) 101234 x y 4 1 3 2 1 0 2 3 2 (d) 101234 y 4 FIGURE 4.7 Scatterplots of 100 independent pairs of bivariate normal random variables, (a) ρ = 0, (b) ρ = .3, (c) ρ = .6, (d) ρ = .9. 4.4 Conditional Expectation and Prediction 147 4.4 Conditional Expectation and Prediction 4.4.1 Definitions and Examples In Section 3.5, conditional frequency functions and density functions were defined. We noted that these had the properties of ordinary frequency and density functions. In particular, associated with a conditional distribution is a conditional mean. Suppose that Y and X are discrete random variables and that the conditional frequency function of Y given x is pY|X (y|x). The conditional expectation of Y given X = x is E(Y|X = x) = y ypY|X (y|x) For the continuous case, we have E(Y|X = x) = yfY|X (y|x) dy More generally, the conditional expectation of a function h(Y) is E[h(Y)|X = x] = h(y) fY|X (y|x) dy in the continuous case. A similar equation holds in the discrete case. EXAMPLE A Consider a Poisson process on [0, 1] with mean λ, and let N be the number of points in [0, 1]. For p < 1, let X be the number of points in [0, p]. Find the conditional distribution and conditional mean of X given N = n. We first find the joint distribution: P(X = x, N = n), which is the probability of x events in [0, p] and n − x events in [p, 1]. From the assumption of a Poisson process, the counts in the two intervals are independent Poisson random variables with parameters pλ and (1 − p)λ,so pXN(x, n) = (pλ)x e−pλ x! [(1 − p)λ]n−x e−(1−p)λ (n − x)! The marginal distribution of N is Poisson, so the conditional frequency function of X is, after some algebra, pX|N (x|n) = pXN(x, n) pN (n) = n! x!(n − x)! px (1 − p)n−x This is the binomial distribution with parameters n and p. The conditional expectation is thus by Example A of Section 4.1.2, np. ■ 148 Chapter 4 Expected Values EXAMPLE B Bivariate Normal Distribution From Example C in Section 3.5.2, if Y and X follow a bivariate normal distribution, the conditional density of Y given X is fY|X (y|x) = 1 σY 2π(1 − ρ2) exp ⎛ ⎜⎜⎜⎝−1 2 y − μY − ρ σY σX (x − μX ) 2 σ 2 Y (1 − ρ2) ⎞ ⎟⎟⎟⎠ This is a normal density with mean μY + ρ(x − μX )σY /σX and variance σ 2 Y (1 − ρ2). The former is the conditional mean and the latter the conditional variance of Y given X = x. Note that the conditional mean is a linear function of X and that as |ρ| increases, the conditional variance decreases; both of these facts are suggested by the elliptical contours of the joint density. To see this more exactly, consider the case in which σX = σY = 1 and μX = μY = 0. The contours then are ellipses satisfying ρ2x2 − 2ρxy + y2 = constant The major and minor axes of such an ellipse are at 45◦ and 135◦. The conditional expectation of Y given X = x is the line y = ρx; note that this line does not lie along the major axis of the ellipse. Figure 4.8 shows such a bivariate normal distribution with ρ = 0.5. The curved lines of the bivariate density correspond to the conditional density of Y given various values of x, but they are not normalized to integrate to 1. The contours of the bivariate normal are the ellipses shown in the xy plane as dashed curves, with the major axis shown by the straight dashed line. The conditional expectation of Y given X = x is shown as a function of x by the solid line in the plane. Note that it is not the major axis of the ellipse. This phenomenon was noted by Sir Francis Galton (1822–1911) who studied the relationship of the heights of sons to that of their fathers. He observed that 0 1 2 3 4 4 3 3 1 12 2 0 1 2 3 4 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 y x f (x, y) FIGURE 4.8 Bivariate normal density with correlation, ρ = 0.5. The conditional expectation of Y given X = x is shown as the solid line in the xy plane. 4.4 Conditional Expectation and Prediction 149 sons of very tall fathers were shorter on average than their fathers and that sons of very short fathers were on average taller. The empirical relationship is shown in Figure 14.19. ■ Assuming that the conditional expectation of Y given X = x exists for every x in the range of X, it is a well-defined function of X and hence is a random variable, which we write as E(Y|X). For instance, in Example A we found that E(X|N = n) = np; thus, E(X|N) = Np is a random variable that is a function of N. Provided that the appropriate sums or integrals converge, this random variable has an expectation and a variance. Its expectation is E[E(Y|X)]; for this expression, note that since E(Y|X) is a random variable that is a function of X, the outer expectation can be taken with respect to the distribution of X (Theorem A of Section 4.1.1). The following theorem says that the average (expected) value of Y can be found by first conditioning on X, finding E(Y|X), and then averaging this quantity with respect to X. THEOREM A E(Y) = E[E(Y|X)]. Proof We will prove this for the discrete case. The continuous case is proved similarly. Using Theorem 4.1.1A we need to show that E(Y) = x E(Y|X = x)pX (x) where E(Y|X = x) = y ypY|X (y|x) Interchanging the order of summation gives us x E(Y|X = x)pX (x) = y y x pY|X (y|x)pX (x) (It can be shown that this interchange can be made.) From the law of total probability, we have pY (y) = x pY|X (y|x)pX (x) Therefore, y y x pY|X (y|x)pX (x) = y ypY (y) = E(Y) ■ Theorem A gives what might be called a law of total expectation: The ex- pectation of a random variable Y can be calculated by weighting the conditional expectations appropriately and summing or integrating. 150 Chapter 4 Expected Values EXAMPLE C Suppose that in a system, a component and a backup unit both have mean lifetimes equal to μ. If the component fails, the system automatically substitutes the backup unit, but there is probability p that something will go wrong and it will fail to do so. Let T be the total lifetime, and let X = 1 if the substitution of the backup takes place successfully, and X = 0 if it does not. Thus the total lifetime is the lifetime of the first component if the backup fails and is the sum of the lifetimes of the original and the backup units if the backup is successfully made. Then E(T |X = 1) = 2μ E(T |X = 0) = μ Thus, E(T ) = E(T |X = 1)P(X = 1) + E(T |X = 0)P(X = 0) = μ(2 − p) ■ EXAMPLE D Random Sums This example introduces sums of the type T = N i=1 Xi where N is a random variable with a finite expectation and the Xi are random variables that are independent of N and have the common mean E(X). Such sums arise in a variety of applications. An insurance company might receive N claims in a given period of time, and the amounts of the individual claims might be modeled as random variables X1, X2,....The random variable N could denote the number of customers entering a store and Xi the expenditure of the ith customer, or N could denote the number of jobs in a single-server queue and Xi the service time for the ith job. For this last case, T is the time to serve all the jobs in the queue. According to Theorem A, E(T ) = E[E(T |N)] Since E(T |N = n) = nE(X), E(T |N) = NE(X) and thus E(T ) = E[NE(X)] = E(N)E(X) This agrees with the intuitive guess that the average time to complete N jobs, where N is random, is the average value of N times the average amount of time to complete a job. ■ We have seen that the expectation of the random variable E(Y|X) is E(Y).We now find its variance. 4.4 Conditional Expectation and Prediction 151 THEOREM B Var(Y) = Var[E(Y|X)] + E[Var(Y|X)]. Proof We will explain what is meant by the notation in the course of the proof. First, Var(Y|X = x) = E(Y 2|X = x) − [E(Y)|X = x)]2 which is defined for all values of x. Thus, just as we defined E(Y|X) to be a random variable by letting X be random, we can define Var(Y|X) as a random variable. In particular, Var(Y|X) has the expectation E[Var(Y|X)]. Since Var(Y|X) = E(Y 2|X) − [E(Y|X)]2 E[Var(Y|X)] = E[E(Y 2|X)] − E{[E(Y|X)]2} Also, Var[E(Y|X)] = E{[E(Y|X)]2}−{E[E(Y|X)]}2 The final piece that we need is Var(Y) = E(Y 2) − [E(Y)]2 = E[E(Y 2|X)] −{E[E(Y|X)]}2 by the law of total expectation. Now we can put all the pieces together: Var(Y) = E[E(Y 2|X)] −{E[E(Y|X)]}2 = E[E(Y 2|X)] − E{[E(Y|X)]2}+E{[E(Y|X)]2}−{E[E(Y|X)]}2 = E[Var(Y|X)] + Var[E(Y|X)] ■ EXAMPLE E Random Sums Let us continue Example D but with the additional assumptions that the Xi are inde- pendent random variables with the same mean, E(X), and the same variance, Var(X), and that Var(N)<∞. According to Theorem B, Var(T ) = E[Var(T |N)] + Var[E(T |N)] Because E(T |N) = NE(X), Var[E(T |N)] = [E(X)]2Var(N) Also, since Var(T |N = n) = Var(n i=1 Xi ) = n Var(X), Var(T |N) = N Var(X) and E[Var(T |N)] = E(N)Var(X) 152 Chapter 4 Expected Values We thus have Var(T ) = [E(X)]2Var(N) + E(N)Var(X) If N is fixed, say, N = n, then Var(T ) = n Var(X). Thus, we see from the preceding equation that extra variability occurs in T because N is random. As a concrete example, suppose that the number of insurance claims in a certain time period has expected value equal to 900 and standard deviation equal to 30, as would be the case if the number were a Poisson random variable with expected value 900. Suppose that the average claim value is $1000 and the standard deviation is $500. Then the expected value of the total, T , of the claims is E(T ) = $900,000 and the variance of T is Var (T ) = 10002 × 900 + 900 × 5002 = 1.125 × 109 The standard deviation of T is the square root of the variance, $33,541. The insurance company could then plan on total claims of $900,000 plus or minus a few standard deviations (by Chebyshev’s inequality). Observe that if the total number of claims were not variable but were fixed at N = 900, the variance of the total claims would be given by E(N)Var(X) in the preceding expression. The result would be a standard deviation equal to $15,000. The variability in the number of claims thus contributes substantially to the uncertainty in the total. ■ 4.4.2 Prediction This section treats the problem of predicting the value of one random variable from another. We might wish, for example, to measure the value of some physical quan- tity, such as pressure, using an instrument. The actual pressures to be measured are unknown and variable, so we might model them as values of a random variable, Y. Assume that measurements are to be taken by some instrument that produces a re- sponse, X, related to Y in some fashion but corrupted by random noise as well; X might represent current flow, for example. Y and X have some joint distribution, and we wish to predict the actual pressure, Y, from the instrument response, X. As another example, in forestry, the volume of a tree is sometimes estimated from its diameter, which is easily measured. For a whole forest, it is reasonable to model diameter (X) and volume (Y) as random variables with some joint distribution, and then attempt to predict Y from X. Let us first consider a relatively trivial situation: the problem of predicting Y by means of a constant value, c. If we wish to choose the “best” value of c, we need some measure of the effectiveness of a prediction. One that is amenable to mathematical analysis and that is widely used is the mean squared error: MSE = E[(Y − c)2] This is the average squared error of prediction, the averaging being done with respect to the distribution of Y. The problem then becomes finding the value of c that min- imizes the mean squared error. To solve this problem, we denote E(Y) by μ and 4.4 Conditional Expectation and Prediction 153 observe that (see Theorem A of Section 4.2.1) E[(Y − c)2] = Var(Y − c) + [E(Y − c)]2 = Var(Y) + (μ − c)2 The first term of the last expression does not depend on c, and the second term is minimized for c = μ, which is the optimal choice of c. Now let us consider predicting Y by some function h(X) in order to minimize MSE = E{[Y − h(X)]2}. From Theorem A of Section 4.4.1, the right-hand side can be expressed as E{[Y − h(X)]2}=E(E{[Y − h(X)]2|X}) The outer expectation is with respect to X. For every x, the inner expectation is minimized by setting h(x) equal to the constant E(Y|X = x), from the result of the preceding paragraph. We thus have that the minimizing function h(X) is h(X) = E(Y|X) EXAMPLE A For the bivariate normal distribution, we found that E(Y|X) = μY + ρ σY σX (X − μX ) This linear function of X is thus the minimum mean squared error predictor of Y from X. ■ A practical limitation of the optimal prediction scheme is that its implementation depends on knowing the joint distribution of Y and X in order to find E(Y|X), and often this information is not available, not even approximately. For this reason, we can try to attain the more modest goal of finding the optimal linear predictor of Y. (In Example A, it turned out that the best predictor was linear, but this is not generally the case.) That is, rather than finding the best function h among all functions, we try to find the best function of the form h(x) = α + βx. This merely requires optimizing over the two parameters α and β.Now E[(Y − α − βX)2] = Var(Y − α − βX) + [E(Y − α − βX)]2 = Var(Y − βX) + [E(Y − α − βX)]2 The first term of the last expression does not depend on α,soα can be chosen so as to minimize the second term. To do this, note that E(Y − α − βX) = μY − α − βμX and that the right-hand side is zero, and hence its square is minimized, if α = μY − βμX As for the first term, Var(Y − βX) = σ 2 Y + β2σ 2 X − 2βσXY 154 Chapter 4 Expected Values where σXY = Cov(X, Y). This is a quadratic function of β, and the minimum is found by setting the derivative with respect to β equal to zero, which yields β = σXY σ 2 X = ρ σY σX ρ is the correlation coefficient. Substituting in these values of α and β, we find that the minimum mean squared error predictor, which we denote by ˆY,is ˆY = α + βX = μY + σXY σ 2 X (X − μX ) The mean squared prediction error is then Var(Y − βX) = σ 2 Y + σ 2 XY σ 4 X σ 2 X − 2 σXY σ 2 X σXY = σ 2 Y − σ 2 XY σ 2 X = σ 2 Y − ρ2σ 2 Y = σ 2 Y (1 − ρ2) Note that the optimal linear predictor depends on the joint distribution of X and Y only through their means, variances, and covariance. Thus, in practice, it is generally easier to construct the optimal linear predictor or an approximation to it than to construct the general optimal predictor E(Y|X). Second, note that the form of the optimal linear predictor is the same as that of E(Y|X) for the bivariate normal distribution. Third, note that the mean squared prediction error depends only on σY and ρ and that it is small if ρ is close to +1or−1. Here we see again, from a different point of view, that the correlation coefficient is a measure of the strength of the linear relationship between X and Y. EXAMPLE B Suppose that two examinations are given in a course. As a probability model, we regard the scores of a student on the first and second examinations as jointly distributed random variables X and Y. Suppose for simplicity that the exams are scaled to have the same means μ = μX = μY and standard deviations σ = σX = σY . Then, the correlation between X and Y is ρ = σXY/σ 2 and the best linear predictor is ˆY = μ + ρ(X − μ),so ˆY − μ = ρ(X − μ) Notice that by this equation we predict the student’s score on the second examination to differ from the overall mean μ by less than did the score on the first examination. If the correlation ρ is positive, this is encouraging for a student who scores below the mean on the first exam, since our best prediction is that his score on the next exam will be closer to the mean. On the other hand, it’s bad news for the student who scored above the mean on the first exam, since our best prediction is that she will score closer to the mean on the next exam. This phenomenon is often referred to as regression to the mean. ■ 4.5 The Moment-Generating Function 155 4.5 The Moment-Generating Function This section develops and applies some of the properties of the moment-generating function. It turns out, despite its unlikely appearance, to be a very useful tool that can dramatically simplify certain calculations. The moment-generating function (mgf) of a random variable X is M(t) = E(etX) if the expectation is defined. In the discrete case, M(t) = x etx p(x) and in the continuous case, M(t) = ∞ −∞ etx f (x) dx The expectation, and hence the moment-generating function, may or may not exist for any particular value of t. In the continuous case, the existence of the expectation depends on how rapidly the tails of the density decrease; for example, because the tails of the Cauchy density die down at the rate x−2, the expectation does not exist for any t and the moment-generating function is undefined. The tails of the normal density die down at the rate e−x2 , so the integral converges for all t. PROPERTY A If the moment-generating function exists for t in an open interval containing zero, it uniquely determines the probability distribution. ■ We cannot prove this important property here—its proof depends on properties of the Laplace transform. Note that Property A says that if two random variables have the same mgf in an open interval containing zero, they have the same distribution. For some problems, we can find the mgf and then deduce the unique probability distribution corresponding to it. The rth moment of a random variable is E(Xr ) if the expectation exists. We have already encountered the first and second moments earlier in this chapter, that is, E(X) and E(X 2). Central moments rather than ordinary moments are often used: The rth central moment is E{[X − E(X)]r }. The variance is the second central moment and is a measure of dispersion about the mean. The third central moment, called the skewness, is used as a measure of the asymmetry of a density or a frequency function about its mean; if a density is symmetric about its mean, the skewness is zero (see Problem 78 at the end of this chapter). As its name implies, the moment-generating function has something to do with moments. To see this, consider the continuous case: M(t) = ∞ −∞ etx f (x) dx 156 Chapter 4 Expected Values The derivative of M(t) is M (t) = d dt ∞ −∞ etx f (x) dx It can be shown that differentiation and integration can be interchanged, so that M (t) = ∞ −∞ xetx f (x) dx and M (0) = ∞ −∞ xf(x) dx = E(X) Differentiating r times, we find M(r)(0) = E(Xr ) It can further be argued that if the moment-generating function exists in an interval containing zero, then so do all the moments. We thus have the following property. PROPERTY B If the moment-generating function exists in an open interval containing zero, then M(r)(0) = E(Xr ). ■ To find the moments of a random variable from the definition of expectation, we must sum a series or carry out an integration. The utility of Property B is that, if the mgf can be found, the process of integration or summation, which may be difficult, can be replaced by the process of differentiation, which is mechanical. We now illustrate these concepts using some familiar distributions. EXAMPLE A Poisson Distribution By definition, M(t) = ∞ k=0 etk λk k! e−λ = ∞ k=0 (λet )k k! e−λ = e−λeλet = eλ(et −1) The sum converges for all t. Differentiating, we have M (t) = λet eλ(et −1) M (t) = λet eλ(et −1) + λ2e2t eλ(et −1) 4.5 The Moment-Generating Function 157 Evaluating these derivatives at t = 0, we find E(X) = λ E(X 2) = λ2 + λ from which it follows that Var(X) = E(X 2) − [E(X)]2 = λ We have found that the mean and the variance of a Poisson distribution are equal. ■ EXAMPLE B Gamma Distribution The mgf of a gamma distribution is M(t) = ∞ 0 etx λα (α)xα−1e−λx dx = λα (α) ∞ 0 xα−1ex(t−λ) dx The latter integral converges for t <λand can be evaluated by relating it to the gamma density having parameters α and λ − t. We thus obtain M(t) = λα (α) (α) (λ − t)α = λ λ − t α Differentiating, we find M (0) = E(X) = α λ M (0) = E(X 2) = α(α + 1) λ2 From these equations, we find that Var(X) = E(X 2) − [E(X)]2 = α(α + 1) λ2 − α2 λ2 = α λ2 ■ EXAMPLE C Standard Normal Distribution For the standard normal distribution, we have M(t) = 1√ 2π ∞ −∞ etxe−x2/2 dx 158 Chapter 4 Expected Values The integral converges for all t and can be evaluated using the technique of completing the square. Since x2 2 − tx = 1 2 (x2 − 2tx + t2) − t2 2 = 1 2 (x − t)2 − t2 2 therefore, M(t) = et2/2 √ 2π ∞ −∞ e−(x−t)2/2 dx Making the change of variables u = x − t and using the fact that the standard normal density integrates to 1, we find that M(t) = et2/2 From this result, we easily see that E(X) = 0 and Var(X) = 1. ■ Let us continue with the development of the properties of the moment-generating function. PROPERTY C If X has the mgf MX (t) and Y = a+bX, then Y has the mgf MY (t) = eat MX (bt). Proof MY (t) = E(etY) = E(eat+bt X ) = E(eatebt X ) = eat E(ebt X ) = eat MX (bt) ■ EXAMPLE D General Normal Distribution If Y follows a general normal distribution with parameters μ and σ, the distribution of Y is the same as that of μ + σ X, where X follows a standard normal distribution. Thus, from Example C and Property C, MY (t) = eμt MX (σt) = eμt eσ 2t2/2 ■ 4.5 The Moment-Generating Function 159 PROPERTY D If X and Y are independent random variables with mgf’s MX and MY and Z = X + Y, then MZ (t) = MX (t)MY (t) on the common interval where both mgf’s exist. Proof MZ (t) = E(etZ) = E(etX+tY) = E(etXetY) From the assumption of independence, MZ (t) = E(etX)E(etY) = MX (t)MY (t) ■ By induction, Property D can be extended to sums of several independent random variables. This is one of the most useful properties of the moment-generating function. The next three examples show how it can be used to easily derive results that would take a lot more work to achieve without recourse to the mgf. EXAMPLE E The sum of independent Poisson random variables is a Poisson random variable: If X is Poisson with parameter λ and Y is Poisson with parameter μ, then X + Y is Poisson with parameter λ + μ, since eλ(et −1)eμ(et −1) = e(λ+μ)(et −1) ■ EXAMPLE F If X follows a gamma distribution with parameters α1 and λ and Y follows a gamma distribution with parameters α2 and λ, the mgf of X + Y is λ λ − t α1 λ λ − t α2 = λ λ − t α1+ α2 where t <λ. The right-hand expression is the mgf of a gamma distribution with parametersλandα1+α2. It follows from this that the sum ofn independent exponential random variables with parameter λ follows a gamma distribution with parameters n andλ. Thus, the time betweenn consecutive events of a Poisson process in time follows a gamma distribution. Assuming that the service times in a queue are independent exponential random variables, the length of time to serve n customers follows a gamma distribution. ■ 160 Chapter 4 Expected Values EXAMPLE G If X ∼ N(μ, σ 2) and, independent of X, Y ∼ N(v, τ 2), then the mgf of X + Y is eμt et2σ 2/2evt et2τ 2/2 = e(μ+v)t et2(σ 2+τ 2)/2 which is the mgf of a normal distribution with mean μ + v and variance σ 2 + τ 2. The sum of independent normal random variables is thus normal. ■ The preceding three examples are atypical. In general, if two independent ran- dom variables follow some type of distribution, it is not necessarily true that their sum follows the same type of distribution. For example, the sum of two gamma ran- dom variables having different values for the parameter λ does not follow a gamma distribution, as can be easily seen from the mgf. We now apply moment-generating functions to random sums of the type intro- duced in Section 4.4.1. Suppose that S = N i=1 Xi where the Xi are independent and have the same mgf, MX , and where N has the mgf MN and is independent of the Xi . By conditioning, we have MS(t) = E[E(etS|N)] Given N = n, MS(t) = [MX (t)]n from Property D. We thus have MS(t) = E[MX (t)N ] = E(eN log MX (t)) = MN [log MX (t)] (We must carefully note the values of t for which this is defined.) EXAMPLE H Compound Poisson Distribution This example presents a model that occurs for certain chain reactions, or “cascade” processes. When a single primary electron, having been accelerated in an electrical field, hits a plate, several secondary electrons are produced. In a multistage multiplying tube, each of these secondary electrons hits another plate and thereby produces a number of tertiary electrons. The process can continue through several stages in this manner. Woodward (1948) considered models of this type in which the number of electrons produced by the impact of a single electron on the plate is random and, in particular, in which the number of secondary electrons follows a Poisson distribution. The number of electrons produced at the third stage is described by a random sum of the type just described, where N is the number of secondary electrons and Xi is the number of electrons produced by the ith secondary electron. Suppose that the Xi are independent Poisson random variables with parameter λ and that N is a Poisson random variable with parameter μ. According to the preceding result, the mgf of S, 4.6 Approximate Methods 161 the total number of particles, is MS(t) = exp[μ(eλ(et −1) − 1)] ■ Example H illustrates the utility of the mgf. It would have been more difficult to find the probability mass function of the number of particles at the third stage. By differentiating the mgf, we can find the moments of the probability mass function (see Problem 98 at the end of this chapter). If X and Y have a joint distribution, their joint moment-generating function is defined as MXY(s, t) = E(esX+tY) which is a function of two variables, s and t. If the joint mgf is defined on an open set containing the origin, it uniquely determines the joint distribution. The mgf of the marginal distribution of X alone is MX (s) = MXY(s, 0) and similarly for Y. It can be shown that X and Y are independent if and only if their joint mgf factors into the product of the mgf’s of the marginal distributions. E(XY) and other higher-order joint moments can be obtained from the joint mgf by differentiation. Analogous properties hold for the joint mgf of several random variables. The major limitation of the mgf is that it may not exist. The characteristic function of a random variable X is defined to be φ(t) = E(eitX) where i = √−1. In the continuous case, φ(t) = ∞ −∞ eitx f (x) dx This integral converges for all values of t, since |eitx|≤1. The characteristic func- tion is thus defined for all distributions. Its properties are similar to those of the mgf: Moments can be obtained by differentiation, the characteristic function changes simply under linear transformations, and the characteristic function of a sum of inde- pendent random variables is the product of their characteristic functions. But using the characteristic function requires some familiarity with the techniques of complex variables. 4.6 Approximate Methods In many applications, only the first two moments of a random variable, and not the entire probability distribution, are known, and even these may be known only approximately. We will see in Chapter 5 that repeated independent observations of a random variable allow reliable estimates to be made of its mean and variance. Suppose that we know the expectation and the variance of a random variable X but not the 162 Chapter 4 Expected Values entire distribution, and that we are interested in the mean and variance of Y = g(X) for some fixed function g. For example, we might be able to measure X and determine its mean and variance, but really be interested in Y, which is related to X in a known way. We might want to know Var(Y), at least approximately, in order to assess the accuracy of the indirect measurement process. From the results given in this chapter, we cannot in general find E(Y) = μY and Var(Y) = σ 2 Y from E(X) = μX and Var(X) = σ 2 X , unless the function g is linear. However, if g is nearly linear in a range in which X has high probability, it can be approximated by a linear function and approximate moments of Y can be found. In proceeding as just described, we follow a tack often taken in applied math- ematics: When confronted with a nonlinear problem that we cannot solve, we lin- earize. In probability and statistics, this method is called propagation of error, or the δ method. Linearization is carried out through a Taylor series expansion of g about μX . To the first order, Y = g(X) ≈ g(μX ) + (X − μX )g (μX ) We have expressed Y as approximately equal to a linear function of X. Recalling that if U = a + bV, then E(U) = a + bE(V ) and Var(U) = b2Var(V ),wefind μY ≈ g(μX ) σ 2 Y ≈ σ 2 X [g (μX )]2 We know that in general E(Y) = g(E(X)), as given by the approximation. In fact, we can carry out the Taylor series expansion to the second order to get an improved approximation of μY : Y = g(X) ≈ g(μX ) + (X − μX )g (μX ) + 1 2 (X − μX )2g (μX ) Taking the expectation of the right-hand side, we have, since E(X − μX ) = 0, E(Y) ≈ g(μX ) + 1 2 σ 2 X g (μX ) How good such approximations are depends on how nonlinear g is in a neighbor- hood of μX and on the size of σX . From Chebyshev’s inequality, we know that X is unlikely to be many standard deviations away from μX ;ifg can be reasonably well approximated in this range by a linear function, the approximations for the moments will be reasonable as well. EXAMPLE A The relation of voltage, current, and resistance is V = IR. Suppose that the voltage is held constant at a value V0 across a medium whose resistance fluctuates randomly as a result, say, of random fluctuations at the molecular level. The current therefore also varies randomly. Suppose that it can be determined experimentally to have mean μI = 0 and variance σ 2 I . We wish to find the mean and variance of the resistance, R, and since we do not know the distribution of I, we must resort to an approximation. 4.6 Approximate Methods 163 We have R = g(I) = V0 I g (μI ) =−V0 μ2 I g (μI ) = 2V0 μ3 I Thus, μR ≈ V0 μI + V0 μ3 I σ 2 I σ 2 R ≈ V 2 0 μ4 I σ 2 I We see that the variability of R depends on both the mean level of I and the variance of I. This makes sense, since if I is quite small, small variations in I will result in large variations in R = V0/I, whereas if I is large, small variations will not affect R as much. The second-order correction factor for μR also depends on μI and is large if μI is small. In fact, when I is near zero, the function g(I) = V0/I is quite nonlinear, and the linearization is not a good approximation. ■ EXAMPLE B This example examines the accuracy of the approximations using a simple test case. We choose the function g(x) = √ x and consider two cases: X uniform on [0, 1], and X uniform on [1, 2]. The graph of g(x) in Figure 4.9 shows that g is more nearly linear in the latter case, so we would expect the approximations to work better there. Let Y = √ X; because X is uniform on [0, 1], E(Y) = 1 0 √ xdx= 2 3 0 .2 .5 1.0 1.5 2.0 g ( x ) x .4 .6 .8 1.0 1.2 1.4 1.6 0 FIGURE 4.9 The function g(x) = √ x is more nearly linear over the interval [1, 2] than over the interval [0, 1]. 164 Chapter 4 Expected Values and E(Y 2) = 1 0 xdx= 1 2 so Var(Y) = 1 2 − 2 3 2 = 1 18 and σY = .236. These results are exact. Using the approximation method, we first calculate g (x) = 1 2 x−1/2 g (x) =−1 4 x−3/2 Since X is uniform on [0, 1], μX = 1 2 , and evaluating the derivatives at this value gives us g (μX ) = √ 2 2 g (μX ) =− √ 2 2 We know that Var(X) = 1 12 for a random variable uniform on [0, 1], so the approxi- mations are E(Y) ≈ 1 2 − 1 2 √ 2 12 × 2 = .678 Var(Y) ≈ 1 2 × 1 12 = .042 σY ≈ .204 The approximation to the mean is .678, and compared to the actual value of .667, it is off by about 1.6%. The approximation to the standard deviation is .204, and compared to the actual value of .236, it is off by 13%. Now let us consider the case in which X is uniform on [1, 2]. Proceeding as before, we find that y = √ x has mean 1.219. The variance and standard deviation are .0142 and .119, respectively. To compare these to the approximations, we note that μX = 3 2 and Var(X) = 1 12 (the random variable uniform on [1, 2] can be obtained by adding the constant 1 to a random variable uniform on [0, 1]; compare with Theorem A in Section 4.2). We find g (μX ) = .408 g (μX ) =−.136 so the approximations are E(Y) ≈ 3 2 − 1 2 .136 12 = 1.219 Var(Y) ≈ .4082 12 = .0138 σY ≈ .118 These values are much closer to the actual values than are the approximations for the first case. ■ 4.6 Approximate Methods 165 Suppose that we have Z = g(X, Y), a function of two variables. We can again carry out Taylor series expansions to approximate the mean and variance of Z. To the first order, letting μ denote the point (μX ,μY ), Z = g(X, Y) ≈ g(μ) + (X − μX )∂g(μ) ∂x + (Y − μY )∂g(μ) ∂y The notation ∂g(μ)/∂x means that the derivative is evaluated at the point μ. Here Z is expressed as approximately equal to a linear function of X and Y, and the mean and variance of this linear function are easily calculated to be E(Z) ≈ g(μ) and Var(Z) ≈ σ 2 X ∂g(μ) ∂x 2 + σ 2 Y ∂g(μ) ∂y 2 + 2σXY ∂g(μ) ∂x ∂g(μ) ∂y (For the latter calculation, see Corollary A in Section 4.3.) As is the case with a single variable, a second-order expansion can be used to obtain an improved estimate of E(Z): Z = g(X, Y) ≈ g(μ) + (X − μX )∂g(μ) ∂x + (Y − μY )∂g(μ) ∂y + 1 2 (X − μX )2 ∂2g(μ) ∂x2 + 1 2 (Y − μY )2 ∂2g(μ) ∂y2 + (X − μX )(Y − μY )∂2g(μ) ∂x∂y Taking expectations term by term on the right-hand side yields E(Z) ≈ g(μ) + 1 2 σ 2 X ∂2g(μ) ∂x2 + 1 2 σ 2 Y ∂2g(μ) ∂y2 + σXY ∂2g(μ) ∂x∂y The general case of a function of n variables can be worked out similarly; the basic concepts are illustrated by the two-variable case. EXAMPLE C Expectation and Variance of a Ratio Let us consider the case where Z = Y/X, which arises frequently in practice. For example, a chemist might measure the concentrations of two substances, both with some measurement error that is indicated by their standard deviations, and then report the relative concentrations in the form of a ratio. What is the approximate standard deviation of the ratio, Z? Using the method of propagation of error derived above, for g(x, y) = y/x,we have ∂g ∂x = −y x2 ∂g ∂y = 1 x ∂2g ∂x2 = 2y x3 ∂2g ∂y2 = 0 ∂2g ∂x∂y = −1 x2 166 Chapter 4 Expected Values Evaluating these derivatives at (μX ,μY ) and using the preceding result, we find, if μX = 0, E(Z) ≈ μY μX + σ 2 X μY μ3 X − σXY μ2 X = μY μX + 1 μ2 X σ 2 X μY μX − ρσX σY From this equation, we see that the difference between E(Z) and μY /μX depends on several factors. If σX and σY are small—that is, if the two concentrations are measured quite accurately—the difference is small. If μX is small, the difference is relatively large. Finally, correlation between X and Y affects the difference. We now consider the variance. Again using the preceding result and evaluating the partial derivatives at (μX ,μY ),wefind Var(Z) ≈ σ 2 X μ2 Y μ4 X + σ 2 Y μ2 X − 2σXY μY μ3 X = 1 μ2 X σ 2 X μ2 Y μ2 X + σ 2 Y − 2ρσX σY μY μX From this equation, we see that the ratio is quite variable when μX is small, paralleling the results in Example A, and that correlation between X and Y, if of the same sign as μY /μX , decreases Var(Z). ■ 4.7 Problems 1. Show that if a random variable is bounded—that is, |X| < M < ∞—then E(X) exists. 2. If X is a discrete uniform random variable—that is, P(X = k) = 1/n for k = 1,2,...,n—find E(X) and Var(X). 3. Find E(X) and Var(X) for Problem 3 in Chapter 2. 4. Let X have the cdf F(x) = 1 − x−α, x ≥ 1. a. Find E(X) for those values of α for which E(X) exists. b. Find Var(X) for those values of α for which it exists. 5. Let X have the density f (x) = 1 + αx 2 , −1 ≤ x ≤ 1, −1 ≤ α ≤ 1 Find E(X) and Var(X). 4.7 Problems 167 6. Let X be a continuous random variable with probability density function f (x) = 2x,0≤ x ≤ 1. a. Find E(X). b. Let Y = X 2. Find the probability mass function of Y and use it to find E(Y). c. Use Theorem A in Section 4.1.1 to find E(X 2) and compare to your answer in part (b). d. Find Var(X) according to the definition of variance given in Section 4.2. Also find Var(X) by using Theorem B of Section 4.2. 7. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities 1 2 , 3 8 , 1 8 , respectively. a. Find E(X). b. Let Y = X 2. Find the probability mass function of Y and use it to find E(Y). c. Use Theorem A of Section 4.1.1 to find E(X 2) and compare to your answer in part (b). d. Find Var(X) according to the definition of variance given in Section 4.2. Also find Var(X) by using Theorem B in Section 4.2. 8. Show that if X is a discrete random variable, taking values on the positive integers, then E(X) = ∞ k=1 P(X ≥ k). Apply this result to find the expected value of a geometric random variable. 9. This is a simplified inventory problem. Suppose that it costs c dollars to stock an item and that the item sells for s dollars. Suppose that the number of items that will be asked for by customers is a random variable with the frequency function p(k). Find a rule for the number of items that should be stocked in order to maximize the expected income. (Hint: Consider the difference of successive terms.) 10. A list of n items is arranged in random order; to find a requested item, they are searched sequentially until the desired item is found. What is the expected number of items that must be searched through, assuming that each item is equally likely to be the one requested? (Questions of this nature arise in the design of computer algorithms.) 11. Referring to Problem 10, suppose that the items are not equally likely to be requested but have known probabilities p1, p2, ..., pn. Suggest an alternative searching procedure that will decrease the average number of items that must be searched through, and show that in fact it does so. 12. If X is a continuous random variable with a density that is symmetric about some point, ξ, show that E(X) = ξ, provided that E(X) exists. 13. If X is a nonnegative continuous random variable, show that E(X) = ∞ 0 [1 − F(x)] dx Apply this result to find the mean of the exponential distribution. 168 Chapter 4 Expected Values 14. Let X be a continuous random variable with the density function f (x) = 2x, 0 ≤ x ≤ 1 a. Find E(X). b. Find E(X 2) and Var(X). 15. Suppose that two lotteries each have n possible numbers and the same payoff. In terms of expected gain, is it better to buy two tickets from one of the lotteries or one from each? 16. Suppose that E(X) = μ and Var(X) = σ 2. Let Z = (X − μ)/σ. Show that E(Z) = 0 and Var(Z) = 1. 17. Find (a) the expectation and (b) the variance of the kth-order statistic of a sample of n independent random variables uniform on [0, 1]. The density function is given in Example C in Section 3.7. 18. If U1, ..., Un are independent uniform random variables, find E(U(n) −U(1)). 19. Find E(U(k) − U(k−1)), where the U(i) are as in Problem 18. 20. A stick of unit length is broken into two pieces. Find the expected ratio of the length of the longer piece to the length of the shorter piece. 21. A random square has a side length that is a uniform [0, 1] random variable. Find the expected area of the square. 22. A random rectangle has sides the lengths of which are independent uniform random variables. Find the expected area of the rectangle, and compare this result to that of Problem 21. 23. Repeat Problems 21 and 22 assuming that the distribution of the lengths is exponential. 24. Prove Theorem A of Section 4.1.2 for the discrete case. 25. If X1 and X2 are independent random variables following a gamma distribution with parameters α and λ, find E(R2), where R2 = X 2 1 + X 2 2. 26. Referring to Example B in Section 4.1.2, what is the expected number of coupons needed to collect r different types, where r < n? 27. If n men throw their hats into a pile and each man takes a hat at random, what is the expected number of matches? (Hint: Express the number as a sum.) 28. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that each gunner selects an aircraft to shoot at independently of the other gunners, and that each gunner hits the selected aircraft with probability p. Find the expected number of aircraft hit by the gunners. 29. Prove Corollary A of Section 4.1.1. 30. Find E[1/(X + 1)], where X is a Poisson random variable. 4.7 Problems 169 31. Let X be uniformly distributed on the interval [1, 2]. Find E(1/X).IsE(1/X) = 1/E(X)? 32. Let X have a gamma distribution with parameters α and λ. For those values of α and λ for which it is defined, find E(1/X). 33. Prove Chebyshev’s inequality in the discrete case. 34. Let X be uniform on [0, 1], and let Y = √ X. Find E(Y) by (a) finding the density of Y and then finding the expectation and (b) using Theorem A of Section 4.1.1. 35. Find the mean of a negative binomial random variable. (Hint: Express the random variable as a sum.) 36. Consider the following scheme for group testing. The original lot of samples is divided into two groups, and each of the subgroups is tested as a whole. If either subgroup tests positive, it is divided in two, and the procedure is repeated. If any of the groups thus obtained tests positive, test every member of that group. Find the expected number of tests performed, and compare it to the number performed with no grouping and with the scheme described in Example C in Section 4.1.2. 37. For what values of p is the group testing of Example C in Section 4.1.2 inferior to testing every individual? 38. This problem continues Example A of Section 4.1.2. a. What is the probability that a fragment is the leftmost member of a contig? b. What is the expected number of fragments that are leftmost members of contigs? c. What is the expected number of contigs? 39. Suppose that a segment of DNA of length 1,000,000 is to be shotgun sequenced with fragments of length 1000. a. How many fragment would be needed so that the chance of an individual site being covered is greater than 0.99? b. With this choice, how many sites would you expect to be missed? 40. A child types the letters Q, W, E, R, T, Y, randomly producing 1000 letters in all. What is the expected number of times that the sequence QQQQ appears, counting overlaps? 41. Continuing with the previous problem, how many times would we expect the word “TRY” to appear? Would we be surprised if it occurred 100 times? (Hint: Consider Markov’s inequality.) 42. Let X be an exponential random variable with standard deviation σ. Find P(|X − E(X)| > kσ)for k = 2, 3, 4, and compare the results to the bounds from Chebyshev’s inequality. 43. Show that Var(X − Y) = Var(X) + Var(Y) − 2Cov(X, Y). 170 Chapter 4 Expected Values 44. If X and Y are independent random variables with equal variances, find Cov(X + Y, X − Y). 45. Find the covariance and the correlation of Ni and N j , where N1, N2, ..., Nr are multinomial random variables. (Hint: Express them as sums.) 46. If U = a + bX and V = c + dY, show that |ρUV|=|ρXY|. 47. If X and Y are independent random variables and Z = Y − X, find expressions for the covariance and the correlation of X and Z in terms of the variances of X and Y. 48. Let U and V be independent random variables with means μ and variances σ 2. Let Z = αU + V √ 1 − α2. Find E(Z) and ρUZ. 49. Two independent measurements, X and Y, are taken of a quantity μ. E(X) = E(Y) = μ,butσX and σY are unequal. The two measurements are combined by means of a weighted average to give Z = αX + (1 − α)Y where α is a scalar and 0 ≤ α ≤ 1. a. Show that E(Z) = μ. b. Find α in terms of σX and σY to minimize Var(Z). c. Under what circumstances is it better to use the average (X + Y)/2 than either X or Y alone? 50. Suppose that Xi , where i = 1, ..., n, are independent random variables with E(Xi ) = μ and Var(Xi ) = σ 2. Let X = n−1 n i=1 Xi . Show that E(X) = μ and Var(X) = σ 2/n. 51. Continuing Example E in Section 4.3, suppose there are n securities, each with the same expected return, that all the returns have the same standard deviations, and that the returns are uncorrelated. What is the optimal portfolio vector? Plot the risk of the optimal portfolio versus n. How does this risk compare to that incurred by putting all your money in one security? 52. Consider two securities, the first having μ1 = 1 and σ1 = 0.1, and the second having μ2 = 0.8 and σ2 = 0.12. Suppose that they are negatively correlated, with ρ =−0.8. a. If you could only invest in one security, which one would you choose, and why? b. Suppose you invest 50% of your money in each of the two. What is your expected return and what is your risk? c. If you invest 80% of your money in security 1 and 20% in security 2, what is your expected return and your risk? d. Denote the expected return and its standard deviation as functions of π by μ(π) and σ(π). The pair (μ(π), σ (π)) trace out a curve in the plane as π varies from 0 to 1. Plot this curve. e. Repeat b–d if the correlation is ρ = 0.1. 53. Show that Cov(X, Y) ≤ √ Var(X)Var(Y). 4.7 Problems 171 54. Let X, Y, and Z be uncorrelated random variables with variances σ 2 X , σ 2 Y , and σ 2 Z , respectively. Let U = Z + X V = Z + Y Find Cov(U, V ) and ρUV. 55. Let T = n k=1 kXk, where the Xk are independent random variables with means μ and variances σ 2. Find E(T ) and Var(T ). 56. Let S = n k=1 Xk, where the Xk are as in Problem 55. Find the covariance and the correlation of S and T . 57. If X and Y are independent random variables, find Var(XY) in terms of the means and variances of X and Y. 58. A function is measured at two points with some error (for example, the position of an object is measured at two times). Let X1 = f (x) + ε1 X2 = f (x + h) + ε2 where ε1 and ε2 are independent random variables with mean zero and variance σ 2. Since the derivative of f is lim h→0 f (x + h) − f (x) h it is estimated by Z = X2 − X1 h a. Find E(Z) and Var(Z). What is the effect of choosing a value of h that is very small, as is suggested by the definition of the derivative? b. Find an approximation to the mean squared error of Z as an estimate of f (x) using a Taylor series expansion. Can you find the value of h that minimizes the mean squared error? c. Suppose that f is measured at three points with some error. How could you construct an estimate of the second derivative of f , and what are the mean and the variance of your estimate? 59. Let (X, Y) be a random point uniformly distributed on a unit disk. Show that Cov(X, Y) = 0, but that X and Y are not independent. 60. Let Y have a density that is symmetric about zero, and let X = SY, where S is an independent random variable taking on the values +1 and −1 with probability 1 2 each. Show that Cov(X, Y) = 0, but that X and Y are not independent. 61. In Section 3.7, the joint density of the minimum and maximum of n independent uniform random variables was found. In the case n = 2, this amounts to X and Y, the minimum and maximum, respectively, of two independent random 172 Chapter 4 Expected Values variables uniform on [0, 1], having the joint density f (x, y) = 2, 0 ≤ x ≤ y a. Find the covariance and the correlation of X and Y. Does the sign of the correlation make sense intuitively? b. Find E(X|Y = y)and E(Y|X = x). Do these results make sense intuitively? c. Find the probability density functions of the random variables E(X|Y) and E(Y|X). d. What is the linear predictor of Y in terms of X (denoted by ˆY = a + bX) that has minimal mean squared error? What is the mean square prediction error? e. What is the predictor of Y in terms of X [ ˆY = h(X)] that has minimal mean squared error? What is the mean square prediction error? 62. Let X and Y have the joint distribution given in Problem 1 of Chapter 3. a. Find the covariance and correlation of X and Y. b. Find E(Y|X = x) for x = 1, 2, 3, 4. Find the probability mass function of the random variable E(Y|X). 63. Let X and Y have the joint distribution given in Problem 8 of Chapter 3. a. Find the covariance and correlation of X and Y. b. Find E(Y|X = x) for 0 ≤ x ≤ 1. 64. Let X and Y be jointly distributed random variables with correlation ρXY; define the standardized random variables ˜X and ˜Y as ˜X = (X − E(X))/√ Var(X) and ˜Y = (Y − E(Y))/√ Var(Y). Show that Cov( ˜X, ˜Y) = ρXY. 65. How has the assumption that N and the Xi are independent been used in Example D of Section 4.4.1? 66. A building contains two elevators, one fast and one slow. The average waiting time for the slow elevator is 3 min. and the average waiting time of the fast elevator is 1 min. If a passenger chooses the fast elevator with probability 2 3 and the slow elevator with probability 1 3 , what is the expected waiting time? (Use the law of total expectation, Theorem A of Section 4.4.1, defining appropriate random variables X and Y.) 67. A random rectangle is formed in the following way: The base, X, is chosen to be a uniform [0, 1] random variable and after having generated the base, the height is chosen to be uniform on [0, X]. Use the law of total expectation, Theorem A of Section 4.4.1, to find the expected circumference and area of the rectangle. 68. Show that E[Var(Y|X)] ≤ Var(Y). 69. Suppose that a bivariate normal distribution has μX = μY = 0 and σX = σY = 1. Sketch the contours of the density and the lines E(Y|X = x) and E(X|Y = y) for ρ = 0, .5, and .9. 4.7 Problems 173 70. If X and Y are independent, show that E(X|Y = y) = E(X). 71. Let X be a binomial random variable representing the number of successes in n independent Bernoulli trials. Let Y be the number of successes in the first m trials, where m < n. Find the conditional frequency function of Y given X = x and the conditional mean. 72. An item is present in a list of n items with probability p; if it is present, its posi- tion in the list is uniformly distributed. A computer program searches through the list sequentially. Find the expected number of items searched through before the program terminates. 73. A fair coin is tossed n times, and the number of heads, N, is counted. The coin is then tossed N more times. Find the expected total number of heads generated by this process. 74. The number of offspring of an organism is a discrete random variable with mean μ and variance σ 2. Each of its offspring reproduces in the same man- ner. Find the expected number of offspring in the third generation and its variance. 75. Let T be an exponential random variable, and conditional on T , letU be uniform on [0, T ]. Find the unconditional mean and variance of U. 76. Let the point (X, Y) be uniformly distributed over the half disk x2 + y2 ≤ 1, where y ≥ 0. If you observe X, what is the best prediction for Y? If you observe Y, what is the best prediction for X? For both questions, “best” means having the minimum mean squared error. 77. Let X and Y have the joint density f (x, y) = e−y, 0 ≤ x ≤ y a. Find Cov(X, Y) and the correlation of X and Y. b. Find E(X|Y = y) and E(Y|X = x). c. Find the density functions of the random variables E(X|Y) and E(Y|X). 78. Show that if a density is symmetric about zero, its skewness is zero. 79. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities 1 2 , 3 8 , 1 8 , respectively. Find the moment-generating function of X, M(t), and verify that E(X) = M (0) and that E(X 2) = M (0). 80. Let X be a continuous random variable with density function f (x) = 2x, 0 ≤ x ≤ 1. Find the moment-generating function of X, M(t), and verify that E(X) = M (0) and that E(X 2) = M (0). 81. Find the moment-generating function of a Bernoulli random variable, and use it to find the mean, variance, and third moment. 82. Use the result of Problem 81 to find the mgf of a binomial random variable and its mean and variance. 174 Chapter 4 Expected Values 83. Show that if Xi follows a binomial distribution with ni trials and probability of success pi = p, where i = 1, ..., n and the Xi are independent, then n i=1 Xi follows a binomial distribution. 84. Referring to Problem 83, show that if the pi are unequal, the sum does not follow a binomial distribution. 85. Find the mgf of a geometric random variable, and use it to find the mean and the variance. 86. Use the result of Problem 85 to find the mgf of a negative binomial random variable and its mean and variance. 87. Under what conditions is the sum of independent negative binomial random variables also negative binomial? 88. Let X and Y be independent random variables, and let α and β be scalars. Find an expression for the mgf of Z = αX + βY in terms of the mgf’s of X and Y. 89. Let X1, X2, ..., Xn be independent normal random variables with means μi and variances σ 2 i . Show that Y = n i=1 αi Xi , where the αi are scalars, is normally distributed, and find its mean and variance. (Hint: Use moment- generating functions.) 90. Assuming that X ∼ N(0,σ2), use the mgf to show that the odd moments are zero and the even moments are μ2n = (2n)!σ 2n 2n(n!) 91. Use the mgf to show that if X follows an exponential distribution, cX (c > 0) does also. 92. Suppose that  is a random variable that follows a gamma distribution with pa- rameters λ and α, where α is an integer, and suppose that, conditional on , X follows a Poisson distribution with parameter . Find the uncondi- tional distribution of α + X.(Hint: Find the mgf by using iterated conditional expectations.) 93. Find the distribution of a geometric sum of exponential random variables by using moment-generating functions. 94. If X is a nonnegative integer-valued random variable, the probability- generating function of X is defined to be G(s) = ∞ k=0 sk pk where pk = P(X = k). a. Show that pk = 1 k! dk dsk G(s) s=0 4.7 Problems 175 b. Show that dG ds s=1 = E(X) d2G ds2 s=1 = E[X(X − 1)] c. Express the probability-generating function in terms of moment-generating function. d. Find the probability-generating function of the Poisson distribution. 95. Show that if X and Y are independent, their joint moment-generating function factors. 96. Show how to find E(XY) from the joint moment-generating function of X and Y. 97. Use moment-generating functions to show that if X and Y are independent, then Var(aX + bY) = a2Var(X) + b2Var(Y) 98. Find the mean and variance of the compound Poisson distribution (Example H in Section 4.5). 99. Find expressions for the approximate mean and variance of Y = g(X) for (a) g(x) = √ x, (b) g(x) = log x, and (c) g(x) = sin−1 x. 100. If X is uniform on [10, 20], find the approximate and exact mean and variance of Y = 1/X, and compare them. 101. Find the approximate mean and variance of Y = √ X, where X is a random variable following a Poisson distribution. 102. Two sides, x0 and y0, of a right triangle are independently measured as X and Y, where E(X) = x0 and E(Y) = y0 and Var(X) = Var(Y) = σ 2. The angle between the two sides is then determined as  = tan−1 Y X Find the approximate mean and variance of . 103. The volume of a bubble is estimated by measuring its diameter and using the relationship V = π 6 D3 Suppose that the true diameter is 2 mm and that the standard deviation of the measurement of the diameter is .01 mm. What is the approximate standard deviation of the estimated volume? 104. The position of an aircraft relative to an observer on the ground is estimated by measuring its distance r from the observer and the angle θ that the line of 176 Chapter 4 Expected Values sight from the observer to the aircraft makes with the horizontal. Suppose that the measurements, denoted by R and , are subject to random errors and are independent of each other. The altitude of the aircraft is then estimated to be Y = R sin . a. Find an approximate expression for the variance of Y. b. For given r, at what value of θ is the estimated altitude most variable? CHAPTER 5 Limit Theorems 5.1 Introduction This chapter is principally concerned with the limiting behavior of the sum of inde- pendent random variables as the number of summands becomes large. The results presented here are both intrinsically interesting and useful in statistics, since many commonly computed statistical quantities, such as averages, can be represented as sums. 5.2 The Law of Large Numbers It is commonly believed that if a fair coin is tossed many times and the proportion of heads is calculated, that proportion will be close to 1 2 . John Kerrich, a South African mathematician, tested this belief empirically while detained as a prisoner during World War II. He tossed a coin 10,000 times and observed 5067 heads. The law of large numbers is a mathematical formulation of this belief. The successive tosses of the coin are modeled as independent random trials. The random variable Xi takes on the value 0 or 1 according to whether the ith trial results in a tail or a head, and the proportion of heads in n trials is X n = 1 n n i=1 Xi The law of large numbers states that X n approaches 1 2 in a sense that is specified by the following theorem. 177 178 Chapter 5 Limit Theorems THEOREM A Law of Large Numbers Let X1, X2,...,Xi ... be a sequence of independent random variables with E(Xi ) = μ and Var(Xi ) = σ 2. Let X n = n−1 n i=1 Xi . Then, for any ε>0, P(|X n − μ| >ε)→ 0asn →∞ Proof We first find E(X n) and Var(X n): E(X n) = 1 n n i=1 E(Xi ) = μ Since the Xi are independent, Var(X n) = 1 n2 n i=1 Var(Xi ) = σ 2 n The desired result now follows immediately from Chebyshev’s inequality, which states that P(|X n − μ| >ε)≤ Var(X n) ε2 = σ 2 nε2 → 0, as n →∞ ■ In the case of a fair coin toss, the Xi are Bernoulli random variables with p = 1/2, E(Xi ) = 1/2 and Var(Xi ) = 1/4. If tossed 10,000 times Var(X 10,000) = 2.5 × 10−5 and the standard deviation of the average is the square root of the variance, 0.005. The proportion observed by Kerrich, 0.5067, is thus a little more than one standard deviation away from its expected value of 0.5, consistent with Chebyshev’s inequality. (Recall from Section 4.2 that Chebyshev’s inequality can be written in the form P(|X n − μ| > kσ)≤ 1/k2.) If a sequence of random variables, {Zn}, is such that P(|Zn −α| >ε)approaches zero as n approaches infinity, for any ε>0 and where α is some scalar, then Zn is said to converge in probability to α. There is another mode of convergence, called strong convergence or almost sure convergence, which asserts more. Zn is said to converge almost surely to α if for every ε>0, |Zn − α| >εonly a finite number of times with probability 1; that is, beyond some point in the sequence, the difference is always less than ε, but where that point is random. The version of the law of large numbers stated and proved earlier asserts that X n converges to μ in probability. This version is usually called the weak law of large numbers. Under the same assumptions, a strong law of large numbers, which asserts that X n converges almost surely to μ, can also be proved, but we will not do so. We now consider some examples that illustrate the utility of the law of large numbers. 5.2 The Law of Large Numbers 179 EXAMPLE A Monte Carlo Integration Suppose that we wish to calculate I ( f ) = 1 0 f (x) dx where the integration cannot be done by elementary means or evaluated using ta- bles of integrals. The most common approach is to use a numerical method in which the integral is approximated by a sum; various schemes and computer packages ex- ist for doing this. Another method, called the Monte Carlo method, works in the following way. Generate independent uniform random variables on [0, 1]—that is, X1, X2,...,Xn—and compute ˆI( f ) = 1 n n i=1 f (Xi ) By the law of large numbers, this should be close to E[ f (X)], which is simply E[ f (X)] = 1 0 f (x) dx = I ( f ) This simple scheme can be easily modified in order to change the range of integration and in other ways. Compared to the standard numerical methods, it is not especially efficient in one dimension, but becomes increasingly efficient as the dimensionality of the integral grows. As a concrete example, let us consider the evaluation of I ( f ) = 1√ 2π 1 0 e−x2/2 dx The integral is that of the standard normal density, which cannot be evaluated in closed form. From the table of the normal distribution (Table 2 in Appendix B), an accurate numerical approximation is I ( f ) = .3413. If 1000 points, X1,...,X1000, uniformly distributed over the interval 0 ≤ x ≤ 1, are generated using a pseudorandom number generator, the integral is then approximated by ˆI( f ) = 1 1000 1√ 2π 1000 i=1 e−X2 i /2 which produced for one realization of the Xi the value .3417. ■ EXAMPLE B Repeated Measurements Suppose that repeated independent unbiased measurements, X1,...,Xn, of a quantity are made. If n is large, the law of large numbers says that X will be close to the true value, μ, of the quantity, but how close X is depends not only on n but on the variance of the measurement error, σ 2, as can be seen in the proof of Theorem A. 180 Chapter 5 Limit Theorems Fortunately, σ 2 can be estimated and therefore Var(X) = σ 2 n can be estimated from the data to assess the precision of X. First, note thatn−1 n i=1 X 2 i converges to E(X 2), from the law of large numbers. Second, it can be shown that if Zn converges to α in probability and g is a continuous function, then g(Zn) → g(α) which implies that X 2 → [E(X)]2 Finally, since n−1 n i=1 X 2 i converges to E(X 2) and X 2 converges to [E(X)]2, with a little additional argument it can be shown that 1 n n i=1 X 2 i − X 2 → E(X 2) − [E(X)]2 = Var(X) More generally, it follows from the law of large numbers that the sample moments, n−1 n i=1 Xr i , converge in probability to the moments of X, E(Xr ). ■ EXAMPLE C A muscle or nerve cell membrane contains a very large number of channels; when open, these channels allow ions to pass through. Individual channels seem to open and close randomly, and it is often assumed that in an equilibrium situation the channels open and close independently of each other and that only a very small fraction are open at any one time. Suppose then that the probability that a channel is open is p,avery small number, that there are m channels in all, and that the amount of current flowing through an individual channel is c. The number of channels open at any particular time is N, a binomial random variable with m trials and probability p of success on each trial. The total amount of current is S = cN and can be measured. We then have E(S) = cE(N) = cmp Var(S) = c2mp(1 − p) and Var(S) E(S) = c(1 − p) ≈ c since p is small. Thus, through independent measurements, S1,...,Sn, we can esti- mate E(S) and Var(S) and therefore c, the amount of current flowing through a single channel, without knowing how many channels there are. ■ 5.3 Convergence in Distribution and the Central Limit Theorem 181 5.3 Convergence in Distribution and the Central Limit Theorem In applications, we often want to find P(a < X < b) when we do not know the cdf of X precisely; it is sometimes possible to do this by approximating FX . The approximation is often arrived at by some sort of limiting argument. The most famous limit theorem in probability theory is the central limit theorem, which is the main topic of this section. Before discussing the central limit theorem, we develop some introductory terminology, theory, and examples. DEFINITION Let X1, X2,...be a sequence of random variables with cumulative distribution functions F1, F2,...,and let X be a random variable with distribution function F. We say that Xn converges in distribution to X if limn→∞ Fn(x) = F(x) at every point at which F is continuous. ■ Moment-generating functions are often useful for establishing the convergence of distribution functions. We know from Property A of Section 4.5 that a distribu- tion function Fn is uniquely determined by its mgf, Mn. The following theorem, which we give without proof, states that this unique determination holds for limits as well. THEOREM A Continuity Theorem Let Fn be a sequence of cumulative distribution functions with the corresponding moment-generating function Mn. Let F be a cumulative distribution function with the moment-generating function M.IfMn(t) → M(t) for all t in an open interval containing zero, then Fn(x) → F(x) at all continuity points of F. ■ EXAMPLE A We will show that the Poisson distribution can be approximated by the normal distri- bution for large values of λ. This is suggested by examining Figure 2.6, which shows that as λ increases, the probability mass function of the Poisson distribution becomes more symmetric and bell-shaped. Let λ1,λ2,... be an increasing sequence with λn →∞, and let {Xn} be a sequence of Poisson random variables with the corresponding parameters. We know that E(Xn) = Var(Xn) = λn. If we wish to approximate the Poisson distribution function by a normal distribution function, the normal must have the same mean and 182 Chapter 5 Limit Theorems variance as the Poisson does. In addition, if we wish to prove a limiting result, we run into the difficulty that the mean and variance are tending to infinity. This difficulty is dealt with by standardizing the random variables—that is, by letting Zn = Xn − E(Xn)√ Var(Xn) = Xn − λn√λn We then have E(Zn) = 0 and Var(Zn) = 1, and we will show that the mgf of Zn converges to the mgf of the standard normal distribution. The mgf of Xn is MXn (t) = eλn(et −1) By Property C of Section 4.5, the mgf of Zn is MZn (t) = e−t √λn MXn t√λn = e−t √λn eλn(et/ √ λn −1) It will be easier to work with the log of this expression. log MZn (t) =−t λn + λn(et/√λn − 1) Using the power series expansion ex = ∞ k=0 xk k! , we see that limn→∞ log MZn (t) = t2 2 or limn→∞ MZn (t) = et2/2 The last expression is the mgf of the standard normal distribution. We have shown that a standardized Poisson random variable converges in distri- bution to a standard normal variable as λ approaches infinity. Practically, we wish to use this limiting result as a basis for an approximation for large but finite values of λn. How adequate the approximation is for λ = 100, say, is a matter for theoretical and/or empirical investigation. It turns out that the approximation is increasingly good for large values of λ and that λ does not have to be all that large. (See Problem 8 at the end of this chapter.) ■ 5.3 Convergence in Distribution and the Central Limit Theorem 183 The next example shows how the approximation of the Poisson distribution can be applied in a specific case. EXAMPLE B A certain type of particle is emitted at a rate of 900 per hour. What is the probability that more than 950 particles will be emitted in a given hour if the counts form a Poisson process? Let X be a Poisson random variable with mean 900. We find P(X > 950) by standardizing: P(X > 950) = P X − 900√ 900 > 950 − 900√ 900 ≈ 1 −  5 3 = .04779 where  is the standard normal cdf. For comparison, the exact probability is .04712. ■ We now turn to the central limit theorem, which is concerned with a limiting property of sums of random variables. If X1, X2,... is a sequence of independent random variables with mean μ and variance σ 2, and if Sn = n i=1 Xi we know from the law of large numbers that Sn/n converges to μ in probability. This followed from the fact that Var Sn n = 1 n2 Var(Sn) = σ 2 n → 0 The central limit theorem is concerned not with the fact that the ratio Sn/n converges to μ but with how it fluctuates around μ. To analyze these fluctuations, we stan- dardize: Zn = Sn − nμ σ√ n You should verify that Zn has mean 0 and variance 1. The central limit theorem states that the distribution of Zn converges to the standard normal distribution. 184 Chapter 5 Limit Theorems THEOREM B Central Limit Theorem Let X1, X2,...be a sequence of independent random variables having mean 0 and variance σ 2 and the common distribution function F and moment-generating function M defined in a neighborhood of zero. Let Sn = n i=1 Xi Then limn→∞ P Sn σ√ n ≤ x = (x), −∞ < x < ∞ Proof Let Zn = Sn/(σ√ n). We will show that the mgf of Zn tends to the mgf of the standard normal distribution. Since Sn is a sum of independent random variables, MSn (t) = [M(t)]n and MZn (t) = M t σ√ n n M(s) has a Taylor series expansion about zero: M(s) = M(0) + sM (0) + 1 2 s2 M (0) + εs where εs/s2 → 0ass → 0. Since E(X) = 0, M (0) = 0, and M (0) = σ 2.As n →∞, t/(σ√ n) → 0, and M t σ√ n = 1 + 1 2 σ 2 t σ√ n 2 + εn where εn/(t2/(nσ 2)) → 0asn →∞. We thus have MZn (t) = 1 + t2 2n + εn n It can be shown that if an → a, then limn→∞ 1 + an n n = ea From this result, it follows that MZn (t) → et2/2 as n →∞ where exp(t2/2) is the mgf of the standard normal distribution, as was to be shown. ■ Theorem B is one of the simplest versions of the central limit theorem; there are many central limit theorems of various degrees of abstraction and generality. We have proved Theorem B under the assumption that the moment-generating functions exist, which is a rather strong assumption. By using characteristic functions instead, we 5.3 Convergence in Distribution and the Central Limit Theorem 185 could modify the proof so that it would only be necessary that first and second mo- ments exist. Further generalizations weaken the assumption that the Xi have the same distribution and apply to linear combinations of independent random variables. Cen- tral limit theorems have also been proved that weaken the independence assumption and allow the Xi to be dependent but not “too” dependent. For practical purposes, especially for statistics, the limiting result in itself is not of primary interest. Statisticians are more interested in its use as an approximation with finite values of n. It is impossible to give a concise and definitive statement of how good the approximation is, but some general guidelines are available, and examining special cases can give insight. How fast the approximation becomes good depends on the distribution of the summands, the Xi . If the distribution is fairly symmetric and has tails that die off rapidly, the approximation becomes good for relatively small values of n. If the distribution is very skewed or if the tails die down very slowly, a larger value of n is needed for a good approximation. The following examples deal with two special cases. EXAMPLE C Because the uniform distribution on [0, 1] has mean 1 2 and variance 1 12 , the sum of 12 uniform random variables, minus 6, has mean 0 and variance 1. The distribution of this sum is quite close to normal; in fact, before better algorithms were developed, it was commonly used in computers for generating normal random variables from uniform ones. It is possible to compare the real and approximate distributions analyt- ically, but we will content ourselves with a simple demonstration. Figure 5.1 shows a histogram of 1000 such sums with a superimposed normal density function. The fit is surprisingly good, especially considering that 12 is not usually regarded as a large value of n. ■ 0 50 Count 100 150 200 250 2024 Value 46 FIGURE 5.1 A histogram of 1000 values, each of which is the sum of 12 uniform [− 1 2 , 1 2 ] pseudorandom variables, with an approximating standard normal density. 186 Chapter 5 Limit Theorems EXAMPLE D The sum of n independent exponential random variables with parameter λ = 1 follows a gamma distribution with λ = 1 and α = n (Example F in Section 4.5). The exponential density is quite skewed; therefore, a good approximation of a standardized gamma by a standardized normal would not be expected for small n. Figure 5.2 shows the cdf’s of the standard normal and standardized gamma distributions for increasing values of n. Note how the approximation improves as n increases. ■ 0 .2 01 Cumulative probability x .4 .6 .8 1.0 3 2 1 2 3 FIGURE 5.2 The standard normal cdf (solid line) and the cdf s of standardized gamma distributions with α = 5 (long dashes), α = 10 (short dashes), and α = 30 (dots). Let us now consider some applications of the central limit theorem. EXAMPLE E Measurement Error Suppose that X1,...,Xn are repeated, independent measurements of a quantity, μ, and that E(Xi ) = μ and Var(Xi ) = σ 2. The average of the measurements, X,is used as an estimate of μ. The law of large numbers tells us that X converges to μ in probability, so we can hope that X is close to μ if n is large. Chebyshev’s inequality allows us to bound the probability of an error of a given size, but the central limit theorem gives a much sharper approximation to the actual error. Suppose that we wish to find P(|X − μ| < c) for some constant c. To use the central limit theorem to approximate this probability, we first standardize, using E(X) = μ and Var(X) = σ 2/n: P(|X − μ| < c) = P(−c < X − μ 5 and n(1 − p)>5. The approximation is especially useful for large values of n, for which tables are not readily available. Suppose that a coin is tossed 100 times and lands heads up 60 times. Should we be surprised and doubt that the coin is fair? To answer this question, we note that if the coin is fair, the number of heads, X,is a binomial random variable with n = 100 trials and probability of success p = 1 2 ,so that E(X) = np = 50 (see Example A of Section 4.1) and Var(X) = np(1− p) = 25 (see Example B of Section 4.3). We could calculate P(X = 60), which would be a small number. But because there are so many possible outcomes, P(X = 50) is also a small number, so this calculation would not really answer the question. Instead, we calculate the probability of a deviation as extreme as or more extreme than 60 if the coin is fair; that is, we calculate P(X ≥ 60). To approximate this probability from the normal distribution, we standardize: P(X ≥ 60) = P X − 50 5 ≥ 60 − 50 5 ≈ 1 − (2) = .0228 The probability is rather small, so the fairness of the coin is called into question. ■ EXAMPLE G Particle Size Distribution The distribution of the sizes of grains of particulate matter is often found to be quite skewed, with a slowly decreasing right tail. A distribution called the lognormal is sometimes fit to such a distribution, and X is said to follow a lognormal distribution if log X has a normal distribution. The central limit theorem gives a theoretical rationale for the use of the lognormal distribution in some situations. Suppose that a particle of initial size y0 is subjected to repeated impacts, that on each impact a proportion, Xi , of the particle remains, and that the Xi are modeled as independent random variables having the same distribution. After the first impact, the 188 Chapter 5 Limit Theorems size of the particle is Y1 = X1 y0; after the second impact, the size is Y2 = X2 X1 y0; and after the nth impact, the size is Yn = Xn Xn−1 ···X2 X1 y0 Then log Yn = log y0 + n i=1 log Xi and the central limit theorem applies to log Yn. ■ A similar construction is relevant to the theory of finance. Suppose that an initial investment of value v0 is made and that returns occur in discrete time, for example, daily. If the return on the first day is R1, then the value becomes V1 = R1v0. After day two the value is V2 = R2 R1v0, and after day n the value is Vn = Rn Rn−1 ···R1v0 The log value is thus log Vn = log v0 + n i=1 log Ri If the returns are independent random variables with the same distribution, then the distribution of log Vn is approximately normally distributed. 5.4 Problems 1. Let X1, X2,...be a sequence of independent random variables with E(Xi ) = μ and Var(Xi ) = σ 2 i . Show that if n−2 n i=1 σ 2 i → 0, then X → μ in probability. 2. Let Xi be as in Problem 1 but with E(Xi ) = μi and n−1 n i=1 μi → μ. Show that X → μ in probability. 3. Suppose that the number of insurance claims, N, filed in a year is Poisson distributed with E(N) = 10,000. Use the normal approximation to the Poisson to approximate P(N > 10,200). 4. Suppose that the number of traffic accidents, N, in a given period of time is dis- tributed as a Poisson random variable with E(N) = 100. Use the normal approx- imation to the Poisson to find  such that P(100 −  2, E(W) exists and equals n/(n − 2). From the definitions of the t and F distributions, it follows that the square of a tn random variable follows an F1,n distribution (see Problem 6 at the end of this chapter). 6.3 The Sample Mean and the Sample Variance 195 6.3 The Sample Mean and the Sample Variance Let X1,...,Xn be independent N(μ, σ 2) random variables; we sometimes refer to them as a sample from a normal distribution. In this section, we will find the joint and marginal distributions of X = 1 n n i=1 Xi S2 = 1 n − 1 n i=1 (Xi − X)2 These are called the sample mean and the sample variance, respectively. First note that because X is a linear combination of independent normal random variables, it is normally distributed with E(X) = μ Var(X) = σ 2 n As a preliminary to showing that X and S2 are independently distributed, we establish the following theorem. THEOREM A The random variable X and the vector of random variables (X1 − X, X2 − X,...,Xn − X) are independent. Proof At the level of this course, it is difficult to give a proof that provides sufficient insight into why this result is true; a rigorous proof essentially depends on geo- metric properties of the multivariate normal distribution, which this book does not cover. We present a proof based on moment-generating functions; in particular, we will show that the joint moment-generating function M(s, t1,...,tn) = E{exp[sX + t1(X1 − X) +···+tn(Xn − X)]} factors into the product of two moment-generating functions—one of X and the other of (X1 − X),...,(Xn − X). The factoring implies (Section 4.5) that the random variables are independent of each other and is accomplished through some algebraic trickery. First we observe that since n i=1 ti (Xi − X) = n i=1 ti Xi − nX ¯t 196 Chapter 6 Distributions Derived from the Normal Distribution then sX + n i=1 ti (Xi − X) = n i=1 s n + (ti − ¯t ) Xi = n i=1 ai Xi where ai = s n + (ti − ¯t ) Furthermore, we observe that n i=1 ai = s n i=1 a2 i = s2 n + n i=1 (ti − ¯t )2 Nowwehave M(s, t1,...,tn) = MX1···Xn (a1,...,an) and since the Xi are independent normal random variables, we have M(s, t1,...,tn) = n i=1 MXi (ai ) = n i=1 exp μai + σ 2 2 a2 i = exp μ n i=1 ai + σ 2 2 n i=1 a2 i = exp μs + σ 2 2 s2 n + σ 2 2 n i=1 (ti − ¯t )2 = exp μs + σ 2 2n s2 exp σ 2 2 n i=1 (ti − ¯t )2 The first factor is the mgf of X. Since the mgf of the vector (X1 − X,...,Xn − X) can be obtained by setting s = 0inM, the second factor is this mgf. ■ 6.3 The Sample Mean and the Sample Variance 197 COROLLARY A X and S2 are independently distributed. Proof This follows immediately since S2 is a function of the vector (X1 − X,..., Xn − X), which is independent of X. ■ The next theorem gives the marginal distribution of S2. THEOREM B The distribution of (n −1)S2/σ 2 is the chi-square distribution with n −1 degrees of freedom. Proof We first note that 1 σ 2 n i=1 (Xi − μ)2 = n i=1 Xi − μ σ 2 ∼ χ2 n Also, 1 σ 2 n i=1 (Xi − μ)2 = 1 σ 2 n i=1 [(Xi − X) + (X − μ)]2 Expanding the square and using the fact that n i=1(Xi − X) = 0, we obtain 1 σ 2 n i=1 (Xi − μ)2 = 1 σ 2 n i=1 (Xi − X)2 + X − μ σ/√ n 2 This is a relation of the form W = U + V . Since U and V are independent by Corollary A, MW (t) = MU (t)MV (t). W and V both follow chi-square distri- butions, so MU (t) = MW (t) MV (t) = (1 − 2t)−n/2 (1 − 2t)−1/2 = (1 − 2t)−(n−1)/2 The last expression is the mgf of a random variable with a χ2 n−1 distribution. ■ One final result concludes this chapter’s collection. 198 Chapter 6 Distributions Derived from the Normal Distribution COROLLARY B Let X and S2 be as given at the beginning of this section. Then X − μ S/√ n ∼ tn−1 Proof We simply express the given ratio in a different form: X − μ S/√ n = X − μ σ/√ n S2/σ 2 The latter is the ratio of an N(0, 1) random variable to the square root of an independent random variable with a χ2 n−1 distribution divided by its degrees of freedom. Thus, from the definition in Section 6.2, the ratio follows a t distribution with n − 1 degrees of freedom. ■ 6.4 Problems 1. Prove Proposition A of Section 6.2. 2. Prove Proposition B of Section 6.2. 3. Let X be the average of a sample of 16 independent normal random variables with mean 0 and variance 1. Determine c such that P(|X| < c) = .5 4. If T follows a t7 distribution, find t0 such that (a) P(|T | < t0) = .9 and (b) P(T > t0) = .05. 5. Show that if X ∼ Fn,m, then X −1 ∼ Fm,n. 6. Show that if T ∼ tn, then T 2 ∼ F1,n. 7. Show that the Cauchy distribution and the t distribution with 1 degree of free- dom are the same. 8. Show that if X and Y are independent exponential random variables with λ = 1, then X/Y follows an F distribution. Also, identify the degrees of freedom. 9. Find the mean and variance of S2, where S2 is as in Section 6.3. 10. Show how to use the chi-square distribution to calculate P(a < S2/σ 2 < b). 11. Let X1,...,Xn be a sample from an N(μX ,σ2) distribution and Y1,...,Ym be an independent sample from an N(μY ,σ2) distribution. Show how to use the F distribution to find P(S2 X /S2 Y > c). CHAPTER 7 Survey Sampling 7.1 Introduction Resting on the probabilistic foundations of the preceding chapters, this chapter marks the beginning of our study of statistics by introducing the subject of survey sampling. As well as being of considerable intrinsic interest and practical utility, the development of the elementary theory of survey sampling serves to introduce several concepts and techniques that will recur and be amplified in later chapters. Sample surveys are used to obtain information about a large population by exam- ining only a small fraction of that population. Sampling techniques have been used in many fields, such as the following: • Governments survey human populations; for example, the U.S. government con- ducts health surveys and census surveys. • Sampling techniques have been extensively employed in agriculture to estimate such quantities as the total acreage of wheat in a state by surveying a sample of farms. • The Interstate Commerce Commission has carried out sampling studies of rail and highway traffic. In one such study, records of shipments of household goods by motor carriers were sampled to evaluate the accuracy of preshipment estimates of charges, claims for damages, and other variables. • In the practice of quality control, the output of a manufacturing process may be sampled in order to examine the items for defects. • During audits of the financial records of large companies, sampling techniques may be used when examination of the entire set of records is impractical. The sampling techniques discussed here are probabilistic in nature—each mem- ber of the population has a specified probability of being included in the sample, and the actual composition of the sample is random. Such techniques differ markedly from 199 200 Chapter 7 Survey Sampling the type of sampling scheme in which particular population members are included in the sample because the investigator thinks they are typical in some way. Such a scheme may be effective in some situations, but there is no way mathematically to guarantee its unbiasedness (a term that will be precisely defined later) or to estimate the magnitude of any error committed, such as that arising from estimating the popu- lation mean by the sample mean. We will see that using a random sampling technique has a consequence that estimates can be guaranteed to be unbiased and probabilistic bounds on errors can be calculated. Among the advantages of using random sampling are the following: • The selection of sample units at random is a guard against investigator biases, even unconscious ones. • A small sample costs far less and is much faster to survey than a complete enumer- ation. • The results from a small sample may actually be more accurate than those from a complete enumeration. The quality of the data in a small sample can be more easily monitored and controlled, and a complete enumeration may require a much larger, and therefore perhaps more poorly trained, staff. • Random sampling techniques make possible the calculation of an estimate of the error due to sampling. • In designing a sample, it is frequently possible to determine the sample size neces- sary to obtain a prescribed error level. Peck et al. (2005) contains several interesting papers about applications of sampling. 7.2 Population Parameters This section defines those numerical characteristics, or parameters, of the population that we will estimate from a sample. We will assume that the population is of size N and that associated with each member of the population is a numerical value of interest. These numerical values will be denoted by x1, x2, ···, xN . The variable xi may be a numerical variable such as age or weight, or it may take on the value 1 or 0 to denote the presence or absence of some characteristic such as gender. We will refer to the latter situation as the dichotomous case. EXAMPLE A This is the first of many examples in this chapter in which we will illustrate ideas by using a study by Herkson (1976). The population consists of N = 393 short- stay hospitals. We will let xi denote the number of patients discharged from the ith hospital during January 1968. A histogram of the population values is shown in Fig- ure 7.1. The histogram was constructed in the following way: The number of hospitals that discharged 0–200, 201– 400, ..., 2801–3000 patients were graphed as horizon- tal lines above the respective intervals. For example, the figure indicates that about 7.2 Population Parameters 201 0 500 1000 1500 2000 Count Number of discharges 20 40 60 80 0 2500 35003000 FIGURE 7.1 Histogram of the numbers of patients discharged during January 1968 from 393 short-stay hospitals. 40 hospitals discharged from 601 to 800 patients. The histogram is a convenient graphical representation of the distribution of the values in the population, being more quickly assimilated than would a list of 393 values. ■ We will be particularly interested in the population mean, or average, μ = 1 N N i=1 xi For the population of 393 hospitals, the mean number of discharges is 814.6. Note the location of this value in Figure 7.1. In the dichotomous case, where the presence or absence of a characteristic is to be determined, μ equals the proportion, p, of individuals in the population having the particular characteristic, because in the sum above, each xi is either 0 or 1. The sum thus reduces to the number of 1s and when divided by N, gives the proportion, p. The population total is τ = N i=1 xi = Nμ The total number of people discharged from the population of hospitals is τ = 320,138. In the dichotomous case, the population total is the total number of members of the population possessing the characteristic of interest. We will also need to consider the population variance, σ 2 = 1 N N i=1 (xi − μ)2 202 Chapter 7 Survey Sampling A useful identity can be obtained by expanding the square in this equation: σ 2 = 1 N N i=1 x2 i − 2μ N i=1 xi + Nμ2 = 1 N N i=1 x2 i − 2Nμ2 + Nμ2 = 1 N N i=1 x2 i − μ2 In the dichotomous case, the population variance reduces to p(1 − p): σ 2 = 1 N N i=1 x2 i − μ2 = p − p2 = p(1 − p) Here we used the fact that because each xi is 0 or 1, each x2 i is also 0 or 1. The population standard deviation is the square root of the population variance and is used as a measure of how spread out, dispersed, or scattered the individual values are. The standard deviation is given in the same units (for example, inches) as are the population values, whereas the variance is given in those units squared. The variance of the discharges is 347,766, and the standard deviation is 589.7; examination of the histogram in Figure 7.1 makes it clear that the latter number is the more reasonable description of the spread of the population values. 7.3 Simple Random Sampling The most elementary form of sampling is simple random sampling (s.r.s.): Each particular sample of size n has the same probability of occurrence; that is, each of theN n possible samples of size n taken without replacement has the same probability. We assume that sampling is done without replacement so that each member of the population will appear in the sample at most once. The actual composition of the sample is usually determined by using a table of random numbers or a random number generator on a computer. Conceptually, we can regard the population members as balls in an urn, a specified number of which are selected for inclusion in the sample at random and without replacement. Because the composition of the sample is random, the sample mean is random. An analysis of the accuracy with which the sample mean approximates the population mean must therefore be probabilistic in nature. In this section, we will derive some statistical properties of the sample mean. 7.3 Simple Random Sampling 203 7.3.1 The Expectation and Variance of the Sample Mean We will denote the sample size by n (n is less than N) and the values of the sample members by X1, X2,...,Xn. It is important to realize that each Xi is a random vari- able. In particular, Xi is not the same as xi : Xi is the value of theith member of the sam- ple, which is random and xi is that of the ith member of the population, which is fixed. We will consider the sample mean, X = 1 n n i=1 Xi as an estimate of the population mean. As an estimate of the population total, we will consider T = N X Properties of T will follow readily from those of X. Since each Xi is a random variable, so is the sample mean; its probability distribution is called its sampling distribution. In general, any numerical value, or statistic, computed from a random sample is a random variable and has an associated sampling distribution. The sampling distribution of X determines how accurately X estimates μ; roughly speaking, the more tightly the sampling distribution is centered on μ, the better the estimate. EXAMPLE A To illustrate the concept of a sampling distribution, let us look again at the population of 393 hospitals. In practice, of course, the population would not be known, and only one sample would be drawn. For pedagogical purposes here, we can consider the sampling distribution of the sample mean from this known population. Suppose, for example, that we want to find the sampling distribution of the mean of a sample of size 16. In principle, we could form all 393 16 samples and compute the mean of each one— this would give the sampling distribution. But because the number of such samples is of the order 1033, this is clearly not practical. We will thus employ a technique known as simulation. We can estimate the sampling distribution of the mean of a sample of size n by drawing many samples of size n, computing the mean of each sample, and then forming a histogram of the collection of sample means. Figure 7.2 shows the results of such a simulation for sample sizes of 8, 16, 32, and 64 with 500 replications for each sample size. Three features of Figure 7.2 are noteworthy: 1. All the histograms are centered about the population mean, 814.6. 2. As the sample size increases, the histograms become less spread out. 3. Although the shape of the histogram of population values (Figure 7.1) is not symmetric about the mean, the histograms in Figure 7.2 are more nearly so. These features will be explained quantitatively. ■ As we have said, X is a random variable whose distribution is determined by that of the Xi . We thus examine the distribution of a single sample element, Xi .It should be noted that the following lemma holds whether sampling is with or without replacement. 204 Chapter 7 Survey Sampling 0 20 400 600 800 1000 Count 40 60 80 100 200 1200 120 (a) 1400 1600 1800 0 20 400 600 800 1000 Count 60 100 200 1200 (b) 1400 1600 1800 0 20 400 600 800 1000 Count 40 60 80 100 200 1200 120 (c) 1400 1600 1800 140 0 20 400 600 800 1000 Count 60 100 200 1200 (d) 1400 1600 1800 140 FIGURE 7.2 Histograms of the values of the mean number of discharges in 500 simple random samples from the population of 393 hospitals. Sample sizes: (a) n = 8, (b) n = 16, (c) n = 32, (d) n = 64. We need to be careful about the values that the random variable Xi can assume. The ith sample member is equally likely to be any of the N population members. If all the population values were distinct, we would then have P(X1 = x j ) = 1/N. But the population values may not be distinct (for example, in the dichotomous case 7.3 Simple Random Sampling 205 there are only two values, 0 and 1). If k members of the population have the same value ζ, then P(Xi = ζ) = k/N. We use this construction in proving the following lemma. LEMMA A Denote the distinct values assumed by the population members by ζ1,ζ2,...,ζm, and denote the number of population members that have the value ζ j by n j , j = 1, 2,...,m. Then Xi is a discrete random variable with probability mass function P(Xi = ζ j ) = n j N Also, E(Xi ) = μ Var(Xi ) = σ 2 Proof The only possible values that Xi can assume are ζ1,ζ2,...,ζm. Since each mem- ber of the population is equally likely to be the ith member of the sample, the probability that Xi assumes the value ζ j is thus n j /N. The expected value of the random variable Xi is then E(Xi ) = m j=1 ζ j P(Xi = ζ j ) = 1 N m j=1 n j ζ j = μ The last equation follows because n j population members have the value ζ j and the sum is thus equal to the sum of the values of all the population members. Finally, Var(Xi ) = E X 2 i − [E(Xi )]2 = 1 N m j=1 n j ζ 2 j − μ2 = σ 2 Here we have used the fact that N i=1x2 i = m j=1n j ζ 2 j and the identity for the population variance derived in Section 7.2. ■ As a measure of the center of the sampling distribution, we will use E(X).Asa measure of the dispersion of the sampling distribution about this center, we will use the standard deviation of X. The key results that will be obtained shortly are that the sampling distribution is centered at μ and that its spread is inversely proportional to the square root of the sample size, n. We first show that the sampling distribution is centered at μ. 206 Chapter 7 Survey Sampling THEOREM A With simple random sampling, E(X) = μ. Proof Since, from Lemma A, E(Xi ) = μ, it follows from Theorem A in Section 4.1.2 that E(X) = 1 n n i=1 E(Xi ) = μ ■ From Theorem A, we have the following corollary. COROLLARY A With simple random sampling, E(T ) = τ. Proof E(T ) = E(N X) = NE(X) = Nμ = τ ■ In the dichotomous case, μ = p, and X is the proportion of the sample that possesses the characteristic of interest. In this case, X will be denoted by ˆp.Wehave shown that E( ˆp) = p. It is important to keep in mind that X is random. The result E(X) = μ can be interpreted to mean that “on the average” X = μ. In general, if we wish to estimate a population parameter, θ say, by a function ˆθ of the sample, X1, X2,...,Xn, and E(ˆθ) = θ, whatever the value of θ may be, we say that ˆθ is unbiased. Thus, X and T are unbiased estimates of μ and τ. On average they are correct. We next investigate how variable they are, by deriving their variances and standard deviations. Section 4.2.1 introduced the concepts of bias and variance in the context of a model of measurement error, and these concepts are also relevant in this new context. In Chapter 4, it was shown that Mean squared error = variance + bias2 Since X and T are unbiased, their mean squared errors are equal to their variances. We next find Var(X). Since X = n−1 n i=1 Xi , it follows from Corollary A of Section 4.3 that Var(X) = 1 n2 n i=1 n j=1 Cov(Xi , X j ) 7.3 Simple Random Sampling 207 Suppose that sampling were done with replacement. Then the Xi would be inde- pendent, and for i = j we would have Cov(Xi , X j ) = 0, whereas Cov(Xi , Xi ) = Var(Xi ) = σ 2. It would then follow that Var X = 1 n2 n i=1 Var(Xi ) = σ 2 n and that the standard deviation of X, also called its standard error, would be σX = σ√ n Sampling without replacement induces dependence among the Xi , which com- plicates this simple result. However, we will see that if the sample size n is small relative to the population size N, the dependence is weak and this simple result holds to a good approximation. To find the variance of the sample mean in sampling without replacement we need to find Cov(Xi , X j ) for i = j. LEMMA B For simple random sampling without replacement, Cov(Xi , X j ) =−σ 2/(N − 1) if i = j Using the identity for covariance established at the beginning of Section 4.3, Cov(Xi , X j ) = E(Xi X j ) − E(Xi )E(X j ) and E(Xi X j ) = m k=1 m l=1 ζkζl P(Xi = ζk and X j = ζl) = m k=1 ζk P(Xi = ζk) m l=1 ζl P(X j = ζl|Xi = ζk) from the multiplication law for conditional probability. Now, P(X j = ζl|Xi = ζk) = nl/(N − 1), if k = l (nl − 1)/(N − 1), if k = l Now if we express m l=1 ζl P(X j = ζl|Xi = ζk) = l =k ζl nl N − 1 + ζk nk − 1 N − 1 = m l=1 ζl nl N − 1 − ζk 1 N − 1 208 Chapter 7 Survey Sampling the expression for E(Xi X j ) becomes m k=1 ζk nk N m l=1 ζl nl N − 1 − ζk N − 1 = 1 N(N − 1) τ 2 − m k=1 ζ 2 k nk = τ 2 N(N − 1) − 1 N(N − 1) m k=1 ζ 2 k nk = Nμ2 N − 1 − 1 N − 1 (μ2 + σ 2) = μ2 − σ 2 N − 1 Finally, subtracting E(Xi )E(X j ) = μ2 from the last equation, we have Cov(Xi , X j ) =− σ 2 N − 1 for i = j. ■ (Alternative proofs of Lemma B are outlined in Problems 25 and 26 at the end of this chapter.) This lemma shows that Xi and X j are not independent of each other for i = j, but that the covariance is very small for large values of N. We are now able to derive the following theorem. THEOREM B With simple random sampling, Var(X) = σ 2 n N − n N − 1 = σ 2 n 1 − n − 1 N − 1 Proof From Corollary A of Section 4.3, Var(X) = 1 n2 n i=1 n j=1 Cov(Xi , X j ) = 1 n2 n i=1 Var(Xi ) + 1 n2 n i=1 j =i Cov(Xi , X j ) = σ 2 n − 1 n2 n(n − 1) σ 2 N − 1 After some algebra, this gives the desired result. ■ 7.3 Simple Random Sampling 209 Notice that the variance of the sample mean in sampling without replacement differs from that in sampling with replacement by the factor 1 − n − 1 N − 1 which is called the finite population correction. The ratio n/N is called the sampling fraction. Frequently, the sampling fraction is very small, in which case the standard error (standard deviation) of X is σX ≈ σ√ n We see that, apart from the usually small finite population correction, the spread of the sampling distribution and therefore the precision of X are determined by the sample size (n) and not by the population size (N ). As will be made more explicit later, the appropriate measure of the precision of the sample mean is its standard error, which is inversely proportional to the square root of the sample size. Thus, in order to double the accuracy, the sample size must be quadrupled. (You might examine Figure 7.2 with this in mind.) The other factor that determines the accuracy of the sample mean is the population standard deviation, σ.Ifσ is small, the population values are not very dispersed and a small sample will be fairly accurate. But if the values are widely dispersed, a much larger sample will be required in order to attain the same accuracy. EXAMPLE B If the population of hospitals is sampled without replacement and the sample size is n = 32, σX = σ√ n 1 − n − 1 N − 1 = 589.7√ 32 1 − 31 392 = 104.2 × .96 = 100.0 Notice that because the sampling fraction is small, the finite population correction makes little difference. To see that σX = 100.0 is a reasonable measure of accuracy, examine part (b) of Figure 7.2 and observe that the vast majority of sample means differed from the population mean (814) by less than two standard errors; i.e., the vast majority of sample means were in the interval (614, 1014). ■ EXAMPLE C Let us apply this result to the problem of estimating a proportion. In the population of hospitals, a proportion p = .654 had fewer than 1000 discharges. If this proportion were estimated from a sample as the sample proportion ˆp, the standard error of ˆp 210 Chapter 7 Survey Sampling could be found by applying Theorem B to this dichotomous case: σ ˆp = p(1 − p) n 1 − n − 1 N − 1 For example, for n = 32, the standard error of ˆp is σ ˆp = .654 × .346 32 1 − 31 392 = .08 ■ The precision of the estimate of the population total does depend on the population size, N. COROLLARY B With simple random sampling, Var(T ) = N 2 σ 2 n N − n N − 1 Proof Since T = N X, Var(T ) = N 2 Var(X) ■ 7.3.2 Estimation of the Population Variance A sample survey is used to estimate population parameters, and it is desirable also to assess and quantify the variability of the estimates. In the previous section, we saw how the standard error of an estimate may be determined from the sample size and the population variance. In practice, however, the population variance will not be known, but as we will show in this section, it can be estimated from the sample. Since the population variance is the average squared deviation from the population mean, estimating it by the average squared deviation from the sample mean seems natural: ˆσ 2 = 1 n n i=1 (Xi − X)2 7.3 Simple Random Sampling 211 The following theorem shows that this estimate is biased. THEOREM A With simple random sampling, E( ˆσ 2) = σ 2 n − 1 n N N − 1 Proof Expanding the square and proceeding as in the identity for the population variance in Section 7.2, we find ˆσ 2 = 1 n n i=1 X 2 i − X 2 Thus, E( ˆσ 2) = 1 n n i=1 E X 2 i − E(X 2) Now, we know that E X 2 i = Var(Xi ) + [E(Xi )]2 = σ 2 + μ2 Similarly, from Theorems A and B of Section 7.3.1, E(X 2) = Var(X) + [E(X)]2 = σ 2 n 1 − n − 1 N − 1 + μ2 Substituting these expressions for E(X 2 i ) and E(X 2) in the preceding equation for E( ˆσ 2) gives the desired result. ■ Because N > n, it follows with a little algebra that n − 1 n N N − 1 < 1 so that E( ˆσ 2)<σ2; ˆσ 2 thus tends to underestimate σ 2. From Theorem A, we see that an unbiased estimate of σ 2 may be obtained by multiplying ˆσ 2 by the factor n(N −1)/[(n−1)N]. Thus, an unbiased estimate of σ 2 is 1 n−1 (1− 1 N ) n i=1(Xi − X)2. We also have the following corollary. 212 Chapter 7 Survey Sampling COROLLARY A An unbiased estimate of Var(X) is s2 X = ˆσ 2 n n n − 1 N − 1 N N − n N − 1 = s2 n 1 − n N where s2 = 1 n − 1 n i=1 (Xi − X)2 Proof Since Var(X) = σ 2 n N − n N − 1 an unbiased estimate of Var(X) may be obtained by substituting in an unbiased estimate of σ 2. Algebra then yields the desired result. ■ Similarly, an unbiased estimate of the variance of T , the estimator of the popu- lation total, is s2 T = N 2s2 X For the dichotomous case, in which each Xi is 0 or 1, note that 1 n n i=1 (Xi − X)2 = 1 n n i=1 X 2 i − X 2 = ˆp(1 − ˆp) Therefore, s2 = n n − 1 ˆp(1 − ˆp) Thus, as a special case of Corollary A, we have the following corollary. COROLLARY B An unbiased estimate of Var( ˆp) is s2 ˆp = ˆp(1 − ˆp) n − 1 1 − n N ■ In many cases, the sampling fraction, n/N, is small and may be neglected. Fur- thermore, it often makes little difference whether n − 1orn is used as the divisor. 7.3 Simple Random Sampling 213 The quantities sX , sT , and s ˆp are called estimated standard errors. If we knew them, the actual standard errors, σX ,σT and σ ˆp, would be used to gauge the accuracy of the estimates X, T and ˆp. If they are not known, which is the typical case, the estimated standard errors are used in their place. EXAMPLE A A simple random sample of 50 of the 393 hospitals was taken. From this sample, X = 938.5 (recall that, in fact, μ = 814.6) and s = 614.53 (σ = 590). An estimate of the variance of X is s2 X = s2 n 1 − n N = 6592 The estimated standard error of X is sX = 81.19 (Note that the true value is σX = σ√ 50 1 − 49 392 = 78.) This estimated standard error gives a rough idea of how accurate the value of X is; in this case, we see that the magnitude of the error is of the order 80, as opposed to 8 or 800, say. In fact, the error was 123.9, or about 1.5 sX . ■ EXAMPLE B From the same sample, the estimate of the total number of discharges in the population of hospitals is T = N X = 368,831 Recall that the true value of the population total is 320,139. The estimated standard error of T is sT = NsX = 31,908 Again, this estimated standard error can be used as a rough gauge of the estimation error. ■ EXAMPLE C Let p be the proportion of hospitals that had fewer than 1000 discharges—that is, p = .654. In the sample of Example A, 26 of 50 hospitals had fewer than 1000 discharges, so ˆp = 26 50 = .52 The variance of ˆp is estimated by s2 ˆp = ˆp(1 − ˆp) n − 1 1 − n N = .0045 Thus, the estimated standard error of ˆp is s ˆp = .067 214 Chapter 7 Survey Sampling Crudely, this tells us that the error of ˆp is in the second or first decimal place—that we are probably not so fortunate as to have an error only in the third decimal place. In fact, the error was .134 or about 2 × s ˆp. ■ These examples show how, in simple random sampling, we can not only form estimates of unknown population parameters, but can also gauge the likely size of the errors of the estimates, by estimating their standard errors from the data in the sample. We have covered a lot of ground, and the presence of the finite population cor- rection complicates the expressions we have derived. It is thus useful to summarize our results in the following table: Population Parameter Estimate Variance of Estimate Estimated Variance μ X = 1 n n i=1 Xi σ 2 X = σ 2 n N−n N−1 s2 X = s2 n 1 − n N p ˆp = sample proportion σ 2 ˆp = p(1−p) n N−n N−1 s2 ˆp = ˆp(1− ˆp) n−1 1 − n N τ T = N X σ 2 T = N 2σ 2 X s2 T = N 2s2 X σ 2 1 − 1 N s2 where s2 = 1 n−1 n i=1(Xi − X)2. The square roots of the entries in the third column are called standard errors, and the square roots of the entries in the fourth column are called estimated standard errors. The former depend on unknown population parameters, so the latter are used to gauge the accuracy of the parameter estimates. When the population is large relative to the sample size, the finite population correction can be ignored, simplifying the preceding expressions. 7.3.3 The Normal Approximation to the Sampling Distribution of X We have found the mean and the standard deviation of the sampling distribution of X. Ideally, we would like to know the sampling distribution, since it would tell us every- thing we could hope to know about the accuracy of the estimate. Without knowledge of the population itself, however, we cannot determine the sampling distribution. In this section, we will use the central limit theorem to deduce an approximation to the sampling distribution—the normal, or Gaussian, distribution. This approximation will be used to find probabilistic bounds for the estimation error. In Section 5.3, we considered a sequence of independent and identically dis- tributed (i.i.d.) random variables, X1, X2,...having the common mean and variance μ and σ 2. The sample mean of X1, X2,...,Xn is X n = 1 n n i=1 Xi 7.3 Simple Random Sampling 215 This sample mean has the properties E(X n) = μ and Var(X n) = σ 2 n The central limit theorem says that, for a fixed number z, P X n − μ σ/√ n ≤ z → (z) as n →∞ where  is the cumulative distribution function of the standard normal distribution. Using a more compact and suggestive notation, we have P X n − μ σXn ≤ z → (z) The context of survey sampling is not exactly like that of the central limit theorem as stated above—as we have seen, in sampling without replacement, the Xi are not independent of each other, and it makes no sense to have n tend to infinity while N remains fixed. But other central limit theorems have been proved that are appropriate to the sampling context. These show that if n is large, but still small relative to N, then X n, the mean of a simple random sample, is approximately normally distributed. To demonstrate the use of the central limit theorem, we will apply it to approx- imate P(|X − μ|≤δ), the probability that the error made in estimating μ by X is less than some constant δ P(|X − μ|≤δ) = P(−δ ≤ X − μ ≤ δ) = P − δ σX ≤ X − μ σX ≤ δ σX ≈  δ σX −  − δ σX = 2 δ σX − 1 since (−z) = 1 − (z), from the symmetry of the standard normal distribution about zero. EXAMPLE A Let us again consider the population of 393 hospitals. The standard deviation of the mean of a sample of size n = 64 is, using the finite population correction, σX = σ√ n 1 − n − 1 N − 1 = 589.7 8 1 − 63 392 = 67.5 We can use the central limit theorem to approximate the probability that the sample mean differs from the population mean by more than 100 in absolute value; i.e., 216 Chapter 7 Survey Sampling P(|X − μ| > 100). First, from the symmetry of the normal distribution, P(|X − μ| > 100) ≈ 2P(X − μ>100) and P(X − μ>100) = 1 − P(X − μ<100) = 1 − P X − μ σX < 100 σX ≈ 1 −  100 67.5 = .069 Thus the probability that the sample mean differs from the population mean by more than 100 is approximately .14. In fact, among the 500 samples of size 64 in Example A in Section 7.3.1, 82, or 16.4%, differed by more than 100 from the population mean. Similarly, the central limit theorem approximation gives .026 as the probability of deviations of more than 150 from the population mean. In the simulation in Example A in Section 7.3.1, 11 of 500, or 2.2%, differed by more than 150. If we are not too finicky, the central limit theorem gives us reasonable and useful approximations. ■ EXAMPLE B For a sample of size 50, the standard error of the sample mean number of discharges is σX = 78 For the particular sample of size 50 discussed in Example A in Section 7.3.2, we found X = 938.35, so X − μ = 123.9. We now calculate an approximation of the probability of an error this large or larger: P(|X − μ|≥123.9) = 1 − P(|X − μ| < 123.9) ≈ 1 − 2 123.9 78 − 1 = 2 − 2(1.59) = .11 Thus, we can expect an error this large or larger to occur about 11% of the time. ■ EXAMPLE C In Example C in Section 7.3.2, we found from the sample of size 50 an estimate ˆp = .52 of the proportion of hospitals that discharged fewer than 1000 patients; in fact, the actual proportion in the population is .65. Thus, | ˆp − p |=.13. What is the probability that an estimate will be off by an amount this large or larger? We have σ ˆp = p(1 − p) n 1 − n − 1 N − 1 = .068 × .94 = .064 7.3 Simple Random Sampling 217 We can therefore calculate P(|p − ˆp| >.13) = 1 − P(|p − ˆp|≤.13) = 1 − P |p − ˆp| σ ˆp ≤ .13 σ ˆp ≈ 2[1 − (2.03)] = .04 We see that the sample was rather “unlucky”—an error this large or larger would occur only about 4% of the time. ■ We can now derive a confidence interval for the population mean, μ. A confi- dence interval for a population parameter, θ, is a random interval, calculated from the sample, that contains θ with some specified probability. For example, a 95% confi- dence interval for μ is a random interval that contains μ with probability .95; if we were to take many random samples and form a confidence interval from each one, about 95% of these intervals would contain μ. If the coverage probability is 1 − α, the interval is called a 100(1 − α)% confidence interval. Confidence intervals are frequently used in conjunction with point estimates to convey information about the uncertainty of the estimates. For 0 ≤ α ≤ 1, let z(α) be that number such that the area under the standard normal density function to the right of z(α) is α (Figure 7.3). Note that the symmetry of the standard normal density function about zero implies that z(1 − α) =−z(α). If Z follows a standard normal distribution, then, by definition of z(α), P(−z(α/2) ≤ Z ≤ z(α/2)) = 1 − α From the central limit theorem, (X − μ)/σX has approximately a standard normal distribution, so P −z(α/2) ≤ X − μ σX ≤ z(α/2) ≈ 1 − α 0 .1 2 10 1 f ( z ) z .2 .3 .4 3 2 3 z() FIGURE 7.3 A standard normal density showing α and z(α). 218 Chapter 7 Survey Sampling Elementary manipulation of the inequalities gives P(X − z(α/2)σX ≤ μ ≤ X + z(α/2)σX ) ≈ 1 − α That is, the probability that μ lies in the interval X ± z(α/2)σX is approximately 1 − α. The interval is thus called a 100(1 − α)% confidence interval. It is important to understand that this interval is random and that the preceding equation states that the probability that this random interval covers μ is 1−α. In practice, α is assigned a small value, such as .1, .05, or .01, so that the probability that the interval covers μ will be large. Also, since the population variance is typically not known, sX is substituted for σX . For large samples, it can be shown that the effect of this substitution is practically negligible. It is impossible to give a precise answer to the question “How large is large?” As a rule of thumb, a value of n greater than 25 or 30 is usually adequate. To illustrate the concept of a confidence interval, 20 samples each of size n = 25 were drawn from the population of hospital discharges. From each of these 20 samples, an approximate 95% confidence interval for μ, the mean number of discharges, was computed. These 20 confidence intervals are displayed as vertical lines in Figure 7.4; the dashed line in the figure is drawn at the true value, μ = 814.6. Notice that it so 400 600 800 Number of discharges 1000 1200 FIGURE 7.4 Vertical lines are 20 approximate 95% confidence intervals for μ.The horizontal line is the true value of μ. 7.3 Simple Random Sampling 219 happened that all the confidence intervals included μ; since these are 95% intervals, on the average 5%, or 1 out of 20, would not include μ. The following example illustrates the procedure for calculating confidence intervals. EXAMPLE D A particular area contains 8000 condominium units. In a survey of the occupants, a simple random sample of size 100 yields the information that the average number of motor vehicles per unit is 1.6, with a sample standard deviation of .8. The estimated standard error of X is thus sX = s√ n 1 − n N = .8 10 1 − 100 8000 = .08 Note that the finite population correction makes almost no difference. Since z(.025) = 1.96, a 95% confidence interval for the population average is X ± 1.96sX , or (1.44, 1.76). An estimate of the total number of motor vehicles is T = 8000 × 1.6 = 12,800. The estimated standard error of T is sT = NsX = 640 A 95% confidence interval for the total number of motor vehicles is T ± 1.96sT ,or (11,546, 14,054). In the same survey, 12% of the respondents said they planned to sell their condos within the next year; ˆp = .12 is an estimate of the population proportion p. The estimated standard error is s ˆp = ˆp(1 − ˆp) n − 1 1 − 100 8000 = .03 A 95% confidence interval for p is ˆp ± 1.96s ˆp, or (.06, .18). The total number of owners planning to sell is estimated as T = N ˆp = 960. The estimated standard error of T is sT = Nsˆp = 240. A 95% confidence interval for the number in the population planning to sell is T ± 1.96sT , or (490, 1430). The proper interpretation of this interval, (490, 1430), is a little subtle. We cannot state that the probability is 0.95 and that the number of owners planning to sell is between 490 and 1430, because that number is either in this interval or not. What is true is that 95% of intervals formed in this way will contain the true number in the long run. This interval is like one of those shown in Figure 7.4; in the long run, 95% of those intervals will contain the true number of discharges, but in the figure any particular interval either does or doesn’t contain the true number. ■ The width of a confidence interval is determined by the sample size n and the population standard deviation σ.Ifσ is known approximately, perhaps from earlier 220 Chapter 7 Survey Sampling samples of the population, n can be chosen so as to obtain a confidence interval close to some desired length. Such analysis is usually an important aspect of planning the design of a sample survey. EXAMPLE E The interval for the total number of owners planning to sell in Example D might be considered too wide for practical purposes; reducing its width would require a larger sample size. Suppose that an interval with a half-width of 200 is desired. Neglecting the finite population correction, the half-width is 1.96sT = 1.96N ˆp(1 − ˆp) n − 1 = 5095√ n − 1 Setting the last expression equal to 200 and solving for n yields n = 650 as the necessary sample size. ■ Let us summarize: The fundamental result of this section is that the sampling distribution of the sample mean is approximately Gaussian. This approximation can be used to quantify the error committed in estimating the population mean by the sample mean, thus giving us a good understanding of the accuracy of estimates produced by a simple random sample. We next introduced the idea of a confidence interval, a random interval that contains a population parameter with a specified probability and thus provides an assessment of the accuracy of the corresponding estimate of that parameter. We have seen in our examples that the width of the confidence interval is a multiple of the estimated standard deviation of the estimate; for example, a confidence interval for μ is X ± ksX , where the constant k depends on the coverage probability of the interval. 7.4 Estimation of a Ratio The foundations of the theory of survey sampling have been laid in the preceding sec- tions on simple random sampling. This and the next section build on that foundation, developing some advanced topics in survey sampling. In this section, we consider the estimation of a ratio. Suppose that for each member of a population, two values, x and y, may be measured. The ratio of interest is r = N i=1 yi N i=1 xi = μy μx Ratios arise frequently in sample surveys; for example, if households are sampled, the following ratios might be calculated: • If y is the number of unemployed males aged 20–30 in a household and x is the number of males aged 20–30 in a household, then r is the proportion of unemployed males aged 20–30. 7.4 Estimation of a Ratio 221 • If y is weekly food expenditure and x is number of inhabitants, then r is weekly food cost per inhabitant. • If y is the number of motor vehicles and x is the number of inhabitants of driving age, then r is the number of motor vehicles per inhabitant of driving age. In a survey of farms, y might be the acres of wheat planted and x the total acreage. In an inventory audit, y might be the audited value of an item and x the book value. In this section, we first consider directly the problem of estimating a ratio. Later, we will use the estimation of a ratio as a technique for estimating μy. We will produce a new estimate, the ratio estimate, which we will compare to the ordinary estimate, Y. Before continuing, we note the elementary but sometimes overlooked fact that r = 1 N N i=1 yi xi Suppose that a sample is drawn consisting of the pairs (Xi , Yi ); the natural estimate of r is R = Y/X. We wish to derive expressions for E(R) and Var(R),but since R is a nonlinear function of the random variables X and Y, we cannot do this in closed form. We will therefore employ the approximate methods of Section 4.6. In order to calculate the approximate variance of R, we need to know Var(X), Var(Y), and Cov(X, Y). The first two quantities we know from Theorem B of Section 7.3.1. For the last quantity, we define the population covariance of x and y to be σxy = 1 N N i=1 (xi − μx )( yi − μy) It can then be shown, in a manner entirely analogous to the proof of Theorem B in Section 7.3.1, that Cov(X, Y) = σxy n 1 − n − 1 N − 1 From Example C in Section 4.6, we have the following theorem. THEOREM A With simple random sampling, the approximate variance of R = Y/X is Var(R) ≈ 1 μ2 x r 2σ 2 X + σ 2 Y − 2rσXY = 1 n 1 − n − 1 N − 1 1 μ2 x r 2σ 2 x + σ 2 y − 2rσxy ■ The population correlation coefficient is defined as ρ = σxy σx σy and is used as a measure of the strength of the linear relationship between the x and y values in the population. It can be shown that −1 ≤ ρ ≤ 1; large values of ρ 222 Chapter 7 Survey Sampling indicate a strong positive relationship between x and y, and small values indicate a strong negative relationship. (See Figure 4.7 for some illustrations of correlation.) The equation in Theorem A can be expressed in terms of the population correlation coefficient as follows: Var(R) ≈ 1 n 1 − n − 1 N − 1 1 μ2 x r 2σ 2 x + σ 2 y − 2rρσx σy From this expression, we see that strong correlation of the same sign as r decreases the variance. We also note that the variance is affected by the size of μx —if μx is small, the variance is large, essentially because small values of X in the ratio R = Y/X cause R to fluctuate wildly. We now consider the approximate expectation of R. From Example C in Section 4.6 and the preceding calculations, we have the following theorem. THEOREM B With simple random sampling, the expectation of R is given approximately by E(R) ≈ r + 1 n 1 − n − 1 N − 1 1 μ2 x rσ 2 x − ρσx σy ■ From the equation in Theorem B, we see that strong correlation of the same sign as r decreases the bias and that the bias is large if μx is small. Furthermore, note that the bias is of the order 1/n, so its contribution to the mean squared error is of the order 1/n2. In comparison, the contribution of the variance is of the order 1/n. Therefore, for large samples, the bias is negligible compared to the standard error of the estimate. For large samples, truncating the Taylor series after the linear term provides a good approximation, since the deviations X − μX and Y − μY are likely to be small. To this order of approximation, R is expressed as a linear combination of X and Y, and an argument based on the central limit theorem can be used to show that R is approximately normally distributed. Approximate confidence intervals can thus be formed for r by using the normal distribution. In order to estimate the standard error of R, we substitute R for r in the formula of Theorem A. The x and y population variances are estimated by s2 x and s2 y . The population covariance is estimated by sxy = 1 n − 1 n i=1 (Xi − X)(Yi − Y) = 1 n − 1 n i=1 Xi Yi − nXY (as can be seen by expanding the product), and the population correlation is estimated by ˆρ = sxy sx sy 7.4 Estimation of a Ratio 223 The estimated variance of R is thus s2 R = 1 n 1 − n − 1 N − 1 1 X 2 (R2s2 x + s2 y − 2Rsxy) An approximate 100(1 − α)% confidence interval for r is R ± z(α/2)sR. EXAMPLE A Suppose that 100 people who recently bought houses are surveyed, and the monthly mortgage payment and gross income of each buyer are determined. Let y denote the mortgage payment and x the gross income. Suppose that X = $3100 Y = $868 sy = $250 sx = $1200 ˆρ = .85 R = .28 Neglecting the finite population correction, the estimated standard error of R is sR = 1 10 1 3100 .282 × 12002 + 2502 − 2 × .28 × .85 × 250 × 1200 = .006 An approximate 95% confidence interval for r is .28 ±(1.96)×(.006), or .28±.012. Note that the high correlation between x and y causes the standard error of R to be small. We can use the observed values for the variances, covariances, and means to gauge the order of magnitude of the bias by substituting them in place of the population parameters in the formula of Theorem B. Doing so, and again neglecting the finite population correction, gives the value .00015 for the bias, which is negligible relative to sR. Note that the large value of X and the large positive correlation coefficient cause the bias to be small. ■ Ratios may also be used as tools for estimating population means and totals. To illustrate the concept, we return to the example of hospital discharges. For this population, the number of beds in each hospital is also known; let us denote the number of beds in the ith hospital by xi and the number of discharges by yi . Suppose that all the xi are known, perhaps from an earlier enumeration, before a sample has been taken to estimate the number of discharges, and that we would like to take advantage of this information. One way to do this is to form a ratio estimate of μy: Y R = μx X Y = μx R where X is the average number of beds and Y is the average number of discharges in the sample. The idea is fairly simple: We expect xi and yi to be closely related in the population, since a hospital with a large number of beds should tend to have a large number of discharges. This is borne out by Figure 7.5, a scatterplot of the number of discharges versus the number of beds. If X <μx , the sample underestimates the number of beds and probably the number of discharges as well; multiplying Y by μx /X increases Y to Y R. 224 Chapter 7 Survey Sampling 0 500 200 400 600 800 Discharges Beds 1000 0 1000 1500 2000 2500 3000 FIGURE 7.5 Scatterplot of the number of discharges versus the number of beds for the 393 hospitals. 0 600 700 800 Mean of simple random sample 900 Count 40 80 500 1000 (a) 1100 120 0 600 700 800 Ratio estimate 900 Count 40 80 500 1000 (b) 1100 120 FIGURE 7.6 (a) A histogram of the means of 500 simple random samples of size 64 from the population of discharges; (b) a histogram of the values of 500 ratio estimates of the mean number of discharges from samples of size 64. To see how this ratio estimate works in practice, it was simulated from 500 sam- ples of size 64 from the population of hospitals. The histogram of the results is shown in Figure 7.6 along with the histogram of the means of 500 simple random samples of size 64. The comparison shows dramatically how effective the ratio estimate is at reducing variability. 7.4 Estimation of a Ratio 225 Two more examples will illustrate the scope of the ratio estimation method. EXAMPLE B Suppose that we want to estimate the total number of unemployed males aged 20–30 from a sample of households and that we know τx , the total number of males aged 20–30, from census data. The ratio estimate is TR = τx Y X where Y is the average number of unemployed males aged 20–30 per household in the sample, and X is the sample average number of males aged 20–30 per house- hold. ■ EXAMPLE C A sample of items in an inventory is taken to estimate the total value of the inventory. Let Yi be the audited value of the ith sample item, and let Xi be its book value. We assume that τx , the total book value of the inventory, is known, and we estimate the total audited value by TR = τx Y X ■ We will now analyze the observed success of the ratio estimate. Since Y R = μX R, Var(Y R) = μ2 X Var(R). From Theorem A, we thus have the following. COROLLARY A The approximate variance of the ratio estimate of μy is Var(Y R) ≈ 1 n 1 − n − 1 N − 1 r 2σ 2 x + σ 2 y − 2rρσx σy ■ Similarly, from Theorem B, we have another corollary. COROLLARY B The approximate bias of the ratio estimate of μy is E(Y R) − μY ≈ 1 n 1 − n − 1 N − 1 1 μx rσ 2 x − ρσx σy ■ When will the ratio estimate YR be better than the ordinary estimate Y? In the fol- lowing, the finite population correction is neglected for simplicity. Since the variance of the ordinary estimate Y is Var(Y) = σ 2 y n 226 Chapter 7 Survey Sampling the ratio estimate has a smaller variance if r 2σ 2 x − 2rρσx σy < 0 or (provided r > 0, for example) 2ρσy > rσx Letting Cx = σx /μx and Cy = σy/μy, this last inequality is equivalent to ρ>1 2 Cx Cy Cx and Cy are called coefficients of variation and give the standard deviation as a proportion of the mean. (Coefficients of variation are often more meaningful than standard deviations. For example, a standard deviation of 10 means one thing if the true value of the quantity being measured is 100 and something entirely different if the true value is 10,000.) In order to assess the accuracy of Y R,Var(Y R) can be estimated from the sample. COROLLARY C The variance of Y R can be estimated by s2 Y R = 1 n 1 − n − 1 N − 1 R2s2 x + s2 y − 2Rsxy and an approximate 100(1 − α)% confidence interval for μy is (Y R ± z( α 2 )sY R ). ■ EXAMPLE D For the population of 393 hospitals, we have μx = 274.8 σx = 213.2 μy = 814.6 σy = 589.7 r = 2.96 ρ = .91 Thus, Var(Y R) ≈ 1 n (2.962 × 213.22 + 589.72 − 2 × 2.96 × .91 × 213.2 × 589.7) = 68,697.4 n and σY R ≈ 262.1√ n Including the finite population correction, the linearized approximation predicts that, with n = 64, σY R = 1 8 (262.1) 1 − 63 392 = 30.0 7.5 Stratified Random Sampling 227 The actual standard deviation of the 500 sample values displayed in Figure 7.6 is 29.9, which is remarkably close. The mean of the 500 values is 816.2, compared to the population mean of 814.6; the slight apparent bias is consistent with Corollary B. In contrast, the standard deviation of Y from a simple random sample of size n = 64 is σY = σ√ n 1 − n − 1 N − 1 = 589.7 8 1 − 63 329 = 66.3 The comparison of σY to σY R is consistent with the substantial reduction in variability accomplished by using a ratio estimate of μy shown in Figure 7.6. The following is another way of interpreting this comparison. If a simple random sample of size n1 is taken, the variance of the estimate is Var(Y) = 589.72/n1.A ratio estimate from a sample of size n2 will have the same variance if 262.12 n2 = 589.72 n1 or n2 = n1 262.1 589.7 2 = .1975n1 Thus, in this case, we can obtain the same precision from a ratio estimate using a sample about 80% smaller than the simple random sample. Note that this comparison neglects the bias of the ratio estimate, which is justifiable in this case because the bias is quite small. Here is a case in which a biased estimate performs substantially better than an unbiased estimate, the bias being quite small and the reduction in variance being quite large. ■ 7.5 Stratified Random Sampling 7.5.1 Introduction and Notation In stratified random sampling, the population is partitioned into subpopulations, or strata, which are then independently sampled. The results from the strata are then combined to estimate population parameters, such as the mean. Following are some examples that suggest the range of situations in which strat- ification is natural: • In auditing financial transactions, the transactions may be grouped into strata on the basis of their nominal values. For example, high-value, medium-value, and low-value strata might be formed. • In samples of human populations, geographical areas often form natural strata. • In a study of records of shipments of household goods by motor carriers, the carriers were grouped into three strata: large carriers, medium carriers, and small carriers. 228 Chapter 7 Survey Sampling Stratified samples are used for a variety of reasons. We are often interested in obtaining information about each of a number of natural subpopulations in addition to information about the population as a whole. The subpopulations might be defined by geographical areas or age groups. In an industrial application in which the popula- tion consists of items produced by a manufacturing process, relevant subpopulations might consist of items produced during different shifts or from different lots of raw material. The use of a stratified random sample guarantees a prescribed number of observations from each subpopulation, whereas the use of a simple random sample can result in underrepresentation of some subpopulations. A second reason for using stratification is that, as will be shown below, the stratified sample mean can be con- siderably more precise than the mean of a simple random sample, especially if the population members within each stratum are relatively homogeneous and if there is considerable variation between strata. In the next section, properties of the stratified sample mean are derived. Since a simple random sample is taken within each stratum, the results will follow easily from the derivations of earlier sections. The section after that takes up the problem of how to allocate the total number of observations, n, among the various strata. Comparisons will be made of the efficiencies of different allocation schemes and also of the precisions of these allocation schemes relative to that of a simple random sample of the same total size. 7.5.2 Properties of Stratified Estimates Suppose there are L strata in all. Let the number of population elements in stratum 1 be denoted by N1, the number in stratum 2 be N2, etc. The total population size is N = N1 + N2 + ...+ NL . The population mean and variance of the lth stratum are denoted by μl and σ 2 l . The overall population mean can be expressed in terms of the μl as follows. Let xil denote the ith population value in the lth stratum and let Wl = Nl/N denote the fraction of the population in the lth stratum. Then μ = 1 N L l=1 Nl i=1 xil = 1 N L l=1 Nlμl = L l=1 Wlμl Within each stratum, a simple random sample of size nl is taken. The sample mean in stratum l is denoted by Xl = 1 nl nl i=1 Xil Here Xil denotes the ith sample value in the lth stratum. Note that Xl is the mean of a simple random sample from the population consisting of the lth stratum, so from Theorem A of Section 7.3.1, E(Xl) = μl. By analogy with the preceding relationship 7.5 Stratified Random Sampling 229 between the overall population mean and the population means of the various strata, the obvious estimate of μ is X s = L l=1 Nl Xl N = L l=1 Wl Xl THEOREM A The stratified estimate, X s, of the population mean is unbiased. Proof E(X s) = L l=1 Wl E(Xl) = 1 N L l=1 Nlμl = μ ■ Since we assume that the samples from different strata are independent of one another and that within each stratum a simple random sample is taken, the variance of X s can be easily calculated. THEOREM B The variance of the stratified sample mean is given by Var(X s) = L l=1 W 2 l 1 nl 1 − nl − 1 Nl − 1 σ 2 l Proof Since the Xl are independent, Var(X s) = L l=1 W 2 l Var(Xl) From Theorem B of Section 7.3.1, we have Var(Xl) = 1 nl 1 − nl − 1 Nl − 1 σ 2 l Therefore, the desired result follows. ■ 230 Chapter 7 Survey Sampling If the sampling fractions within all strata are small, Var(X s) ≈ L l=1 W 2 l σ 2 l nl EXAMPLE A We again consider the population of hospitals. As we did in the discussion of ratio estimates, we assume that the number of beds in each hospital is known but that the number of discharges is not. We will try to make use of this knowledge by stratifying the hospitals according to the number of beds. Let stratum A consist of the 98 smallest hospitals, stratum B of the 98 next larger, stratum C of the 98 next larger, and stratum D of the 99 largest. The following table shows the results of this stratification of hospitals by size: Stratum Nl Wl μl σl A 98 .249 182.9 103.4 B 98 .249 526.5 204.8 C 98 .249 956.3 243.5 D 99 .251 1591.2 419.2 Suppose that we use a sample of total size n and let n1 = n2 = n3 = n4 = n 4 so that we have equal sample sizes in each stratum. Then, from Theorem B, neglecting the finite population corrections and using the numerical values in the preceding table, we have Var(X s) = 4 l=1 W 2 l σ 2 l n1 = 4 n 4 l=1 W 2 l σ 2 l = 72, 042.6 n and σXs = 268.4√ n The standard deviation of the mean of a simple random sample is σX = 587.7√ n Comparing the two standard deviations, we see that a tremendous gain in precision has resulted from the stratification. The ratio of the variances is .20; thus a stratified estimate based on a total sample size of n/5 is as precise as a simple random sample of size n. The reduction in variance due to stratification is comparable to that achieved 7.5 Stratified Random Sampling 231 by using a ratio estimate (Example D in Section 7.4). In later parts of this section, we will look more analytically at why the stratification done here produced such dramatic improvement. ■ Let us next consider the stratified estimate of the population total, Ts = N X s. From Theorem B, we have the following corollary. COROLLARY A The expectation and variance of the stratified estimate of the population total are E(Ts) = τ and Var(Ts) = N 2Var(X s) = L l=1 N 2 l 1 nl 1 − nl − 1 Nl − 1 σ 2 l ■ In order to estimate the standard errors of X s and Ts, the variances of the individual strata must be separately estimated and substituted into the preceding formulae. The estimate of σ 2 l is given by s2 l = 1 nl − 1 nl i=1 (Xil − Xl)2 Var(X s) is estimated by s2 Xs = L l=1 W 2 l 1 nl 1 − nl Nl s2 l The next example illustrates how this variance estimate can be used to find approximate confidence intervals for μ based on X s. EXAMPLE B A sample of size 10 was drawn from each of the four strata of hospitals described in Example A, yielding the following: X 1 = 240.6 s2 1 = 6827.6 X 2 = 507.4 s2 2 = 23,790.7 X 3 = 865.1 s2 3 = 42,573.0 X 4 = 1716.5 s2 4 = 152,099.6 232 Chapter 7 Survey Sampling Therefore, X s = 832.5. The variance of the stratified sample mean is estimated by s2 Xs = 1 10 4 l=1 W 2 l 1 − nl − 1 Nl − 1 s2 l = 1282.0 Thus, sXs = 35.8 An approximate 95% confidence interval for the population mean number of dis- charges is X s ± 1.96s¯xs , or (762.4, 902.7). The total number of discharges is estimated by Ts = 393X s = 327,172. The standard error of Ts is estimated by sTs = 393sXs = 14,069. An approximate 95% confidence interval for the population total is Ts ± 1.96sTs , or (299,596, 354, 748). ■ 7.5.3 Methods of Allocation In Section 7.5.2, it was shown that, neglecting the finite population correction, Var(X s) = L l=1 W 2 l σ 2 l nl If the resources of a survey allow only a total of n units to be sampled, the question arises of how to choose n1,...,nL to minimize Var(X s) subject to the constraint n1 +···+nL = n. For the sake of simplicity, the calculations in this section ignore the finite popu- lation correction within each stratum. The analysis may be extended to include these corrections, but at the cost of some additional algebra. More complete results are contained in Cochran (1977). THEOREM A The sample sizes n1,...,nL that minimize Var(X s) subject to the constraint n1 +···+nL = n are given by nl = n Wlσl L k=1 Wkσk where l = 1,...,L. 7.5 Stratified Random Sampling 233 Proof We introduce a Lagrange multiplier, and we must then minimize L(n1,...,nL ,λ)= L l=1 W 2 l σ 2 l nl + λ L l=1 nl − n For l = 1,...,L, we have ∂L ∂nl =−W 2 l σ 2 l n2 l + λ Setting these partial derivatives equal to zero, we have the system of equations nl = Wlσl√ λ for l = 1,...,L. To determine λ, we first sum these equations over l: n = 1√ λ L l=1 Wlσl Thus, 1√ λ = n L l=1 Wlσl and nl = n Wlσl L l=1 Wlσl which proves the theorem. ■ This theorem shows that those strata for which Wlσl is large should be sampled heavily. This makes sense intuitively. If Wl is large, the stratum contains a large fraction of the population; if σl is large, the population values in the stratum are quite variable, and in order to obtain a good determination of the stratum’s mean, a relatively large sample size must be used. This optimal allocation scheme is called Neyman allocation. Substituting the optimal values of nl as given in Theorem A into the equation for Var(X s) given in Theorem B in Section 7.5.2 gives us the following corollary. COROLLARY A Denoting by X so, the stratified estimate using the optimal allocations as given in Theorem A and neglecting the finite population correction, Var(X so) = L l=1 Wlσl 2 n ■ 234 Chapter 7 Survey Sampling EXAMPLE A For the population of hospitals, the weights for optimal allocation, Wlσl/ Wlσl, are, from the table of Example A of Section 7.5.2, Stratum ABCD Weight .106 .210 .250 .434 Note that, because of its larger standard deviation, stratum D is sampled more than four times as heavily as stratum A. ■ The optimal allocations depend on the individual variances of the strata, which generally will not be known. Furthermore, if a survey measures several attributes for each population member, it is usually impossible to find an allocation that is simultaneously optimal for all of those variables. A simple and popular alternative method of allocation is to use the same sampling fraction in each stratum, n1 N1 = n2 N2 =···= nL NL which holds if nl = n Nl N = nWl for l = 1,...,L. This method is called proportional allocation. The estimate of the population mean based on proportional allocation is X sp = L l=1 Wl Xl = L l=1 Wl 1 nl nl i=1 Xil = 1 n L l=1 nl i=1 Xil since Wl/nl = 1/n. This estimate is simply the unweighted mean of the sample values. THEOREM B With stratified sampling based on proportional allocation, ignoring the finite population correction, Var(X sp) = 1 n L l=1 Wlσ 2 l 7.5 Stratified Random Sampling 235 Proof From Theorem B of Section 7.5.2, we have Var(X sp) = L l=1 W 2 l Var(Xl) = L l=1 W 2 l σ 2 l nl Using nl = nWl, the result follows. ■ We now compare Var(X sp) and Var(X so) in order to discover the circumstances under which optimal allocation is substantially better than proportional allocation. THEOREM C With stratified random sampling, the difference between the variance of the estimate of the population mean based on proportional allocation and the variance of that estimate based on optimal allocation is, ignoring the finite population correction, Var(X sp) − Var(X so) = 1 n L l=1 Wl(σl − ¯σ)2 where ¯σ = L l=1 Wlσl Proof Var(X sp) − Var(X so) = 1 n ⎡ ⎣ L l=1 Wlσ 2 l − L l=1 Wlσl 2 ⎤ ⎦ The term within the large brackets equals L l=1 Wl(σl − ¯σ)2, which may be verified by expanding the square and collecting terms. ■ According to Theorem C, if the variances of the strata are all the same, propor- tional allocation yields the same results as optimal allocation. The more variable these variances are, the better it is to use optimal allocation. 236 Chapter 7 Survey Sampling EXAMPLE B Let us calculate how much better optimal allocation is than proportional allocation for the population of hospitals. From Theorem C and Corollary A, we have Var(X sp) = Var(X so) + 1 n Wl(σl − ¯σ)2 Therefore, Var(X sp) Var(X so) = 1 + 1 n Wl(σl − ¯σ)2 Var(X so) = 1 + Wl(σl − ¯σ)2 ( Wlσl)2 = 1 + .218 Thus, under proportional allocation, the variance of the mean is about 20% larger than it is under optimal allocation. ■ We can also compare the variance under simple random sampling with the vari- ance under proportional allocation. The variance under simple random sampling is, neglecting the finite population correction, Var(X) = σ 2 n In order to compare this equation with that for the variance under proportional allo- cation, we need a relationship between the overall population variance, σ 2, and the strata variances, σ 2 l . The overall population variance may be expressed as σ 2 = 1 N L l=1 Nl i=1 (xil − μ)2 Also, (xil − μ)2 = [(xil − μl) + (μl − μ)]2 = (xil − μl)2 + 2(xil − μl)(μl − μ) + (μl − μ)2 When both sides of this last equation are summed over l, the middle term on the right-hand side becomes zero since Nlμl = Nl l=1 xil,sowehave Nl i=1 (xil − μ)2 = nl i=1 (xil − μl)2 + Nl(μl − μ)2 = Nlσ 2 l + Nl(μl − μ)2 Dividing both sides by N and summing over l,wehave σ 2 = L l=1 Wlσ 2 l + L l=1 Wl(μl − μ)2 7.5 Stratified Random Sampling 237 Substituting this expression for σ 2 into Var(X) = σ 2/n and using the formula for Var(X sp) given in Theorem B completes a proof of the following theorem. THEOREM D The difference between the variance of the mean of a simple random sample and the variance of the mean of a stratified random sample based on proportional allocation is, neglecting the finite population correction, Var(X) − Var(X sp) = 1 n L l=1 Wl(μl − μ)2 ■ Thus, stratified random sampling with proportional allocation always gives a smaller variance than does simple random sampling, providing that the finite popu- lation correction is ignored. Comparing the equations for the variances under simple random sampling, proportional allocation, and optimal allocation, we see that strat- ification with proportional allocation is better than simple random sampling if the strata means are quite variable and that stratification with optimal allocation is even better than stratification with proportional allocation if the strata standard deviations are variable. EXAMPLE C We calculate the improvement that would result from using stratification with propor- tional allocation rather than simple random sampling for the population of hospitals. From Theorems B and D, we have Var(X srs) Var(X sp) = 1 + Wl(μl − ¯μ)2 Wlσ 2 l = 1 + 3.83 As is frequently the case, the gain from using stratification with proportional allocation rather than simple random sampling is much greater than the gain from using optimal allocation rather than proportional allocation. Furthermore, proportional allocation requires knowledge only of the sizes of the strata, whereas optimal allocation requires knowledge of the standard deviations of the strata, and such knowledge is usually unavailable. ■ Typically, stratified random sampling can result in substantial increases in preci- sion for populations containing values that vary greatly in size. For example, a pop- ulation of transactions, a sample of which is to be audited for errors, might contain transactions in the hundreds of thousands of dollars and transactions in the hundreds of dollars. If such a population were divided into several strata according to the dollar amounts of the transactions, there might well be considerable variation in the mean transaction errors between the strata, since there may be rather large errors on large 238 Chapter 7 Survey Sampling transactions and small errors on small transactions. The variability of the errors might also be larger in the former strata as well. We have not addressed the question of how many strata to form and how to define the strata. In order to construct the optimal number of strata, the population values themselves, which are of course unknown, would have to be used. Stratification must therefore be done on the basis of some related variable that is known (such as transaction amount in the preceding paragraph) or on the results of earlier samples. In practice, it usually turns out that such relationships are not strong enough to make it worthwhile constructing more than a few strata. 7.6 Concluding Remarks This chapter introduced survey sampling. It first covered the most elementary method of probability sampling—simple random sampling. The theory of this method under- lies the theory of more complex sampling techniques. Stratified sampling was also in- troduced and shown to increase the precision of estimates substantially in many cases. Several concepts and techniques introduced here recur throughout statistics: the concept of a random estimate of a population parameter, such as the population mean; bias; the standard error of an estimate; confidence intervals based on the central limit theorem; and linearization, or propagation of error. The theory and technique of survey sampling go far beyond the material in this introduction. One method that deserves mention because of its widespread use is systematic sampling. The population members are given in a list. If, say, a 10% sample is desired, every tenth member of the list is sampled starting from some random point among the first ten. If the list is in totally random order, this method is similar to simple random sampling. If, however, there is some correlation or relationship between successive members, the method is more similar to stratified sampling. The clear danger of this method is that there may be some periodic structure in the list, in which case bias can ensue. Another commonly used method is cluster sampling. In sampling residential households, a survey might choose blocks randomly and then either sample every dwelling on each chosen block or further subsample the dwellings. Because one would expect dwellings within a single block to be relatively homogeneous, this method is typically less precise than a simple random sample of the same size. We have developed a mathematical model for survey sampling and have deduced consequences of that model, including probabilistic error bounds for the estimates. As is always the case, reality never quite matches the mathematical model. The basic assumptions of the model are (1) that every population member appears in the sample with a specified probability and (2) that an exact measurement or response is obtained from every sample member. In practice, neither assumption will hold pre- cisely. Converse and Traugott (1986) provide an interesting discussion of the practical difficulties of polls and surveys and consequences for the variability of the estimates. The first assumption may fail because of the difficulty of obtaining an ex- act enumeration of the population or because of imprecision in its definition. For example, political surveys can be putatively based on all adults, all registered voters, or all “likely” voters. However, the most serious problem with respect to the first 7.7 Problems 239 assumption is that of nonresponse. Response levels of only 60% to 70% are common in surveys of human populations. The possibility of substantial bias clearly arises if there is a relationship of potential answers to survey questions to the propensity to respond to those questions. For example, adults living in families are easier to contact by a telephone survey than those living alone, and the opinions of these two groups may well differ on certain issues. It is important to realize that the standard errors of estimates that we have developed earlier in this chapter account only for random variability in sample composition, not for systematic biases. The Literary Digest poll of 1936, which predicted a 57% to 43% victory for Republican Alfred Landon over incumbent president Franklin Roosevelt, is one of the most famous of flawed surveys. Questionnaires were mailed to about 10 million voters, who were selected from lists such as telephone books and club memberships, and approximately 2.4 million of the questionnaires were returned. There were two intrinsic problems: (1) nonresponse—those who did not respond may have voted dif- ferently from those who did—and (2) selection bias—even if all 10 million voters had responded, they would not have constituted a random sample; those in lower socioeconomic classes (who were more likely to vote for Roosevelt) were less likely to have telephone service or belong to clubs and thus less likely to be included in the sample than were wealthier voters. The assumption that an exact measurement is obtained from every member of the sample may also be in error. In surveys conducted by interviewers, the interviewer’s approach and personality may affect the response. In surveys that use questionnaires, the wording of the questions and the context within which they are lodged can have an effect. An interesting example is a poll conducted by Stanley Presser, (New Yorker, Oct 18, 2004). Half of the sample was asked, “Do you think the United States should allow public speeches against democracy?” The other half was asked, “Do you think the United States should forbid public speeches against democracy?” 56% said no to the first question, and 39% said yes to the second. The interesting paper by Hansen in Tanur et al. (1972) reports on efforts of the U.S. Bureau of the Census to investigate these sorts of problems. 7.7 Problems 1. Consider a population consisting of five values—1, 2, 2, 4, and 8. Find the population mean and variance. Calculate the sampling distribution of the mean of a sample of size 2 by generating all possible such samples. From them, find the mean and variance of the sampling distribution, and compare the results to Theorems A and B in Section 7.3.1. 2. Suppose that a sample of size n = 2 is drawn from the population of the preceding problem and that the proportion of the sample values that are greater than 3 is recorded. Find the sampling distribution of this statistic by listing all possible such samples. Find the mean and variance of the sampling distribution. 3. Which of the following is a random variable? a. The population mean b. The population size, N 240 Chapter 7 Survey Sampling c. The sample size, n d. The sample mean e. The variance of the sample mean f. The largest value in the sample g. The population variance h. The estimated variance of the sample mean 4. Two populations are surveyed with simple random samples. A sample of size n1 is used for population I, which has a population standard deviation σ1; a sample of size n2 = 2n1 is used for population II, which has a population standard deviation σ2 = 2σ1. Ignoring finite population corrections, in which of the two samples would you expect the estimate of the population mean to be more accurate? 5. How would you respond to a friend who asks you, “How can we say that the sample mean is a random variable when it is just a number, like the population mean? For example, in Example A of Section 7.3.2, a simple random sam- ple of size 50 produced ¯x = 938.5; how can the number 938.5 be a random variable?” 6. Suppose that two populations have equal population variances but are of different sizes: N1 = 100,000 and N2 = 10,000,000. Compare the variances of the sample means for a sample of size n = 25. Is it substantially easier to estimate the mean of the smaller population? 7. Suppose that a simple random sample is used to estimate the proportion of families in a certain area that are living below the poverty level. If this proportion is roughly .15, what sample size is necessary so that the standard error of the estimate is .02? 8. A sample of size n = 100 is taken from a population that has a proportion p = 1/5. a. Find δ such that P(| ˆp − p|≥δ) = 0.025. b. If, in the sample, ˆp = 0.25, will the 95% confidence interval for p contain the true value of p? 9. In a simple random sample of 1,500 voters, 55% said they planned to vote for a particular proposition, and 45% said they planned to vote against it. The estimated margin of victory for the proposition is thus 10%. What is the standard error of this estimated margin? What is an approximate 95% confidence interval for the margin? 10. True or false (and state why): If a sample from a population is large, a histogram of the values in the sample will be approximately normal, even if the population is not normal. 11. Consider a population of size four, the members of which have values x1, x2, x3, x4. a. If simple random sampling were used, how many samples of size two are there? b. Suppose that rather than simple random sampling, the following sampling scheme is used. The possible samples of size two are {x1, x2}, {x2, x3}, {x3, x4}, {x1, x4} 7.7 Problems 241 and the sampling is done in such a way that each of these four possible samples is equally likely. Is the sample mean unbiased? 12. Consider simple random sampling with replacement. a. Show that s2 = 1 n − 1 n i=1 (Xi − X)2 is an unbiased estimate of σ 2. b. Is s an unbiased estimate of σ? c. Show that n−1s2 is an unbiased estimate of σ 2 X . d. Show that n−1 N 2s2 is an unbiased estimate of σ 2 T . e. Show that ˆp(1 − ˆp)/(n − 1) is an unbiased estimate of σ 2 ˆp . 13. Suppose that the total number of discharges, τ, in Example A of Section 7.2 is estimated from a simple random sample of size 50. Denoting the estimate by T , use the central limit theorem to sketch the approximate probability density of the error T − τ. 14. The proportion of hospitals in Example A of Section 7.2 that had fewer than 1000 discharges is p = .654. Suppose that the total number of hospitals having fewer than 1000 discharges is estimated from a simple random sample of size 25. Use the central limit theorem to sketch the approximate sampling distribution of the estimate. 15. Consider estimating the mean of the population of hospital discharges (Exam- ple A of Section 7.2) from a simple random sample of size n. Use the normal approximation to the distribution of X in answering the following: a. Sketch P(|X − μ| > 200) as a function of n for 20 ≤ n ≤ 100. b. For n = 20, 40, and 80, find  such that P(|X − μ| >)≈ .10. Similarly, find  such that P(|X − μ| >)≈ .50. 16. True or false? a. The center of a 95% confidence interval for the population mean is a random variable. b. A 95% confidence interval for μ contains the sample mean with probability .95. c. A 95% confidence interval contains 95% of the population. d. Out of one hundred 95% confidence intervals for μ, 95 will contain μ. 17. A 90% confidence interval for the average number of children per household based on a simple random sample is found to be (.7, 2.1). Can we conclude that 90% of households have between .7 and 2.1 children? 18. From independent surveys of two populations, 90% confidence intervals for the population means are constructed. What is the probability that neither interval contains the respective population mean? That both do? 19. This problem introduces the concept of a one-sided confidence interval. Using the central limit theorem, how should the constant k be chosen so that the interval 242 Chapter 7 Survey Sampling (−∞, X + ksX ) is a 90% confidence interval for μ—i.e., so that P(μ ≤ X + ksX ) = .9? This is called a one-sided confidence interval. How should k be chosen so that (X − ksX , ∞) is 95% one-sided confidence interval? 20. In Example D of Section 7.3.3, a 95% confidence interval for μ was found to be (1.44, 1.76). Because μ is some fixed number, it either lies in this interval or it doesn’t, so it doesn’t make any sense to claim that P(1.44 ≤ μ ≤ 1.76) = .95. What do we mean, then, by saying this is a “95% confidence interval?” 21. In order to halve the width of a 95% confidence interval for a mean, by what factor should the sample size be increased? Ignore the finite population correction. 22. An investigator quantifies her uncertainty about the estimate of a population mean by reporting X ± sX . What size confidence interval is this? 23. a. Show that the standard error of an estimated proportion is largest when p = 1/2. b. Use this result and Corollary B of Section 7.3.2 to conclude that the quantity 1 2 N − n N(n − 1) is a conservative estimate of the standard error of ˆp no matter what the value of p may be. c. Use the central limit theorem to conclude that the interval ˆp ± N − n N(n − 1) contains p with probability at least .95. 24. For a random sample of size n from a population of size N, consider the following as an estimate of μ: X c = n i=1 ci Xi where the ci are fixed numbers and X1,...,Xn is the sample. a. Find a condition on the ci such that the estimate is unbiased. b. Show that the choice of ci that minimizes the variances of the estimate subject to this condition is ci = 1/n, where i = 1,...,n. 25. Here is an alternative proof of Lemma B in Section 7.3.1. Consider a random permutation Y1, Y2,...,YN of x1, x2,...,xN . Argue that the joint distribution of any subcollection, Yi1 ,...,Yin , of the Yi is the same as that of a simple random sample, X1,...,Xn. In particular, Var(Yi ) = Var(Xk) = σ 2 and Cov(Yi , Y j ) = Cov(Xk, Xl) = γ 7.7 Problems 243 if i = j and k = l. Since Y1 + Y2 +···+YN = τ, Var N i=1 Yi = 0 (Why?) Express Var(N i=1 Yi ) in terms of σ 2 and the unknown covariance, γ . Solve for γ , and conclude that γ =− σ 2 N − 1 for i = j. 26. This is another proof of Lemma B in Section 7.3.1. Let Ui be a random vari- able with Ui = 1iftheith population member is in the sample and equal to 0 otherwise. a. Show that the sample mean X = n−1 N i=1 Ui xi . b. Show that P(Ui = 1) = n/N. Find E(Ui ), using the fact that Ui is a Bernoulli random variable. c. What is the variance of the Bernoulli random variable Ui ? d. Noting that UiU j is a Bernoulli random variable, find E(UiU j ), i = j. (Be careful to take into account that the sample is drawn without replacement.) e. Find Cov(Ui , U j ), i = j. f. Using the representation of X above, find Var(X). 27. Suppose that the population size N is not known, but it is known that n ≤ N. Show that the following procedure will generate a simple random sample of size n. Imagine that the population is arranged in a long list that you can read sequentially. a. Let the sample initially consist of the the first n elements in the list. b. For k = 1, 2,...,as long as the end of the list has not been encountered: i. Read the (n + k)-th element in the list. ii. Place it in the sample with probability n/(n + k) and, if it is placed in the sample, randomly drop one of the exisiting sample members. 28. In surveys, it is difficult to obtain accurate answers to sensitive questions such as “Have you ever used heroin?” or “Have you ever cheated on an exam?” Warner (1965) introduced the method of randomized response to deal with such sit- uations. A respondent spins an arrow on a wheel or draws a ball from an urn containing balls of two colors to determine which of two statements to respond to: (1) “I have characteristic A,” or (2) “I do not have characteristic A.” The inter- viewer does not know which statement is being responded to but merely records a yes or a no. The hope is that an interviewee is more likely to answer truthfully if he or she realizes that the interviewer does not know which statement is being responded to. Let R be the proportion of a sample answering Yes. Let p be the probability that statement 1 is responded to (p is known from the structure of the randomizing device), and let q be the proportion of the population that has characteristic A. Let r be the probability that a respondent answers Yes. a. Show thatr = (2p−1)q+(1− p). [Hint: P(yes) = P(yes given question 1) × P(question 1) + P(yes given question 2) × P(question 2).] 244 Chapter 7 Survey Sampling b. If r were known, how could q be determined? c. Show that E(R) = r, and propose an estimate, Q, for q. Show that the estimate is unbiased. d. Ignoring the finite population correction, show that Var(R) = r(1 − r) n where n is the sample size. e. Find an expression for Var(Q). 29. A variation of the method described in Problem 28 has been proposed. Instead of responding to statement 2, the respondent answers an unrelated question for which the probability of a “yes” response is known, for example, “Were you born in June?” a. Propose an estimate of q for this method. b. Show that the estimate is unbiased. c. Obtain an expression for the variance of the estimate. 30. Compare the accuracies of the methods of Problems 28 and 29 by comparing their standard deviations. You may do this by substituting some plausible numerical values for p and q. 31. Referring to Example D in Section 7.3.3, how large should the sample be in order that the 95% confidence interval for the total number of owners planning to sell will have a width of 500? 32. Referring again to Example D in Section 7.3.3, suppose that a survey is done of another condominium project of 12,000 units. The sample size is 200, and the proportion planning to sell in this sample is .18. a. What is the standard error of this estimate? Give a 90% confidence interval. b. Suppose we use the notation ˆp1 = .12 and ˆp2 = .18 to refer to the proportions in the two samples. Let ˆd = ˆp1 − ˆp2 be an estimate of the difference, d,of the two population proportions p1 and p2. Using the fact that ˆp1 and ˆp2 are independent random variables, find expressions for the variance and standard error of ˆd. c. Because ˆp1 and ˆp2 are approximately normally distributed, so is ˆd. Use this fact to construct 99%, 95%, and 90% confidence intervals for d. Is there clear evidence that p1 is really different from p2? 33. Two populations are independently surveyed using simple random samples of size n, and two proportions, p1 and p2, are estimated. It is expected that both population proportions are close to .5. What should the sample size be so that the standard error of the difference, ˆp1 − ˆp2, will be less than .02? 34. In a survey of a very large population, the incidences of two health problems are to be estimated from the same sample. It is expected that the first problem will affect about 3% of the population and the second about 40%. Ignore the finite population correction in answering the following questions. 7.7 Problems 245 a. How large should the sample be in order for the standard errors of both esti- mates to be less than .01? What are the actual standard errors for this sample size? b. Suppose that instead of imposing the same limit on both standard errors, the investigator wants the standard error to be less than 10% of the true value in each case. What should the sample size be? 35. A simple random sample of a population of size 2000 yields the following 25 values: 104 109 111 109 87 86 80 119 88 122 91 103 99 108 96 104 98 98 83 107 79 87 94 92 97 a. Calculate an unbiased estimate of the population mean. b. Calculate unbiased estimates of the population variance and Var(X). c. Give approximate 95% confidence intervals for the population mean and total. 36. With simple random sampling, is X 2 an unbiased estimate of μ2? If not, what is the bias? 37. Two surveys were independently conducted to estimate a population mean, μ. Denote the estimates and their standard errors by X 1 and X 2 and σX1 and σX2 . Assume that X 1 and X 2 are unbiased. For some α and β, the two estimates can be combined to give a better estimator: X = αX 1 + βX 2 a. Find the conditions on α and β that make the combined estimate unbiased. b. What choice of α and β minimizes the variances, subject to the condition of unbiasedness? 38. Let X1,...,Xn be a simple random sample. Show that 1 n n i=1 X 3 i is an unbiased estimate of 1 N N i=1 x3 i . 39. Suppose that of a population of N items, k are defective in some way. For exam- ple, the items might be documents, a small proportion of which are fraudulent. How large should a sample be so that with a specified probability it will contain at least one of the defective items? For example, if N = 10,000, k = 50, and p = .95, what should the sample size be? Such calculations are useful in planning sample sizes for acceptance sampling. 40. This problem presents an algorithm for drawing a simple random sample from a population in a sequential manner. The members of the population are considered for inclusion in the sample one at a time in some prespecified order (for example, the order in which they are listed). The ith member of the population is included 246 Chapter 7 Survey Sampling in the sample with probability n − ni N − i + 1 where ni is the number of population members already in the sample before the ith member is examined. Show that the sample selected in this way is in fact a simple random sample; that is, show that every possible sample occurs with probability 1 N n 41. In accounting and auditing, the following sampling method is sometimes used to estimate a population total. In estimating the value of an inventory, suppose that a book value exists for each item and is readily accessible. For each item in the sample, the difference D, audited value minus book value, is determined. The inventory value is estimated by the sum of the book values of the population and N D, where N is the population size. a. Show that the estimate is unbiased. b. Find an expression for the variance of the estimate. c. Compare the expression obtained in part (b) to the variance of the usual es- timate, which is the product of N and the average audited value. Under what circumstances would the proposed method be more accurate? d. How could a ratio estimate be employed in this situation? Would there be any advantage or disadvantage to using a ratio estimate rather than the proposed method? 42. Show that the population correlation coefficient is less than or equal to 1 in absolute value. 43. Suppose that for Example D in Section 7.3.3, the average number of occupants per condominium unit in the sample is 2.2 with a sample standard deviation of .7 and the sample correlation coefficient between the number of occupants and the number of motor vehicles is .85. Estimate the population ratio of the number of motor vehicles per occupant and its standard error. Find an approximate 95% confidence interval for the estimate. 44. Show that Var(Y R) Var(Y) ≈ 1 + Cx Cy Cx Cy − 2ρ Sketch the graph of this ratio as a function of Cx /Cy. 45. In the population of hospitals, the correlation of the number of beds and the num- ber of discharges is ρ = .91 (Example D of Section 7.4). To see how Var(Y R) would be different if the correlation were different, plot Var(Y R) for n = 64 as a function of ρ for −1 <ρ<1. 7.7 Problems 247 46. Use the central limit theorem to sketch the approximate sampling distribution of Y R for n = 64 for the population of hospitals. Compare to the approximate sampling distribution of Y. 47. For the population of hospitals and a sample size of n = 64, find the approxi- mate bias of Y R by applying Corollary B of Section 7.4 and compare it to the approximate standard deviation of the estimate. Repeat for n = 128. 48. A simple random sample of 100 households located in a city recorded the number of people living in the household, X, and the weekly expenditure for food, Y.It is known that there are 100,000 households in the city. In the sample Xi = 320 Yi = 10,000 X 2 i = 1250 Y 2 i = 1,100,000 Xi Yi = 36,000 Neglect the finite population correction in answering the following. a. Estimate the ratio r = μy/μx . b. Form an approximate 95% confidence interval for μy/μx . c. Using only the data on Y estimate the total weekly food expenditure, τ, for households in the city and form a 90% confidence interval. 49. In a wildlife survey, an area of desert land was divided into 1000 squares, or “quadrats,” a simple random sample of 50 of which were surveyed. In each sur- veyed quadrat, the number of birds, Y, and the area covered by vegetation, X, were determined. It was found that Xi = 3000 Yi = 150 X 2 i = 225,000 Y 2 i = 650 Xi Yi = 11,000 a. Estimate the ratio of the average number of birds per quadrat to the average vegetation cover per quadrat. b. Estimate the standard error of your estimate and find an approximate 90% confidence interval for the population average. c. Estimate the total number of birds and find an approximate 95% confidence interval for the population total. d. Suppose that from an aerial survey, the total area covered by vegetation could easily be determined. How could this information be used to provide another 248 Chapter 7 Survey Sampling estimate of the number of birds? Would you expect this estimate to be better than or worse than that found in part (c)? 50. Hartley and Ross (1954) derived the following exact bound on the relative size of the bias and standard error of a ratio estimate: |E(R) − r| σR ≤ σX μx = σx μx 1 n 1 − n − 1 N − 1 a. Derive this bound from the relation Cov(R, X) = E Y X X − E Y X E(X) b. Apply the bound to Problem 43 using sample estimates in place of the given population parameters. 51. This problem introduces a technique called the “jackknife,” originally proposed by Quenouille (1956) for reducing bias. Many nonlinear estimates, including the ratio estimator, have the property that E(ˆθ) = θ + b1 n + b2 n2 +··· where ˆθ is an estimate of θ. The jackknife forms an estimate ˆθJ , which has a leading bias term of the order n−2 rather than n−1. Thus, for sufficiently large n, the bias of ˆθJ is substantially smaller than that of ˆθ. The technique involves splitting the sample into several subsamples, computing the estimate for each subsample, and then combining the several estimates. The sample is split into p groups of size m, where n = mp.For j = 1,...,p, the estimate ˆθ j is calculated from the m(p − 1) observations left after the jth group has been deleted. From the preceding expression, E(ˆθ j ) = θ + b1 m(p − 1) + b2 [m(p − 1)]2 +··· Now, p “pseudovalues” are defined: Vj = pˆθ − ( p − 1)ˆθ j The jackknife estimate, ˆθJ , is defined as the average of the pseudovalues: ˆθJ = 1 p p j=1 Vj Show that the bias of ˆθJ is of the order n−2. 52. A population consists of three strata with N1 = N2 = 1000 and N3 = 500. A stratified random sample with 10 observations in each stratum yields the 7.7 Problems 249 following data: Stratum 1 94 99 106 106 101 102 122 104 97 97 Stratum 2 183 183 179 211 178 179 192 192 201 177 Stratum 3 343 302 286 317 289 284 357 288 314 276 Estimate the population mean and total and give a 90% confidence interval. 53. The following table (Cochran 1977) shows the stratification of all farms in a county by farm size and the mean and standard deviation of the number of acres of corn in each stratum. Farm Size Nl μl σl 0–40 394 5.4 8.3 41–80 461 16.3 13.3 81–120 391 24.3 15.1 121–160 334 34.5 19.8 161–200 169 42.1 24.5 201–240 113 50.1 26.0 241 + 148 63.8 35.2 a. For a sample size of 100 farms, compute the sample sizes from each stratum for proportional and optimal allocation, and compare them. b. Calculate the variances of the sample mean for each allocation and compare them to each other and to the variance of an estimate formed from simple random sampling. c. What are the population mean and variance? d. Suppose that ten farms are sampled per stratum. What is Var(X s)? How large a simple random sample would have to be taken to attain the same variance? Ignore the finite population correction. e. Repeat part (d) using proportional allocation of the 70 samples. 54. a. Suppose that the cost of a survey is C = C0 + C1n, where C0 is a startup cost and C1 is the cost per observation. For a given cost C, find the al- location n1,...,nL to L strata that is optimal in the sense that it mini- mizes the variance of the estimate of the population mean subject to the cost constraint. b. Suppose that the cost of an observation varies from stratum to stratum—in some strata the observations might be relatively cheap and in others relatively expensive. The cost of a survey with an allocation n1,...,nL is C = C0 + L l=1 Clnl For a fixed total cost C, what choice of n1, ···, nL minimizes the variance? c. Assuming that the cost function is as given in part (b), for a fixed variance, find nl to minimize cost. 250 Chapter 7 Survey Sampling 55. The designer of a sample survey stratifies a population into two strata, H and L. H contains 100,000 people, and L contains 500,000. He decides to allocate 100 samples to stratum H and 200 to stratum L, taking a simple random sample in each stratum. a. How should the designer estimate the population mean? b. Suppose that the population standard deviation in stratum H is 20 and the standard deviation in stratum L is 10. What will be the standard error of his estimate? c. Would it be better to allocate 200 samples to stratum H and 100 to stratum L? d. Would it be better to use proportional allocation? 56. How might stratification be used in each of the following sampling problems? a. A survey of household expenditures in a city. b. A survey to examine the lead concentration in the soil in a large plot of land. c. A survey to estimate the number of people who use elevators in a large building with a single bank of elevators. d. A survey of programs on a television station, taken to estimate the proportion of time taken up by advertising on Monday through Friday from 6 P.M. until 10 P.M. Assume that 52 weeks of recorded broadcasts are available for analysis. 57. Consider stratifying the population of Problem 1 into two strata: (1, 2, 2) and (4, 8). Assuming that one observation is taken from each stratum, find the sampling distribution of the estimate of the population mean and the mean and standard deviation of the sampling distribution. Compare to Theorems A and B in Section 7.5.2 and the results of Problem 1. 58. (Computer Exercise) Construct a population consisting of the integers from 1 to 100. Simulate the sampling distribution of the sample mean of a sample of size 12 by drawing 100 samples of size 12 and making a histogram of the results. 59. (Computer Exercise) Continuing with Problem 58, divide the population into two strata of equal size, allocate six observations per stratum, and simulate the distribution of the stratified estimate of the population mean. Do the same thing with four strata. Compare the results to each other and to the results of Problem 58. 60. A population consists of two strata, H and L, of sizes 100,000 and 500,000 and standard deviations 20 and 12, respectively. A stratified sample of size 100 is to be taken. a. Find the optimal allocation for estimating the population mean. b. Find the optimal allocation for estimating the difference of the means of the strata, μH − μL . 61. The value of a population mean increases linearly through time: μ(t) = α + βt while the variance remains constant. Independent simple random samples of size n are taken at times t = 1, 2, and 3. a. Find conditions on w1,w2, and w3 such that ˆβ = w1 X 1 + w2 X 2 + w3 X 3 7.7 Problems 251 is an unbiased estimate of the rate of change, β. Here Xi denotes the sample mean at time ti . b. What values of the wi minimize the variance subject to the constraint that the estimate is unbiased? 62. In Example B of Section 7.5.2, the standard error of X s was estimated to be sXs = 35.8. How good is this estimate—what is the actual standard error of X s? 63. (Open-ended) Monte Carlo evaluation of an integral was introduced in Example A of Section 5.2. Refer to that example for the following notation. Try to interpret that method from the point of view of survey sampling by considering an “infinite population” of numbers in the interval [0, 1], each population member x having a value f (x). Interpret ˆI( f ) as the mean of a simple random sample. What is the standard error of ˆI( f )? How could it be estimated? How could a confidence interval for I ( f ) be formed? Do you think that anything could be gained by stratifying the “population?” For example, the strata could be the intervals [0, .5) and [.5, 1]. You might find it helpful to consider some examples. 64. The value of an inventory is to be estimated by sampling. The items are stratified by book value in the following way: Stratum Nl μl σl $1000 + 70 3000 1250 $200–1000 500 500 100 $1–200 10,000 90 30 a. What should the relative sampling fraction in each stratum be for proportional and for optimal allocation? Ignore the finite population correction. b. How do the variances under each type of allocation compare to each other and to the variance under simple random sampling? 65. The disk file cancer contains values for breast cancer mortality from 1950 to 1960 (y) and the adult white female population in 1960 (x) for 301 counties in North Carolina, South Carolina, and Georgia. a. Make a histogram of the population values for cancer mortality. b. What are the population mean and total cancer mortality? What are the pop- ulation variance and standard deviation? c. Simulate the sampling distribution of the mean of a sample of 25 observations of cancer mortality. d. Draw a simple random sample of size 25 and use it to estimate the mean and total cancer mortality. e. Estimate the population variance and standard deviation from the sample of part (d). f. Form 95% confidence intervals for the population mean and total from the sample of part (d). Do the intervals cover the population values? g. Repeat parts (d) through (f) for a sample of size 100. h. Suppose that the size of the total population of each county is known and that this information is used to improve the cancer mortality estimates by forming a ratio estimator. Do you think this will be effective? Why or why not? 252 Chapter 7 Survey Sampling i. Simulate the sampling distribution of ratio estimators of mean cancer mortal- ity based on a simple random sample of size 25. Compare this result to that of part (c). j. Draw a simple random sample of size 25 and estimate the population mean and total cancer mortality by calculating ratio estimates. How do these estimates compare to those formed in the usual way in part (d) from the same data? k. Form confidence intervals about the estimates obtained in part ( j). l. Stratify the counties into four strata by population size. Randomly sample six observations from each stratum and form estimates of the population mean and total mortality. m. Stratify the counties into four strata by population size. What are the sam- pling fractions for proportional allocation and optimal allocation? Compare the variances of the estimates of the population mean obtained using simple random sampling, proportional allocation, and optimal allocation. n. How much better than those in part (m) will the estimates of the population mean be if 8, 16, 32, or 64 strata are used instead? 66. A photograph of a large crowd on a beach is taken from a helicopter. The photo is of such high resolution that when sections are magnified, individual people can be identified, but to count the entire crowd in this way would be very time- consuming. Devise a plan to estimate the number of people on the beach by using a sampling procedure. 67. The data set families contains information about 43,886 families living in the city of Cyberville. The city has four regions: the Northern region has 10,149 families, the Eastern region has 10,390 families, the Southern region has 13,457 families, and the Western region has 9,890. For each family, the following infor- mation is recorded: 1. Family type 1: Husband-wife family 2: Male-head family 3: Female-head family 2. Number of persons in family 3. Number of children in family 4. Family income 5. Region 1: North 2: East 3: South 4: West 6. Education level of head of household 31: Less than 1st grade 32: 1st, 2nd, 3rd, or 4th grade 33: 5th or 6th grade 34: 7th or 8th grade 35: 9th grade 36: 10th grade 37: 11th grade 7.7 Problems 253 38: 12th grade, no diploma 39: High school graduate, high school diploma, or equivalent 40: Some college but no degree 41: Associate degree in college (occupation/vocation program) 42: Associate degree in college (academic program) 43: Bachelor’s degree (e.g., B.S., B.A., A.B.) 44: Master’s degree (e.g., M.S., M.A., M.B.A.) 45: Professional school degree (e.g., M.D., D.D.S., D.V.M., LL.B., J.D.) 46: Doctoral degree (e.g., Ph.D., Ed.D.) In these exercises, you will try to learn about the families of Cyberville by using sampling. a. Take a simple random sample of 500 families. Estimate the following popula- tion parameters, calculate the estimated standard errors of these estimates, and form 95% confidence intervals: i. The proportion of female-headed families ii. The average number of children per family iii. The proportion of heads of households who did not receive a high school diploma iv. The average family income Repeat the preceding parameters for five different simple random samples of size 500 and compare the results. b. Take 100 samples of size 400. i. For each sample, find the average family income. ii. Find the average and standard deviation of these 100 estimates and make a histogram of the estimates. iii. Superimpose a plot of a normal density with that mean and standard devi- ation of the histogram and comment on how well it appears to fit. iv. Plot the empirical cumulative distribution function (see Section 10.2). On this plot, superimpose the normal cumulative distribution function with mean and standard deviation as earlier. Comment on the fit. v. Another method for examining a normal approximation is via a normal probability plot (Section 9.9). Make such a plot and comment on what it shows about the approximation. vi. For each of the 100 samples, find a 95% confidence interval for the pop- ulation average income. How many of those intervals actually contain the population target? vii. Take 100 samples of size 100. Compare the averages, standard deviations, and histograms to those obtained for a sample of size 400 and explain how the theory of simple random sampling relates to the comparisons. c. For a simple random sample of 500, compare the incomes of the three family types by comparing histograms and boxplots (see Chapter 10.6). d. Take simple random samples of size 400 from each of the four regions. i. Compare the incomes by region by making parallel boxplots. ii. Does it appear that some regions have larger families than others? iii. Are there differences in education level among the four regions? 254 Chapter 7 Survey Sampling e. Formulate a question of your choice and attempt to answer it with a simple random sample of size 400. f. Does stratification help in estimating the average family income? From a simple random sample of size 400, estimate the average income and also the standard error of your estimate. Form a 95% confidence interval. Next, allocate the 400 observations proportionally to the four regions and estimate the average income from the stratified sample. Estimate the standard error and form a 95% confi- dence interval. Compare your results to the results of the simple random sample. CHAPTER 8 Estimation of Parameters and Fitting of Probability Distributions 8.1 Introduction In this chapter, we discuss fitting probability laws to data. Many families of probability laws depend on a small number of parameters; for example, the Poisson family de- pends on the parameter λ (the mean number of counts), and the Gaussian family depends on two parameters, μ and σ. Unless the values of parameters are known in advance, they must be estimated from data in order to fit the probability law. After parameter values have been chosen, the model should be compared to the actual data to see if the fit is reasonable; Chapter 9 is concerned with measures and tests of goodness of fit. In order to introduce and illustrate some of the ideas and to provide a concrete basis for later theoretical discussions, we will first consider a classical example—the fitting of a Poisson distribution to radioactive decay. The concepts introduced in this example will be elaborated in this and the next chapter. 8.2 Fitting the Poisson Distribution to Emissions of Alpha Particles Records of emissions of alpha particles from radioactive sources show that the num- ber of emissions per unit of time is not constant but fluctuates in a seemingly random fashion. If the underlying rate of emission is constant over the period of observation (which will be the case if the half-life is much longer than the time period of obser- vation) and if the particles come from a very large number of independent sources (atoms), the Poisson model seems appropriate. For this reason, the Poisson distribu- tion is frequently used as a model for radioactive decay. You should recall that the 255 256 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Poisson distribution as a model for random counts in space or time rests on three assumptions: (1) the underlying rate at which the events occur is constant in space or time, (2) events in disjoint intervals of space or time occur independently, and (3) there are no multiple events. Berkson (1966) conducted a careful analysis of data obtained from the National Bureau of Standards. The source of the alpha particles was americium 241. The experimenters recorded 10,220 times between successive emissions. The observed mean emission rate (total number of emissions divided by total time) was .8392 emissions per sec. The clock used to record the times was accurate to .0002 sec. The first two columns of the following table display the counts, n, that were observed in 1207 intervals, each of length 10 sec. In 18 of the 1207 intervals, there were 0, 1, or 2 counts; in 28 of the intervals there were 3 counts, etc. n Observed Expected 0–2 18 12.2 3 28 27.0 4 56 56.5 5 105 94.9 6 126 132.7 7 146 159.1 8 164 166.9 9 161 155.6 10 123 130.6 11 101 99.7 12 74 69.7 13 53 45.0 14 23 27.0 15 15 15.1 16 9 7.9 17+ 5 7.1 1207 1207 In fitting a Poisson distribution to the counts shown in the table, we view the 1207 counts as 1207 independent realizations of Poisson random variables, each of which has the probability mass function πk = P(X = k) = λke−λ k! In order to fit the Poisson distribution, we must estimate a value for λ from the observed data. Since the average count in a 10-second interval was 8.392, we take this as an estimate of λ (recall that the E(X) = λ) and denote it by ˆλ. Before continuing, we want to mention some issues that will be explored in depth in subsequent sections of this chapter. First, observe that if the experiment 8.3 Parameter Estimation 257 were to be repeated, the counts would be different and the estimate of λ would be different; it is thus appropriate to regard the estimate of λ as a random variable which has a probability distribution referred to as its sampling distribution. The situation is entirely analogous to tossing a coin 10 times and regarding the number of heads as a binomially distributed random variable. Doing so and observing 6 heads generates one realization of this random variable; in the same sense 8.392 is a realization of a random variable. The question thus arises: what is the sampling distribution? This is of some practical interest, since the spread of the sampling distribution reflects the variability of the estimate. We could ask crudely, to what decimal place is the estimate 8.392 accurate? Second, later in this chapter we will discuss the rationale for choosing to estimate λ as we have done. Although estimating λ as the observed mean count is quite reasonable on its face, in principle there might be better procedures. We now turn to assessing goodness of fit, a subject that will be taken up in depth in the next chapter. Consider the 16 cells into which the counts are grouped. Under the hypothesized model, the probability that a random count falls in any one of the cells may be calculated from the Poisson probability law. The probability that an observation falls in the first cell (0, 1, or 2 counts) is p1 = π0 + π1 + π2 The probability that an observation falls in the second cell is p2 = π3. The probability that an observation falls in the 16th cell is p16 = ∞ k=17 πk Under the assumption that X1,...,X1207 are independent Poisson random variables, the number of observations out of 1207 falling in a given cell follows a binomial distribution with a mean, or expected value, of 1207pk, and the joint distribution of the counts in all the cells is multinomial with n = 1207 and probabilities p1, p2,...,p16. The third column of the preceding table gives the expected number of counts in each cell; for example, because p4 = .0786, the expected count in the corresponding cell is 1207×.0786 = 94.9. Qualitatively, there is good agreement between the expected and observed counts. Quantitative measures will be presented in Chapter 9. 8.3 Parameter Estimation As was illustrated in the example of alpha particle emissions, in order to fit a probabil- ity law to data, one typically has to estimate parameters associated with the probability law from the data. The following examples further illustrate this point. EXAMPLE A Normal Distribution The normal, or Gaussian, distribution involves two parameters, μ and σ, where μ is the mean of the distribution and σ 2 is the variance: f (x|μ, σ) = 1 σ √ 2π e− 1 2 (x−μ)2 σ2 , −∞ < x < ∞ 258 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 0 .05 2.0 1.0 0 1.0 P ( x ) x .10 .15 .20 3.0 2.0 3.0 .25 .30 .35 .40 .45 FIGURE 8.1 Gaussian fit of current flow across a cell membrane to a frequency polygon. The use of the normal distribution as a model is usually justified using some version of the central limit theorem, which says that the sum of a large number of independent random variables is approximately normally distributed. For example, Bevan, Kullberg, and Rice (1979) studied random fluctuations of current across a muscle cell membrane. The cell membrane contained a large number of channels, which opened and closed at random and were assumed to operate independently. The net current resulted from ions flowing through open channels and was therefore the sum of a large number of roughly independent currents. As the channels opened and closed, the net current fluctuated randomly. Figure 8.1 shows a smoothed histogram of values obtained from 49,152 observations of the net current and an approximat- ing Gaussian curve. The fit of the Gaussian distribution is quite good, although the smoothed histogram seems to show a slight skewness. In this application, informa- tion about the characteristics of the individual channels, such as conductance, was extracted from the estimated parameters μ and σ 2. ■ EXAMPLE B Gamma Distribution The gamma distribution depends on two parameters, α and λ: f (x|α, λ) = 1 (α)λα xα−1e−λx , 0 ≤ x ≤∞ The family of gamma distributions provides a flexible set of densities for nonnegative random variables. Figure 8.2 shows how the gamma distribution fits to the amounts of rainfall from different storms (Le Cam and Neyman 1967). Gamma distributions were fit 8.3 Parameter Estimation 259 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 70 Amount (mm) Frequency 0 80 60 100 110 Unseeded (b) 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 Amount (mm) Frequency 0 60 100 110 Seeded (a) 120 FIGURE 8.2 Fit of gamma densities to amounts of rainfall for (a) seeded and (b) unseeded storms. to rainfall amounts from storms that were seeded and unseeded in an experiment to determine the effects, if any, of seeding. Differences in the distributions between the seeded and unseeded conditions should be reflected in differences in the parameters α and λ. ■ As these examples illustrate, there are a variety of reasons for fitting probability laws to data. A scientific theory may suggest the form of a probability distribution and the parameters of that distribution may be of direct interest to the scientific inves- tigation; the examples of alpha particle emission and Example A are of this character. Example B is typical of situations in which a model is fit for essentially descriptive purposes as a method of data summary or compression. A probability model may play a role in a complex modeling situation; for example, utility companies interested in projecting patterns of consumer demand find it useful to model daily temperatures as random variables from a distribution of a particular form. This distribution may then be used in simulations of the effects of various pricing and generation schemes. In a similar way, hydrologists planning uses of water resources use stochastic models of rainfall in simulations. 260 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions We will take the following basic approach to the study of parameter estimation. The observed data will be regarded as realizations of random variables X1, X2,...,Xn, whose joint distribution depends on an unknown parameter θ. Note that θ may be a vector, such as (α, λ) in Example B. Usually the Xi will be modeled as independent random variables all having the same distribution f (x|θ), in which case their joint dis- tribution is f (x1|θ)f (x2|θ)··· f (xn|θ). We will refer to such Xi as independent and identically distributed, or i.i.d. An estimate of θ will be a function of X1, X2,...,Xn and will hence be a random variable with a probability distribution called its sampling distribution. We will use approximations to the sampling distribution to assess the variability of the estimate, most frequently through its standard deviation, which is commonly called its standard error. It is desirable to have general procedures for forming estimates so that each new problem does not have to be approached ab initio. We will develop two such proce- dures, the method of moments and the method of maximum likelihood, concentrating primarily on the latter, because it is the more generally useful. The advanced theory of statistics is heavily concerned with “optimal estimation,” and we will touch lightly on this topic. The essential idea is that given a choice of many different estimation procedures, we would like to use that estimate whose sampling distribution is most concentrated around the true parameter value. Before going on to the method of moments, let us note that there are strong similarities of the subject matter of this and the previous chapter. In Chapter 7 we were concerned with estimating population parameters, such as the mean and total, and the process of random sampling created random variables whose probability distributions depended on those parameters. We were concerned with the sampling distributions of the estimates and with assessing variability via standard errors and confidence intervals. In this chapter we consider models in which the data are generated from a probability distribution. This distribution usually has a more hypothetical status than that of Chapter 7, where the distribution was induced by deliberate randomization. In this chapter we will also be concerned with sampling distributions and with assessing variability through standard errors and confidence intervals. 8.4 The Method of Moments The kth moment of a probability law is defined as μk = E(X k) where X is a random variable following that probability law (of course, this is defined only if the expectation exists). If X1, X2,...,Xn are i.i.d. random variables from that distribution, the kth sample moment is defined as ˆμk = 1 n n i=1 X k i We can view ˆμk as an estimate of μk. The method of moments estimates parameters by finding expressions for them in terms of the lowest possible order moments and then substituting sample moments into the expressions. 8.4 The Method of Moments 261 Suppose, for example, that we wish to estimate two parameters, θ1 and θ2.Ifθ1 and θ2 can be expressed in terms of the first two moments as θ1 = f1(μ1,μ2) θ2 = f2(μ1,μ2) then the method of moments estimates are ˆθ1 = f1( ˆμ1, ˆμ2) ˆθ2 = f2( ˆμ1, ˆμ2) The construction of a method of moments estimate involves three basic steps: 1. Calculate low order moments, finding expressions for the moments in terms of the parameters. Typically, the number of low order moments needed will be the same as the number of parameters. 2. Invert the expressions found in the preceding step, finding new expressions for the parameters in terms of the moments. 3. Insert the sample moments into the expressions obtained in the second step, thus obtaining estimates of the parameters in terms of the sample moments. To illustrate this procedure, we consider some examples. EXAMPLE A Poisson Distribution The first moment for the Poisson distribution is the parameter λ = E(X). The first sample moment is ˆμ1 = X = 1 n n i=1 Xi which is, therefore, the method of moments estimate of λ: ˆλ = X. As a concrete example, let us consider a study done at the National Institute of Science and Technology (Steel et al. 1980). Asbestos fibers on filters were counted as part of a project to develop measurement standards for asbestos concentration. Asbestos dissolved in water was spread on a filter, and 3-mm diameter punches were taken from the filter and mounted on a transmission electron microscope. An operator counted the number of fibers in each of 23 grid squares, yielding the following counts: 31 29 19 18 31 28 34 27 34 30 16 18 26 27 27 18 24 22 28 24 21 17 24 The Poisson distribution would be a plausible model for describing the variability from grid square to grid square in this situation and could be used to characterize the inherent variability in future measurements. The method of moments estimate of λ is simply the arithmetic mean of the counts listed above, these or ˆλ = 24.9. If the experiment were to be repeated, the counts—and therefore the estimate— would not be exactly the same. It is thus natural to ask how stable this estimate is. 262 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions A standard statistical technique for addressing this question is to derive the sampling distribution of the estimate or an approximation to that distribution. The statistical model stipulates that the individual counts Xi are independent Poisson random vari- ables with parameter λ0. Letting S = Xi , the parameter estimate ˆλ = S/n is a random variable, the distribution of which is called its sampling distribution. Now from Example E in Section 4.5, the distribution of the sum of independent Poisson random variables is Poisson distributed, so the distribution of S is Poisson (nλ0). Thus the probability mass function of ˆλ is P(ˆλ = v) = P(S = nv) = (nλ0)nve−nλ0 (nv)! for v such that nv is a nonnegative integer. Since S is Poisson, its mean and variance are both nλ0,so E(ˆλ) = 1 n E(S) = λ0 Var(ˆλ) = 1 n2 Var(S) = λ0 n From Example A in Section 5.3, if nλ0 is large, the distribution of S is approximately normal; hence, that of ˆλ is approximately normal as well, with mean and variance given above. Because E(ˆλ) = λ0, we say that the estimate is unbiased: the sampling distribution is centered at λ0. The second equation shows that the sampling distribution becomes more concentrated about λ0 as n increases. The standard deviation of this distribution is called the standard error of ˆλ and is σˆλ = λ0 n Of course, we can’t know the sampling distribution or the standard error of ˆλ because they depend on λ0, which is unknown. However, we can derive an approximation by substituting ˆλ and λ0 and use it to assess the variability of our estimate. In particular, we can calculate the estimated standard error of ˆλ as sˆλ = ˆλ n For this example, we find sˆλ = 24.9 23 = 1.04 At the end of this section, we will present a justification for using ˆλ in place of λ0. In summary, we have found that the sampling distribution of ˆλ is approximately normal, centered at the true value λ0 with standard deviation 1.04. This gives us a reasonable assessment of the variability of our parameter estimate. For example, because a normally distributed random variable is unlikely to be more than two standard deviations away from its mean, the error in our estimate of λ is unlikely to be more than 2.08. We thus have not only an estimate of λ0, but also an understanding of the inherent variability of that estimate. 8.4 The Method of Moments 263 In Chapter 9, we will address the question of whether the Poisson distribution really fits these data. Clearly, we could calculate the average of any batch of numbers, whether or not they were well fit by the Poisson distribution. ■ EXAMPLE B Normal Distribution The first and second moments for the normal distribution are μ1 = E(X) = μ μ2 = E(X 2) = μ2 + σ 2 Therefore, μ = μ1 σ 2 = μ2 − μ2 1 The corresponding estimates of μ and σ 2 from the sample moments are ˆμ = X ˆσ 2 = 1 n n i=1 X 2 i − X 2 = 1 n n i=1 (Xi − X)2 From Section 6.3, the sampling distribution of X is N(μ, σ 2/n) and n ˆσ 2/σ 2 ∼ χ2 n−1. Furthermore, X and ˆσ 2 are independently distributed. We will return to these sampling distributions later in the chapter. ■ EXAMPLE C Gamma Distribution The first two moments of the gamma distribution are μ1 = α λ μ2 = α(α + 1) λ2 (see Example B in Section 4.5). To apply the method of moments, we must express α and λ in terms of μ1 and μ2. From the second equation, μ2 = μ2 1 + μ1 λ or λ = μ1 μ2 − μ2 1 Also, from the equation for the first moment given here, α = λμ1 = μ2 1 μ2 − μ2 1 The method of moments estimates are, since ˆσ 2 = ˆμ2 − ˆμ2 1, ˆλ = X ˆσ 2 264 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 0 .5 1.0 1.5 2.0 2.5 20 40 60 80 100 140 Precipitation Count 0 160 120 FIGURE 8.3 Gamma densities fit by the methods of moments and by the method of maximum likelihood to amounts of precipitation; the solid line shows the method of moments estimate and the dotted line the maximum likelihood estimate. and ˆα = X 2 ˆσ 2 As a concrete example, let us consider the fit of the amounts of precipitation during 227 storms in Illinois from 1960 to 1964 to a gamma distribution (Le Cam and Neyman 1967). The data, listed in Problem 42 at the end of Chapter 10, were gathered and analyzed in an attempt to characterize the natural variability in precipitation from storm to storm. A histogram shows that the distribution is quite skewed, so a gamma distribution is a natural candidate for a model. For these data, X = .224 and ˆσ 2 = .1338, and therefore ˆα = .375 and ˆλ = 1.674. The histogram with the fitted density is shown in Figure 8.3. Note that, in order to make visual comparison easy, the density was normalized to have a total area equal to the total area under the histogram, which is the number of observations times the bin width of the histogram, or 227 × .2 = 45.4. Alternatively, the histogram could have been normalized to have a total area of 1. Qualitatively, the fit in Figure 8.3 looks reasonable; we will examine it in more detail in Example C in Section 9.9. ■ We now turn to a discussion of the sampling distributions of ˆα and ˆλ. In the previ- ous two examples, we were able to use known theoretical results in deriving sampling distributions, but it appears that it would be difficult to derive the exact forms of the sampling distributions of ˆλ and ˆα, because they are each rather complicated functions of the sample values X1, X2,...,Xn. However, the problem can be approached by simulation. Imagine for the moment that we knew the true values λ0 and α0. We could generate many, many samples of size n = 227 from the gamma distribution with 8.4 The Method of Moments 265 these parameter values, and from each of these samples we could calculate estimates of λ and α. A histogram of the values of the estimates of λ, for example, should then give us a good idea of the sampling distribution of ˆλ. The only problem with this idea is that it requires knowing the true parameter values. (Notice that we faced a problem very much like this in Example A.) So we substitute our estimates of λ and α for the true values; that is we draw many, many samples of size n = 227 from a gamma distribution with parameters α = .375 and λ = 1.674. The results of drawing 1000 such samples of size n = 227 are displayed in Figure 8.4. Figure 8.4(a) is a histogram of the 1000 estimates of α so obtained and Figure 8.4(b) shows the corresponding histogram for λ. These histograms indicate the variability that is inherent in estimating the parameters from a sample of this size. For example, we see that if the true value of α is .375, then it would not be very unusual for the estimate to be in error by .1 or more. Notice that the shapes of the histograms suggest that they might be approximated by normal densities. The variability shown by the histograms can be summarized by calculating the standard deviations of the 1000 estimates, thus providing estimated standard errors of ˆα and ˆλ. Tobe precise, if the 1000 estimates ofα are denoted byα∗ i ,i = 1, 2,...,1000, the standard error of ˆα is estimated as sˆα = 1 1000 1000 i=1 (α∗ i − α)2 where α is the mean of the 1000 values. The results of this calculation and the corresponding one for ˆλ are sˆα = .06 and sˆλ = .34. These standard errors are concise quantifications of the amount of variability of the estimates ˆα = .375 and ˆλ = 1.674 displayed in Figure 8.4. 2.01.51.0 2.5 3.00 50 100 150 200 250 0 50 100 150 200 0.3 0.4 (a) 0.5 0.60.2 (b) FIGURE 8.4 Histogram of 1000 simulated method of moment estimates of (a) α and (b) λ. 266 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Our use of simulation (or Monte Carlo) here is an example of what in statistics is called the bootstrap. We will see more examples of this versatile method later. EXAMPLE D An Angular Distribution The angle θ at which electrons are emitted in muon decay has a distribution with the density f (x|α) = 1 + αx 2 , −1 ≤ x ≤ 1 and − 1 ≤ α ≤ 1 where x = cos θ. The parameter α is related to polarization. Physical considerations dictate that |α|≤ 1 3 , but we note that f (x|α) is a probability density for |α|≤1. The method of moments may be applied to estimate α from a sample of experimental measurements, X1,...,Xn. The mean of the density is μ = 1 −1 x 1 + αx 2 dx = α 3 Thus, the method of moments estimate of α is ˆα = 3X. Consideration of the sampling distribution of ˆα is left as an exercise (Problem 13). ■ Under reasonable conditions, method of moments estimates have the desirable property of consistency. An estimate, ˆθ, is said to be a consistent estimate of a parameter, θ,ifˆθ approaches θ as the sample size approaches infinity. The following states this more precisely. DEFINITION Let ˆθn be an estimate of a parameter θ based on a sample of size n. Then ˆθn is said to be consistent in probability if ˆθn converges in probability to θ as n approaches infinity; that is, for any >0, P(|ˆθn − θ| >)→ 0asn →∞ ■ The weak law of large numbers implies that the sample moments converge in probability to the population moments. If the functions relating the estimates to the sample moments are continuous, the estimates will converge to the parameters as the sample moments converge to the population moments. The consistency of method of moments estimates can be used to provide a jus- tification for a procedure that we used in estimating standard errors in the previous examples. We were interested in the variance (or its square root—the standard error) of a parameter estimate ˆθ. Denoting the true parameter by θ0, we had a relationship of the form σˆθ = 1√ n σ(θ0) (In Example A, σˆλ = √λ0/n, so that σ (λ) = √ λ.) We approximated this by the 8.5 The Method of Maximum Likelihood 267 estimated standard error sˆθ = 1√ n σ(ˆθ) We now claim that the consistency of ˆθ implies that sˆθ ≈ σˆθ . More precisely, limn→∞ sˆθ σˆθ = 1 provided that the function σ(θ)is continuous in θ. The result follows since if ˆθ → θ0, then σ(ˆθ) → σ(θ0). Of course, this is just a limiting result and we always have a finite value of n in practice, but it does provide some hope that the ratio will be close to 1 and that the estimated standard error will be a reasonable indication of variability. Let us summarize the results of this section. We have shown how the method of moments can provide estimates of the parameters of a probability distribution based on a “sample” (an i.i.d. collection) of random variables from that distribution. We addressed the question of variability or reliability of the estimates by observing that if the sample is random, the parameter estimates are random variables having distributions that are referred to as their sampling distributions. The standard deviation of the sampling distribution is called the standard error of the estimate. We then faced the problem of how to ascertain the variability of an estimate from the sample itself. In some cases the sampling distribution was of an explicit form depending upon the unknown parameters (Examples A and B); in these cases we could substitute our estimates for the unknown parameters in order to approximate the sampling distribution. In other cases the form of the sampling distribution was not so obvious, but we realized that even if we didn’t know it explicitly, we could simulate it. By using the bootstrap we avoided doing perhaps difficult analytic calculations by sitting back and instructing a computer to generate random numbers. 8.5 The Method of Maximum Likelihood As well as being a useful tool for parameter estimation in our current context, the method of maximum likelihood can be applied to a great variety of other statistical problems, such as curve fitting, for example. This general utility is one of the major reasons for the importance of likelihood methods in statistics. We will later see that maximum likelihood estimates have nice theoretical properties as well. Suppose that random variables X1,...,Xn have a joint density or frequency function f (x1, x2,...,xn|θ). Given observed values Xi = xi , where i = 1,...,n, the likelihood of θ as a function of x1, x2,...,xn is defined as lik(θ) = f (x1, x2,...,xn|θ) Note that we consider the joint density as a function of θ rather than as a function of the xi . If the distribution is discrete, so that f is a frequency function, the likelihood function gives the probability of observing the given data as a function of the para- meter θ. The maximum likelihood estimate (mle) of θ is that value of θ that max- imizes the likelihood—that is, makes the observed data “most probable” or “most likely.” 268 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions If the Xi are assumed to be i.i.d., their joint density is the product of the marginal densities, and the likelihood is lik(θ) = n i=1 f (Xi |θ) Rather than maximizing the likelihood itself, it is usually easier to maximize its natural logarithm (which is equivalent since the logarithm is a monotonic function). For an i.i.d. sample, the log likelihood is l(θ) = n i=1 log[ f (Xi |θ)] (In this text, “log” will always mean the natural logarithm.) Let us find the maximum likelihood estimates for the examples first considered in Section 8.4. EXAMPLE A Poisson Distribution If X follows a Poisson distribution with parameter λ, then P(X = x) = λx e−λ x! If X1,...,Xn are i.i.d. and Poisson, their joint frequency function is the product of the marginal frequency functions. The log likelihood is thus l(λ) = n i=1 (Xi log λ − λ − log Xi !) = log λ n i=1 Xi − nλ − n i=1 log Xi ! 86 84 22 24 26 28 Log likelihood 82 80 78 76 74 72 70 20 30 FIGURE 8.5 Plot of the log likelihood function of λ for asbestos data. 8.5 The Method of Maximum Likelihood 269 Figure 8.5 is a graph of l(λ) for the asbestos counts of Example A in Section 8.4. Setting the first derivative of the log likelihood equal to zero, we find l (λ) = 1 λ n i=1 Xi − n = 0 The mle is then ˆλ = X We can check that this is indeed a maximum (in fact, l(λ) is a concave function of λ; see Figure 8.5). The maximum likelihood estimate agrees with the method of moments for this case and thus has the same sampling distribution. ■ EXAMPLE B Normal Distribution If X1, X2,...,Xn are i.i.d. N(μ, σ 2), their joint density is the product of their marginal densities: f (x1, x2,...,xn|μ, σ) = n i=1 1 σ √ 2π exp −1 2 xi − μ σ 2 Regarded as a function of μ and σ, this is the likelihood function. The log likelihood is thus l(μ, σ ) =−n log σ − n 2 log 2π − 1 2σ 2 n i=1 (Xi − μ)2 The partials with respect to μ and σ are ∂l ∂μ = 1 σ 2 n i=1 (Xi − μ) ∂l ∂σ =−n σ + σ −3 n i=1 (Xi − μ)2 Setting the first partial equal to zero and solving for the mle, we obtain ˆμ = X Setting the second partial equal to zero and substituting the mle for μ, we find that the mle for σ is ˆσ = 1 n n i=1 (Xi − X)2 Again, these estimates and their sampling distributions are the same as those obtained by the method of moments. ■ 270 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions EXAMPLE C Gamma Distribution Since the density function of a gamma distribution is f (x|α, λ) = 1 (α)λα xα−1e−λx , 0 ≤ x < ∞ the log likelihood of an i.i.d. sample, Xi ,...,Xn,is l(α, λ) = n i=1 [α log λ + (α − 1) log Xi − λXi − log (α)] = nα log λ + (α − 1) n i=1 log Xi − λ n i=1 Xi − n log (α) The partial derivatives are ∂l ∂α = n log λ + n i=1 log Xi − n (α) (α) ∂l ∂λ = nα λ − n i=1 Xi Setting the second partial equal to zero, we find ˆλ = n ˆα n i=1 Xi = ˆα X But when this solution is substituted into the equation for the first partial, we obtain a nonlinear equation for the mle of α: n log ˆα − n log X + n i=1 log Xi − n (ˆα) (ˆα) = 0 This equation cannot be solved in closed form; an iterative method for finding the roots has to be employed. To start the iterative procedure, we could use the initial value obtained by the method of moments. For this example, the two methods do not give the same estimates. The mle’s are computed from the precipitation data of Example C in Section 8.4 by an iterative procedure (a combination of the secant method and the method of bisection) using the method of moments estimates as starting values. The resulting estimates are ˆα = .441 and ˆλ = 1.96. In Example C in Section 8.4, the method of moments estimates were found to be ˆα = .375 and ˆλ = 1.674. Figure 8.3 shows fitted densities from both types of estimates of α and λ. There is clearly little practical difference, especially if we keep in mind that the gamma distribution is only a possible model and should not be taken as being literally true. Because the maximum likelihood estimates are not given in closed form, obtaining their exact sampling distribution would appear to be intractable. We thus use the bootstrap to approximate these distributions, just as we did to approximate the sampling distributions of the method of moments estimates. The underlying ratio- nale is the same: If we knew the “true” values, α0 and λ0, say, we could approximate 8.5 The Method of Maximum Likelihood 271 0 50 100 150 300 (b) 2.0 2.5 3.0 3.51.5 200 250 0 50 100 150 (a) 200 0.600.550.500.450.40 FIGURE 8.6 Histograms of 1000 simulated maximum likelihood estimates of (a) α and (b) λ. the sampling distribution of their maximum likelihood estimates by generating many, many samples of size n = 227 from a gamma distribution with parameters α0 and λ0, forming the maximum likelihood estimates from each sample, and displaying the results in histograms. Since, of course, we don’t know the true values, we let our maximum likelihood estimates play their role: We generated 1000 samples each of size n = 227 of gamma distributed random variables with α = .471 and λ = 1.97. For each of these samples, the maximum likelihood estimates of α and λ were calcu- lated. Histograms of these 1000 estimates are shown in Figure 8.6; we regard these histograms as approximations to the sampling distribution of the maximum likelihood estimates ˆα and ˆλ. Comparison of Figures 8.6 and 8.4 is interesting. We see that the sampling dis- tributions of the maximum likelihood estimates are substantially less dispersed than those of the method of moments estimates, which indicates that in this situation, the method of maximum likelihood is more precise than the method of moments. The standard deviations of the values displayed in the histograms are the estimated stan- dard errors of the maximum likelihood estimates; we find sˆα = .03 and sˆλ = .26. Recall that in Example C of Section 8.4 the corresponding estimated standard errors for the method of moments estimates were found to be .06 and .34. ■ EXAMPLE D Muon Decay From the form of the density given in Example D in Section 8.4, the log likelihood is l(α) = n i=1 log(1 + αXi ) − n log 2 Setting the derivative equal to zero, we see that the mle of α satisfies the following 272 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions nonlinear equation: n i=1 Xi 1 + ˆαXi = 0 Again, we would have to use an iterative technique to solve for ˆα. The method of moments estimate could be used as a starting value. ■ In Examples C and D, in order to find the maximum likelihood estimate, we would have to solve a nonlinear equation. In general, in some problems involving several parameters, systems of nonlinear equations must be solved to find the mle’s. We will not discuss numerical methods here; a good discussion is found in Chapter 6 of Dahlquist and Bjorck (1974). 8.5.1 Maximum Likelihood Estimates of Multinomial Cell Probabilities The method of maximum likelihood is often applied to problems involving multino- mial cell probabilities. Suppose that X1,...,Xm, the counts in cells 1,...,m, follow a multinomial distribution with a total count of n and cell probabilities p1,...,pm. We wish to estimate the p’s from the x’s. The joint frequency function of X1,...,Xm is f (x1,...,xm|p1,...,pm) = n! m i=1 xi ! m i=1 pxi i Note that the marginal distribution of each Xi is binomial (n, pi ), and that since the Xi are not independent (they are constrained to sum to n), their joint frequency function is not the product of the marginal frequency functions, as it was in the examples considered in the preceding section. We can, however, still use the method of maximum likelihood since we can write an expression for the joint distribution. We assume n is given, and we wish to estimate p1,...,pm with the constraint that the pi sum to 1. From the joint frequency function just given, the log likelihood is l(p1,...,pm) = log n! − m i=1 log xi ! + m i=1 xi log pi To maximize this likelihood subject to the constraint, we introduce a Lagrange mul- tiplier and maximize L(p1,...,pm,λ)= log n! − m i=1 log xi ! + m i=1 xi log pi + λ m i=1 pi − 1 Setting the partial derivatives equal to zero, we have the following system of equations: ˆp j =−x j λ , j = 1,...,m 8.5 The Method of Maximum Likelihood 273 Summing both sides of this equation, we have 1 = −n λ or λ =−n Therefore, ˆp j = x j n which is an obvious set of estimates. The sampling distribution of ˆp j is determined by the distribution of x j , which is binomial. In some situations, such as frequently occur in the study of genetics, the multi- nomial cell probabilities are functions of other unknown parameters θ; that is, pi = pi (θ). In such cases, the log likelihood of θ is l(θ) = log n! − m i=1 log xi ! + m i=1 xi log pi (θ) EXAMPLE A Hardy-Weinberg Equilibrium If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur in a population with frequencies (1 − θ)2, 2θ(1 − θ), and θ2, according to the Hardy- Weinberg law. In a sample from the Chinese population of Hong Kong in 1937, blood types occurred with the following frequencies, where M and N are erythrocyte antigens: Blood Type MMNNTotal Frequency 342 500 187 1029 There are several possible ways to estimate θ from the observed frequencies. For ex- ample, if we equate θ2 with 187/1029, we obtain .4263 as an estimate of θ. Intuitively, however, it seems that this procedure ignores some of the information in the other cells. If we let X1, X2, and X3 denote the counts in the three cells and let n = 1029, the log likelihood of θ is (you should check this): l(θ) = log n! − 3 i=1 log Xi ! + X1 log(1 − θ)2 + X2 log 2θ(1 − θ)+ X3 log θ2 = log n! − 3 i=1 log Xi ! + (2X1 + X2) log(1 − θ) + (2X3 + X2) log θ + X2 log 2 In maximizingl(θ), we do not need to explicitly incorporate the constraint that the cell probabilities sum to 1 since the functional form of pi (θ) is such that 3 i=1 pi (θ) = 1. 274 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Setting the derivative equal to zero, we have −2X1 + X2 1 − θ + 2X3 + X2 θ = 0 Solving this, we obtain the mle: ˆθ = 2X3 + X2 2X1 + 2X2 + 2X3 = 2X3 + X2 2n = 2 × 187 + 500 2 × 1029 = .4247 How precise is this estimate? Do we have faith in the accuracy of the first, second, third, or fourth decimal place? We will address these questions by using the boot- strap to estimate the sampling distribution and the standard error of ˆθ. The bootstrap logic is as follows: If θ were known, then the three multinomial cell probabilities, (1 − θ)2, 2θ(1 − θ), and θ2, would be known. To find the sampling distribution of ˆθ, we could simulate many multinomial random variables with these probabilities and n = 1029, and for each we could form an estimate of θ. A histogram of these estimates would be an approximation to the sampling distribution. Since, of course, we don’t know the actual value of θ to use in such a simulation, the bootstrap principle tells us to use ˆθ = .4247 in its place. With this estimated value of θ the three cell probabilities (M, MN, N) are .331, .489, and .180. One thousand multinomial random counts, each with total count 1029, were simulated with these probabilities (see problem 35 at the end of the chapter for the method of generating these random counts). From each of these 1000 computer “experiments,” a value θ∗ was determined. A histogram of the estimates (Figure 8.7) can be regarded as an estimate of the sampling distribution of ˆθ. The estimated standard error of ˆθ is the standard deviation of these 1000 values: sˆθ = .011. ■ 8.5.2 Large Sample Theory for Maximum Likelihood Estimates In this section we develop approximations to the sampling distribution of maximum likelihood estimates by using limiting arguments as the sample size increases. The theory we shall sketch shows that under reasonable conditions, maximum likelihood estimates are consistent. We also develop a useful and important approximation for the variance of a maximum likelihood estimate and argue that for large sample sizes, the sampling distribution is approximately normal. The rigorous development of this large sample theory is quite technical; we will simply state some results and give very rough, heuristic arguments for the case of an i.i.d. sample and a one-dimensional parameter. (The arguments for Theorems A and B may be skipped without loss of continuity. Rigorous proofs may be found in Cram´er (1946).) 8.5 The Method of Maximum Likelihood 275 0 50 100 150 0.38 0.42 0.44 0.460.40 FIGURE 8.7 Histogram of 1000 simulated maximum likelihood estimates of θ described in Example A. For an i.i.d. sample of size n, the log likelihood is l(θ) = n i=1 log f (xi |θ) We denote the true value of θ by θ0. It can be shown that under reasonable conditions ˆθ is a consistent estimate of θ0; that is, ˆθ converges to θ0 in probability as n approaches infinity. THEOREM A Under appropriate smoothness conditions on f , the mle from an i.i.d. sample is consistent. Proof The following is merely a sketch of the proof. Consider maximizing 1 nl(θ) = 1 n n i=1 log f (Xi |θ) 276 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions As n tends to infinity, the law of large numbers implies that 1 nl(θ) → E log f (X|θ) = log f (x|θ)f (x|θ0) dx It is thus plausible that for large n, the θ that maximizes l(θ) should be close to the θ that maximizes E log f (X|θ). (An involved argument is necessary to establish this.) To maximize E log f (X|θ), we consider its derivative: ∂ ∂θ log f (x|θ)f (x|θ0) dx = ∂ ∂θ f (x|θ) f (x|θ) f (x|θ0) dx If θ = θ0, this equation becomes ∂ ∂θ f (x|θ0) dx = ∂ ∂θ f (x|θ0) dx = ∂ ∂θ(1) = 0 which shows that θ0 is a stationary point and hopefully a maximum. Note that we have interchanged differentiation and integration and that the assumption of smoothness on f must be strong enough to justify this. ■ We will now derive a useful intermediate result. LEMMA A Define I (θ) by I (θ) = E ∂ ∂θ log f (X|θ) 2 Under appropriate smoothness conditions on f, I (θ) may also be expressed as I (θ) =−E ∂2 ∂θ2 log f (X|θ) Proof First, we observe that since f (x|θ) dx = 1, ∂ ∂θ f (x|θ) dx = 0 Combining this with the identity ∂ ∂θ f (x|θ) = ∂ ∂θ log f (x|θ) f (x|θ) 8.5 The Method of Maximum Likelihood 277 we have 0 = ∂ ∂θ f (x|θ) dx = ∂ ∂θ log f (x|θ) f (x|θ) dx where we have interchanged differentiation and integration (some assumptions must be made in order to do this). Taking second derivatives of the preceding expressions, we have 0 = ∂ ∂θ ∂ ∂θ log f (x|θ) f (x|θ) dx = ∂2 ∂θ2 log f (x|θ) f (x|θ) dx + ∂ ∂θ log f (x|θ) 2 f (x|θ) dx From this, the desired result follows. ■ The large sample distribution of a maximum likelihood estimate is approximately normal with mean θ0 and variance 1/[nI(θ0)]. Since this is merely a limiting result, which holds as the sample size tends to infinity, we say that the mle is asymptot- ically unbiased and refer to the variance of the limiting normal distribution as the asymptotic variance of the mle. THEOREM B Under smoothness conditions on f , the probability distribution of √ nI(θ0)(ˆθ−θ0) tends to a standard normal distribution. Proof The following is merely a sketch of the proof; the details of the argument are beyond the scope of this book. From a Taylor series expansion, 0 = l (ˆθ) ≈ l (θ0) + (ˆθ − θ0)l (θ0) (ˆθ − θ0) ≈ −l (θ0) l (θ0) n1/2(ˆθ − θ0) ≈ −n−1/2l (θ0) n−1l (θ0) First, we consider the numerator of this last expression. Its expectation is E[n−1/2l (θ0)] = n−1/2 n i=1 E ∂ ∂θ log f (Xi |θ0) = 0 278 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions as in Theorem A. Its variance is Var[n−1/2l (θ0)] = 1 n n i=1 E ∂ ∂θ log f (Xi |θ0) 2 = I (θ0) Next, we consider the denominator: 1 nl (θ0) = 1 n n i=1 ∂2 ∂θ2 log f (xi |θ0) By the law of large numbers, the latter expression converges to E ∂2 ∂θ2 log f (X|θ0) =−I (θ0) from Lemma A. We thus have n1/2(ˆθ − θ0) ≈ n−1/2l (θ0) I (θ0) Therefore, E[n1/2(ˆθ − θ0)] ≈ 0 Furthermore, Var[n1/2(ˆθ − θ0)] ≈ I (θ0) I 2(θ0) = 1 I (θ0) and thus Var(ˆθ − θ0) ≈ 1 nI(θ0) The central limit theorem may be applied to l (θ0), which is a sum of i.i.d. random variables: l (θ0) = n i=1 ∂ ∂θ0 log f (Xi |θ) ■ Another interpretation of the result of Theorem B is as follows. For an i.i.d. sam- ple, the maximum likelihood estimate is the maximizer of the log likelihood function, l(θ) = n i=1 log f (Xi |θ) The asymptotic variance is 1 nI(θ0) =− 1 El (θ0) 8.5 The Method of Maximum Likelihood 279 when El (θ0) is large, l(θ) is, on average, changing very rapidly in a vicinity of θ0 and the variance of the maximizer is small. A corresponding result can be proved from the multidimensional case. The vector of maximum likelihood estimates is asymptotically normally distributed. The mean of the asymptotic distribution is the vector of true parameters, θ0. The covariance of the estimates ˆθi and ˆθ j is given by the ij entry of the matrix n−1 I −1(θ0), where I (θ) is the matrix with ij component E ∂ ∂θi log f (X|θ) ∂ ∂θj log f (X|θ) =−E ∂2 ∂θi ∂θj log f (X|θ) Since we do not wish to delve deeply into technical details, we do not specify the conditions under which the results obtained in this section hold. It is worth mentioning, however, that the true parameter value, θ0, is required to be an interior point of the set of all parameter values. Thus the results would not be expected to apply in Example D of Section 8.5 if α0 = 1, for example. It is also required that the support of the density or frequency function f (x|θ)[the set of values for which f (x|θ) > 0] does not depend onθ. Thus, for example, the results would not be expected to apply to estimatingθ from a sample of random variables that were uniformly distributed on the interval [0,θ]. The following sections will apply these results in several examples. 8.5.3 Confidence Intervals from Maximum Likelihood Estimates In Chapter 7, confidence intervals for the population mean μ were introduced. Re- call that the confidence interval for μ was a random interval that contained μ with some specified probability. In the current context, we are interested in estimating the parameter θ of a probability distribution. We will develop confidence intervals for θ based on ˆθ; these intervals serve essentially the same function as they did in Chapter 7 in that they express in a fairly direct way the degree of uncertainty in the estimate ˆθ.A confidence interval for θ is an interval based on the sample values used to estimate θ. Since these sample values are random, the interval is random and the probability that it contains θ is called the coverage probability of the interval. Thus, for example, a 90% confidence interval for θ is a random interval that contains θ with probability .9. A confidence interval quantifies the uncertainty of a parameter estimate. We will discuss three methods for forming confidence intervals for maximum likelihood estimates: exact methods, approximations based on the large sample prop- erties of maximum likelihood estimates, and bootstrap confidence intervals. The con- struction of confidence intervals for parameters of a normal distribution illustrates the use of exact methods. EXAMPLE A We found in Example B of Section 8.5 that the maximum likelihood estimates of μ and σ 2 from an i.i.d. normal sample are ˆμ = X ˆσ 2 = 1 n n i=1 (Xi − X)2 280 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions A confidence interval for μ is based on the fact that √ n(X − μ) S ∼ tn−1 where tn−1 denotes the t distribution with n − 1 degrees of freedom and S2 = 1 n − 1 n i=1 (Xi − X)2 (see Section 6.3). Let tn−1(α/2) denote that point beyond which the t distribution with n − 1 degrees of freedom has probability α/2. Since the t distribution is symmetric about 0, the probability to the left of −tn−1(α/2) is also α/2. Then, by definition, P −tn−1(α/2) ≤ √ n(X − μ) S ≤ tn−1(α/2) = 1 − α The inequality can be manipulated to yield P X − S√ n tn−1(α/2) ≤ μ ≤ X + S√ n tn−1(α/2) = 1 − α According to this equation, the probability that μ lies in the interval X ± Stn−1(α/2)/√ n is 1 − α. Note that this interval is random: The center is at the random point X and the width is proportional to S, which is also random. Now let us turn to a confidence interval for σ 2. From Section 6.3, n ˆσ 2 σ 2 ∼ χ2 n−1 where χ2 n−1 denotes the chi-squared distribution with n − 1 degrees of freedom. Let χ2 m(α) denote the point beyond which the chi-square distribution with m degrees of freedom has probability α. It then follows by definition that P χ2 n−1(1 − α/2) ≤ n ˆσ 2 σ 2 ≤ χ2 n−1(α/2) = 1 − α Manipulation of the inequalities yields P n ˆσ 2 χ2 n−1(α/2) ≤ σ 2 ≤ n ˆσ 2 χ2 n−1(1 − α/2) = 1 − α Therefore, a 100(1 − α)% confidence interval for σ 2 is n ˆσ 2 χ2 n−1(α/2), n ˆσ 2 χ2 n−1(1 − α/2) Note that this interval is not symmetric about ˆσ 2—it is not of the form ˆσ 2 ± c, unlike the previous example. A simulation illustrates these ideas: The following experiment was done on a computer 20 times. A random sample of size n = 11 from normal distribution with mean μ = 10 and variance σ 2 = 9 was generated. From the sample, X and ˆσ 2 were 8.5 The Method of Maximum Likelihood 281 7 8 9 10 11 12 0 10 20 30 40 50 FIGURE 8.8 20 confidence intervals for μ (left panel) and for σ 2 (right panel) as described in Example A. Horizontal lines indicate the true values. calculated and 90% confidence intervals for μ and σ 2 were constructed, as described before. Thus at the end there were 20 intervals for μ and 20 intervals for σ 2. The 20 intervals for μ are shown as vertical lines in the left panel of Figure 8.8 and the 20 intervals for σ 2 are shown in the right panel. Horizontal lines are drawn at the true values μ = 10 and σ 2 = 9. Since these are 90% confidence intervals, we expect the true parameter values to fall outside the intervals 10% of the time; thus on the average we would expect 2 of 20 intervals to fail to cover the true parameter value. From the figure, we see that all the intervals for μ actually cover μ, whereas four of the intervals of σ 2 failed to contain σ 2. ■ Exact methods such as that illustrated in the previous example are the exception rather than the rule in practice. To construct an exact interval requires detailed knowl- edge of the sampling distribution as well as some cleverness. A second method of constructing confidence intervals is based on the large sample theory of the previous section. According to the results of that section, the distribution of √ nI(θ0)(ˆθ − θ0) is approximately the standard normal distribution. Since θ0 is unknown, we will use I (ˆθ) in place of I (θ0); we have employed similar substitutions a number of times before—for example, in finding an approximate standard error in Example A of Sec- tion 8.4. It can be further argued that the distribution of nI(ˆθ)(ˆθ − θ0) is also approximately standard normal. Since the standard normal distribution is symmetric about 0, P −z(α/2) ≤ nI(ˆθ)(ˆθ − θ0) ≤ z(α/2) ≈ 1 − α Manipulation of the inequalities yields ˆθ ± z(α/2) 1 nI(ˆθ) 282 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions as an approximate 100(1−α)% confidence interval. We now illustrate this procedure with an example. EXAMPLE B Poisson Distribution The mle of λ from a sample of size n from a Poisson distribution is ˆλ = X Since the sum of independent Poisson random variables follows a Poisson distribution, the parameter of which is the sum of the parameters of the individual summands, n ˆλ = n i=1 Xi follows a Poisson distribution with mean nλ. Also, the sampling distribution of ˆλ is known, although it depends on the true value of λ, which is unknown. Exact confidence intervals for λ may be obtained by using this fact, and special tables are available (Pearson and Hartley 1966). For large samples, confidence intervals may be derived as follows. First, we need to calculate I (λ). Let f (x|λ) denote the probability mass function of a Poisson random variable with parameter λ. There are two ways to do this. We may use the definition I (λ) = E ∂ ∂λ log f (X|λ) 2 We know that log f (x|λ) = x log λ − λ − log x! and thus I (λ) = E X λ − 1 2 Rather than evaluate this quantity, we may use the alternative expression for I (λ) given by Lemma A of Section 8.5.2: I (λ) =−E ∂2 ∂λ2 log f (X|λ) Since ∂2 ∂λ2 log f (X|λ) =−X λ2 I (λ) is simply E(X) λ2 = 1 λ Thus, an approximate 100(1 − α)% confidence interval for λ is X ± z(α/2) X n Note that the asymptotic variance is in fact the exact variance in this case. The confidence interval, however, is only approximate, since the sampling distribution of X is only approximately normal. 8.5 The Method of Maximum Likelihood 283 As a concrete example, let us return to the study that involved counting asbestos fibers on filters, discussed earlier. In Example A in Section 8.4, we found ˆλ = 24.9. The estimated standard error of ˆλ is thus (n = 23) sˆλ = ˆλ n = 1.04 An approximate 90% confidence interval for λ is ˆλ ± 1.65sˆλ or (23.2, 26.6). This interval gives a good indication of the uncertainty inherent in the determination of the average asbestos level using the model that the counts in the grid squares are independent Poisson random variables. ■ In a similar way, approximate confidence intervals can be obtained for parameters estimated from random multinomial counts. The counts are not i.i.d., so the variance of the parameter estimate is not of the form 1/[nI(θ)]. However, it can be shown that Var(ˆθ) ≈ 1 E[l (θ0)2] =− 1 E[l (θ0)] and the maximum likelihood estimate is approximately normally distributed. Exam- ple C illustrates this concept. EXAMPLE C Hardy-Weinberg Equilibrium Let us return to the example of Hardy-Weinberg equilibrium discussed in Example A in Section 8.5.1. There we found ˆθ = .4247. Now, l (θ) =−2X1 + X2 1 − θ + 2X3 + X2 θ In order to calculate E[l (θ)2], we would have to deal with the variances and covari- ances of the Xi . This does not look too inviting; it turns out to be easier to calculate E[l (θ)]. l (θ) =−2X1 + X2 (1 − θ)2 − 2X3 + X2 θ2 Since the Xi are binomially distributed, we have E(X1) = n(1 − θ)2 E(X2) = 2nθ(1 − θ) E(X3) = nθ2 We find, after some algebra, that E[l (θ)] =− 2n θ(1 − θ) Since θ is unknown, we substitute ˆθ in its place and obtain the estimated standard 284 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions error of ˆθ: sˆθ = 1 −I (ˆθ) = ˆθ(1 − ˆθ) 2n = .011 An approximate 95% confidence interval for θ is ˆθ ± 1.96sˆθ ,or(.403,.447). (Note that this estimated standard error of ˆθ agrees with that obtained by the bootstrap in Example 8.5.1A.) ■ Finally, we describe the use of the bootstrap for finding approximate confidence intervals. Suppose that ˆθ is an estimate of a parameter θ—the true, unknown value of which is θ0—and suppose for the moment that the distribution of  = ˆθ − θ0 is known. Denote the α/2 and 1 − α/2 quantiles of this distribution by δ and δ; i.e., P(ˆθ − θ0 ≤ δ) = α 2 P(ˆθ − θ0 ≤ δ) = 1 − α 2 Then P(δ ≤ ˆθ − θ0 ≤ δ) = 1 − α and from manipulation of the inequalities, P(ˆθ − δ ≤ θ0 ≤ ˆθ − δ) = 1 − α The preceding assumed that the distribution of ˆθ − θ0 was known, which is typically not the case. If θ0 were known, this distribution could be approximated arbitrarily well by simulation: Many, many samples of observations could be randomly generated on a computer with the true value θ0; for each sample, the difference ˆθ −θ0 could be recorded; and the two quantiles δ and δ could, consequently, be determined as accurately as desired. Since θ0 is not known, the bootstrap principle suggests using ˆθ in its place: Generate many, many samples (say, B in all) from a distribution with value ˆθ; and for each sample construct an estimate of θ, say θ∗ j, j = 1, 2,...,B. The distribution of ˆθ −θ0 is then approximated by that of θ∗ − ˆθ, the quantiles of which are used to form an approximate confidence interval. Examples may make this clearer. EXAMPLE D We first apply this technique to the Hardy-Weinberg equilibrium problem; we will find an approximate 95% confidence interval based on the bootstrap and compare the result to the interval obtained in Example C, where large-sample theory for maxi- mum likelihood estimates was used. The 1000 bootstrap estimates of θ of Example A of Section 8.5.1 provide an estimate of the distribution of θ∗; in particular the 25th largest is .403 and the 975th largest is .446, which are our estimates of the .025 and 8.6 The Bayesian Approach to Parameter Estimation 285 .975 quantiles of the distribution. The distribution of θ∗ − ˆθ is approximated by sub- tracting ˆθ = .425 from each θ∗ i , so the .025 and .975 quantiles of this distribution are estimated as δ = .403 − .425 =−.022 δ = .446 − .425 = .021 Thus our approximate 95% confidence interval is (ˆθ − δ, ˆθ − δ) = (.404,.447) Since the uncertainty in ˆθ is in the second decimal place, this interval and that found in Example C are identical for all practical purposes. ■ EXAMPLE E Finally, we apply the bootstrap to find approximate confidence intervals for the parameters of the gamma distribution fit in Example C of Section 8.5. Recall that the estimates were ˆα = .471 and ˆλ = 1.97. Of the 1000 bootstrap values of α∗,α∗ 1 ,α∗ 2 ,...,α∗ 1000, the 50th largest was .419 and the 950th largest was .538; the .05 and .95 quantiles of the distribution of α∗ − ˆα are approximated by subtracting ˆα from these values, giving δ = .419 − .471 =−.052 δ = .538 − .471 = .067 Our approximate 90% confidence interval for α0 is thus (ˆα − δ, ˆα − δ) = (.404,.523) The 50th and 950th largest values of λ∗ were 1.619 and 2.478, and the corresponding approximate 90% confidence interval for λ0 is (1.462, 2.321). ■ We caution the reader that there are a number of different methods of using the bootstrap to find approximate confidence intervals. We have chosen to present the preceding method largely because the reasoning leading to its development is fairly direct. Another popular method, the bootstrap percentile method, uses the quantiles of the bootstrap distribution of ˆθ directly. Using this method in the previous example, the confidence interval for α would be (.419,.538). Although this direct equation of quantiles of the bootstrap sampling distribution with confidence limits may seem initially appealing, its rationale is somewhat obscure. If the bootstrap distribution is symmetric, the two methods are equivalent (see Problem 38). 8.6 The Bayesian Approach to Parameter Estimation A preview of the Bayesian approach was given in Example E of Section 3.5.2, which should be reviewed before continuing. 286 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions In the Bayesian approach, the unknown parameter θ is treated as a random vari- able, with “prior distribution” f(θ) representing what we know about the parameter before observing data, X. In the following, we assume  is a continuous random variable; the discrete case is entirely analogous. This model is in contrast to the ap- proaches described in the previous sections, in which θ was treated as an unknown constant. For a given value,  = θ, the data have the probability distribution (density or probability mass function) fX|(x|θ). The joint distribution of X and  is thus fX,(x,θ)= fX|(x|θ)f(θ) and the marginal distribution of X is fX (x) = fX,(x,θ)dθ = fX|(x|θ)f(θ)dθ The distribution of  given the data X is thus f|X (θ|x) = fX,(x,θ) fX (x) = fX|(x|θ)f(θ) fX|(x|θ)f(θ)dθ This is called the posterior distribution; it represents what is known about  having observed data X. Note that the likelihood is fX|(x|θ), viewed as a function of θ, and we may usefully summarize the preceding result as f|X (θ|x) ∝ fX|(x|θ)× f(θ) Posterior density ∝ Likelihood × Prior density The Bayes paradigm has an appealing formal simplicity as it involves elementary probability operations. Wewill now see what it amounts to for examples we considered earlier. EXAMPLE A Fitting a Poisson Distribution Here the unknown parameter is λ, which has a prior distribution f(λ), and the data are n i.i.d. observations X1, X2,...,Xn, which for a given value λ are Poisson random variables with fXi |(xi |λ) = λxi e−λ xi ! , xi = 0, 1, 2,... Their joint distribution given λ is (from independence) the product of their marginal distributions given λ fX|(x|λ) = λn i=1xi e−nλ n i=1 xi ! 8.6 The Bayesian Approach to Parameter Estimation 287 where X denotes (X1, X2,...,Xn). The posterior distribution of  given X is then f|X (λ|x) = λn i=1xi e−nλ f(λ) λn i=1xi e−nλ f(λ) dλ (the term n i=1 xi ! has cancelled out). Thus, to evaluate the posterior distribution, we have to do two things: spec- ify the prior distribution f(λ) and carry out the integration in the denominator of the preceding expression. For illustration, we consider the data of Examples 8.4A and 8.5A. We will consider two approaches to specifying the prior distribution. The first is that of an orthodox Bayesian who takes very seriously the model that the prior distribution specifies his prior opinion. Note that this specification should be done before seeing the data, X, and he is required to provide the probability density f(λ) through introspection. This is not an easy task to carry out, and even the orthodox often compromise for convenience. He thus decides to quantify his opinion by specifying a prior mean μ1 = 15 and standard deviation σ = 5 and to use, because the math works out nicely as we will see, a gamma density with that mean and standard deviation. This choice could be aided by plotting gamma densities for various parameter values. The prior density is shown in Figure 8.9. Using the relationships developed in Exam- ple C in Section 8.4, the second moment is μ2 = μ2 1 + σ 2 = 250 and the parameters of the gamma density are ν = μ1 μ2 − μ2 1 = 0.6 α = νμ1 = 9 51015 20 25 0.1 0.0 0.2 0.3 0.4 0 30 FIGURE 8.9 First statistician s prior (solid) and posterior (dashed). Second statistician s posterior (dotted). 288 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions (We denote the parameter by ν rather than by the usual λ since λ has already been used for the parameter of the Poisson distribution.) The prior distribution for  is then f(λ) = να (α)λα−1e−νλ After some cancellation, the posterior density is f|X (λ|x) = λxi +α−1e−(n+ν)λ ∞ 0 λxi +α−1e−(n+ν)λdλ Now, consider this an important trick that is used time and again in Bayesian calcula- tions: the denominator is a constant that makes the expression integrate to 1. We can deduce from the form of the numerator that the ratio must be a gamma density with parameters α = xi + α = 582 ν = n + ν = 23.6 This standard trick allows the statistician to avoid having to do any explicit integra- tion. (Make sure you understand it, because it will occur again several times.) The posterior density is shown in Figure 8.9. Compare it to the prior distribution to observe how observation of the data, X, has drastically changed his state of knowledge about . Notice that the posterior density is much more symmetric and looks like a normal density (that this is no accident will be shown later). ■ According to the Bayesian paradigm, all the information about  is contained in the posterior distribution. The mean of this distribution (the posterior mean)is μpost = α ν = 24.7 The most probable value of , the posterior mode, is 24.6. (Verify that the gamma density is maximized at (α − 1)/ν.) Either of these two values could be used as a point estimate of the unknown mean of the Poisson distribution, if a single number is required. The variance of the posterior distribution is σ 2 post = α ν 2 = 1.04 and the posterior standard deviation is σpost = 1.02, which is a simple measure of variability—the posterior distribution of  has mean 24.7 and standard deviation 1.02. A Bayesian analogue of a 90% confidence interval is the interval from the 5th percentile to the 95th percentile of the posterior, which can be found numerically to be [23.02, 26.34]. A common alternative to this interval is a high posterior density (HPD) interval, formed as follows: Imagine placing a horizontal line at the high point of the posterior density and moving it downward until the interval of λ formed below where the line cuts the density contained 90% probability. If the posterior density is symmetric and unimodal, as is nearly the case in Figure 8.9, the HPD interval will coincide with the interval between the percentiles. 8.6 The Bayesian Approach to Parameter Estimation 289 The second statistician takes a more utilitarian, noncommittal approach. She believes that it is implausible that the mean count λ could be larger than 100, and uses a simple prior that is uniform on [0, 100], without trying to quantify her opinion more precisely. The posterior density is thus f|X (λ|x) = λn i=1xi e−nλ 1 100 1 100 100 0 λn i=1xi e−nλdλ , 0 ≤ λ ≤ 100 The denominator has to be integrated numerically, but this is easy to do for such a smooth function. The resulting posterior density is shown in Figure 8.9. Using numerical evaluations, she finds that the posterior mode is 24.9, the posterior mean is 25.0, and the posterior standard deviation is 1.04. The interval from the 5th to the 95th percentile is [23.3, 26.7]. We now compare these two results to each other and to the results of maximum likelihood analysis. Estimate Bayes 1 Bayes 2 Maximum Likelihood mode 24.6 24.9 24.9 mean 24.7 25.0 — standard deviation 1.02 1.04 1.04 upper limit 26.3 26.7 26.6 lower limit 23.0 23.3 23.2 Comparing the results of the second Bayesian to those of maximum likelihood, it is important to realize that her posterior density is directly proportional to the like- lihood for 0 ≤ λ ≤ 100, because the prior is flat over this range and the posterior is proportional to the prior times the likelihood. Thus, her posterior mode and the max- imum likelihood estimate are identical. There is no such guarantee that her posterior standard deviation and the approximate standard error of the maximum likelihood estimate are identical, but they turn out to be, to the number of significant figures displayed in the table. The two 90% intervals are very close. Now compare the results of the first and second Bayesians. Observe that although his prior opinion was not in accord with the data, the data strongly modified the prior, to produce a posterior that is close to hers. Even though they start with quite different assumptions, the data forces them to very similar conclusions. His prior opinion has indeed influenced the results: his posterior mean and mode are less than hers, but the influence has been mild. (If there had been less data or if his prior opinions had been much more biased to low values, the results would have been in greater conflict.) The fundamental result that the posterior is proportional to the prior times the likelihood helps us to understand the difference: the likelihood is substantial only in the region approximately between λ = 22 and λ = 28. (This can be seen in the figure, because the second statistician’s posterior is proportional to the likelihood. See Figure 8.5, also). In this region, his prior decreases slowly, so the posterior is proportional to a weighted version of the likelihood, with slowly decreasing weight. 290 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions The first Bayesian’s posterior thus differs from the second by being pushed up slightly on the left and pulled down on the right. Although they are very similar numerically, there is an important difference between the Bayesian and frequentist interpretation of the confidence intervals. In the Bayesian framework,  is a random variable and it makes perfect sense to say, “Given the observations, the probability that  is in the interval [23.3, 26.7] is 0.90.” Under the frequentist framework, such a statement makes no sense, because λ is a constant, albeit unknown, and it either lies in the interval [23.3, 26.7] or doesn’t—no probability is involved. Before the data are observed, the interval is random, and it makes sense to state that the probability that the interval contains the true parameter value is 0.90, but after the data are observed, nothing is random anymore. One way to understand the difference of interpretation is to realize that in the Bayesian analysis the interval refers to the state of knowledge about λ and not to λ itself. Finally, we note that an alternative for the second statistician would have been to use a gamma prior because of its analytical convenience, but to make the prior very flat. This can be accomplished by setting α and λ to be very small. EXAMPLE B Normal Distribution It is convenient to reparametrize the normal distribution, replacing σ 2 by ξ = 1/σ 2; ξ is called the precision. We will also use θ in place of μ. The density is then f (x|θ,ξ) = ξ 2π 1/2 exp −1 2 ξ(x − θ)2 The normal distribution has two parameters, and we will consider cases of Bayesian analysis depending on which of them are known and unknown. ■ Case of Unknown Mean and Known Variance We first consider the case in which the precision is known, ξ = ξ0 and the mean, θ, is unknown. In the Bayesian treatment, the mean is a random variable, . It is mathe- matically convenient to use a prior distribution for , which is N(θ0,ξ−1 prior). This prior is very flat, or uninformative, when ξprior is very small, i.e., when the prior variance is very large. Thus, if X = (X1, X2,...,Xn) are independent given θ f|X (θ|x) ∝ fX|(x|θ)× f(θ) = ξ0 2π n/2 n i=1 exp −ξ0 2 (xi − θ)2 × ξprior 2π 1/2 × exp −ξprior 2 (θ − θ0)2 ∝ exp −1 2 ξ0 n i=1 (xi − θ)2 + ξprior(θ − θ0)2 Here we have exhibited only the terms in the posterior density that depend upon θ; the last expression above shows the shape of the posterior density as a function of θ. The posterior density itself is proportional to this expression, with a proportionality constant that is determined by the requirement that the posterior density integrates to 1. 8.6 The Bayesian Approach to Parameter Estimation 291 We will now manipulate the expression for the numerator to cast it in a form so that we can recognize that the posterior density is normal. Expressing (xi − θ)2 =(xi − ¯x)2 + n(θ − ¯x)2, and absorbing more terms that do not depend on θ into the constant of proportionality (a typical move in Bayesian calculations), we find f|X (θ|x) ∝ exp −1 2[nξ0(θ − ¯x)2 + ξprior(θ − θ0)2] Now, observe that this is of the form exp(−(1/2)Q(θ)), where Q(θ) is a quadratic polynomial. We can find expressions ξpost and θpost, and write Q(θ) = ξpost(θ − θpost)2 + terms that do not depend on θ and conclude that the posterior density is normal with posterior mean θpost and pos- terior precision ξpost. Again, terms that do not depend on θ do not affect the shape of the posterior density and are absorbed in the normalization constant that makes the posterior density integrate to 1. Thus we expand Q(θ) and identify the coefficient of θ2 as the posterior precision and the coefficient of −θ as twice the posterior mean times the posterior precision. Doing so, we find ξpost = nξ0 + ξprior θpost = nξ0 ¯x + θ0ξprior nξ0 + ξprior = ¯x nξ0 nξ0 + ξprior + θ0 ξprior nξ0 + ξprior The posterior density of θ is thus normal with this mean and precision. Note that the precision has increased and that the posterior mean is a weighted combination of the sample mean and the prior mean. To interpret these results, consider what happens when ξprior  nξ0, which would be the case if n were sufficiently large of if ξprior were small (as for a very flat prior). Then the posterior mean would be θpost ≈ ¯x which is the maximum likelihood estimate, and ξpost ≈ nξ0 This last equation can be written as σ 2 post = σ 2 0 /n, which is just the variance of X in the non-Bayesian setting. In summary, if the flat prior with very small ξprior is used, the posterior density of θ is very close to normal with mean ¯x and variance σ 2 0 /n. ■ Case of Known Mean and Unknown Variance In this case, the precision is unknown and is treated as a random variable , with prior distribution f(ξ).Givenξ, the Xi are independent N(θ0,ξ−1). Let X = (X1, X2,...,Xn). Then f|X (ξ|x) ∝ fX|(x|ξ)f(ξ) ∝ ξ n/2 exp −1 2 ξ (xi − θ0)2 f(ξ) 292 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Observing how the density depends on ξ, we realize that it is analytically convenient to specify the prior to be a gamma density:  ∼ (α, λ). Then f|X (ξ|x) ∝ ξ n/2 exp −1 2 ξ (xi − θ0)2 ξ α−1e−λξ which is a gamma density with parameters, αpost = α + n 2 λpost = λ + 1 2 (xi − θ0)2 In the case of a flat prior (small α and λ), the posterior mean and mode are Posterior mean ≈ 1 n (xi − θ0)2 Posterior mode ≈ 1 n − 2 (xi − θ0)2 The former is the maximum likelihood estimate of σ 2. In the limit, λ → 0, α → 0, f|X (ξ|x) ∝ ξ n/2−1 exp −1 2 ξ (xi − θ0)2 ■ Case of Unknown Mean and Unknown Variance In this case, there are two unknown parameters, and a Bayesian approach requires the specification of a joint two-dimensional prior distribution. We follow a path of mathematical convenience and take the priors to be independent:  ∼ N θ0,ξ−1 prior  ∼ (α, λ) We then have f,|X (θ, ξ|x) ∝ fX|,(x|θ,ξ)f(θ) f(ξ) ∝ ξ n/2 exp −ξ 2 (xi − θ)2 × exp −ξprior 2 (θ − θ0)2 ξ α−1 exp(−λξ) From the manner in which θ and ξ occur in the first exponential, it appears that the two variables are not independent in the posterior even though they were in the prior. To evaluate this joint posterior density, we would have to find the constant of propor- tionality that makes it integrate to 1—the normalization constant. Two dimensional numerical integration could be used. Often the primary interest is in the mean, θ, and one useful aspect of Bayesian analysis is that information about θ can be “marginalized” by integrating out ξ: f|X (θ|x) = ∞ 0 f,|X (θ, ξ|x)dξ 8.6 The Bayesian Approach to Parameter Estimation 293 Examining the preceding expression for f,|X (θ, ξ|x) as a function of ξ,wesee that it is of the form of a gamma density, with parameters ˜α = α + n/2 and ˜λ = λ + (1/2) (xi − θ)2, so we can evaluate the integral. We thus find f|X (θ|x) ∝ exp −ξprior 2 (θ − θ0)2) (α + n/2) [λ + 1 2 (xi − θ)2]α+n/2 This is not a density that we recognize, but it could be evaluated numerically. Doing so would again entail finding the normalizing constant, which could be done by numerical integration. Some simplifications occur when n is large or when the prior is quite flat (α, λ, ξprior are small). Then f|X (θ|x) ∝ (xi − θ)2 −n/2 This posterior is maximized when (xi − θ)2 is minimized, which occurs at θ = ¯x. We can relate this to the result we found for maximum likelihood analysis by expressing (xi − θ)2 = (xi − ¯x)2 + n(θ − ¯x)2 = (n − 1)s2 + n(θ − ¯x)2 = (n − 1)s2 1 + n(θ − ¯x)2 (n − 1)s2 Substituting this above and absorbing terms that do not depend on θ into the propor- tionality constant, we find f|X (θ|x) ∝ 1 + 1 n − 1 n(θ − ¯x)2 s2 −n/2 Now comparing this to the definition of the t distribution (Section 6.2), we see that√ n( − ¯x) s ∼ tn−1 corresponding to the result from maximum likelihood analysis. The interval ¯x ±tn−1(α/2)s/√ n was earlier derived as a 100(1−α)% confidence interval centered about the maximum likelihood estimate, and here it has reappeared in the Bayesian analysis as an interval with posterior probability 1 − α. There are differences of interpretation, however, just as there were for the earlier Poisson case. The Bayesian interval is a probability statement referring to the state of knowledge about θ given the observed data, regarding θ as a random variable. The frequentist confidence interval is based on a probability statement about the possible values of the observations, regarding θ as a constant, albeit unknown. ■ EXAMPLE C Hardy-Weinberg Equilibrium We now turn to a Bayesian treatment of Example A in Section 8.5.1. We use the multinomial likelihood function and a prior for θ, which is uniform on [0, 1]. The posterior density is thus proportional to the likelihood, and is shown in Figure 8.10. Note that it looks very much like a normal density, a phenomenon that will be explored in a later section. Since fX|(x|θ) is a polynomial in θ (of high degree), 294 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 0.35 0.40 0.45 0.55 0.60 10 0 20 30 0.30 Posterior density 0.50 FIGURE 8.10 Posterior distribution of . the normalization constant can in principle be computed analytically. (Alternatively, all the computations can be done numerically.) Because the prior is flat, the posterior is directly proportional to the likelihood and the maximum of the posterior density is the maximum likelihood estimate, ˆθ = 0.4247. The 0.025 percentile of the density is 0.404, and the 0.975 percentile is 0.446. These results agree with the approximate confidence interval found for the maximum likelihood estimate in Example C in Section 8.5.3. ■ 8.6.1 Further Remarks on Priors In the previous section, we saw that if the prior for a Poisson parameter is chosen to be a gamma density, then the posterior is also a gamma density. Similarly, when the prior for a normal mean with known variance is chosen to be normal, then the posterior is normal as well. Earlier, in Example E in Section 3.5.2, a beta prior was used for a binomial parameter, and the posterior turned out to be beta as well. These are examples of conjugate priors: if the prior distribution belongs to a family G and, conditional on the parameters of G, the data have a distribution H, then G is said to be conjugate to H if the posterior is in the family G. Other conjugate priors will be the subject of problems at the end of the chapter. Conjugate priors are used for mathematical convenience (required integrations can be done in closed form) and because they can assume a variety of shapes as the parameters of the prior are varied. In scientific applications, it is usually desirable to use a flat, or “uninformative,” prior so that the data can speak for themselves. Even if a scientific investigator actually had a strong prior opinion, he or she might want to present an “objective” analysis. This is accomplished by using a flat prior so that the conclusions, as summarized in the posterior density, are those of one who is initially unopinionated or unprejudiced. 8.6 The Bayesian Approach to Parameter Estimation 295 If an informative prior were used, it would have to be justified to the larger scientific community. The objective prior thus has a hypothetical, or “what if,” status: if one was initially indifferent to parameter values in the range in which the likelihood is large, then one’s opinion after observing the data would be expressed as a posterior proportional to the likelihood. Attempts have been made to formalize more precisely what the notion of an unin- formative prior means. One problem that is addressed is caused by reparametrization. For example, suppose that the prior density of the precision ξ is taken to be uniform on an interval [a, b], which might seem to be a reasonable way to quantify the notion of being uniformative. However, if the variance σ 2 = 1/ξ, rather than the precison, was used, the prior density of σ 2 would not be uniform on [b−1, a−1]. We will not delve further into these issues here, except to note that the parametrization θ or g(θ) would make a difference only if the difference in the shapes of the priors was substantial in the region in which the likelihood was large. We saw in the Poisson example that if α and ν are very small, the gamma prior is quite flat and the posterior is proportional to the likelihood function. Formally, if α and ν are set equal to zero, then the prior is f|α,ν(λ) = λ−1, 0 ≤ λ<∞ But this function does not integrate to 1—it is not a probability density. A similar phenomena occurs in the normal case with unknown mean and known precision, if the prior precision is set equal to 0. The prior is then f(θ) ∝ 1, −∞ <θ<∞ and not a probability density either. Such priors are called improper priors (priors that lack propriety). In general, if an improper prior is formally used, the posterior may not be a density either, because the denominator of the expression for the posterior density, fX|(x|θ)f(θ) dθ may not converge. (Note that it is integrated with respect to θ, not x.) This has not been the case in our examples. For the Poisson example, if f(λ) ∝ λ−1, then the denominator is ∞ 0 λ xi −1e−nλdλ<∞ In the normal case, too, the integral is defined, and thus there is a well-defined posterior density. Let us revisit some examples using the device of an improper prior. In the Poisson example, using the improper prior f(λ) = λ−1 results in a (proper) posterior f|X (λ|x) ∝ λ xi −1e−nλ which can be recognized as a gamma density. In the normal example with unknown mean and variance, we can take θ and ξ to be independent with improper priors f(θ) = 1 and f(ξ) = ξ −1. The joint posterior of θ and ξ is then f,|X (θ, ξ|x) ∝ ξ n/2−1 exp −ξ 2 (xi − θ)2 296 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Expressing n i=1(xi − θ)2 = (n − 1)s2 + n(θ − ¯x)2,wehave f,|X (θ, ξ|x) ∝ ξ n/2−1 exp −ξ 2 (n − 1)s2 exp −nξ 2 (θ − ¯x)2 For fixed ξ, this expression is proportional to the conditional density of θ given ξ. (Why?) From the form of the dependence on θ, we see that conditional on ξ, θ is normal with mean ¯x and precision nξ. By integrating out ξ, we can find the marginal distribution of θ and relate it to the t distribution as was done earlier. Since improper priors are not actually probability densities, they are difficult to interpret literally. However, the resulting posteriors can be viewed as approximations to those that would have arisen with extreme values of the parameters of proper priors. The priors corresponding to such extreme values are very flat, so the posterior is dominated by the likelihood. Then it is only in the range in which the likelihood is large that the prior makes any practical difference—truncating the improper prior well outside this range to produce a proper prior will not appreciably change the posterior. 8.6.2 Large Sample Normal Approximation to the Posterior We have seen in several examples that the posterior distribution is nearly normal with the mean equal to the maximum likelihood estimate, and that the posterior standard deviation is close to the asymptotic standard deviation of the maximum likelihood estimate. The two methods thus often give quite comparable results. We will not give a formal proof here, but rather will sketch an argument that the posterior distribution is approximately normal with the mean equal the the maximum likelihood estimate, ˆθ, and variance approximately equal to −[l (ˆθ)]−1. Denoting the observations generically by x, the posterior distribution is f|X (θ|x) ∝ f(θ) fX|(x|θ) = exp[log f(θ)] exp[log fX|(x|θ)] = exp[log f(θ)] exp[l(θ)] Now, if the sample is large, the posterior is dominated by the likelihood, and in the region where the likelihood is large, the prior is nearly constant. Thus, to an approxi- mation, f|X (θ|x) ∝ exp l(ˆθ)+ (θ − ˆθ)l (ˆθ)+ 1 2 (θ − ˆθ)2l (ˆθ) ∝ exp 1 2 (θ − ˆθ)2l (ˆθ) In the last step, we used the fact that since ˆθ is the maximum likelihood estimate l (ˆθ) = 0. The term l(ˆθ) was absorbed into a proportionality constant, since we are evaluating the posterior as a function of θ. Finally, observe that the last expression is proportional to a normal density with mean ˆθ and variance −[l (ˆθ)]−1. 8.6 The Bayesian Approach to Parameter Estimation 297 8.6.3 Computational Aspects Contemporary computational resources have had an enormous impact on Bayesian inference. As we have seen in several examples, the computationally difficult part of Bayesian inference is the calculation of the normalizing constant that makes the posterior density integrate to 1. Traditionally, such calculations were performed ana- lytically, often using conjugate priors so that the integrations could be done explicitly. The numerical integration of a well-behaved function of a small number of variables is now trivial. Difficulties do arise in high dimensional problems, however, and the integrations are often done by sophisticated Monte Carlo methods. We will not go into these sorts of methods in this book, but will hint at their nature in the following exam- ple of a method called Gibbs Sampling. Consider, as a simple example, inference for a normal distribution with unknown mean and variance. From Example B in Section 8.6 f,|X (θ, ξ|x) ∝ ξ n/2 exp −ξ 2 (xi − θ)2 × exp −ξprior 2 (θ − θ0)2 ξ α−1 exp(−λξ) For simplicity, suppose that an improper prior is used: ξprior → 0,α → 0,λ → 0. Then f,|X (θ, ξ|x) ∝ ξ n/2−1 exp −ξ 2 (xi − θ)2 ∝ ξ n/2−1 exp nξ 2 (θ − ¯x)2 Here we expressed (xi − θ)2 = (xi − ¯x)2 + n(θ − ¯x)2 and absorbed terms that do not involve θ into the constant of proportionality. To study the posterior distribution of ξ and θ by Monte Carlo, we would draw many pairs (ξk,θk) from this joint density; the problem is how to actually do this. Gibbs Sampling would accomplish this in the following way. Observe that the expression f,|X (θ, ξ|x) shows that for given ξ, θ is normally distributed with mean ¯x and precision nξ. (Fix ξ in the expression and recognize a normal density in θ.) Also, if θ is fixed, the density of ξ is a gamma density. Gibbs Sampling alternates back and forth between the two conditional distributions: 1. Choose an initial value θ0; ¯x would be a natural choice. 2. Generate ξ0 from a gamma density with parameter θ0. 3. Generate θ1 from a normal distribution with parameter ξ0. 4. Generate ξ1 from a gamma density with parameter θ1. 5. Continue on in this fashion. The analysis of the algorithm and why it works is beyond the scope of this book. A “burn-in” period is required so that we might run this scheme for a few hundred steps before beginning to record pairs (ξk,θk), k = 1,...,N, which would be regarded 298 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions as simulated pairs from the posterior. A further complication is that these pairs are not independent of one another. But, nonetheless, a histogram of the collection of θk could be used as an estimate of the marginal posterior distribution of . The posterior mean of  can be estimated as E(|X) ≈ 1 N N k=1 θk 8.7 Efficiency and the Cram´er-Rao Lower Bound In most statistical estimation problems, there are a variety of possible parameter estimates. For example, in Chapter 7 we considered both the sample mean and a ratio estimate, and in this chapter we considered the method of moments and the method of maximum likelihood. Given a variety of possible estimates, how would we choose which to use? Qualitatively, it would be sensible to choose that estimate whose sampling distribution was most highly concentrated about the true parameter value. To define this aim operationally, we would need to specify a quantitative measure of such concentration. Mean squared error is the most commonly used measure of concentration, largely because of its analytic simplicity. The mean squared error of ˆθ as an estimate of θ0 is MSE(ˆθ) = E(ˆθ − θ0)2 = Var(ˆθ)+ (E(ˆθ)− θ0)2 (See Theorem A of Section 4.2.1.) If the estimate ˆθ is unbiased [E(ˆθ)= θ0], MSE(ˆθ)= Var(ˆθ). When the estimates under consideration are unbiased, comparison of their mean squared errors reduces to comparison of their variances, or equivalently, stan- dard errors. Given two estimates, ˆθ and ˜θ, of a parameter θ, the efficiency of ˆθ relative to ˜θ is defined to be eff(ˆθ, ˜θ) = Var(˜θ) Var(ˆθ) Thus, if the efficiency is smaller than 1, ˆθ has a larger variance than ˜θ has. This comparison is most meaningful when both ˆθ and ˜θ are unbiased or when both have the same bias. Frequently, the variances of ˆθ and ˜θ are of the form Var(ˆθ) = c1 n Var(˜θ) = c2 n where n is the sample size. If this is the case, the efficiency can be interpreted as the ratio of sample sizes necessary to obtain the same variance for both ˆθ and ˜θ. (In Chapter 7, we compared the efficiencies of estimates of a population mean from a simple random sample, a stratified random sample with proportional allocation, and a stratified random sample with optimal allocation.) 8.7 Efficiency and the Cram´er-Rao Lower Bound 299 EXAMPLE A Muon Decay Two estimates have been derived for α in the problem of muon decay. The method of moments estimate is ˜α = 3X The maximum likelihood estimate is the solution of the nonlinear equation n i=1 Xi 1 + ˆαXi = 0 We need to find the variances of these two estimates. Since the variance of a sample mean is σ 2/n, we compute σ 2: σ 2 = E(X 2) − [E(X)]2 = 1 −1 x2 1 + αx 2 dx − α2 9 = 1 3 − α2 9 Thus, the variance of the method of moments estimate is Var(˜α) = 9Var(X) = 3 − α2 n The exact variance of the mle, ˆθ, cannot be computed in closed form, so we approxi- mate it by the asymptotic variance, Var(ˆα) ≈ 1 nI(α) and then compare this asymptotic variance to the variance of ˜α. The ratio of the former to the latter is called the asymptotic relative efficiency. By definition, I (α) = E ∂ ∂α log f (x|α) 2 = 1 −1 x2 (1 + αx)2 1 + αx 2 dx = log 1 + α 1 − α − 2α 2α3 , −1 <α<1,α = 0 = 1 3 ,α= 0 The asymptotic relative efficiency is thus (for α = 0) Var(ˆα) Var(˜α) = 2α3 3 − α2 ⎡ ⎢⎢⎣ 1 log 1 + α 1 − α − 2α ⎤ ⎥⎥⎦ 300 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions The following table gives this efficiency for various values of α between 0 and 1; symmetry would yield the values between −1 and 0. α Efficiency 0.0 1.0 .1 .997 .2 .989 .3 .975 .4 .953 .5 .931 .6 .878 .7 .817 .8 .727 .9 .582 .95 .464 As α tends to 1, the efficiency tends to 0. Thus, the mle is not much better than the method of moments estimate for α close to 0 but does increasingly better as α tends to 1. It must be kept in mind that we used the asymptotic variance of the mle, so we calculated an asymptotic relative efficiency, viewing this as an approximation to the actual relative efficiency. To gain more precise information for a given sample size, a simulation of the sampling distribution of the mle could be conducted. This might be especially interesting for α = 1, a case for which the formula for the asymptotic variance given above does not appear to make much sense. With a simulation study, the behavior of the bias as n and α vary could be analyzed (we showed that the mle is asymptotically unbiased, but there may be bias for a finite sample size), and the actual distribution could be compared to the approximating normal. ■ In searching for an optimal estimate, we might ask whether there is a lower bound for the MSE of any estimate. If such a lower bound existed, it would function as a benchmark against which estimates could be compared. If an estimate achieved this lower bound, we would know that it could not be improved upon. In the case in which the estimate is unbiased, the Cram´er-Rao inequality provides such a lower bound. We now state and prove the Cram´er-Rao inequality. THEOREM A Cram´er-Rao Inequality Let X1,...,Xn be i.i.d. with density function f (x|θ). Let T = t(X1,...,Xn) be an unbiased estimate of θ. Then, under smoothness assumptions on f (x|θ), Var(T ) ≥ 1 nI(θ) 8.7 Efficiency and the Cram´er-Rao Lower Bound 301 Proof Let Z = n i=1 ∂ ∂θ log f (Xi |θ) = n i=1 ∂ ∂θ f (Xi |θ) f (Xi |θ) In Section 8.5.2, we showed that E(Z) = 0. Because the correlation coefficient of Z and T is less than or equal to 1 in absolute value Cov2(Z, T ) ≤ Var(Z)Var(T ) It was also shown in Section 8.5.2 that Var ∂ ∂θ log f (X|θ) = I (θ) Therefore, Var(Z) = nI(θ) The proof will be completed by showing that Cov(Z, T ) = 1. Since Z has mean 0, Cov(Z, T ) = E(ZT) = ··· t(x1,...,xn) ⎡ ⎢⎣ n i=1 ∂ ∂θ f (xi |θ) f (xi |θ) ⎤ ⎥⎦ n j=1 f (x j |θ) dxj Noting that n i=1 ∂ ∂θ f (xi |θ) f (xi |θ) n j=1 f (x j |θ) = ∂ ∂θ n i=1 f (xi |θ) we rewrite the expression for the covariance of Z and T as Cov(Z, T ) = ··· t(x1,...,xn) ∂ ∂θ n i=1 f (xi |θ) dxi = ∂ ∂θ ··· t(x1,...,xn) n i=1 f (xi |θ) dxi = ∂ ∂θ E(T ) = ∂ ∂θ(θ) = 1 which proves the inequality. [Note the interchange of differentiation and integra- tion that must be justified by the smoothness assumptions on f (x|θ).] ■ 302 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Theorem A gives a lower bound on the variance of any unbiased estimate. An unbiased estimate whose variance achieves this lower bound is said to be efficient. Since the asymptotic variance of a maximum likelihood estimate is equal to the lower bound, maximum likelihood estimates are said to be asymptotically efficient. For a finite sample size, however, a maximum likelihood estimate may not be ef- ficient, and maximum likelihood estimates are not the only asymptotically efficient estimates. EXAMPLE B Poisson Distribution In Example B in Section 8.5.3, we found that for the Poisson distribution I (λ) = 1 λ Therefore, by Theorem A, for any unbiased estimate T of λ, based on a sample of independent Poisson random variables, X1,...,Xn, Var(T ) ≥ λ n The mle of λ was found to be X = S/n, where S = X1 +···+Xn. Since S follows a Poisson distribution with parameter nλ,Var(S) = nλ and Var(X) = λ/n. Therefore, X attains the Cram´er-Rao lower bound, and we know that no unbiased estimator of λ can have a smaller variance. In this sense, X is optimal for the Poisson distribution. But note that the theorem does not preclude the possibility that there is a biased estimator of λ that has a smaller mean squared error than X does. ■ 8.7.1 An Example: The Negative Binomial Distribution The Poisson distribution is often the first model considered for random counts; it has the property that the mean of the distribution is equal to the variance. When it is found that the variance of the counts is substantially larger than the mean, the negative binomial distribution is sometimes instead considered as a model. We consider a reparametrization and generalization of the negative binomial distribution introduced in Section 2.1.3, which is a discrete distribution on the nonnegative integers with a frequency function depending on the parameters m and k: f (x|m, k) = 1 + m k −k (k + x) x! (k) m m + k x The mean and variance of the negative binomial distribution can be shown to be μ = m σ 2 = m + m2 k It is apparent that this distribution is overdispersed (σ 2 >μ)relative to the Poisson. We will not derive the mean and variance. (They are most easily obtained by using moment-generating functions.) 8.7 Efficiency and the Cram´er-Rao Lower Bound 303 The negative binomial distribution can be used as a model in several cases: • If k is an integer, the distribution of the number of successes up to the kth failure in a sequence of independent Bernoulli trials with probability of success p = m/(m + k) is negative binomial. • Suppose that  is a random variable following a gamma distribution and that for λ, a given value of , X follows a Poisson distribution with mean λ. It can be shown that the unconditional distribution of X is negative binomial. Thus, for situations in which the rate varies randomly over time or space, the negative binomial distribution might tentatively be considered as a model. • The negative binomial distribution also arises with a particular type of clustering. Suppose that counts of colonies, or clusters, follow a Poisson distribution and that each colony has a random number of individuals. If the probability distribution of the number of individuals per colony is of a particular form (the logarithmic series distribution), it can be shown that the distribution of counts of individuals is negative binomial. The negative binomial distribution might be a plausible model for the distribution of insect counts if the insects hatch from depositions, or clumps, of larvae. • The negative binomial distribution can be applied to model population size in a certain birth/death process, the assumption being that the birth rate and death rate per individual are constant and that there is a constant rate of immigration. Anscombe (1950) discusses estimation of the parameters m and k and compares the efficiencies of several methods of estimation. The simplest method is the method of moments; from the relations of m and k to μ and σ 2 given previously, the method of moments estimates of m and k are ˆm = X ˆk = X 2 ˆσ 2 − X Another relatively simple method of estimation of m and k is based on the number of zeros. The probability of the count being zero is p0 = 1 + m k −k If m is estimated by the sample mean and there are n0 zeros out of a sample size of n, then k is estimated by ˆk, where ˆk satisfies n0 n = 1 + X ˆk −ˆk Although the solution cannot be obtained in closed form, it is not difficult to find by iteration. Figure 8.11, from Anscombe (1950), shows the asymptotic efficiencies of the two methods of estimation of the negative binomial parameters relative to the maximum likelihood estimate. In the figure, the method of moments is method 1 and the method based on the number of zeros is method 2. Method 2 is quite efficient when the mean is small—that is, when there are a large number of zeros. Method 1 becomes more efficient as k increases. 304 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 0-1 0-2 0-4 1 2 4 10 20 40 100 0-04 0-1 0-2 0-4 1 2 4 Mean m 10 20 40 100 200 400 Exponent k Method 1 Method 2 98% 90% 75% 50% 50% 75% 90% 98% FIGURE 8.11 Asymptotic efficiencies of estimates of negative binomial parameters. The maximum likelihood estimate is asymptotically efficient but is somewhat more difficult to compute. The equations will not be written out here. Bliss and Fisher (1953) discuss computational methods and give several examples. The maximum likelihood estimate of m is the sample mean, but that of k is the solution of a nonlinear equation. EXAMPLE A Insect Counts Let us consider an example from Bliss and Fisher (1953). From each of 6 apple trees in an orchard that was sprayed, 25 leaves were selected. On each of the leaves, the number of adult female red mites was counted. Intuitively, we might conclude that this situation was too heterogeneous for a Poisson model to fit; the rates of infestation might be different on different trees and at different locations on the same tree. The following table shows the observed counts and the expected counts from fitting Poisson and negative binomial distributions. The mle’s for k and m were ˆk = 1.025 and ˆm = 1.146. Number Observed Poisson Negative Binomial per Leaf Count Distribution Distribution 0 70 47.7 69.5 1 38 54.6 37.6 2 17 31.3 20.1 3 10 12.0 10.7 4 9 3.4 5.7 5 3 .75 3.0 6 2 .15 1.6 7 1 .03 .85 8+ 0 .00 .95 8.8 Sufficiency 305 Casual inspection of this table makes it clear that the Poisson does not fit; there are many more small and large counts observed than are expected for a Poisson distribution. ■ A recursive relation is useful in fitting the negative binomial distribution: p0 = 1 + m k −k pn = k + n − 1 n m k + m pn−1 8.8 Sufficiency This section introduces the concept of sufficiency and some of its theoretical impli- cations. Suppose that X1,...,Xn is a sample from a probability distribution with the density or frequency function f (x|θ). The concept of sufficiency arises as an attempt to answer the following question: Is there a statistic, a function T (X1,...,Xn), that contains all the information in the sample about θ? If so, a reduction of the original data to this statistic without loss of information is possible. For example, consider a sequence of independent Bernoulli trials with unknown probability of success, θ. We may have the intuitive feeling that the total number of successes contains all the information about θ that there is in the sample, that the order in which the suc- cesses occurred, for example, does not give any additional information. The following definition formalizes this idea. DEFINITION A statistic T (X1,...,Xn) is said to be sufficient for θ if the conditional dis- tribution of X1,...,Xn,givenT = t, does not depend on θ for any value of t. ■ In other words, given the value of T , which is called a sufficient statistic, we can gain no more knowledge aboutθ from knowing more about the probability distribution of X1,...,Xn. (Formally, we could envision keeping only T and throwing away all the Xi without any loss of information. Informally, and more realistically, this would make no sense at all. The values of the Xi might indicate that the model did not fit or that something was fishy about the data. What would you think, for example, if you saw 50 ones followed by 50 zeros in a sequence of supposedly independent Bernoulli trials?) 306 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions EXAMPLE A Let X1,...,Xn be a sequence of independent Bernoulli random variables with P(Xi = 1) = θ. We will verify that T = n i=1 Xi is sufficient for θ. P(X1 = x1,...,Xn = xn|T = t) = P(X1 = x1,...,Xn = xn, T = t) P(T = t) Bearing in mind that the Xi can take on only the values 0s and 1s, the probability in the numerator is the probability that some particular set of tXi are equal to 1s and the other n − t are 0s. Since the Xi are independent, the probability of this is the product of the marginal probabilities, or θt (1 − θ)n−t . To find the denominator note that the distribution of T , the total number of ones, is binomial with n trials and probability of success θ. The ratio in question is thus θt (1 − θ)n−t n t θt (1 − θ)n−t = 1 n t The conditional distribution thus does not involve θ at all. Given the total number of ones, the probability that they occur on any particular set of t trials is the same for any value of θ so that set of trials contains no additional information about θ. ■ 8.8.1 A Factorization Theorem The preceding definition of sufficiency is hard to work with, because it does not indicate how to go about finding a sufficient statistic, and given a candidate statistic, T , it would typically be very hard to conclude whether it was sufficient because of the difficulty in evaluating the conditional distribution. The following factorization theorem provides a convenient means of identifying sufficient statistics. THEOREM A A necessary and sufficient condition for T (X1,...,Xn) to be sufficient for a parameter θ is that the joint probability function (density function or frequency function) factors in the form f (x1,...,xn|θ) = g[T (x1,...,xn), θ]h(x1,...,xn) 8.8 Sufficiency 307 Proof We give a proof for the discrete case. (The proof for the general case is more subtle and requires regularity conditions, but the basic ideas are the same.) First, suppose that the frequency function factors as given in the theorem. To simplify notation, we will let X denote (X1,...,Xn) and x denote (x1,...,xn).Wehave P(T = t) = T (x)=t P(X = x) = g(t,θ) T (x)=t h(x) Here the notation indicates that the sum is over all x such that T (x) = t. We then have P(X = x|T = t) = P(X = x, T = t) P(T = t) = h(x) T (x)=t h(x) This conditional distribution does not depend on θ, as was to be shown. To show that the conclusion holds in the other direction, suppose that the conditional distribution of X given T is independent of θ. Let g(t,θ)= P(T = t|θ) h(x) = P(X = x|T = t) We then have P(X = x|θ) = P(T = t|θ)P(X = x|T = t) = g(t,θ)h(x) as was to be shown. ■ We can demonstrate the utility of Theorem A by applying it to some examples. More examples are included in the problems at the end of this chapter. EXAMPLE A Consider a sequence of independent Bernoulli random variables, X1,...,Xn, where P(Xi = x) = θ x (1 − θ)1−x , x = 0orx = 1 then f (x|θ) = n i=1 θ xi (1 − θ)1−xi = θn i=1xi (1 − θ)n−n i=1xi = θ 1 − θ n i=1 xi (1 − θ)n 308 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions We see that f (x|θ) depends only on x1, x2,...,xn through the sufficient statistic t = n i=1 xi and f (x|θ) is of the form g(n i=1 xi ,θ)h(x), where h(x) = 1 and g(t,θ)= θ 1 − θ t (1 − θ)n ■ EXAMPLE B Consider a random sample from a normal distribution that has an unknown mean and variance. We have f (x|μ, σ) = n i=1 1 σ √ 2π exp −1 2σ 2 (xi − μ)2 = 1 σ n(2π)n/2 exp −1 2σ 2 n i=1 (xi − μ)2 = 1 σ n(2π)n/2 exp −1 2σ 2 n i=1 x2 i − 2μ n i=1 xi + nμ2 This expression is just a function of n i=1 xi and n i=1 x2 i , which are therefore sufficient statistics. In this example we have a two-dimensional sufficient statistic. Although Theorem A was stated explicitly for a one-dimensional sufficient statistic, the multidimensional analogue holds also. ■ Because the likelihood, f (x1,...,xn; θ) = g[T (x1,...,xn), θ]h(x1,...,xn) it depends only on the data through T (x1,...,xn). The maximum likelihood esti- mate is found by maximizing g[T (x1,...,xn), θ]. In Example A, the likelihood is a function of t = n i=1 xi , and the maximum likelihood estimate is ˆθ = t/n. Similarly, in a Bayesian framework, the posterior distribution of θ is proportional to the product of the prior distribution of θ and the likelihood. As a function of θ, the posterior distribution thus depends only on the data through g[T (x1,...,xn), θ]—the posterior probability of θ is the same for all {x1,...,xn} which have a common value of T (x1,...,xn). The sufficient statistic carries all the information about θ that is contained in the data x1, x2,...,xn. A study of the properties of probability distributions that have sufficient statistics of the same dimension as the parameter space regardless of sample size led to the development of what is called the exponential family of probability distributions. Many common distributions, including the normal, the binomial, the Poisson, and the gamma, are members of this family. One-parameter members of the exponential family have density or frequency functions of the form f (x|θ) = exp[c(θ)T (x) + d(θ) + S(x)], x ∈ A = 0, x ∈ A 8.8 Sufficiency 309 where the set A does not depend on θ. Suppose that X1,...,Xn is a sample from a member of the exponential family; the joint probability function is f (x|θ) = n i=1 exp[c(θ)T (xi ) + d(θ) + S(xi )] = exp c(θ) n i=1 T (xi ) + nd(θ) exp n i=1 S(xi ) From this result, it is apparent by the factorization theorem that n i=1 T (Xi ) is a sufficient statistic. EXAMPLE C The frequency function of the Bernoulli distribution is P(X = x) = θ x (1 − θ)1−x , x = 0orx = 1 = exp x log θ 1 − θ + log(1 − θ) This is a member of the exponential family with T (x) = x, and we have already seen that n i=1 Xi , is a sufficient statistic for a sample from the Bernoulli distribution. ■ A k-parameter member of the exponential family has a density or frequency function of the form f (x|θ) = exp k i=1 ci (θ)Ti (x) + d(θ) + S(x) , x ∈ A = 0, x ∈ A where the set A does not depend on θ. The normal distribution is of this form. A great deal of theoretical work has centered around the exponential family; further discussion of this family can be found in Bickel and Doksum (2001). We conclude this section with the following corollary of Theorem A. COROLLARY A If T is sufficient for θ, the maximum likelihood estimate is a function of T . Proof From Theorem A, the likelihood is g(T,θ)h(x), which depends on θ only through T . To maximize this quantity, we need only maximize g(T,θ). ■ 310 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions Corollary A and the Rao-Blackwell theorem of the next section may be interpreted as giving some theoretical support to the use of maximum likelihood estimates. 8.8.2 The Rao-Blackwell Theorem In the preceding section, we argued for the importance of sufficient statistics on essen- tially qualitative grounds. The Rao-Blackwell theorem gives a quantitative rationale for basing an estimator of a parameter θ on a sufficient statistic if one exists. THEOREM A Rao-Blackwell Theorem Let ˆθ be an estimator of θ with E(ˆθ2)<∞ for all θ. Suppose that T is sufficient for θ, and let ˜θ = E(ˆθ|T ). Then, for all θ, E(˜θ − θ)2 ≤ E(ˆθ − θ)2 The inequality is strict unless ˆθ = ˜θ. Proof We first note that, from the property of iterated conditional expectation (Theorem A of Section 4.4.1), E(˜θ) = E[E(ˆθ|T )] = E(ˆθ) Therefore, to compare the mean squared error of the two estimators, we need only compare their variances. From Theorem B of Section 4.4.1, we have Var(ˆθ) = Var[E(ˆθ|T )] + E[Var(ˆθ|T )] or Var(ˆθ) = Var(˜θ)+ E[Var(ˆθ|T )] Thus, Var(ˆθ) > Var(˜θ) unless Var(ˆθ|T ) = 0, which is the case only if ˆθ is a function of T , which would imply ˆθ = ˜θ. ■ Since E(ˆθ|T )is a function of the sufficient statistic T , the Rao-Blackwell theorem gives a strong rationale for basing estimators on sufficient statistics if they exist. If an estimator is not a function of a sufficient statistic, it can be improved. Suppose that there are two estimates, ˆθ1 and ˆθ2, having the same expectation. Assuming that a sufficient statistic T exists, we may construct two other estimates, ˜θ1 and ˜θ2, by conditioning on T . The theory we have developed so far gives no clues as to which one of these two is better. If the probability distribution of T has the property called completeness, ˜θ1 and ˜θ2 are identical, by a theorem of Lehmann and Scheff´e. We will not define completeness or pursue this topic further; Lehmann and Casella (1998) and Bickel and Doksum (2001) discuss this concept. 8.9 Concluding Remarks 311 8.9 Concluding Remarks Certain key ideas first introduced in the context of survey sampling in Chapter 7 have recurred in this chapter. We have viewed an estimate as a random variable having a probability distribution called its sampling distribution. In Chapter 7, the estimate was of a parameter, such as the mean, of a finite population; in this chapter, the estimate was of a parameter of a probability distribution. In both cases, characteristics of the sampling distribution, such as the bias and the variance and the large sample approximate form, have been of interest. In both chapters, we studied confidence intervals for the true value of the unknown parameter. The method of propagation of error, or linearization, has been a useful tool in both chapters. These key ideas will be important in other contexts in later chapters as well. Important concepts and techniques in estimation theory were introduced in this chapter. We discussed two general methods of estimation—the method of moments and the method of maximum likelihood. The latter especially has great general utility in statistics. We developed and applied some approximate distribution theory for maximum likelihood estimates. Other theoretical developments included the concept of efficiency, the Cram´er-Rao lower bound, and the concept of sufficiency and some of its consequences. Bayesian inference was introduced in this chapter. The point of view contrasts rather sharply with that of frequentist inference in that the Bayesian formalism allows uncertainty statements about parameter values to be probabilistic, for example,“After seeing the data, the probability is 95% that 1.8 ≤ θ ≤ 6.3.” In frequentist inference, θ is not a random variable, and a statement like this would literally make no sense; it would be replaced by, “A 95% confidence interval for θ is [1.8, 6.3],” perhaps followed by a long convoluted explication of the meaning of a confidence interval. Despite this apparently sharp philosophical difference, Bayesian and frequentist procedures have a great deal in common and typically lead to similar conclusions. Despite the distinction between the two statements above, the statements may well mean essentially the same thing operationally to a practitioner who has analyzed the data. The likelihood function is fundamental for both frequentist and Bayesian inference. In an application, the choice of a model, that is, the choice of a likelihood function, will typically be much more important than whether on subsequently multiplies it be a prior or just maximizes it. This is especially true if flat priors are used; in fact, one might regard a flat prior as a device that allows the likelihood to be treated as a probability density. In this chapter, we introduced the bootstrap method for assessing the variability of an estimate. Such uses of simulation have become increasingly widespread as computers have become faster and cheaper; the bootstrap as a general method has been developed only quite recently and has rapidly become one of the most important statistical tools. We will see other situations in which the bootstrap is useful in later chapters. Efron and Tibshirani (1993) give an excellent introduction to the theory and applications of the bootstrap. The context in which we have introduced the bootstrap is often referred to as the parametric bootstrap.The nonparametric bootstrap will be introduced in Chapter 10. The parametric bootstrap can be thought about somewhat abstractly in the following 312 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions way. We have data x that we regard as being generated by a probability distribution F(x|θ), which depends on a parameter θ. We wish to know Eh(X,θ) for some function h(). For example, if θ itself is estimated from the data as ˆθ(x) and h(X,θ)= [ˆθ(X) − θ]2, then Eh(X,θ) is the mean square error of the estimate. As another example, if h(X,θ)= 1if|ˆθ(X) − θ| > 0 otherwise then Eh(X,θ)is the probability that |ˆθ(X) − θ| >. We realize that if θ were known, we could use the computer to generate independent random variables X1, X2,...,XB from F(x|θ)and then appeal to the law of large numbers: Eh(X,θ)≈ 1 B B i=1 h(Xi ,θ) This approximation could be made arbitrarily precise by choosing B sufficiently large. The parametric bootstrap principle is to perform this Monte Carlo simulation using ˆθ in place of the unknown θ—that is, using F(x|ˆθ) to generate the Xi . It is difficult to give a concise answer to the natural question: How much error is introduced by using ˆθ in place of θ? The answer depends on the continuity of Eh(X,θ)as a function of θ—if small changes in θ can give rise to large changes in Eh(X,θ), the parametric bootstrap will not work well. 8.10 Problems 1. The following table gives the observed counts in 1-second intervals for Berkson’s data (Section 8.2). What are the expected counts from a Poisson dis- tribution? Do they match the observed counts? n Observed 0 5267 1 4436 2 1800 3 534 4 111 5+ 21 2. The Poisson distribution has been used by traffic engineers as a model for light traffic, based on the rationale that if the rate is approximately constant and the traffic is light (so the individual cars move independently of each other), the distribution of counts of cars in a given time interval or space area should be nearly Poisson (Gerlough and Schuhl 1955). The following table shows the number of right turns during 300 3-min intervals at a specific intersection. Fit a Poisson distribution. Comment on the fit by comparing observed and expected counts. It is useful to know that the 300 intervals were distributed over various hours of the day and various days of the week. 8.10 Problems 313 n Frequency 014 130 236 368 443 543 630 714 810 96 10 4 11 1 12 1 13+ 0 3. One of the earliest applications of the Poisson distribution was made by Student (1907) in studying errors made in counting yeast cells or blood corpuscles with a haemacytometer. In this study, yeast cells were killed and mixed with water and gelatin; the mixture was then spread on a glass and allowed to cool. Four different concentrations were used. Counts were made on 400 squares, and the data are summarized in the following table: Number Concentration Concentration Concentration Concentration of Cells 1 2 3 4 0 213 103 75 0 1 128 143 103 20 2 37 98 121 43 318425453 43 8 3086 51 4 1370 602254 700137 800018 900110 100005 110002 120002 a. Estimate the parameter λ for each of the four sets of data. b. Find an approximate 95% confidence interval for each estimate. c. Compare observed and expected counts. 4. Suppose that X is a discrete random variable with P(X = 0) = 2 3 θ 314 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions P(X = 1) = 1 3 θ P(X = 2) = 2 3 (1 − θ) P(X = 3) = 1 3 (1 − θ) where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent observations were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1). a. Find the method of moments estimate of θ. b. Find an approximate standard error for your estimate. c. What is the maximum likelihood estimate of θ? d. What is an approximate standard error of the maximum likelihood estimate? e. If the prior distribution of  is uniform on [0, 1], what is the posterior density? Plot it. What is the mode of the posterior? 5. Suppose that X is a discrete random variable with P(X = 1) = θ and P(X = 2) = 1−θ. Three independent observations of X are made: x1 = 1, x2 = 2, x3 = 2. a. Find the method of moments estimate of θ. b. What is the likelihood function? c. What is the maximum likelihood estimate of θ? d. If  has a prior distribution that is uniform on [0, 1], what is its posterior density? 6. Suppose that X ∼ bin(n, p). a. Show that the mle of p is ˆp = X/n. b. Show that mle of part (a) attains the Cram´er-Rao lower bound. c. If n = 10 and X = 5, plot the log likelihood function. 7. Suppose that X follows a geometric distribution, P(X = k) = p(1 − p)k−1 and assume an i.i.d. sample of size n. a. Find the method of moments estimate of p. b. Find the mle of p. c. Find the asymptotic variance of the mle. d. Let p have a uniform prior distribution on [0, 1]. What is the posterior distri- bution of p? What is the posterior mean? 8. In an ecological study of the feeding behavior of birds, the number of hops between flights was counted for several birds. For the following data, (a) fit a geometric distribution, (b) find an approximate 95% confidence interval for p, (c) 8.10 Problems 315 examine goodness of fit. (d) If a uniform prior is used for p, what is the posterior distribution and what are the posterior mean and standard deviation? Number of Hops Frequency 148 231 320 49 56 65 74 82 91 10 1 11 2 12 1 9. How would you respond to the following argument? This talk of sampling dis- tributions is ridiculous! Consider Example A of Section 8.4. The experimenter found the mean number of fibers to be 24.9. How can this be a “random variable” with an associated “probability distribution” when it’s just a number? The author of this book is guilty of deliberate mystification! 10. Use the normal approximation of the Poisson distribution to sketch the approxi- mate sampling distribution of ˆλ of Example A of Section 8.4. According to this approximation, what is P(|λ0 − ˆλ| >δ)for δ = .5, 1, 1.5, 2, and 2.5, where λ0 denotes the true value of λ? 11. In Example A of Section 8.4, we used knowledge of the exact form of the sampling distribution of ˆλ to estimate its standard error by sˆλ = ˆλ n This was arrived at by realizing that Xi follows a Poisson distribution with parameter nλ0. Now suppose we hadn’t realized this but had used the bootstrap, letting the computer do our work for us by generating B samples of size n = 23 of Poisson random variables with parameter λ = 24.9, forming the mle of λ from each sample, and then finally computing the standard deviation of the resulting collection of estimates and taking this as an estimate of the standard error of ˆλ. Argue that as B →∞, the standard error estimated in this way will tend to sˆλ. 12. Suppose that you had to choose either the method of moments estimates or the maximum likelihood estimates in Example C of Section 8.4 and C of Section 8.5. Which would you choose and why? 13. In Example D of Section 8.4, the method of moments estimate was found to be ˆα = 3X. In this problem, you will consider the sampling distribution of ˆα. a. Show that E(ˆα) = α—that is, that the estimate is unbiased. 316 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions b. Show that Var(ˆα) = (3 − α2)/n.[Hint: What is Var(X)?] c. Use the central limit theorem to deduce a normal approximation to the sam- pling distribution of ˆα. According to this approximation, if n = 25 and α = 0, what is the P(|ˆα| >.5)? 14. In Example C of Section 8.5, how could you use the bootstrap to estimate the following measures of the accuracy of ˆα: (a) P(|ˆα −α0| >.05), (b) E(|ˆα −α0|), (c) that number  such that P(|ˆα − α0| >)= .5. 15. The upper quartile of a distribution with cumulative distribution F is that point q.25 such that F(q.25) = .75. For a gamma distribution, the upper quartile depends on α and λ, so denote it as q(α, λ). If a gamma distribution is fit to data as in Example C of Section 8.5 and the parameters α and λ are estimated by ˆα and ˆλ, the upper quartile could then be estimated by ˆq = q(ˆα, ˆλ). Explain how to use the bootstrap to estimate the standard error of ˆq. 16. Consider an i.i.d. sample of random variables with density function f (x|σ)= 1 2σ exp −|x| σ a. Find the method of moments estimate of σ. b. Find the maximum likelihood estimate of σ. c. Find the asymptotic variance of the mle. d. Find a sufficient statistic for σ. 17. Suppose that X1, X2,...,Xn are i.i.d. random variables on the interval [0, 1] with the density function f (x|α) = (2α) (α)2 [x(1 − x)]α−1 where α>0 is a parameter to be estimated from the sample. It can be shown that E(X) = 1 2 Var(X) = 1 4(2α + 1) a. How does the shape of the density depend on α? b. How can the method of moments be used to estimate α? c. What equation does the mle of α satisfy? d. What is the asymptotic variance of the mle? e. Find a sufficient statistic for α. 18. Suppose that X1, X2,...,Xn are i.i.d. random variables on the interval [0, 1] with the density function f (x|α) = (3α) (α) (2α)xα−1(1 − x)2α−1 where α>0 is a parameter to be estimated from the sample. It can be shown 8.10 Problems 317 that E(X) = 1 3 Var(X) = 2 9(3α + 1) a. How could the method of moments be used to estimate α? b. What equation does the mle of α satisfy? c. What is the asymptotic variance of the mle? d. Find a sufficient statistic for α. 19. Suppose that X1, X2,...,Xn are i.i.d. N(μ, σ 2). a. If μ is known, what is the mle of σ? b. If σ is known, what is the mle of μ? c. In the case above (σ known), does any other unbiased estimate of μ have smaller variance? 20. Suppose that X1, X2,...,X25 are i.i.d. N(μ, σ 2), where μ = 0 and σ = 10. Plot the sampling distributions of X and ˆσ 2. 21. Suppose that X1, X2,...,Xn are i.i.d. with density function f (x|θ) = e−(x−θ), x ≥ θ and f (x|θ) = 0 otherwise. a. Find the method of moments estimate of θ. b. Find the mle of θ.(Hint: Be careful, and don’t differentiate before thinking. For what values of θ is the likelihood positive?) c. Find a sufficient statistic for θ. 22. The Weibull distribution was defined in Problem 67 of Chapter 2. This distribution is sometimes fit to lifetimes. Describe how to fit this distribution to data and how to find approximate standard errors of the parameter estimates. 23. A company has manufactured certain objects and has printed a serial number on each manufactured object. The serial numbers start at 1 and end at N, where N is the number of objects that have been manufactured. One of these objects is selected at random, and the serial number of that object is 888. What is the method of moments estimate of N? What is the mle of N? 24. Find a very new shiny penny. Hold it on its edge and spin it. Do this 20 times and count the number of times it comes to rest heads up. Letting π denote the probability of a head, graph the log likelihood of π. Next, repeat the experiment in a slightly different way: This time spin the coin until 10 heads come up. Again, graph the log likelihood of π. 25. If a thumbtack is tossed in the air, it can come to rest on the ground with either the point up or the point touching the ground. Find a thumbtack. Before doing any experiment, what do you think π, the probability of it landing point up, is? Next, toss the thumbtack 20 times and graph the log likelihood of π. Then do 318 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions another experiment: Toss the thumbtack until it lands point up 5 times, and graph the log likelihood of π based on this experiment. Find and graph the posterior distribution arising from a uniform prior on π. Find the posterior mean and standard deviation and compare the posterior with a normal distribution with that mean and standard deviation. Finally, toss the thumbtack 20 more times and compare the posterior distribution based on all 40 tosses to that based on the first 20. 26. In an effort to determine the size of an animal population, 100 animals are captured and tagged. Some time later, another 50 animals are captured, and it is found that 20 of them are tagged. How would you estimate the population size? What assumptions about the capture/recapture process do you need to make? (See Example I of Section 1.4.2.) 27. Suppose that certain electronic components have lifetimes that are exponentially distributed: f (t|τ) = (1/τ) exp(−t/τ), t ≥ 0. Five new components are put on test, the first one fails at 100 days, and no further observations are recorded. a. What is the likelihood function of τ? b. What is the mle of τ? c. What is the sampling distribution of the mle? d. What is the standard error of the mle? (Hint: See Example A of Section 3.7.) 28. Why do the intervals in the left panel of Figure 8.8 have different centers? Why do they have different lengths? 29. Are the estimates of σ 2 at the centers of the confidence intervals shown in the right panel of Figure 8.8? Why are some intervals so short and others so long? For which of the samples that produced these confidence intervals was ˆσ 2 smallest? 30. The exponential distribution is f (x; λ) = λe−λx and E(X) = λ−1. The cumula- tive distribution function is F(x) = P(X ≤ x) = 1 − e−λx . Three observations are made by an instrument that reports x1 = 5 and x2 = 3, but x3 is too large for the instrument to measure and it reports only that x3 > 10. (The largest value the instrument can measure is 10.0.) a. What is the likelihood function? b. What is the mle of λ? 31. George spins a coin three times and observes no heads. He then gives the coin to Hilary. She spins it until the first head occurs, and ends up spinning it four times total. Let θ denote the probability the coin comes up heads. a. What is the likelihood of θ? b. What is the MLE of θ? 32. The following 16 numbers came from normal random number generator on a computer: 5.3299 4.2537 3.1502 3.7032 1.6070 6.3923 3.1181 6.5941 3.5281 4.7433 0.1077 1.5977 5.4920 1.7220 4.1547 2.2799 8.10 Problems 319 a. What would you guess the mean and variance (μ and σ 2) of the generating normal distribution were? b. Give 90%, 95%, and 99% confidence intervals for μ and σ 2. c. Give 90%, 95%, and 99% confidence intervals for σ. d. How much larger a sample do you think you would need to halve the length of the confidence interval for μ? 33. Suppose that X1, X2,...,Xn are i.i.d. N(μ, σ 2), where μ and σ are unknown. How should the constant c be chosen so that the interval (−∞, X + c) is a 95% confidence interval for μ; that is, c should be chosen so that P(−∞ <μ≤ X + c) = .95. 34. Suppose that X1, X2,...,Xn are i.i.d. N(μ0,σ2 0 ) and μ and σ 2 are estimated by the method of maximum likelihood, with resulting estimates ˆμ and ˆσ 2. Suppose the bootstrap is used to estimate the sampling distribution of ˆμ. a. Explain why the bootstrap estimate of the distribution of ˆμ is N( ˆμ, ˆσ 2 n ). b. Explain why the bootstrap estimate of the distribution of ˆμ − μ0 is N(0, ˆσ 2 n ). c. According to the result of the previous part, what is the form of the bootstrap confidence interval for μ, and how does it compare to the exact confidence interval based on the t distribution? 35. (Bootstrap in Example A of Section 8.5.1) Let U1, U2,...,U1029 be independent uniformly distributed random variables. Let X1 equal the number of Ui less than .331, X2 equal the number between .331 and .820, and X3 equal the number greater than .820. Why is the joint distribution of X1, X2, and X3 multinomial with probabilities .331, .489, and .180 and n = 1029? 36. How do the approximate 90% confidence intervals in Example E of Section 8.5.3 compare to those that would be obtained approximating the sampling distributions of ˆα and ˆλ by normal distributions with standard deviations given by sˆα and sˆλ as in Example C of Section 8.5? 37. Using the notation of Section 8.5.3, suppose that θ and θ are lower and upper quantiles of the distribution of θ∗. Show that the bootstrap confidence interval for θ can be written as (2ˆθ − θ,2ˆθ − θ). 38. Continuing Problem 37, show that if the sampling distribution of θ∗ is symmetric about ˆθ, then the bootstrap confidence interval is (θ, θ). 39. In Section 8.5.3, the bootstrap confidence interval was derived from consideration of the sampling distribution of ˆθ−θ0. Suppose that we had started with considering the distribution of ˆθ/θ. How would the argument have proceeded, and would the bootstrap interval that was finally arrived at have been different? 40. In Example A of Section 8.5.1, how could you use the bootstrap to estimate the following measures of the accuracy of ˆθ: (a) P(|ˆθ − θ0| >.01), (b) E(|ˆθ − θ0|), (c) that number  such that P(|ˆθ − θ0| >)= .5? 41. What are the relative efficiencies of the method of moments and maximum like- lihood estimates of α and λ in Example C of Section 8.4 and Example C of Section 8.5? 320 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 42. The file gamma-ray contains a small quantity of data collected from the Comp- ton Gamma Ray Observatory, a satellite launched by NASA in 1991 (http://cossc.gsfc.nasa.gov/). For each of 100 sequential time intervals of variable lengths (given in seconds), the number of gamma rays originating in a particular area of the sky was recorded. Assuming a model that the arrival times are a Poisson process with constant emission rate (λ = events per second), estimate λ. What is the estimated standard error? How might you informally check the assumption that the emission rate is constant? What is the posterior distribution of  if an improper gamma prior is used? 43. The file gamma-arrivals contains another set of gamma-ray data, this one consisting of the times between arrivals (interarrival times) of 3,935 photons (units are seconds). a. Make a histogram of the interarrival times. Does it appear that a gamma distribution would be a plausible model? b. Fit the parameters by the method of moments and by maximum likelihood. How do the estimates compare? c. Plot the two fitted gamma densities on top of the histogram. Do the fits look reasonable? d. For both maximum likelihood and the method of moments, use the bootstrap to estimate the standard errors of the parameter estimates. How do the estimated standard errors of the two methods compare? e. For both maximum likelihood and the method of moments, use the bootstrap to form approximate confidence intervals for the parameters. How do the confidence intervals for the two methods compare? f. Is the interarrival time distribution consistent with a Poisson process model for the arrival times? 44. The file bodytemp contains normal body temperature readings (degrees Fahren- heit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females (coded by 2) from Shoemaker (1996). Assuming that the population distributions are normal (an assumption that will be investigated in a later chapter), estimate the means and standard deviations of the males and females. Form 95% confidence intervals for the means. Standard folklore is that the average body temperature is 98.6 degrees Fahrenheit. Does this appear to be the case? 45. A Random Walk Model for Chromatin. A human chromosome is a very large molecule, about 2 or 3 centimeters long, containing 100 million base pairs (Mbp). The cell nucleus, where the chromosome is contained, is in contrast only about a thousandth of a centimeter in diameter. The chromosome is packed in a series of coils, called chromatin, in association with special proteins (histones), forming a string of microscopic beads. It is a mixture of DNA and protein. In the G0/G1 phase of the cell cycle, between mitosis and the onset of DNA replication, the mitotic chromosomes diffuse into the interphase nucleus. At this stage, a number of important processes related to chromosome function take place. For exam- ple, DNA is made accessible for transcription and is duplicated, and repairs are made of DNA strand breaks. By the time of the next mitosis, the chromosomes have been duplicated. The complexity of these and other processes raises many 8.10 Problems 321 questions about the large-scale spatial organization of chromosomes and how this organization relates to cell function. Fundamentally, it is puzzling how these processes can unfold in such a spatially restricted environment. At a scale of about 10−3 Mbp, the DNA forms a chromatin fiber about 30 nm in diameter; at a scale of about 10−1 Mbp the chromatin may form loops. Very little is known about the spatial organization beyond this scale. Various models have been proposed, ranging from highly random to highly organized, including irregularly folded fibers, giant loops, radial loop structures, systematic organization to make the chromatin readily accessible to transcription and repli- cation machinery, and stochastic configurations based on random walk models for polymers. A series of experiments (Sachs et al., 1995; Yokota et al., 1995) were con- ducted to learn more about spatial organization on larger scales. Pairs of small DNA sequences (size about 40 kbp) at specified locations on human chromo- some 4 were flourescently labeled in a large number of cells. The distances between the members of these pairs were then determined by flourescence mi- croscopy. (The distances measured were actually two-dimensional distances be- tween the projections of the paired locations onto a plane.) The empirical dis- tribution of these distances provides information about the nature of large-scale organization. There has long been a tradition in chemistry of modeling the configurations of polymers by the theory of random walks. As a consequence of such a model, the two-dimensional distance should follow a Rayleigh distribution f (r|θ) = r θ2 exp −r 2 2θ2 Basically, the reason for this is as follows: The random walk model implies that the joint distribution of the locations of the pair in R3 is multivariate Gaussian; by properties of the multivariate Gaussian, it can be shown the joint distribution of the locations of the projections onto a plane is bivariate Gaussian. As in Example A of Section 3.6.2 of the text, it can be shown that the distance between the points follows a Rayleigh distribution. In this exercise, you will fit the Rayleigh distribution to some of the experi- mental results and examine the goodness of fit. The entire data set comprises 36 experiments in which the separation between the pairs of flourescently tagged locations ranged from 10 Mbp to 192 Mbp. In each such experimental condi- tion, about 100–200 measurements of two-dimensional distances were deter- mined. This exercise will be concerned just with the data from three experiments (short, medium, and long separation). The measurements from these experi- ments is contained in the filesChromatin/short, Chromatin/medium, Chromatin/long. a. What is the maximum likelihood estimate of θ for a sample from a Rayleigh distribution? b. What is the method of moments estimate? c. What are the approximate variances of the mle and the method of moments estimate? 322 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions d. For each of the three experiments, plot the likelihood functions and find the mle’s and their approximate variances. e. Find the method of moments estimates and the approximate variances. f. For each experiment, make a histogram (with unit area) of the measurements and plot the fitted densities on top. Do the fits look reasonable? Is there any appreciable difference between the maximum likelihood fits and the method of moments fits? g. Does there appear to be any relationship between your estimates and the genomic separation of the points? h. For one of the experiments, compare the asymptotic variances to the results obtained from a parametric bootstrap. In order to do this, you will have to generate random variables from a Rayleigh distribution with parameter θ. Show that if X follows a Rayleigh distribution with θ = 1, then Y = θ X follows a Rayleigh distribution with parameter θ.Thus it is sufficient to figure out how to generate random variables that are Rayleigh, θ = 1. Show how Proposition D of Section 2.3 of the text can be applied to accomplish this. B = 100 bootstrap samples should suffice for this problem. Make a histogram of the values of the θ∗. Does the distribution appear roughly normal? Do you think that the large sample theory can be reasonably applied here? Compare the standard deviation calculated from the bootstrap to the standard errors you found previously. i. For one of the experiments, use the bootstrap to construct an approximate 95% confidence interval for θ using B = 1000 bootstrap samples. Compare this interval to that obtained using large sample theory. 46. The data of this exercise were gathered as part of a study to estimate the population size of the bowhead whale (Raftery and Zeh 1993). The statistical procedures for estimating the population size along with an assessment of the variability of the estimate were quite involved, and this problem deals with only one aspect of the problem—a study of the distribution of whale swimming speeds. Pairs of sightings and corresponding locations that could be reliably attributed to the same whale were collected, thus providing an estimate of velocity for each whale. The velocities, v1,v2,...,v210 (km/h), were converted into times t1, t2,...,t210 to swim 1 km—ti = 1/vi . The distribution of the ti was then fit by a gamma distribution. The times are contained in the file whales. a. Make a histogram of the 210 values of ti . Does it appear that a gamma distri- bution would be a plausible model to fit? b. Fit the parameters of the gamma distribution by the method of moments. c. Fit the parameters of the gamma distribution by maximum likelihood. How do these values compare to those found before? d. Plot the two gamma densities on top of the histogram. Do the fits look rea- sonable? e. Estimate the sampling distributions and the standard errors of the parameters fit by the method of moments by using the bootstrap. f. Estimate the sampling distributions and the standard errors of the parameters fit by maximum likelihood by using the bootstrap. How do they compare to the results found previously? 8.10 Problems 323 g. Find approximate confidence intervals for the parameters estimated by maxi- mum likelihood. 47. The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: f (x|x0,θ)= θxθ 0 x−θ−1, x ≥ x0,θ>1 Assume that x0 > 0 is given and that X1, X2,...,Xn is an i.i.d. sample. a. Find the method of moments estimate of θ. b. Find the mle of θ. c. Find the asymptotic variance of the mle. d. Find a sufficient statistic for θ. 48. Consider the following method of estimating λ for a Poisson distribution. Observe that p0 = P(X = 0) = e−λ Letting Y denote the number of zeros from an i.i.d. sample of size n, λ might be estimated by ˜λ =−log Y n Use the method of propagation of error to obtain approximate expressions for the variance and the bias of this estimate. Compare the variance of this estimate to the variance of the mle, computing relative efficiencies for various values of λ. Note that Y ∼ bin(n, p0). 49. For the example on muon decay in Section 8.4, suppose that instead of recording x = cos θ, only whether the electron goes backward (x < 0) or forward (x > 0) is recorded. a. How could α be estimated from n independent observations of this type? (Hint: Use the binomial distribution.) b. What is the variance of this estimate and its efficiency relative to the method of moments estimate and the mle for α = 0,.1,.2,.3,.4,.5,.6,.7,.8,.9? 50. Let X1,...,Xn be an i.i.d. sample from a Rayleigh distribution with parameter θ>0: f (x|θ) = x θ2 e−x2/(2θ2), x ≥ 0 (This is an alternative parametrization of that of Example A in Section 3.6.2.) a. Find the method of moments estimate of θ. b. Find the mle of θ. c. Find the asymptotic variance of the mle. 51. The double exponential distribution is f (x|θ) = 1 2e−|x−θ|, −∞ < x < ∞ For an i.i.d. sample of size n = 2m + 1, show that the mle of θ is the median of the sample. (The observation such that half of the rest of the observations are 324 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions smaller and half are larger.) [Hint: The function g(x) =|x| is not differentiable. Draw a picture for a small value of n to try to understand what is going on.] 52. Let X1,...,Xn be i.i.d. random variables with the density function f (x|θ) = (θ + 1)xθ , 0 ≤ x ≤ 1 a. Find the method of moments estimate of θ. b. Find the mle of θ. c. Find the asymptotic variance of the mle. d. Find a sufficient statistic for θ. 53. Let X1,...,Xn be i.i.d. uniform on [0,θ]. a. Find the method of moments estimate of θ and its mean and variance. b. Find the mle of θ. c. Find the probability density of the mle, and calculate its mean and variance. Compare the variance, the bias, and the mean squared error to those of the method of moments estimate. d. Find a modification of the mle that renders it unbiased. 54. Suppose that an i.i.d. sample of size 15 from a normal distribution gives X = 10 and s2 = 25. Find 90% confidence intervals for μ and σ 2. 55. For two factors—starchy or sugary, and green base leaf or white base leaf—the following counts for the progeny of self-fertilized heterozygotes were observed (Fisher 1958): Type Count Starchy green 1997 Starchy white 906 Sugary green 904 Sugary white 32 According to genetic theory, the cell probabilities are .25(2 + θ),.25(1 − θ), .25(1 − θ), and .25θ, where θ(0 <θ<1) is a parameter related to the linkage of the factors. a. Find the mle of θ and its asymptotic variance. b. Form an approximate 95% confidence interval for θ based on part (a). c. Use the bootstrap to find the approximate standard deviation of the mle and compare to the result of part (a). d. Use the bootstrap to find an approximate 95% confidence interval and compare to part (b). 56. Referring to Problem 55, consider two other estimates of θ. (1) The expected number of counts in the first cell is n(2 + θ)/4; if this expected number is equated to the count X1, the following estimate is obtained: ˜θ1 = 4X1 n − 2 8.10 Problems 325 (2) The same procedure done for the last cell yields ˜θ2 = 4X4 n Compute these estimates. Using that X1 and X4 are binomial random variables, show that these estimates are unbiased, and obtain expressions for their vari- ances. Evaluate the estimated standard errors and compare them to the estimated standard error of the mle. 57. This problem is concerned with the estimation of the variance of a normal dis- tribution with unknown mean from a sample X1,...,Xn of i.i.d. normal random variables. In answering the following questions, use the fact that (from Theorem B of Section 6.3) (n − 1)s2 σ 2 ∼ χ2 n−1 and that the mean and variance of a chi-square random variable with r df are r and 2r, respectively. a. Which of the following estimates is unbiased? s2 = 1 n − 1 n i=1 (Xi − X)2 ˆσ 2 = 1 n n i=1 (Xi − X)2 b. Which of the estimates given in part (a) has the smaller MSE? c. For what value of ρ does ρ n i=1(Xi − X)2 have the minimal MSE? 58. If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur with probabilities (1 − θ)2, 2θ(1 − θ), and θ2, respectively. Plato et al. (1964) published the following data on haptoglobin type in a sample of 190 people: Haptoglobin Type Hp1-1 Hp1-2 Hp2-2 10 68 112 a. Find the mle of θ. b. Find the asymptotic variance of the mle. c. Find an approximate 99% confidence interval for θ. d. Use the bootstrap to find the approximate standard deviation of the mle and compare to the result of part (b). e. Use the bootstrap to find an approximate 99% confidence interval and compare to part (c). 59. Suppose that in the population of twins, males (M) and females (F) are equally likely to occur and that the probability that twins are identical is α. If twins are not identical, their genes are independent. a. Show that P(MM) = P(FF) = 1 + α 4 P(MF) = 1 − α 2 326 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions b. Suppose that n twins are sampled. It is found that n1 are MM, n2 are FF, and n3 are MF, but it is not known which twins are identical. Find the mle of α and its variance. 60. Let X1,...,Xn be an i.i.d. sample from an exponential distribution with the density function f (x|τ) = 1 τ e−x/τ , 0 ≤ x < ∞ a. Find the mle of τ. b. What is the exact sampling distribution of the mle? c. Use the central limit theorem to find a normal approximation to the sampling distribution. d. Show that the mle is unbiased, and find its exact variance. (Hint: The sum of the Xi follows a gamma distribution.) e. Is there any other unbiased estimate with smaller variance? f. Find the form of an approximate confidence interval for τ. g. Find the form of an exact confidence interval for τ. 61. Laplace’s rule of succession. Laplace claimed that when an event happens n times in a row and never fails to happen, the probability that the event will occur the next time is (n + 1)/(n + 2). Can you suggest a rationale for this claim? 62. Show that the gamma distribution is a conjugate prior for the exponential distri- bution. Suppose that the waiting time in a queue is modeled as an exponential random variable with unknown parameter λ, and that the average time to serve a random sample of 20 customers is 5.1 minutes. A gamma distribution is used as a prior. Consider two cases: (1) the mean of the gamma is 0.5 and the standard deviation is 1, and (2) the mean is 10 and the standard deviation is 20. Plot the two posterior distributions and compare them. Find the two posterior means and compare them. Explain the differences. 63. Suppose that 100 items are sampled from a manufacturing process and 3 are found to be defective. A beta prior is used for the unknown proportion θ of defective items. Consider two cases: (1) a = b = 1, and (2) a = 0.5 and b = 5. Plot the two posterior distributions and compare them. Find the two posterior means and compare them. Explain the differences. 64. This is a continuation of the previous problem. Let X = 0 or 1 according to whether an item is defective. For each choice of the prior, what is the marginal distribution of X before the sample is taken? What are the marginal distribu- tions after the sample is taken? (Hint: for the second question, use the posterior distribution of θ.) 65. Suppose that a random sample of size 20 is taken from a normal distribution with unknown mean and known variance equal to 1, and the mean is found to be ¯x = 10. A normal distribution was used as the prior for the mean, and it was found that the posterior mean was 15 and the posterior standard deviation was 0.1. What were the mean and standard deviation of the prior? 8.10 Problems 327 66. Let the unknown probability that a basketball player makes a shot successfully be θ. Suppose your prior on θ is uniform on [0, 1] and that she then makes two shots in a row. Assume that the outcomes of the two shots are independent. a. What is the posterior density of θ? b. What would you estimate the probability that she makes a third shot to be? 67. Evans (1953) considered fitting the negative binomial distribution and other dis- tributions to a number of data sets that arose in ecological studies. Two of these sets will be used in this problem. The first data set gives frequency counts of Glaux maritima made in 500 contiguous 20-cm2 quadrants. For the second data set, a plot of potato plants 48 rows wide and 96 ft long was examined. The area was split into 2304 sampling units consisting of 2-ft lengths of row and in each unit the number of potato beetles was counted. Fit Poisson and negative binomial distributions, and comment on the goodness of fit. For these data, the method of moments should be fairly efficient. Count Glaux maritima Potato Beetles 0 1 190 1 15 264 2 27 304 3 42 260 4 77 294 5 77 219 6 89 183 7 57 150 8 48 104 92490 10 14 60 11 16 46 12 9 29 13 3 36 14 1 19 15 12 16 11 17 6 18 10 19 2 20 4 21 1 22 3 23 4 24 1 25 1 26 0 27 0 28 1 328 Chapter 8 Estimation of Parameters and Fitting of Probability Distributions 68. Let X1,...,Xn be an i.i.d. sample from a Poisson distribution with mean λ, and let T = n i=1 Xi . a. Show that the distribution of X1,...,Xn given T is independent of λ, and conclude that T is sufficient for λ. b. Show that X1 is not sufficient. c. Use Theorem A of Section 8.8.1 to show that T is sufficient. Identify the functions g and h of that theorem. 69. Use the factorization theorem (Theorem A in Section 8.8.1) to conclude that T = n i=1 Xi is a sufficient statistic when the Xi are an i.i.d. sample from a geometric distribution. 70. Use the factorization theorem to find a sufficient statistic for the exponential distribution. 71. Let X1,...,Xn be an i.i.d. sample from a distribution with the density function f (x|θ) = θ (1 + x)θ+1 , 0 <θ<∞ and 0 ≤ x < ∞ Find a sufficient statistic for θ. 72. Show that n i=1 Xi and n i=1 Xi are sufficient statistics for the gamma distribu- tion. 73. Find a sufficient statistic for the Rayleigh density, f (x|θ) = x θ2 e−x2/(2θ2), x ≥ 0 74. Show that the binomial distribution belongs to the exponential family. 75. Show that the gamma distribution belongs to the exponential family. CHAPTER 9 Testing Hypotheses and Assessing Goodness of Fit 9.1 Introduction We will introduce some of the basic concepts of this chapter by means of a simple artificial example; it is important that you read the example carefully. Suppose that I have two coins, coin 0 has probability of heads equal to 0.5 and coin 1 has probability of heads equal to 0.7. I choose one of the coins, toss it 10 times and tell you the number of heads, but do not tell you whether it was coin 0 or coin 1. On the basis of the number of heads, your task is to decide which coin it was. How should your decision rule be? Let X denote the number of heads. Figure 9.1 gives p(x) for each of the coins. x 012345678910 coin 0 .0010 .0098 .0439 .1172 .2051 .2461 .2051 .1172 .0439 .0098 .0010 coin 1 .0000 .0001 .0014 .0090 .0368 .1029 .2001 .2668 .2335 .1211 .0282 FIGURE 9.1 Suppose that you observed two heads. Then P0(2)/P1(2) is about 30, which we will call the likelihood ratio—coin 0 was about 30 times more likely to produce this result than was coin 1. This result would favor coin 0. On the other hand, if there were 8 heads, the likelihood ratio would be .0439/.2335 = 0.19, which would favor coin 1. The likelihood ratio will play a central role in the procedures we develop. We specify two hypotheses, H0 and H1, according to whether coin 0 or coin 1 was tossed. We first develop a Bayesian methodology for assessing the evidence for each of the hypotheses. This approach requires the specification of prior probabilities P(H0) and P(H1) for each of the hypotheses before observing any data. If you believed that I have no reason to choose coin 0 over coin 1, you would take P(H0) = P(H1) = 1/2. 329 330 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit After observing the number of heads, your posterior probabilities would be P(H0|x) and P(H1|x). The former would be P(H0|x) = P(H0, x) P(x) = P(x|H0)P(H0) P(x) The ratio P(H0|x) P(H1|x) = P(H0) P(H1) P(x|H0) P(x|H1) is the product of the ratio of prior probabilities and the likelihood ratio. Thus, the evidence provided by the data is contained in the likelihood ratio, which is multiplied by the ratio of prior probabilities to produce the ratio of posterior probabilities. The likelihood ratio is evaluated in Figure 9.2. x 0123456 7 8 9 10 P(x|H0) P(x|H1) 165.4 70.88 30.38 13.02 5.579 2.391 1.025 0.4392 0.1882 0.0807 0.0346 FIGURE 9.2 (The numbers in Figure 9.2 do not precisely agree with the ratios of the numbers in the Figure 9.1 because the former are truncated to four decimal places.) The evidence x is increasingly favorable to H0 as x decreases, i.e., the likelihood ratio is monotonic in x. If one’s prior probabilities were equal, then for zero to six heads, H0 would be more probable and for seven to ten heads, H1 would be more probable. If the prior probabilities change, the breakpoint changes. If you were asked to choose H0 or H1 on the basis of the data x, it seems reasonable that you would choose the hypothesis which had larger posterior probability. You would choose H0 if P(H0|x) P(H1|x) = P(H0) P(H1) P(x|H0) P(x|H1) > 1 or equivalently if P(x|H0) P(x|H1) > c where the critical value c depends upon your prior probability. Your decision would be based on the likelihood ratio: you accept H0 if the likelihood ratio is greater than c, and you reject H0 if the likelihood ratio is less than c. Let us now further examine the consequences of a particular decision rule, i.e., a particular specification of the constant c. First suppose that c = 1; then H0 is accepted as long as X ≤ 6 and is rejected in favor of H1 if X > 6. We can make two possible errors: reject H0 when it is true, or accept H0 when it is false. The probabilities of these two possible errors can be evaluated as follows: P(reject H0|H0) = P(X > 6|H0) = 0.18 Here we used the binomial probabilities above corresponding to H0. Similarly, the 9.2 The Neyman-Pearson Paradigm 331 probability of the other kind of error is P(accept H0|H1) = P(X ≤ 6|H1) = 0.35 Now suppose that c = 0.1, which corresponds to P(H0)/P(H1) = 10. Then from Figure 9.2, H0 is accepted when X ≤ 8. Compared to equal odds, more extreme evidence is required to reject H0 because the prior probabilities greatly favor H0. Then the probabilities of the two types of errors are P(reject H0|H0) = P(X > 8|H0) = 0.01 P(accept H0|H1) = P(X ≤ 8|H1) = 0.85 In this way, we see that there is a correspondence between the prior probabilities and the probabilities of the two types of errors. The constant c controls the tradeoff between the probabilities of the two types of errors. 9.2 The Neyman-Pearson Paradigm Rather than using a Bayesian approach, Neyman and Pearson formulated their theory of hypothesis testing by casting it as a decision problem and making the probabilities of the two types of errors central, thus bypassing the necessity of specifying prior probabilities. In doing so, this approach introduced an asymmetry: one hypothesis is singled out as the null hypothesis and the other as the alternative hypothesis, the former usually denoted by H0 and the latter by H1 or HA. We will see later through examples how this specification is naturally made, but for now we will continue with the example of the previous section and arbitrarily declare H0 to the null hypothesis. The following terminology is standard: • Rejecting H0 when it is true is called a type I error. • The probability of a type I error is called the significance level of the test and is usually denoted by α. • Accepting the null hypothesis when it is false is called a type II error and its probability is usually denoted by β. • The probability that the null hypothesis is rejected when it is false is called the power of the test, and equals 1 − β. • We have seen in this example how rejecting H0 when the likelihood ratio is less than a constant c is equivalent to rejecting when the number of heads is greater than some value x0. The likelihood ratio, or equivalently, the number of heads, is called the test statistic. • The set of values of the test statistic that leads to rejection of the null hypothesis is called the rejection region, and the set of values that leads to acceptance is called the acceptance region. • The probability distribution of the test statistic when the null hypothesis is true is called the null distribution. 332 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit In the example in the introduction to this chapter, the null and alternative hy- potheses each completely specify the probability distribution of the number of heads, as binomial(10,0.5) or binomial(10,0.7), respectively. These are called simple hypotheses. The Neyman-Pearson Lemma shows that basing the test on the like- lihood ratio as we did is optimal: NEYMAN-PEARSON LEMMA Suppose that H0 and H1 are simple hypotheses and that the test that rejects H0 whenever the likelihood ratio is less than c and significance level α. Then any other test for which the significance level is less than or equal to α has power less than or equal to that of the likelihood ratio test. ■ The point is that there are many possible tests. Any partition of the set of possible outcomes of the observations into a set that has probability less than or equal to α when the null hypothesis is true and its complement, and that rejects when the observations are in the complement has significance level less than or equal to α by construction. Among all such possible partitions, that based on the likelihood ratio maximizes the power. Proof Let f (x) denote the probability density function or frequency function of the observations. A test of H0 : f (x) = f0(x) versus H1 : f (x) = f A(x) amounts to using a decision function d(x), where d(x) = 0ifH0 is accepted and d(x) = 1 if H0 is rejected. Since d(X) is a Bernoulli random variable, E(d(X)) = P (d(X) = 1). The significance level of the test is thus α = P0(d(X) = 1) = E0(d(X)), and the power is PA(d(X) = 0) = E A(d(X)). Here E0 denotes expectation under the probability law specified by H0, etc. Let d(X) correspond to the likelihood ratio test: d(x) = 1if f0(x)0 and if d(x) = 0, cfA(x) − f0(x) ≤ 0. Now integrating (or summing) both sides of the inequality above with respect to x gives cEA(d∗(X)) − E0(d∗(X)) ≤ cEA(d(X)) − E0(d(X)) and thus E0(d(X)) − E0(d∗(X)) ≤ c[E A(d(X)) − E A(d∗(X))] The conclusion follows since the left-hand side of this inequality is nonnegative by assumption. ■ 9.2 The Neyman-Pearson Paradigm 333 EXAMPLE A Let X1,...,Xn be a random sample from a normal distribution having known variance σ 2. Consider two simple hypotheses: H0: μ = μ0 HA: μ = μ1 where μ0 and μ1 are given constants. Let the significance level α be prescribed. The Neyman-Pearson Lemma states that among all tests with significance level α, the test that rejects for small values of the likelihood ratio is most powerful. We thus calculate the likelihood ratio, which is f0(X) f1(X) = exp −1 2σ 2 n i=1 (Xi − μ0)2 exp −1 2σ 2 n i=1 (Xi − μ1)2 since the multipliers of the exponentials cancel. Small values of this statistic corre- spond to small values of n i=1(Xi −μ1)2 −n i=1(Xi −μ0)2. Expanding the squares, we see that the latter expression reduces to 2nX(μ0 − μ1) + nμ2 1 − nμ2 0 Now, if μ0 − μ1 > 0, the likelihood ratio is small if X is small; if μ0 − μ1 < 0, the likelihood ratio is small if X is large. To be concrete, let us assume the latter case. We then know that the likelihood ratio is a function of X and is small when X is large. The Neyman-Pearson lemma thus tells us that the most powerful test rejects for X > x0 for some x0, and we choose x0 so as to give the test the desired level α. That is, x0 is chosen so that P(X > x0) = α if H0 is true. Under H0 in this example, the null distribution of X is a normal distribution with mean μ0 and variance σ 2/n, so x0 can be chosen from tables of the standard normal distribution. Since P(X > x0) = P X − μ0 σ/√ n > x0 − μ0 σ/√ n we can solve x0 − μ0 σ/√ n = z(α) for x0 in order to find the rejection region for a level α test. Here, as usual, z(α) denotes the upper α point of the standard normal distribution; that is, if Z is a standard normal random variable, P(Z > z(α)) = α. ■ This example is typical of the way that the Neyman-Pearson Lemma is used. We write down the likelihood ratio and observe that small values of it correspond in a one-to-one manner with extreme values of a test statistic, in this case X. Knowing the null distribution of the test statistic makes it possible to choose a critical level that produces a desired significance level α. 334 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit Unfortunately, the Neyman-Pearson Lemma is of little direct utility in most practical problems, because the case of testing a simple null hypothesis versus a simple alternative is rarely encountered. If a hypothesis does not completely specify the probability distribution, the hypothesis is called a composite hypothesis. Here are some examples: EXAMPLE B Goodness-of-Fit Test Let X1, X2,...,Xn be a sample from a discrete probability distribution. The null hypothesis could be that the distribution is Poisson with some unspecified mean, and the alternative could be that the distribution is not Poisson. For example, we might want to test whether a Poisson model is reasonable for the data of Example A in Section 8.4. Since the null hypothesis does not completely specify the distribution of the Xi ’s, it is composite. If the null hypothesis were refined to state that the distribution was Poisson with some specified mean, then it would be simple. The alternative hypothesis does not completely specify the distribution, so it is composite. We will take up the subject of testing for goodness of fit later in this chapter. ■ EXAMPLE C Testing for ESP Consider a hypothetical experiment in which a subject is asked to identify, without looking, the suits of 20 cards drawn randomly with replacement from a 52 card deck. Let T be the number of correct identifications. The null hypothesis states that the person is purely guessing, and the alternative states that the person has extrasen- sory ability. The null hypothesis is simple because then T is binomial(20,0.25). The alternative does not completely specify the distribution of T , so it is composite; note that it does not even specify that the distribution is binomial. ■ This example is useful for further illustrating two other issues that arise in hy- pothesis testing: the specification of the significance level and the choice of the null hypothesis. 9.2.1 Specification of the Significance Level and the Concept of a p-value One of the strengths of the Neyman-Pearson approach is that only the distribution under the null hypothesis is needed in order to construct a test. In Example C above, it would be conventional and convenient to take the null hypothesis to be that of pure guessing; we discuss this further in the next section. In this case, the null distribution of T is binomial(20,0.25). Because large values of T tend to lend credence to the alternative, the rejection region would be of the form {T ≥ t0} where t0 is chosen so that P(T ≥ t0|H0) = α, the desired significance level of the test. For example, from calculating binomial probabilities, we find P(T ≥ 12) = .0009, so for this choice of the critical region, the null hypothesis of no ESP ability would be falsely rejected only with probability about one in a thousand. Note that we did not need to 9.2 The Neyman-Pearson Paradigm 335 specify the form of the probability distribution under the alternative, but used only the notion that if the alternative is true, the subject would be expected to correctly identify more suits than if purely guessing. In comparison, a fully Bayesian treat- ment would have to specify the distribution under the alternative as well as prior probabilities. The theory requires the specification of the significance level, α, in advance of analyzing the data, but gives no guidance about how to make this choice. In practice it is almost always the case that the choice of α is essentially arbitrary, but is heavily influenced by custom. Small values, such as 0.01 and 0.05, are commonly used. Another criticism of the paradigm is that it is built on the assumption that one must either reject or not reject a hypothesis, when typically no such decision is actually required. The theory is thus often applied in a hypothetical manner. For example, suppose that the subject above guessed the suit correctly nine times. Since P(T ≥ 9|H0) = .041, the null hypothesis would have been “rejected” at the significance level α = .05, if one were actually “rejecting” or “not rejecting.” Thus, the evidence is often summarized as a p-value, which is defined to be the smallest significance level at which the null hypothesis would be rejected. If nine suits were identified correctly, the p-value would be 0.041. If ten suits were identified, it would be 0.014, since P(T ≥ 10|H0) = .014, etc. The use of a p-value to summarize the evidence against the null hypothesis was advocated by the eminent statistician Sir Ronald Fisher. But rather than casting it within a hypothetical framework of “rejection,” he thought of the p-value as being the probability under the null hypothesis of a result as or more extreme than that actually observed. So for example, in the case that ten suits are identified, the p-value is the chance of someone getting at least that many correct by purely guessing. The smaller the p-value, the stronger the evidence against the null hypothesis. The Bayesian paradigm summarizes the evidence for and against the null hy- pothesis as a posterior probability. Its application depends on specifying probability models under both the null and the alternative and on assigning meaningful prior prob- abilities. It is important to understand that a p-value is not the posterior probability that the null hypothesis is true. To reiterate, the p-value is the probability of a result as or more extreme than that actually observed if the null hypothesis were true. This is a probability, but it is not the posterior probability that the null hypothesis is true; the latter depends on the specification of prior probabilities. Consider the example of Section 9.1. If x = 8 heads are observed, the p-value is .0439 + .0098 + .0010 = .0546, or about 5%. Suppose that the prior probabilities were equal. The likelihood ratio is 0.1882 = P(H0|x)/(1 − P(H0|x)) from which it follows that P(H0|x) = 0.1584, or about 16%. 9.2.2 The Null Hypothesis As should be clear by now, there is an asymmetry in the Neyman-Pearson paradigm between the null and alternative hypotheses. The decision as to which is the null and which is the alternative hypothesis is not a mathematical one, and depends on scientific context, custom, and convenience. This will gradually become clearer as we see more real examples in this and later chapters, and for now we will make only the following remarks: 336 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit • In Example B of Section 9.2, we chose as the null hypothesis the hypothesis that the distribution was Poisson and as the alternative hypothesis the hypothesis that the distribution was not Poisson. In this case, the null hypothesis is simpler than the alternative, which in a sense contains more distributions than does the null. It is conventional to choose the simpler of two hypotheses as the null. • The consequences of incorrectly rejecting one hypothesis may be graver than those of incorrectly rejecting the other. In such a case, the former should be chosen as the null hypothesis, because the probability of falsely rejecting it could be controlled by choosing α. Examples of this kind arise in screening new drugs; frequently, it must be documented rather conclusively that a new drug has a positive effect before it is accepted for general use. • In scientific investigations, the null hypothesis is often a simple explanation that must be discredited in order to demonstrate the presence of some physical phe- nomenon or effect. The hypothetical ESP experiment referred to earlier falls in this category; the null hypothesis states that the subject is merely guessing, that there is no ESP. The validity of the null hypothesis would not be cast in doubt unless the results would be extremely unlikely under the null. We will see other examples of this type beginning in Chapter 11. 9.2.3 Uniformly Most Powerful Tests The optimality result of the Neyman-Pearson Lemma requires that both hypotheses be simple. In some cases, the theory can be extended to include composite hypotheses. If the alternative H1 is composite, a test that is most powerful for every simple alternative in H1 is said to be uniformly most powerful. EXAMPLE A Continuing with Example A of Section 9.2, consider testing H0 : μ = μ0 versus H1 : μ>μ0. In Example A, we saw that for a particular alternative μ1 >μ0, the most powerful test rejects for X > x0, where x0 depends on μ0,σ,n,butnot on μ1. Because this test is most powerful and is the same for every alternative, it is uniformly most powerful. ■ It can also be argued that the test is uniformly most powerful for testing H0 : μ ≤ μ0 versus H1 : μ>μ0. But it is not uniformly most powerful for testing H0 : μ = μ0 versus H1 : μ = μ0. This follows from further examination of the example, which shows that the test that is most powerful against the alternative that μ>μ0 rejects for large values of X − μ0, whereas the test that is most powerful against the alternative μ<μ0 rejects for small values of X − μ0. The most powerful test is thus not the same for every alternative. In typical composite situations, there is no uniformly most powerful test. The al- ternatives H1 : μ<μ0 and H1 : μ>μ0 are called one-sided alternatives. The alternative H1 : μ = μ0 is a two-sided alternative. 9.3 The Duality of Confidence Intervals and Hypothesis Tests 337 9.3 The Duality of Confidence Intervals and Hypothesis Tests There is a duality between confidence intervals (more generally, confidence sets) and hypothesis tests. In this section, we will show that a confidence set can be obtained by “inverting” a hypothesis test, and vice versa. Before presenting the general structure, we consider an example. EXAMPLE A Let X1,...,Xn be a random sample from a normal distribution having unknown mean μ and known variance σ 2. We consider testing the following hypotheses: H0: μ = μ0 HA: μ = μ0 Consider a test at a specific level α that rejects for |X − μ0| > x0, where x0 is determined so that P(|X − μ0| > x0) = α if H0 is true: x0 = σX z(α/2). Here the standard deviation of X is denoted by σX = σ/√ n. The test thus accepts when |X − μ0| <σX z(α/2) or −σX z(α/2)χ2 1 (α) Again using the fact that a chi-square random variable with 1 degree of freedom is the square of a standard normal random variable, we can rewrite this relation to show that the rejection region for the test is |X − μ0|≥ σ√ n z(α/2) ■ The preceding derivation has been rather formal, but upon examination the result looks perfectly reasonable or perhaps even so obvious as to make us doubt the value of the formal exercise: The test of H0: μ = μ0 versus H1: μ = μ0 rejects when |X −μ0| is large. The test does not reject when −σz(α/2)/√ n ≤ X − μ0 ≤ σz(α/2)/√ n or, equivalently, when X − σz(α/2)/√ n ≤ μ0 ≤ X + σz(α/2)/√ n. That is, the test does not reject when μ0 lies in a 100(1 − α)% confidence interval for μ. Compare to Example A of Section 9.3. 9.5 Likelihood Ratio Tests for the Multinomial Distribution 341 In order for the likelihood ratio test to have the significance level α, λ0 must be chosen so that P( ≤ λ0) = α if H0 is true. If the sampling distribution of  under H0 is known, we can determine λ0. Generally, the sampling distribution is not of a simple form, but in many situations the following theorem provides the basis for an approximation to the null distribution. THEOREM A Under smoothness conditions on the probability density or frequency functions involved, the null distribution of −2 log  tends to a chi-square distribution with degrees of freedom equal to dim  − dim ω0 as the sample size tends to infinity. ■ The proof, which is beyond the scope of this book, is based on a second-order Taylor series expansion. In the statement of Theorem A, dim  and dim ω0 are the numbers of free pa- rameters under  and ω0, respectively. In Example A, the null hypothesis completely specifies μ and σ 2; there are no free parameters under ω0,sodimω0 = 0. Under , σ is fixed but μ is free, so dim  = 1. For this example, the null distribution of −2 log  is exactly χ2 1 . 9.5 Likelihood Ratio Tests for the Multinomial Distribution In this section we will develop a generalized likelihood ratio test of the goodness of fit of a model for multinomial cell probabilities. Under the model, the vector of cell prob- abilities p is described by a hypothesis H0, which specifies that p = p(θ), θ ∈ ω0, where θ is a parameter that may be unknown. For example, in Section 8.2 we consid- ered fitting Poisson probabilities that depended on an unknown parameter (there called λ, which played the role of θ) to the cell counts in a table. We want to judge the plausibility of the model relative to a model H1 in which the cell prob- abilities are free except for the constraints that they are nonnegative and sum to 1. If there are m cells,  is thus the set consisting of m nonnegative numbers that sum to 1. The numerator of the likelihood ratio is maxp∈ω0 n! x1! ···xm! p1(θ)x1 ···pm(θ)xm where the xi are the observed counts in the m cells. By the definition of the maximum likelihood estimate, this likelihood is maximized when ˆθ is the maximum likelihood estimate of θ. The corresponding probabilities will be denoted by pi (ˆθ). 342 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit Since the probabilities are unrestricted under , the denominator is maximized by the unrestricted mle’s, or ˆpi = xi n The likelihood ratio is, therefore,  = n! x1! ···xm! p1(ˆθ)x1 ···pm(ˆθ)xm n! x1! ···xm! ˆpx1 1 ··· ˆpxmm = m i=1 pi (ˆθ) ˆpi xi Also, since xi = n ˆpi , −2 log  =−2n m i=1 ˆpi log pi (ˆθ) ˆpi = 2 m i=1 Oi log Oi Ei where Oi = n ˆpi and Ei = npi (ˆθ) denote the observed and expected counts, respec- tively. Under , the cell probabilities are allowed to be free, with the constraint that they sum to 1, so dim  = m − 1. If, under H0, the probabilities pi (ˆθ) depend on a k-dimensional parameter θ that has been estimated from the data, dim ω0 = k. The large sample distribution of −2 log  is thus a chi-square distribution with m − k − 1 degrees of freedom (the number of cells minus the number of estimated parameters minus 1). Pearson’s chi-square statistic is commonly used to test for goodness of fit X 2 = m i=1 [xi − npi (ˆθ)]2 npi (ˆθ) Pearson’s statistic and the likelihood ratio are asymptotically equivalent under H0. To indicate heuristically why this is so, we will go through a Taylor series argument. To begin, −2 log  = 2n m i=1 ˆpi log ˆpi pi (ˆθ) If H0 is true and n is large, ˆpi ≈ pi (ˆθ). The Taylor series expansion of the function f (x) = x log x x0 about x0 is f (x) = (x − x0) + 1 2 (x − x0)2 1 x0 +··· 9.5 Likelihood Ratio Tests for the Multinomial Distribution 343 Thus, −2 log  ≈ 2n m i=1 [ ˆpi − pi (ˆθ)] + n m i=1 [ ˆpi − pi (ˆθ)]2 pi (ˆθ) The first term on the right-hand side is equal to 0 since the probabilities sum to 1, and the second term on the right-hand side may be expressed as m i=1 [xi − npi (ˆθ)]2 npi (ˆθ) since xi , the observed count, equals n ˆpi for i = 1,...,m. We have argued for the approximate equivalence of two test statistics. Pearson’s test has been more commonly used than the likelihood ratio test, because it is some- what easier to calculate without the use of a computer. Let us consider some examples. EXAMPLE A Hardy-Weinberg Equilibrium Hardy-Weinberg equilibrium was first introduced in Example A in Section 8.5.1. We will now test whether this model fits the observed data. Recall that the Hardy- Weinberg equilibrium model says that the cell probabilities are (1 − θ)2,2θ(1 − θ), and θ2. Using the maximum likelihood estimate for θ, ˆθ = .4247, and multiplying the resulting probabilities by the sample size n = 1029, we calculate expected counts, which are compared with observed counts in the following table: Blood Type MMNN Observed 342 500 187 Expected 340.6 502.8 185.6 The null hypothesis will be that the multinomial distribution is as specified by the Hardy-Weinberg equilibrium frequencies, with unknown parameter θ. The alternative hypothesis will be that the multinomial distribution does not have probabilities of that specified form. We first choose a value for α, the significance level for the test (recall that the significance level is the probability of falsely rejecting the hypothesis that the multinomial cell, probabilities are as specified by genetic theory). In this application, there is no compelling reason to choose any particular value of α, so we will follow convention and let α = .05. This means that our decision rule will falsely reject H0 in only 5% of the cases. We will use Pearson’s chi-square test, and therefore X 2 as our test statistic. The null distribution of X 2 is approximately chi-square with 1 degree of freedom. (There are two independent cells, and one parameter has been estimated from the data.) Since, from Table 3 in Appendix B, the point defining the upper 5% of the chi-square distribution with 1 degree of freedom is 3.84, the test rejects if X 2 > 3.84. We next 344 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit calculate X 2: X 2 = (O − E)2 E = .0319 Thus, the null hypothesis is not rejected. There is a certain unnecessary rigidity in this procedure, because it is not clear that such a decision (to reject or not) has to be made at all. There is also a certain arbitrariness: There was no strong reason to let α = .05, but that choice essentially determined our decision. If we had let α = .01, the decision would have been the same since χ2(.01)>χ2(.05), but what if we had let α = .10, or .20? It is here that the concept of the p-value becomes useful. Recall that the p-value is the smallest significance level at which the null hypothesis would be rejected. From a computer calculation of the chi-square distribution (or from a table of the normal distribution, since a chi-square distribution with 1 degree of freedom is the square of a standard normal random variable), the probability that a chi-square random variable with 1 degree of freedom is greater than or equal to .0319 is .86, which is the p-value. Another interpretation of this p-value is that if the model were correct, deviations this large or larger would occur 86% of the time. Thus, the data give us no reason to doubt the model. In comparison, the likelihood ratio test statistic is −2 log  = 2 3 i=1 Oi log Oi Ei = .0319 The two tests lead to the same conclusion. Finally, we note that the actual maximized likelihood ratio is = exp(−.0319/2) = .98. Thus the Hardy-Weinberg model is almost as likely as the most general possible model. ■ EXAMPLE B Bacterial Clumps In testing milk for bacterial contamination, 0.01 ml of milk is spread over an area of1cm2 on a slide. The slide is mounted on a microscope, and counts of bacterial clumps within grid squares are made. The Poisson model appears quite reasonable for the distribution of the clumps at first glance: The clumps are presumably mixed uniformly throughout the milk, and there is no reason to suspect that the clumps bunch together. However, on closer examination, two possible problems are noted. First, bacteria held by surface tension on the lower surface of the drop may adhere to the glass slide on contact, producing increased concentrations in that area of the film. Second, the film is not of uniform thickness, being thicker in the center and thinner at the edges, giving rise to nonuniform concentrations of bacteria. The following table, taken from Bliss and Fisher (1953), summarizes the counts of clumps on 400 grid squares. Number per Square 0 1234567891019 Frequency 561048062422799532 1 9.5 Likelihood Ratio Tests for the Multinomial Distribution 345 To fit a Poisson distribution to these data, we compute the mle, ˆλ, which is the mean of the 400 counts: ˆλ = 0 × 56 + 1 × 104 + 2 × 80 +···+19 × 1 400= 2.44 The following table shows the observed and expected counts and the components of chi-square test statistic. (The last several cells were grouped together so that the minimum expected count would be nearly 5.) Observed 56 104 80 62 42 27 9 20 Expected 34.9 85.1 103.8 84.4 51.5 25.1 10.2 5.0 Component of X 2 12.8 4.2 5.5 5.9 1.8 .14 .14 45.0 The chi-square statistic is X 2 = 75.4. With 6 degrees of freedom (there are eight cells, and one parameter has been estimated from the data), the null hypothesis is conclusively rejected [χ2 6 (.005) = 18.55, so the p-value is less than .005]. When a goodness-of-fit test rejects, it is instructive to find out why; where does the model fail to fit? This can be seen by looking at the cells that make large contributions to X 2 and the signs of the observed minus the expected counts for those cells. We see here that the greatest contributions to X 2 come from the first and last cells of the table—there are too many small counts and too many large counts relative to what is expected for a Poisson distribution. ■ EXAMPLE C Fisher’s Reexamination of Mendel’s Data In one of his famous experiments, Mendel crossed 556 smooth, yellow male peas with wrinkled, green female peas. According to now established genetic theory, the relative frequencies of the progeny should be as given in the following table. Type Frequency Smooth yellow 9 16 Smooth green 3 16 Wrinkled yellow 3 16 Wrinkled green 1 16 The counts that Mendel recorded and the expected counts are given in the following table: 346 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit Type Observed Count Expected Count Smooth yellow 315 312.75 Smooth green 108 104.25 Wrinkled yellow 102 104.25 Wrinkled green 31 34.75 Calculating the likelihood ratio test statistic, we obtain −2 log  = 2 4 i=1 Oi log Oi Ei = .618 Comparing this value with the chi-square distribution with 3 degrees of freedom (three independent parameters are estimated under  and none under ω0), we have a p-value of slightly less than .9. Pearson’s chi-square statistic is .604, which is quite close to the value from the likelihood ratio test. We interpret the p-value as meaning that, even if the model were correct, discrepancies this large or larger would be expected to occur on the basis of chance about 90% of the time. There is thus no reason to reject the hypothesis that the counts come from a multinomial distribution with the prescribed probabilities. We would tend to doubt this hypothesis for only small p-values. The p-value can also be interpreted to mean that on the basis of chance we would expect agreement this close or closer about only 10% of the time. There is some validity to the suggestion that the data agree with the model too well; if the p-value had been .999, for example, we would definitely be suspicious. Fisher pooled the results of all of Mendel’s experiments in the following way. Suppose that two independent experiments give chi-square statistics with p and r degrees of freedom, respectively. Then, under the null hypothesis that the models were correct, the sum of those two test statistics would follow a chi-square distribution with p +r degrees of freedom. Fisher added the chi-square statistics for all the independent experiments and compared the result with the chi-square distribution with degrees of freedom equal to the sum of all the degrees of freedom. The resulting p-value was .99996. Such close agreement would only occur 4 times out of 100,000 on the basis of chance! What happened? Did Mendel deliberately or unconsciously fudge the data? Did he have an overzealous lab technician who was hoping for a recommendation to medical school? Was there divine intervention? Perhaps the best explanation is that Mendel continued experimenting until the results looked good and then he stopped. The statistical analysis here assumes that the sample size is fixed before the data are collected. ■ Mendel is not the only scientist whose data are too good to be true. Cyril Burt was an English psychologist whose work had a great impact on the debate concerning the genetic basis for intelligence. His many papers and extensive data argue for such a basis. In 1946, Burt became the first psychologist to be knighted; however, during the 1970s, his work came under increasing attack, and he was 9.6 The Poisson Dispersion Test 347 accused of actually fabricating data. One of his most famous studies was of the intel- ligence and occupational status of 40,000 fathers and sons. Dorfman (1978) studied the goodness of fit of these intelligence scores to a normal distribution, using Pear- son’s chi-square test. The p-values for fathers and sons were greater than 1 − 10−7 and 1 − 10−6, respectively! Dorfman concluded that “it may well be that Burt’s fre- quency distributions are the most normally distributed in the history of anthropometric measurement.” 9.6 The Poisson Dispersion Test The likelihood ratio test and Pearson’s chi-square test are carried out with respect to the general alternative hypothesis that the cell probabilities are completely free. If one has a specific alternative hypothesis in mind, better power can usually be obtained by testing against that alternative rather than against a more general alternative. Such a test is developed in this section for the hypothesis that a distribution is Poisson. The test is quite useful, and its construction affords another illustration of a generalized likelihood ratio test. The two key assumptions underlying the Poisson distribution are that the rate is constant and that the counts in one interval of time or space are independent of the counts in disjoint intervals. These assumptions are often not met. For exam- ple, suppose that insects are counted on leaves of plants. The leaves are of different sizes and occur at various locations on different plants; the rate of infestation may well not be constant over the different locations. Furthermore, if the insects hatched from eggs that were deposited in groups, there might be clustering of the insects and the independence assumption might fail. If counts occurring over time are being recorded, the underlying rate of the phenomenon being studied might not be con- stant. Motor vehicle counts for traffic studies, for example, typically vary cyclically over time. Given counts x1,...,xn, we consider testing the null hypothesis that the counts are Poisson with the common parameter λ versus the alternative hypothesis that they are Poisson but have different rates, λ1,...,λn. Under , there are n dif- ferent rates; ω0 ⊂  is the special case that they are all equal. Under ω0, the maximum likelihood estimate of λ is ˆλ = X. Under , the maximum likelihood estimates of the λi are x1,...,xn; we denote these estimates by ˜λi . The likelihood ratio is thus  = n i=1 ˆλxi e−ˆλ/xi ! n i=1 ˜λxi i e−˜λi /xi ! = n i=1 ¯x xi xi exi −¯x 348 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit The likelihood ratio test statistic is −2 log  =−2 n i=1 xi log ¯x xi + (xi − ¯x) = 2 n i=1 xi log xi ¯x A nearly equivalent form of this statistic is produced using the Taylor series argument given in Section 9.5: −2 log  ≈ 1 ¯x n i=1 (xi − ¯x)2 Under , there are n independent parameters, λ1,...,λn,sodim = n. Under ω0, there is only one parameter, λ,sodimω0 = 1, and the degrees of freedom are n − 1. The last equation given for the test statistic may be interpreted as being the ratio of n times the estimated variance to the estimated mean. For the Poisson distribution, the variance equals the mean; for the types of alternatives discussed above, the variance is typically greater than the mean. For this reason the test is often called the Poisson dispersion test. It is sensitive to—that is, has high power against—alternatives that are overdispersed relative to the Poisson, such as the negative binomial distribution. The ratio ˆσ 2/¯x is sometimes used as a measure of clustering. EXAMPLE A Asbestos Fibers In Example A in Section 8.4, we considered whether counts of asbestos fibers on grid squares could be modeled as a Poisson distribution. Applying the Poisson dispersion test, we find that 1 ¯x (xi − ¯x)2 = 26.56 or, if the likelihood ratio statistic is used, 2 xi log xi ¯x = 27.11 Since there are 23 observations, there are 22 degrees of freedom. From a computer calculation, the p-value for the likelihood ratio statistic is .21. The evidence against the null hypothesis is not persuasive; however, the sample size is small and the test may have low power. ■ EXAMPLE B Bacterial Clumps In Example B in Section 9.5, we applied Pearson’s chi-square test to test whether counts of bacteria clumps in milk were fit by a Poisson distribution. There we found 9.7 Hanging Rootograms 349 ¯x = 2.44. The sample variance is ˆσ 2 = 02 × 56 + 12 × 104 +···+192 × 1 400 − ¯x2 = 4.59 The ratio of the variance to the mean is 1.88 rather than 1; the test statistic is T = n ˆσ 2 ¯x = 400 × 4.59 2.40 = 752.7 Under the null hypothesis, the statistic approximately follows a chi-square distribution with 399 degrees of freedom. Since a chi-square random variable with m degrees of freedom is the sum of the squares of m independent N(0, 1) random variables, the central limit theorem implies that for large values of m the chi-square distribution with m degrees of freedom is approximately normal. For a chi-square distribution, the mean equals the number of degrees of freedom and variance equals twice the number of degrees of freedom. The p-value can thus be found by standardizing the statistic and using tables of the standard normal distribution: P(T ≥ 752.7) = P T − 399√ 2 × 399 ≥ 752.7 − 399√ 2 × 399 ≈ 1 − (12.5) ≈ 0 Thus, there is almost no doubt that the Poisson distribution fails to fit the data. ■ 9.7 Hanging Rootograms In this and the next section, we develop additional informal techniques for assessing goodness of fit. The first of these is the hanging rootogram. Hanging rootograms are a graphical display of the differences between observed and fitted values in histograms. To illustrate the construction and interpretation of hanging rootograms, we will use a set of data from the field of clinical chemistry (Martin, Gudzinowicz, and Fanger 1975). The following table gives the empirical distribution of 152 serum potassium levels. In clinical chemistry, distributions such as this are often tabulated to establish a range of “normal” values against which the level of the chemical found in a patient can be compared to determine whether it is abnormal. The tabulated distributions are often fit to parametric forms such as the normal distribution. 350 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit Serum Potassium Levels Interval Midpoint Frequency 3.2 2 3.3 1 3.4 3 3.5 2 3.6 7 3.7 8 3.8 8 3.9 14 4.0 14 4.1 18 4.2 16 4.3 15 4.4 10 4.5 8 4.6 8 4.7 6 4.8 4 4.9 1 5.0 1 5.1 1 5.2 4 5.3 1 Figure 9.3(a) is a histogram of the frequencies. The plot looks roughly bell- shaped, but the normal distribution is not the only bell-shaped distribution. In order to evaluate their distribution more exactly, we must compare the observed frequencies to frequencies fit by the normal distribution. This can be done in the following way. Suppose that the parameters μ and σ of the normal distribution are estimated from the data by ¯x and ˆσ. If the jth interval has the left boundary x j−1 and the right boundary x j , then according to the model, the probability that an observation falls in that interval is ˆp j =  x j − ¯x ˆσ −  x j−1 − ¯x ˆσ If the sample is of size n, the predicted, or fitted, count in the jth interval is ˆn j = n ˆp j which may be compared to the observed counts, n j . Figure 9.3(b) is a hanging histogram of the differences: observed count (n j ) minus expected count (ˆn j ). These differences are difficult to interpret since the vari- ability is not constant from cell to cell. If we neglect the variability in the estimated expected counts, we have Var(n j − ˆn j ) = Var(n j ) = npj (1 − p j ) = npj − np2 j 9.7 Hanging Rootograms 351 (a) 5 0 10 15 20 (b) 2 1 0 1 2 3 4 (d) 2 1 0 1 2 3 4 5 (c) 1.0 0.0 1.0 Hanging histogramHistogram Hanging chi-gramHanging rootogram FIGURE 9.3 (a) Histogram, (b) hanging histogram, (c) hanging rootogram, and (d) hanging chi-gram for normal fit to serum potassium data. In this case, the p j are small, so Var(n j − ˆn j ) ≈ npj Thus, cells with large values of p j (equivalent to large values of n j if the model is at all close) have more variable differences, n j − ˆn j . In a hanging histogram, we expect larger fluctuations in the center than in the tails. This unequal variability makes it difficult to assess and compare the fluctuations, since a large deviation may indicate real misfit of the model or may be merely caused by large random variability. To put the differences between observed and expected values on a scale on which they all have equal variability, a variance-stabilizing transformation may be used. (Such transformations will be used in later chapters as well.) Suppose that a random variable X has mean μ and variance σ 2(μ), which depends on μ.IfY = f (X), the method of propagation of error (Section 4.6) shows that Var(Y) ≈ σ 2(μ)[ f (μ)]2 Thus if f is chosen so thatσ 2(μ)[ f (μ)]2 is constant, the variance ofY will not depend on μ. The transformation f that accomplishes this is called a variance-stabilizing transformation. Let us apply this idea to the case we have been considering: E(n j ) = npj = μ Var(n j ) ≈ npj = σ 2(μ) 352 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit That is, σ 2(μ) = μ,so f will be a variance-stabilizing transformation if μ[ f (μ)]2 is constant. The function f (x) = √ x does the job, and E(√ n j ) ≈ √ npj Var(√ n j ) ≈ 1 4 if the model is correct. Figure 9.3(c) shows a hanging rootogram, a display showing √ n j − ˆn j The advantage of the hanging rootogram is that the deviations from cell to cell have approximately the same statistical variability. To assess the deviations, we may use the rough rule of thumb that a deviation of more than 2 or 3 standard deviations (more than 1.0 or 1.5 in this case) is “large.” The most striking feature of the hanging rootogram in Figure 9.3(c) is the large deviation in the right tail. Generally, deviations in the center have been down-weighted and those in the tails emphasized by the transformation. Also, it is noteworthy that although the deviations other than the one in the right tail are not especially large, they have a certain systematic character: Note the run of positive deviations followed by the run of negative deviations and then the large positive deviation in the extreme right tail. This may indicate some asymmetry in the distribution. A possible alternative to the rootogram is what can be called a hanging chi-gram, a plot of the components of Pearson’s chi-square statistic: n j − ˆn j ˆn j Neglecting the variability in the expected counts, as before, Var(n j − ˆn j ) ≈ npj = ˆn j , so Var n j − ˆn j ˆn j ≈ 1 so this technique also stabilizes variance. Figure 9.3(d) is a hanging chi-gram for the case we have been considering; it is quite similar in overall character to the hanging rootogram, but the deviation in the right tail is emphasized even more. 9.8 Probability Plots Probability plots are an extremely useful graphical tool for qualitatively assessing the fit of data to a theoretical distribution. Consider a sample of size n from a uniform distribution on [0, 1]. Denote the ordered sample values by X(1) < X(2) ··· < X(n). These values are called order statistics. It can be shown (see Problem 17 at the end of Chapter 4) that E(X( j)) = j n + 1 This suggests plotting the ordered observations X(1),...,X(n) against their expected values 1/(n + 1),...,n/(n + 1). If the underlying distribution is uniform, the plot 9.8 Probability Plots 353 0 .2 .2 .4 .6 .8 Ordered observations Uniform quantiles .4 .6 .8 1.0 0 1.0 FIGURE 9.4 Uniform-uniform probability plot. should look roughly linear. Figure 9.4 is such a plot for a sample of size 100 from a uniform distribution. Now suppose that a sample Y1,...,Y100 is generated in which each Y is half the sum of two independent uniform random variables. The distribution of Y is no longer uniform but triangular: f ( y) = 4y, 0 ≤ y ≤ 1 2 4 − 4y, 1 2 ≤ y ≤ 1 The ordered observations Y(1),...,Y(n) are plotted against the points 1/(n + 1),..., n/(n +1). The graph in Figure 9.5 shows a clear deviation from linearity and enables us to describe qualitatively the deviation of the distribution of the Y’s from the uniform distribution. Note that in the left tail of the plotted distribution (near 0), the order statis- tics are larger than expected for a uniform distribution, and in the right tail (near 1), they are smaller, indicating that the tails of the distribution of the Y’s decrease more quickly (are “lighter”) than the tails of the uniform distribution. The technique can be extended to other continuous probability laws by means of Proposition C of Section 2.3, which states that if X is a continuous random variable with a strictly increasing cumulative distribution function, FX , and if Y = FX (X), then Y has a uniform distribution on [0, 1]. The transformation Y = FX (X) is known as the probability integral transform. The following procedure is suggested by the proposition just referred to. Sup- pose that it is hypothesized that X follows a certain distribution, F. Given a sample X1,...,Xn, we plot F(X(k)) vs. k n + 1 354 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 0 .2 .2 .4 .6 .8 Ordered observations Uniform quantiles .4 .6 .8 1.0 0 1.0 FIGURE 9.5 Uniform-triangular probability plot. or equivalently X(k) vs. F−1 k n + 1 In some cases, F is of the form F(x) = G x − μ σ where μ and σ are called location and scale parameters, respectively. The normal distribution is of this form. We could plot X(k) − μ σ vs. G−1 k n + 1 or if we plotted X(k) vs. G−1 k n + 1 the result would be approximately a straight line if the model were correct: X(k) ≈ σG−1 k n + 1 + μ Slight modifications of this procedure are sometimes used. For example, rather than G−1[k/(n + 1)], E(X(k)), the expected value of the kth smallest observation can be 9.8 Probability Plots 355 used. But it can be argued that E(X(k)) ≈ F−1 k n + 1 = σG−1 k n + 1 + μ so this modification yields very similar results to the original procedure. The procedure can be viewed from another perspective. Recall from Section 2.2 that F−1[k/(n + 1)] is the k/(n + 1) quantile of the distribution F; that is, it is the point such that the probability that a random variable with distribution function F is less than it is k/(n + 1). We are thus plotting the ordered observations (which may be viewed as the observed or empirical quantiles) versus the quantiles of the theoretical distribution. EXAMPLE A We can illustrate the procedure just described using a set of 100 observations, which are Michelson’s determinations of the velocity of light made from June 5, 1879 to July 2, 1879; 299,000 has been subtracted from the determinations to give the values listed [data from Stigler (1977)]: 850 960 880 890 890 740 940 880 810 840 900 960 880 810 780 1070 940 860 820 810 930 880 720 800 760 850 800 720 770 810 950 850 620 760 790 980 880 860 740 810 980 900 970 750 820 880 840 950 760 850 1000 830 880 910 870 980 790 910 920 870 930 810 850 890 810 650 880 870 860 740 760 880 840 880 810 810 830 840 720 940 1000 800 850 840 950 1000 790 840 850 800 960 760 840 850 810 960 800 840 780 870 Figure 9.6 shows the normal probability plot. The plot looks straight, showing that the normal distribution gives a reasonable fit. A word of caution is in order here: Probability plots are by nature monotone- increasing and they all tend to look fairly straight. Some experience is necessary in gauging “straightness.” Simulations, which are easily done, are very helpful in sharpening one’s judgment. Some find it useful to hold the plot so that they are looking down the plotted line as if it were a roadway; this often makes curvature much more apparent. ■ 356 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 600 700 2 101 Ordered observations Normal quantiles 800 900 1000 1100 3 2 3 FIGURE 9.6 Normal probability plot of Michelson's data. EXAMPLE B In order to be able to interpret probability plots, it is useful to see how they are shaped for samples from nonnormal distributions. Figure 9.7 is a normal probability plot of 500 pseudorandom variables from a double exponential distribution: f (x) = 1 2 e−|x|, −∞ < x < ∞ 8 6 2 101 Ordered observations Normal quantiles 4 2 0 2 3 2 3 44 4 6 FIGURE 9.7 Normal probability plot of 500 pseudorandom variables from a double exponential distribution. 9.8 Probability Plots 357 0 2 2 101 Ordered observations Normal quantiles 4 6 8 10 3 2 3 44 12 14 FIGURE 9.8 Normal probability plot of 500 pseudorandom variables from a gamma distribution with the shape parameter α = 5. This density is symmetric about zero, but its tails die off at the rate exp(−|x|). This rate is slower than that for the tails of the normal distribution, which decay at the rate exp(−x2). Note how the plot in Figure 9.7 bends down at the left and up at the right, indicating that the observations in the left tail were more negative than expected for a normal distribution and the observations in the right tail were more positive. In other words, the extreme observations were larger in magnitude than extreme observations from a normal distribution would be. This effect results because the tails of the double exponential are “heavier” than those of a normal distribution. Figure 9.8 is a normal probability plot of 500 pseudorandom numbers from a gamma distribution with the shape parameter α = 5 and the scale parameter λ = 1. As can be seen in Figure 2.11, the gamma density with α = 5 is nonsymmetric, or skewed, and this is reflected by the bowlike appearance of the probability plot. ■ EXAMPLE C As an example for a nonnormal distribution, Figure 9.9 is a gamma probability plot of the precipitation amounts of Example C in Section 8.4. The parameter λ of a gamma distribution is a scale parameter and so, as we have seen before, affects only the slope of a probability plot, not its straightness. Thus in constructing a probability plot we can take λ = 1 without loss. A computer was used to find the quantiles of a gamma distribution with parameter α = .471 and λ = 1, and Figure 9.9 was produced by plotting the observed sorted values of precipitation versus the quantiles. Qualitatively, the fit looks reasonable, because there is no gross systematic deviation from a straight line. ■ 358 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 0 .5 1234 Stored values of precipitation Ordered quantiles of gamma distribution ( .471) 1.0 1.5 2.0 2.5 0 5 FIGURE 9.9 Gamma probability plot of rainfall distribution. Probability plots can also be constructed for grouped data, such as the data on serum potassium levels in Section 9.7. Because the ordered observations are not all available in such a case, the procedure must be modified. Suppose that the grouping gives the points x1, x2,...,xm+1 for the histogram’s bin boundaries and that in the interval [xi , xi+1) there are ni counts, where i = 1,...,m. We denote the cumulative frequencies by N j = j i=1 ni . Then N1 < N2 < ···< Nm and Nm = n, which is the total sample size. We thus plot x j+1 vs. G−1 N j n + 1 , j = 1,...,m EXAMPLE D Figure 9.10 shows a normal probability plot for the serum potassium data of Section 9.7. The cumulative frequencies are found by summing the frequencies in each bin. The deviations in the right tail are immediately apparent. ■ 9.9 Tests for Normality A wide variety of tests are available for testing goodness of fit to the normal distribu- tion. We discuss some of them in this section; more discussion may be found in the works referred to. 9.9 Tests for Normality 359 3.0 3.5 2 101 Ordered observations Normal quantiles 4.0 4.5 5.0 5.5 3 2 3 FIGURE 9.10 Normal probability plot of serum potassium data. If the data are grouped into bins, with several counts in each bin, Pearson’s chi- square test for goodness of fit may be applied. But if the parameters are estimated from ungrouped data and the expected counts in each bin are calculated using the estimated parameters, the limiting distribution of the test statistic is no longer chi- square. In order for the limiting distribution to be chi-square, the parameters must be estimated from the grouped data. This was pointed out by Chernoff and Lehmann (1954) and is further discussed by Dahiya and Gurland (1972). Generally speaking, it seems rather artificial and wasteful of information to group continuous data. Departures from normality often take the form of asymmetry, or skewness. For a normal distribution, the third central moment is ∞ −∞(x − μ)3ϕ(x)dx, which equals 0 since the density is symmetric about μ. Suppose that we wish to test the null hypothesis that X1,...,Xn are independent normally distributed random variables with the same mean and variance. A goodness-of-fit test can be based on the coefficient of skewness for the sample, b1 = 1 n n i=1 (Xi − X)3 s3 The test rejects for large values of |b1|. Symmetric distributions can depart from normality by being heavy-tailed or light- tailed or too peaked or too flat in the center. These forms of departures may be detected by the coefficient of kurtosis for the sample, b2 = 1 n n i=1 (Xi − X)4 s4 360 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit If either of these measures is to be used as a test statistic, its sampling distribution when the distribution generating the data is normal must be determined. The hypoth- esis test rejects normality when the observed value of the statistic is in the tails of the sampling distribution. These sampling distributions are difficult to evaluate in closed form but can be approximated by simulations. A goodness-of-fit test may also be based on the linearity of the probability plot, as measured by the correlation coefficient, r,ofthex and y coordinates of the points of the probability plot. The test rejects for small values of r. The sam- pling distribution of r under normality has been approximated by simulations and is tabled in Filliben (1975). Ryan and Joiner (unpublished) give a short table for the null sampling distribution of r from normal probability plots with critical val- ues for the correlation coefficient corresponding to significance levels .1, .05, and .01: n .1 .05 .01 4 .8951 .8734 .8318 5 .9033 .8804 .8320 10 .9347 .9180 .8804 15 .9506 .9383 .9110 20 .9600 .9503 .9290 25 .9662 .9582 .9408 30 .9707 .9639 .9490 40 .9767 .9715 .9597 50 .9807 .9764 .9664 60 .9836 .9799 .9710 75 .9865 .9835 .9752 They also report the results of some simulations of the power ofr as the test statistic for certain alternative distributions. For example, the power against a uniform distribution with a significance level of .1 is .13 for n = 10 and .20 for n = 20. This is somewhat discouraging—the test rejects only 13% of the time and 20% of the time for the given sample sizes if the real underlying distribution is uniform. The moral is that it may be quite difficult to detect departure from normality in small samples. On a more positive note, the power of r against an exponential distribution is 53% for n = 10 and 89% for n = 20. Pearson, D’Agostino, and Bowman (1977) report the results of quite extensive simulations of the power for several alternative distributions and give further refer- ences. For Michelson’s data (see Example A in Section 9.8), the correlation coefficient is .995. From the tables in Filliben (1975), this falls between the 50th and 75th percentiles of the null sampling distribution, giving no reason to reject the hypothesis of normality. It may not be very realistic, however, to model the 100 observations of the velocity of light as a sample of 100 independent random variables from some probability distribution and to use this model to test goodness of fit. We have little information about how these data were collected and processed. For example, since the observations were made sequentially, it is quite possible that the measurement 9.10 Concluding Remarks 361 process drifted in time or that successive errors were correlated. It is also possible that Michelson discarded some obviously bad data. 9.10 Concluding Remarks Two very important concepts, estimation and hypothesis testing, have been introduced in this chapter and the last. They have been introduced here in the context of fitting probability distributions but will recur throughout the rest of this book in various other contexts. Generally, observations are taken from a probability law that depends on a parameter, θ. Estimation theory is concerned with estimating θ from the data; the theory of hypothesis testing is concerned with testing hypotheses about the value of θ. Methods based on likelihood, maximum likelihood estimation and likelihood ratio tests, have also been introduced. These methods are much more generally useful than has been demonstrated by the specific purposes to which they have been put in these chapters. The likelihood and the likelihood ratio are key concepts of statistics, from both Bayesian and frequentist perspectives. The fundamental concepts and techniques of hypothesis testing have been intro- duced in this chapter. We have seen how to test a null hypothesis by choosing a test statistic and a rejection region such that, under the null hypothesis, the probability that the test statistic falls in the rejection region is α, the significance level of the test. The choice of this region is determined by knowing, at least approximately, the null distribution of the test statistic. The test statistic is frequently, but not always, a likelihood ratio; when the exact distribution of the likelihood ratio cannot be found, we can use the chi-square distribution as a large-sample approximation. We have also explored the relation of the p-value of the test statistic to the significance level. In some situations, the p-value is a less rigid summary of the evidence than is a decision whether to reject the null hypothesis. With the increasing availability of flexible computer programs and inexpensive computers, graphical methods are being used more and more in statistics. The last part of this chapter introduced two graphical techniques: hanging rootograms and probabil- ity plots. Other graphical techniques will be introduced in Chapter 10. Such informal techniques are often of more practical use than are more formal techniques, such as hypothesis testing. Literally testing for goodness of fit is often rather artificial—a parametric distribution is usually entertained only as a model for the distribution of data values, and it is clear a priori that the data do not really come from that distri- bution. If enough data were available, the goodness-of-fit test would certainly reject. Rather than test a hypothesis that no one literally believes could hold, it is usually more useful to ascertain qualitatively where the model fits and where and how it fails to fit. Some concepts introduced in Chapters 7 and 8 have been elaborated on in this chapter. In Chapter 7, we introduced confidence intervals for the parameters of fi- nite populations; in Chapter 8, we considered confidence intervals for parameters of probability distributions. In this chapter, we have introduced hypothesis testing and developed a relation between hypothesis tests and confidence intervals. The method of propagation of error, used in Chapter 7 as a tool for analyzing the statistical behavior of ratio estimates, has been used in this chapter in connection with variance-stabilizing transformations. 362 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 9.11 Problems 1. A coin is thrown independently 10 times to test the hypothesis that the probability of heads is 1 2 versus the alternative that the probability is not 1 2 . The test rejects if either 0 or 10 heads are observed. a. What is the significance level of the test? b. If in fact the probability of heads is .1, what is the power of the test? 2. Which of the following hypotheses are simple, and which are composite? a. X follows a uniform distribution on [0, 1]. b. A die is unbiased. c. X follows a normal distribution with mean 0 and variance σ 2 > 10. d. X follows a normal distribution with mean μ = 0. 3. Suppose that X ∼ bin(100, p). Consider the test that rejects H0: p = .5infavor of HA: p = .5 for |X − 50| > 10. Use the normal approximation to the binomial distribution to answer the following: a. What is α? b. Graph the power as a function of p. 4. Let X have one of the following distributions: XH0 HA x1 .2 .1 x2 .3 .4 x3 .3 .1 x4 .2 .4 a. Compare the likelihood ratio, , for each possible value X and order the xi according to . b. What is the likelihood ratio test of H0 versus HA at level α = .2? What is the test at level α = .5? c. If the prior probabilities are P(H0) = P(HA), which outcomes favor H0? d. What prior probabilities correspond to the decision rules with α = .2 and α = .5? 5. True or false, and state why: a. The significance level of a statistical test is equal to the probability that the null hypothesis is true. b. If the significance level of a test is decreased, the power would be expected to increase. c. If a test is rejected at the significance level α, the probability that the null hypothesis is true equals α. d. The probability that the null hypothesis is falsely rejected is equal to the power of the test. e. A type I error occurs when the test statistic falls in the rejection region of the test. 9.11 Problems 363 f. A type II error is more serious than a type I error. g. The power of a test is determined by the null distribution of the test statistic. h. The likelihood ratio is a random variable. 6. Consider the coin tossing example of Section 9.1. Suppose that instead of tossing the coin 10 times, the coin was tossed until a head came up and the total number of tosses, X, was recorded. a. If the prior probabilities are equal, which outcomes favor H0 and which favor H1? b. Suppose P(H0)/P(H1) = 10. What outcomes favor H0? c. What is the significance level of a test that rejects H0 if X ≥ 8? d. What is the power of this test? 7. Let X1,...,Xn be a sample from a Poisson distribution. Find the likelihood ratio for testing H0: λ = λ0 versus HA: λ = λ1, where λ1 >λ0. Use the fact that the sum of independent Poisson random variables follows a Poisson distribution to explain how to determine a rejection region for a test at level α. 8. Show that the test of Problem 7 is uniformly most powerful for testing H0: λ = λ0 versus HA: λ>λ0. 9. Let X1,...,X25 be a sample from a normal distribution having a variance of 100. Find the rejection region for a test at level α = .10 of H0: μ = 0 versus HA: μ = 1.5. What is the power of the test? Repeat for α = .01. 10. Suppose that X1,...,Xn form a random sample from a density function, f (x|θ), for which T is a sufficient statistic for θ. Show that the likelihood ratio test of H0: θ = θ0 versus HA: θ = θ1 is a function of T . Explain how, if the distribution of T is known under H0, the rejection region of the test may be chosen so that the test has the level α. 11. Suppose that X1,...,X25 form a random sample from a normal distribution hav- ing a variance of 100. Graph the power of the likelihood ratio test of H0: μ = 0 versus HA: μ = 0 as a function of μ, at significance levels .10 and .05. Do the same for a sample size of 100. Compare the graphs and explain what you see. 12. Let X1,...,Xn be a random sample from an exponential distribution with the density function f (x|θ) = θ exp[−θx]. Derive a likelihood ratio test of H0: θ = θ0 versus HA: θ = θ0, and show that the rejection region is of the form {X exp[−θ0 X] ≤ c}. 13. Suppose, to be specific, that in Problem 12, θ0 = 1, n = 10, and that α = .05. In order to use the test, we must find the appropriate value of c. a. Show that the rejection region is of the form {X ≤ x0}∪{X ≥ x1}, where x0 and x1 are determined by c. b. Explain why c should be chosen so that P(X exp(−X) ≤ c) = .05 when θ0 = 1. c. Explain why 10 i=1 Xi and hence X follow gamma distributions when θ0 = 1. How could this knowledge be used to choose c? 364 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit d. Suppose that you hadn’t thought of the preceding fact. Explain how you could determine a good approximation to c by generating random numbers on a computer (simulation). 14. Suppose that under H0, a measurement X is N(0,σ2), and that under H1, X is N(1,σ2) and that the prior probability P(H0) = 2×P(H1). As in Section 9.1, the hypothesis H0 will be chosen if P(H0|x)>P(H1|x).Forσ 2 = 0.1, 0.5, 1.0, 5.0: a. For what values of X will H0 be chosen? b. In the long run, what proportion of the time will H0 be chosen if H0 is true 2 3 of the time? 15. Suppose that under H0, a measurement X is N(0,σ2), and that under H1, X is N(1,σ2) and that the prior probability P(H0) = P(H1).Forσ = 1 and x ∈ [0, 3], plot and compare (1) the p-value for the test of H0 and (2) P(H0|x). Can the p-value be interpreted as the probability that H0 is true? Choose another value of σ and repeat. 16. In the previous problem, with σ = 1, what is the probability that the p-value is less than 0.05 if H0 is true? What is the probability if H1 is true? 17. Let X ∼ N(0,σ2), and consider testing H0 : σ1 = σ0 versus HA : σ = σ1, where σ1 >σ0. The values σ0 and σ1 are fixed. a. What is the likelihood ratio as a function of x? What values favor H0? What is the rejection region of a level α test? b. For a sample, X1, X2,...,Xn distributed as above, repeat the previous ques- tion. c. Is the test in the previous question uniformly most powerful for testing H0 : σ = σ0 versus H1 : σ>σ0? 18. Let X1, X2,...,Xn be i.i.d. random variables from a double exponential distri- bution with density f (x) = 1 2 λ exp(−λ|x|). Derive a likelihood ratio test of the hypothesis H0 : λ = λ0 versus H1 : λ = λ1, where λ0 and λ1 >λ0 are specified numbers. Is the test uniformly most powerful against the alternative H1 : λ>λ0? 19. Under H0, a random variable has the cumulative distribution function F0(x) = x2, 0 ≤ x ≤ 1; and under H1, it has the cumulative distribution function F1(x) = x3, 0 ≤ x ≤ 1. a. If the two hypotheses have equal prior probability, for what values of x is the posterior probability of H0 greater than that of H1? b. What is the form of the likelihood ratio test of H0 versus H1? c. What is the rejection region of a level α test? d. What is the power of the test? 20. Consider two probability density functions on [0, 1]: f0(x) = 1, and f1(x) = 2x. Among all tests of the null hypothesis H0 : X ∼ f0(x) versus the alternative X ∼ f1(x), with significance level α = 0.10, how large can the power possibly be? 21. Suppose that a single observation X is taken from a uniform density on [0,θ], and consider testing H0 : θ = 1 versus H1 : θ = 2. 9.11 Problems 365 a. Find a test that has significance level α = 0. What is its power? b. For 0 <α<1, consider the test that rejects when X ∈ [0,α]. What is its significance level and power? c. What is the significance level and power of the test that rejects when X ∈ [1 − α, 1]? d. Find another test that has the same significance level and power as the previous one. e. Does the likelihood ratio test determine a unique rejection region? f. What happens if the null and alternative hypotheses are interchanged—H0 : θ = 2 versus H1 : θ = 1? 22. In Example A of Section 8.5.3 a confidence interval for the variance of a normal distribution was derived. Use Theorem B of Section 9.3 to derive an acceptance region for testing the hypothesis H0: σ 2 = σ 2 0 at the significance level α based on a sample X1, X2,...,Xn. Precisely describe the rejection region if σ0 = 1, n = 15,α = .05. 23. Suppose that a 99% confidence interval for the mean μ of a normal distribution is found to be (−2.0, 3.0). Would a test of H0: μ =−3 versus HA: μ =−3be rejected at the .01 significance level? 24. Let X be a binomial random variable with n trials and probability p of success. a. What is the generalized likelihood ratio for testing H0: p = .5 versus HA: p = .5? b. Show that the test rejects for large values of |X − n/2|. c. Using the null distribution of X, show how the significance level corresponding to a rejection region |X − n/2| > k can be determined. d. If n = 10 and k = 2, what is the significance level of the test? e. Use the normal approximation to the binomial distribution to find the signifi- cance level if n = 100 and k = 10. This analysis is the basis of the sign test, a typical application of which would be something like this: An experimental drug is to be evaluated on laboratory rats. In n pairs of litter mates, one animal is given the drug and the other is given a placebo. A physiological measure of benefit is made after some time has passed. Let X be the number of pairs for which the animal receiving the drug benefited more than its litter mate. A simple model for the distribution of X if there is no drug effect is binomial with p = .5. This is then the null hypothesis that must be made untenable by the data before one could conclude that the drug had an effect. 25. Calculate the likelihood ratio for Example B of Section 9.5 and compare the results of a test based on the likelihood ratio to those of one based on Pearson’s chi-square statistic. 26. True or false: a. The generalized likelihood ratio statistic  is always less than or equal to 1. b. If the p-value is .03, the corresponding test will reject at the significance level .02. 366 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit c. If a test rejects at significance level .06, then the p-value is less than or equal to .06. d. The p-value of a test is the probability that the null hypothesis is correct. e. In testing a simple versus simple hypothesis via the likelihood ratio, the p-value equals the likelihood ratio. f. If a chi-square test statistic with 4 degrees of freedom has a value of 8.5, the p-value is less than .05. 27. What values of a chi-square test statistic with 7 degrees of freedom yield a p-value less than or equal to .10? 28. Suppose that a test statistic T has a standard normal null distribution. a. If the test rejects for large values of |T |, what is the p-value corresponding to T = 1.50? b. Answer the same question if the test rejects for large T . 29. Suppose that a level α test based on a test statistic T rejects if T > t0. Suppose that g is a monotone-increasing function and let S = g(T ). Is the test that rejects if S > g(t0) alevelα test? 30. Suppose that the null hypothesis is true, that the distribution of the test statistic, T say, is continuous with cdf F and that the test rejects for large values of T . Let V denote the p-value of the test. a. Show that V = 1 − F(T ). b. Conclude that the null distribution of V is uniform. (Hint: See Proposition C of Section 2.3.) c. If the null hypothesis is true, what is the probability that the p-value is greater than .1? d. Show that the test that rejects if V <αhas significance level α. 31. What values of the generalized likelihood ratio  are necessary to reject the null hypothesis at the significance level α = .1 if the degrees of freedom are 1, 5, 10, and 20? 32. The intensity of light reflected by an object is measured. Suppose there are two types of possible objects, A and B. If the object is of type A, the measurement is normally distributed with mean 100 and standard deviation 25; if it is of type B, the measurement is normally distributed with mean 125 and standard deviation 25. A single measurement is taken with the value X = 120. a. What is the likelihood ratio? b. If the prior probabilities of A and B are equal ( 1 2 each), what is the posterior probability that the item is of type B? c. Suppose that a decision rule has been formulated that declares the object to be of type B if X > 125. What is the significance level associated with this rule? d. What is the power of this test? e. What is the p-value when X = 120? 33. It has been suggested that dying people may be able to postpone their death until after an important occasion, such as a wedding or birthday. Phillips and King 9.11 Problems 367 (1988) studied the patterns of death surrounding Passover, an important Jewish holiday, in California during the years 1966–1984. They compared the number of deaths during the week before Passover to the number of deaths during the week after Passover for 1919 people who had Jewish surnames. Of these, 922 occurred in the week before Passover and 997, in the week after Passover. The significance of this discrepancy can be assessed by statistical calculations. We can think of the counts before and after as constituting a table with two cells. If there is no holiday effect, then a death has probability 1 2 of falling in each cell. Thus, in order to show that there is a holiday effect, it is necessary to show that this simple model does not fit the data. Test the goodness of fit of the model by Pearson’s X 2 test or by a likelihood ratio test. Repeat this analysis for a group of males of Chinese and Japanese ancestry, of whom 418 died in the week before Passover and 434 died in the week after. What is the relevance of this latter analysis to the former? 34. Test the goodness of fit of the data to the genetic model given in Problem 55 of Chapter 8. 35. Test the goodness of fit of the data to the genetic model given in Problem 58 of Chapter 8. 36. The National Center for Health Statistics (1970) gives the following data on distribution of suicides in the United States by month in 1970. Is there any evidence that the suicide rate varies seasonally, or are the data consistent with the hypothesis that the rate is constant? (Hint: Under the latter hypothesis, model the number of suicides in each month as a multinomial random variable with the appropriate probabilities and conduct a goodness-of-fit test. Look at the signs of the deviations, Oi − Ei , and see if there is a pattern.) Number Month of Suicides Days/Month Jan. 1867 31 Feb. 1789 28 Mar. 1944 31 Apr. 2094 30 May 2097 31 June 1981 30 July 1887 31 Aug. 2024 31 Sept. 1928 30 Oct. 2032 31 Nov. 1978 30 Dec. 1859 31 37. The following table gives the number of deaths due to accidental falls for each month during 1970. Is there any evidence for a departure from uniformity in the 368 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit rate over time? That is, is there a seasonal pattern to this death rate? If so, describe its pattern and speculate as to causes. Month Number of Deaths Jan. 1668 Feb. 1407 Mar. 1370 Apr. 1309 May 1341 June 1338 July 1406 Aug. 1446 Sept. 1332 Oct. 1363 Nov. 1410 Dec. 1526 38. Yip et al. (2000) studied seasonal variations in suicide rates in England and Wales during 1982–1996, collecting counts shown in the following table: Month Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec Male 3755 3251 3777 3706 3717 3660 3669 3626 3481 3590 3605 3392 Female 1362 1244 1496 1452 1448 1376 1370 1301 1337 1351 1416 1226 Do either the male or female data show seasonality? 39. There is a great deal of folklore about the effects of the full moon on humans and other animals. Do animals bite humans more during a full moon? In an attempt to study this question, Bhattacharjee et al. (2000) collected data on admissions to a medical facility for treatment of bites by animals: cats, rats, horses, and dogs. 95% of the bites were by man’s best friend, the dog. The lunar cycle was divided into 10 periods, and the number of bites in each period is shown in the following table. Day 29 is the full moon. Is there a temporal trend in the incidence of bites? Lunar Day 16,17,18 19,20,21 22,23,24 25,26,27 28,29,1 2,3,4 5,6,7 8,9,10 11,12,13 14,15 Number of Bites 137 150 163 201 269 155 142 146 148 110 40. Consider testing goodness of fit for a multinomial distribution with two cells. Denote the number of observations in each cell by X1 and X2 and let the hypothe- sized probabilities be p1 and p2. Pearson’s chi-square statistic is equal to 2 i=1 (Xi − npi )2 npi 9.11 Problems 369 Show that this may be expressed as (X1 − np1)2 np1(1 − p1) Because X1 is binomially distributed, the following holds approximately under the null hypothesis: X1 − np1√ np1(1 − p1) ∼ N(0, 1) Thus, the square of the quantity on the left-hand side is approximately distributed as a chi-square random variable with 1 degree of freedom. 41. Let Xi ∼ bin(ni , pi ), for i = 1,...,m, be independent. Derive a likelihood ratio test for the hypothesis H0: p1 = p2 =···= pm against the alternative hypothesis that the pi are not all equal. What is the large- sample distribution of the test statistic? 42. Nylon bars were tested for brittleness (Bennett and Franklin 1954). Each of 280 bars was molded under similar conditions and was tested in five places. Assuming that each bar has uniform composition, the number of breaks on a given bar should be binomially distributed with five trials and an unknown probability p of failure. If the bars are all of the same uniform strength, p should be the same for all of them; if they are of different strengths, p should vary from bar to bar. Thus, the null hypothesis is that the p’s are all equal. The following table summarizes the outcome of the experiment: Breaks/Bar Frequency 0 157 169 235 317 41 51 a. Under the given assumption, the data in the table consist of 280 observations of independent binomial random variables. Find the mle of p. b. Pooling the last three cells, test the agreement of the observed frequency distribution with the binomial distribution using Pearson’s chi-square test. c. Apply the test procedure derived in the previous problem. 43. a. In 1965, a newspaper carried a story about a high school student who reported getting 9207 heads and 8743 tails in 17,950 coin tosses. Is this a significant discrepancy from the null hypothesis H0: p = 1 2 ? b. Jack Youden, a statistician at the National Bureau of Standards, contacted the student and asked him exactly how he had performed the experiment (Youden 370 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 1974). To save time, the student had tossed groups of five coins at a time, and a younger brother had recorded the results, shown in the following table: Number of Heads Frequency 0 100 1 524 2 1080 3 1126 4 655 5 105 Are the data consistent with the hypothesis that all the coins were fair ( p = 1 2 )? c. Are the data consistent with the hypothesis that all five coins had the same probability of heads but that this probability was not necessarily 1 2 ?(Hint: Use the binomial distribution.) 44. Derive and carry out a likelihood ratio test of the hypothesis H0: θ = 1 2 versus H1: θ = 1 2 for Problem 58 of Chapter 8. 45. In a classic genetics study, Geissler (1889) studied hospital records in Saxony and compiled data on the gender ratio. The following table shows the number of male children in 6115 families with 12 children. If the genders of successive children are independent and the probabilities remain constant over time, the number of males born to a particular family of 12 children should be a binomial random variable with 12 trials and an unknown probability p of success. If the probability of a male child is the same for each family, the table represents the occurrence of 6115 binomial random variables. Test whether the data agree with this model. Why might the model fail? Number Frequency 07 145 2 181 3 478 4 829 5 1112 6 1343 7 1033 8 670 9 286 10 104 11 24 12 3 46. Show that the transformation Y = sin−1 √ ˆp is variance-stabilizing if ˆp = X/n, where X ∼ bin(n, p). 9.11 Problems 371 47. Let X follow a Poisson distribution with mean λ. Show that the transformation Y = √ X is variance-stabilizing. 48. Suppose that E(X) = μ and Var(X) = cμ2, where c is a constant. Find a variance-stabilizing transformation. 49. An English naturalist collected data on the lengths of cuckoo eggs, measuring to the nearest .5 mm. Examine the normality of this distribution by (a) constructing a histogram and superposing a normal density, (b) plotting on normal probability paper, and (c) constructing a hanging rootogram. Length Frequency 18.5 0 19.0 1 19.5 3 20.0 33 20.5 39 21.0 156 21.5 152 22.0 392 22.5 288 23.0 286 23.5 100 24.0 86 24.5 21 25.0 12 25.5 2 26.0 0 26.5 1 50. Burr (1974) gives the following data on the percentage of manganese in iron made in a blast furnace. For 24 days, a single analysis was made on each of five casts. Examine the normality of this distribution by making a normal probability plot and a hanging rootogram. (As a prelude to topics that will be taken up in later chapters, you might also informally examine whether the percentage of manganese is roughly constant from one day to the next or whether there are significant trends over time.) Day Day Day Day Day Day Day Day Day Day Day Day 123456789101112 1.40 1.40 1.80 1.54 1.52 1.62 1.58 1.62 1.60 1.38 1.34 1.50 1.28 1.34 1.44 1.50 1.46 1.58 1.64 1.46 1.44 1.34 1.28 1.46 1.36 1.54 1.46 1.48 1.42 1.62 1.62 1.38 1.46 1.36 1.08 1.28 1.38 1.44 1.50 1.52 1.58 1.76 1.72 1.42 1.38 1.58 1.08 1.18 1.44 1.46 1.38 1.58 1.70 1.68 1.60 1.38 1.34 1.38 1.36 1.28 (Continued) 372 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit Day Day Day Day Day Day Day Day Day Day Day Day 13 14 15 16 17 18 19 20 21 22 23 24 1.26 1.52 1.50 1.42 1.32 1.16 1.24 1.30 1.30 1.48 1.32 1.44 1.50 1.50 1.42 1.32 1.40 1.34 1.22 1.48 1.52 1.46 1.22 1.28 1.52 1.46 1.38 1.48 1.40 1.40 1.20 1.28 1.76 1.48 1.72 1.10 1.38 1.34 1.36 1.36 1.26 1.16 1.30 1.18 1.16 1.42 1.18 1.06 1.50 1.40 1.38 1.38 1.26 1.54 1.36 1.28 1.28 1.36 1.36 1.10 51. Examine the probability plot in Figure 9.6 and explain why there are several sets of horizontal bands of points. 52. The following table gives values of two abundance ratios for different isotopes of potassium from several samples of minerals (H. Ku, private communication). Examine whether each of the ratios appears normally distributed by first making histograms and superposing normal densities and then making probability plots. 39K/41K 41K/40K 39K/41K 41K/40K 39K/41K 41K/40K 13.8645 576.369 13.8689 578.277 13.8724 576.017 13.8695 578.012 13.8593 574.708 13.8665 574.881 13.8659 575.597 13.8742 573.630 13.8566 578.508 13.8622 575.244 13.8703 576.069 13.8555 576.796 13.8696 575.567 13.8472 575.637 13.8534 580.394 13.8604 576.836 13.8555 575.971 13.8685 576.772 13.8672 576.236 13.8439 576.403 13.8694 576.501 13.8598 575.291 13.8646 576.179 13.8599 574.950 13.8641 576.478 13.8702 575.129 13.8605 577.614 13.8673 576.992 13.8606 577.084 13.8619 574.506 13.8597 578.335 13.8622 576.749 13.9641 576.317 13.8604 576.767 13.8588 576.669 13.8597 575.665 13.8591 576.571 13.8547 575.869 13.8617 575.815 13.8472 576.617 13.8597 577.793 13.861 576.109 13.863 575.885 13.8663 577.770 13.8615 576.144 13.8566 576.651 13.8597 577.697 13.8469 576.820 13.8503 575.974 13.8604 576.299 13.8582 576.672 13.8553 577.255 13.8634 575.903 13.8645 576.169 13.8642 574.664 13.8658 574.773 13.8713 575.390 13.8613 576.405 13.8547 577.391 13.8593 575.108 13.8706 574.306 13.8519 577.057 13.8522 576.663 13.8601 577.095 13.863 577.286 13.8489 578.358 13.866 576.957 13.8581 575.510 13.8609 575.371 13.8655 576.434 13.8644 576.509 13.857 575.851 13.8612 575.211 13.8665 574.300 13.8566 575.644 13.8598 576.630 13.8648 575.846 13.864 574.462 53. Hoaglin (1980) suggested a “Poissonness plot”—a simple visual method for assessing goodness of fit. The expected frequencies for a sample of size n from 9.11 Problems 373 a Poisson distribution are Ek = nP(X = k) = ne−λ λk k!or log Ek = log n − λ + k log λ − log k! Thus, a plot of log(Ok) + log k! versus k should yield nearly a straight line with a slope of log λ and an intercept of log n − λ. Construct such plots for the data of Problems 1, 2, and 3 of Chapter 8. Comment on how straight they are. 54. A random variable X is said to follow a lognormal distribution if Y = log(X) follows a normal distribution. The lognormal is sometimes used as a model for heavy-tailed skewed distributions. a. Calculate the density function of the lognormal distribution. b. Examine whether the lognormal roughly fits the following data (Robson 1929), which are the dorsal lengths in millimeters of taxonomically distinct octopods. 110 15 60 54 19 115 73 190 57 43 44 18 37 43 55 19 23 82 175 50 80 65 63 36 16 10 17 52 43 70 22 95 20 41 17 15 12 11 29 29 61 22 40 17 26 30 16 116 28 32 33 29 27 16 55 8 11 49 82 85 20 67 27 44 16 6 35 17 26 32 76 150 21 5 6 51 75 23 29 64 22 47 9 10 28 18 84 52 130 50 45 12 21 73 55. a. Generate samples of size 25, 50, and 100 from a normal distribution. Construct probability plots. Do this several times to get an idea of how probability plots behave when the underlying distribution is really normal. b. Repeat part (a) for a chi-square distribution with 10 df. c. Repeat part (a) for Y = Z/U, where Z ∼ N(0, 1) and U ∼ U[0, 1] and Z and U are independent. d. Repeat part (a) for a uniform distribution. e. Repeat part (a) for an exponential distribution. f. Can you distinguish between the normal distribution of part (a) and the sub- sequent nonnormal distributions? 56. Suppose that a sample is taken from a symmetric distribution whose tails decrease more slowly than those of the normal distribution. What would be the qualitative shape of a normal probability plot of this sample? 57. The Cauchy distribution has the probability density function f (x) = 1 π 1 1 + x2 , −∞ < x < ∞ 374 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit What would be the qualitative shape of a normal probability plot of a sample from this distribution? 58. Show how probability plots for the exponential distribution, F(x) = 1 − e−λx , may be constructed. Berkson (1966) recorded times between events and fit them to an exponential distribution. (The times between events in a Poisson process are exponentially distributed.) The following table comes from Berkson’s paper. Make an exponential probability plot, and evaluate its “straightness.” Time Interval (sec) Observed Frequency 0–60 115 60–120 104 120–181 99 181–243 106 243–306 113 306–369 104 369–432 101 432–497 106 497–562 104 562–628 96 628–698 512 689–1130 524 1130–1714 468 1714–2125 531 2125–2567 461 2567–3044 526 3044–3562 506 3562–4130 509 4130–4758 520 4758–5460 540 5460–6255 542 6255–7174 499 7174–8260 494 8260–9590 500 9590–11,304 550 11,304–13,719 465 13,719–14,347 104 14,347–15,049 97 15,049–15,845 101 15,845–16,763 104 16,763–17,849 92 17,849–19,179 102 19,179–20,893 103 20,893–23,309 110 23,309–27,439 112 27,439+ 100 59. Construct a hanging rootogram from the data of the previous problem in order to compare the observed distribution to an exponential distribution. 9.11 Problems 375 60. The exponential distribution is widely used in studies of reliability as a model for lifetimes, largely because of its mathematical simplicity. Barlow, Toland, and Freeman (1984) analyzed data on the strength of Kevlar 49/epoxy, a material used in the space shuttle. The times to failure (in hours) of 76 strands tested at a stress level of 90% are given in the following table. Times to Failure at 90% Stress Level .01 .01 .02 .02 .02 .03 .03 .04 .05 .06 .07 .07 .08 .09 .09 .10 .10 .11 .11 .12 .13 .18 .19 .20 .23 .24 .24 .29 .34 .35 .36 .38 .40 .42 .43 .52 .54 .56 .60 .60 .63 .65 .67 .68 .72 .72 .72 .73 .79 .79 .80 .80 .83 .85 .90 .92 .95 .99 1.00 1.01 1.02 1.03 1.05 1.10 1.10 1.11 1.15 1.18 1.20 1.29 1.31 1.33 1.34 1.40 1.43 1.45 1.50 1.51 1.52 1.53 1.54 1.54 1.55 1.58 1.60 1.63 1.64 1.80 1.80 1.81 2.02 2.05 2.14 2.17 2.33 3.03 3.03 3.24 4.20 4.69 7.89 a. Construct a probability plot of the data against the quantiles of an exponential distribution to assess qualitatively whether the exponential is a reasonable model. Can you explain the peculiar appearance of the plot? b. Compare the data to the exponential distribution by means of a hanging rooto- gram. 61. The files haliburton and macdonalds give the monthly returns on the stocks of these two companies from 1975 through 1999. a. Make histograms of the returns and superimpose fitted normal densities. Com- ment on the quality of the fit. Which stock is more volatile? b. Make normal probability plots and again comment on the quality of the fit. 62. Apply the Poisson dispersion test to the data on gamma-ray counts—Problem 42 of Chapter 8. You will have to modify the development of the likelihood ratio test in Section 9.5 to take account of the time intervals being of different lengths. 63. Construct a gamma probability plot for the data of Problem 46 of Chapter 8. 376 Chapter 9 Testing Hypotheses and Assessing Goodness of Fit 64. The file bodytemp contains normal body temperature readings (degrees Fahren- heit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females (coded by 2) from Shoemaker (1996). a. Assess the normality of the male and female body temperatures by making nor- mal probability plots. In order to judge the inherent variability of these plots, simulate several samples from normal distributions with matching means and standard deviations, and make normal probability plots. What do you con- clude? b. Repeat the preceding problem for heart rates. c. For the males, test the null hypothesis that the mean body temperature is 98.6◦ versus the alternative that the mean is not equal to 98.6◦. Do the same for the females. What do you conclude? 65. This problem continues the analysis of the chromatin data from Problem 45 of Chapter 8 and is concerned with further examining goodness of fit. a. Goodness of fit can also be examined via probability plots in which the quan- tiles of a theoretical distribution are plotted against those of the empirical distribution. Following the discussion in Section 9.8, show that it is sufficient to plot the observed order statistics, X(k), versus the quantiles of the Rayleigh distribution with θ = 1. Construct three such probability plots and comment on any systematic lack of fit that you observe. To get an idea of what sort of variability could be expected due to chance, simulate several sets of data from a Rayleigh distribution and make corresponding probability plots. b. Formally test goodness of fit by performing a chi-squared goodness of fit test, comparing histogram counts to those predicted from the Rayleigh model. You may need to combine cells of the histograms so that the expected counts in each cell are at least 5. CHAPTER 10 Summarizing Data 10.1 Introduction This chapter deals with methods of describing and summarizing data that are in the form of one or more samples, or batches. These procedures, many of which generate graphical displays, are useful in revealing the structure of data that are initially in the form of numbers printed in columns on a page or recorded on a tape or disk as a computer file. In the absence of a stochastic model, the methods are useful for purely descriptive purposes. If it is appropriate to entertain a stochastic model, the implications of that model for the method are of interest. For example, the arithmetic mean ¯x is often used as a summary of a collection of numbers x1, x2,...,xn;it indicates a “typical value.” (We discuss some of its strengths and weaknesses in this regard in Section 10.4.) In some situations, it may be useful to model the collection of numbers as a realization of n independent random variables X1, X2,...,Xn with common mean μ and variance σ 2. The question of variability of ¯x can be addressed with such a model—the mean ¯x is regarded as an estimate of μ, and we know from previous work that the stochastic model implies E(X) = μ and Var(X) = σ 2/n. We will first discuss methods that are data analogues of the cumulative distri- bution function of a random variable. These methods are useful in displaying the distribution of data values. Next, we will discuss the histogram and related graphical displays that play the role for data that the probability density or frequency function plays for a random variable, giving a different view of the distribution of data values than that provided by the cumulative distribution function. We then discuss simpler numerical summaries of data, numbers that indicate a typical or central value of the data and a quantification of the spread. Such statistics provide a more condensed summary than do the cumulative distribution function and the histogram. We will pay particular attention to the effect of extreme data points on these measures. Next, we will introduce boxplots, graphical summaries that combine in a simple form information about the central values, spread, and shape of a distribution. Finally, 377 378 Chapter 10 Summarizing Data scatterplots are introduced as a method for displaying information about relation- ships of variables. 10.2 Methods Based on the Cumulative Distribution Function 10.2.1 The Empirical Cumulative Distribution Function Suppose that x1,...,xn is a batch of numbers. (The word sample is often used in the case that the xi are independently and identically distributed with some distribution function; the word batch will imply no such commitment to a stochastic model.) The empirical cumulative distribution function (ecdf) is defined as Fn(x) = 1 n (#xi ≤ x) (With this definition, Fn is right-continuous; in the former Soviet Union and Eastern Europe, the ecdf is usually defined to be left-continuous.) Denote the ordered batch of numbers by x(1) ≤ x(2) ≤···≤x(n). Then if x < x(1), Fn(x) = 0, if x(1) ≤ x < x(2), Fn(x) = 1/n ,ifx(k) ≤ x < x(k+1), Fn(x) = k/n, and so on. If there is a single observation with value x, Fn has a jump of height 1/n at x; if there are r observations with the same value x, Fn has a jump of height r/n at x. The ecdf is the data analogue of the cumulative distribution function of a random variable: F(x) gives the probability that X ≤ x and Fn(x) gives the proportion of the collection of numbers less than or equal to x. EXAMPLE A As an example of the use of the ecdf, let us consider data taken from a study by White, Riethof, and Kushnir (1960) of the chemical properties of beeswax. The aim of the study was to investigate chemical methods for detecting the presence of synthetic waxes that had been added to beeswax. For example, the addition of microcrystalline wax raises the melting point of beeswax. If all pure beeswax had the same melting point, its determination would be a reasonable way to detect dilutions. The melting point and other chemical properties of beeswax, however, vary from one beehive to another. The authors obtained samples of pure beeswax from 59 sources, measured several chemical properties, and examined the variability of the measurements. The 59 melting points (in ◦C) are listed here. As a summary of these measurements, the ecdf is plotted in Figure 10.1. 63.78 63.45 63.58 63.08 63.40 64.42 63.27 63.10 63.34 63.50 63.83 63.63 63.27 63.30 63.83 63.50 63.36 63.86 63.34 63.92 63.88 63.36 63.36 63.51 63.51 63.84 64.27 63.50 63.56 63.39 63.78 63.92 63.92 63.56 63.43 64.21 64.24 64.12 63.92 63.53 63.50 63.30 63.86 63.93 63.43 64.40 63.61 63.03 63.68 63.13 63.41 63.60 63.13 63.69 63.05 62.85 63.31 63.66 63.60 10.2 Methods Based on the Cumulative Distribution Function 379 .0 .2 63.2 63.6 64.0 64.4 Cumulative frequency Melting point (C) .4 .6 .8 1.0 62.8 FIGURE 10.1 The empirical cumulative distribution function of the melting points of beeswax. Figure 10.1 conveniently summarizes the natural variability in melting points. For example, we can see from the graph that about 90% of the samples had melting points less than 64.2◦C and that about 12% had melting points less than 63.2◦C. White, Riethof, and Kushnir showed that the addition of 5% microcrystalline wax raised the melting point of beeswax by .85◦C and the addition of 10% raised the melting point by 2.22◦C. From Figure 10.1, we can see that an addition of 5% microcrystalline wax might well be difficult to detect, especially if it was made to beeswax that had a low melting point, but that an addition of 10% would be detectable. In further calculations, the investigators modeled the distribution of melting points as Gaussian. How reasonable does this model appear to be? ■ Let us briefly consider some of the elementary statistical properties of the ecdf in the case in which X1,...,Xn is a random sample from a continuous distribution function, F. For purposes of analysis, it is convenient to express Fn in the following way: Fn(x) = 1 n n i=1 I(−∞,x](Xi ) where I(−∞,x](Xi ) = 1, if Xi ≤ x 0, if Xi > x 380 Chapter 10 Summarizing Data The random variables I(−∞,x](Xi ) are independent Bernoulli random variables: I(−∞,x](Xi ) = 1, with probability F(x) 0, with probability 1 − F(x) Thus, nFn(x) is a binomial random variable (n trials, probability F(x) of success) and so E[Fn(x)] = F(x) Var[Fn(x)] = 1 n F(x)[1 − F(x)] As an estimate of F(x), Fn(x) is unbiased and has a maximum variance at that value of x such that F(x) = .5, that is, at the median. As x becomes very large or very small, the variance tends to zero. In the preceding paragraph, we considered Fn(x) for fixed x; the results can be applied to form a confidence interval for F(x) for any given value of x. Much deeper analysis focuses on the stochastic behavior of Fn as a random function; that is, all values of x are considered simultaneously. It turns out, somewhat surprisingly, that the distribution of max−∞ t) = 1 − F(t) where T is a random variable with cdf F. In applications where the data consist of times until failure or death and are thus nonnegative, it is often customary to work with the survival function rather than the cumulative distribution function, although the two give equivalent information. Data of this type occur in medical and reliability studies. In these cases, S(t) is simply the probability that the lifetime will be longer than t. We will be concerned with the sample analogue of S, Sn(t) = 1 − Fn(t) which gives the proportion of the data greater than t. 10.2 Methods Based on the Cumulative Distribution Function 381 EXAMPLE A As an example, let us consider the use of the survival function with a study of the lifetimes of guinea pigs infected with varying doses of tubercle bacilli (Bjerkdal 1960). In one study, five groups of 72 animals each were inoculated with the bacilli at increasing dosages, and a control group of 107 animals was used. We denote the inoculated groups by I, II, III, IV, and V, in order of increasing dose. The animals were observed over a 2-year period, and their times of death (in days) were recorded. The data are given here. Note that not all the animals in the lower-dosage regimens died. Control Lifetimes 18 36 50 52 86 87 89 91 102 105 114 114 115 118 119 120 149 160 165 166 167 167 173 178 189 209 212 216 273 278 279 292 341 355 367 380 382 421 421 432 446 455 463 474 506 515 546 559 576 590 603 607 608 621 634 634 637 638 641 650 663 665 688 725 735 Dose I Lifetimes 76 93 97 107 108 113 114 119 136 137 138 139 152 154 154 160 164 164 166 168 178 179 181 181 183 185 194 198 212 213 216 220 225 225 244 253 256 259 265 268 268 270 283 289 291 311 315 326 326 361 373 373 376 397 398 406 452 466 592 598 Dose II Lifetimes 72 72 78 83 85 99 99 110 113 113 114 114 118 119 123 124 131 133 135 137 140 142 144 145 154 156 157 162 162 164 165 167 171 176 177 181 182 187 192 196 211 214 216 216 218 228 238 242 248 256 257 262 264 267 267 270 286 303 309 324 326 334 335 358 409 473 550 382 Chapter 10 Summarizing Data Dose III Lifetimes 10 33 44 56 59 72 74 77 92 93 96 100 100 102 105 107 107 108 108 108 109 112 113 115 116 120 121 122 122 124 130 134 136 139 144 146 153 159 160 163 163 168 171 172 176 183 195 196 197 202 213 215 216 222 230 231 240 245 251 253 254 254 278 293 327 342 347 361 402 432 458 555 Dose IV Lifetimes 43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598 Dose V Lifetimes 12 15 22 24 24 32 32 33 34 38 38 43 44 48 52 53 54 54 55 56 57 58 58 59 60 60 60 60 61 62 63 65 65 67 68 70 70 72 73 75 76 76 81 83 84 85 87 91 95 96 98 99 109 110 121 127 129 131 143 146 146 175 175 211 233 258 258 263 297 341 341 376 A plot (Figure 10.2) of the empirical survival functions provides a convenient summary of the data. The proportions surviving beyond given times are plotted; it is not necessary to know the actual lifetimes of the animals that survived beyond the termination of the study. The graph is a much more effective presentation of the data than the tabular listings. One of Bjerkdahl’s primary interests was comparing the effect increased exposure had on guinea pigs that had different levels of resistivity. Comparing groups III and V, for example, we see that the difference in lifetimes of the weakest guinea pigs (say the 10% weakest) from the two groups was about 50 days, whereas the difference in lifetimes for stronger animals increases to about 100 days. ■ 10.2 Methods Based on the Cumulative Distribution Function 383 0 .2 200 400 600 800 Proportion of live animals Days elapsed .4 .6 .8 1.0 0 Control I II III IV V FIGURE 10.2 Survival functions for guinea pig lifetimes. For purposes of visual clar- ity, the points have been joined by lines: The solid line corresponds to the control group, the dotted line to group I, the short-dash line to group II, the long-dash line to group III, the dot-and-long-dash line to group IV, and the short-and-long-dash line to group V. Survival plots may also be used for informal examinations of the hazard func- tion, which may be interpreted as the instantaneous death rate for individuals who have survived up to a given time. If an individual is alive at time t, the probability that that individual will die in the time interval (t, t + δ) is, assuming that the density function f is continuous at t, P(t ≤ T ≤ t + δ|T ≥ t) = P(t ≤ T ≤ t + δ) P(T ≥ t) = F(t + δ) − F(t) 1 − F(t) ≈ δf (t) 1 − F(t) The hazard function is defined as h(t) = f (t) 1 − F(t) and may be thought of as the instantaneous rate of mortality for an individual alive at time t.IfT is the lifetime of a manufactured component, it may be natural to think of h(t) as the instantaneous or age-specific failure rate. It may also be expressed as h(t) =−d dt log[1 − F(t)] =−d dt log S(t) which reveals that it is the negative of the slope of the log of the survival function. 384 Chapter 10 Summarizing Data Consider, for example, the exponential distribution: F(t) = 1 − e−λt S(t) = e−λt f (t) = λe−λt h(t) = λ The instantaneous mortality rate is constant. If the exponential distribution were used as a model for the time until failure of a component, it would imply that the probability of the component failing did not depend on its age. This is a consequence of the “memoryless” property of the exponential distribution (Section 2.2.1). An alternative model might have a hazard function that is U-shaped, the rate of failure being high for very new components because of flaws in the manufacturing process that show up very quickly, declining for components of intermediate age, and then increasing for older components as they wear out. The empirical survival function and its logarithm can be expressed in terms of the ordered observations. For simplicity, suppose that there are no ties and that the ordered failure times are T(1) < T(2) < ···< T(n). Then if t = T(i), Fn(t) = i/n and Sn(t) = 1 − i/n. Since log Sn(t) is then undefined for t ≥ T(n), it is often defined as Sn(t) = 1 − i/(n + 1) for T(i) ≤ t < T(i+1). EXAMPLE B For the data of Example A, Figure 10.3 is a plot of the log of the empirical survival functions. We plotted log[1 − i/(n + 1)] versus the ordered survival times T(i). From the slopes of these curves, we see that the hazard functions are initially fairly small. As the dosage level increases, the instantaneous mortality rates both increase more quickly and reach higher levels. The increased mortality rate sets in at an earlier age for the high-dosage group and seems greater (the slope is greater). (To see this, hold the figure at an angle so that you are “looking down” the curves.) ■ When interpreting plots such as that presented in Figure 10.3, we will find it useful to keep in mind the variability of the empirical log survival function. Using the method of propagation of error (Section 4.6), we have Var{log[1 − Fn(t)]}≈Var[1 − Fn(t)] [1 − F(t)]2 = 1 n F(t)[1 − F(t)] [1 − F(t)]2 = 1 n F(t) 1 − F(t) 10.2 Methods Based on the Cumulative Distribution Function 385 .01 .05 200 400 600 800 Proportion of live animals Days elapsed .1 .5 1 5 0 Control I II III IV V FIGURE 10.3 Log survival functions for guinea pig lifetimes. For purposes of visual clarity, the points have been joined by lines: The solid line corresponds to the control group, the dotted line to group I, the short-dash line to group II, the long-dash line to group III, the dot-and-long-dash line to group IV, and the short-and-long-dash line to group V. From this expression, we see that for large values of t, the empirical log survival function is extremely unreliable, because 1−F(t) is then very small. Thus, in practice, the last few data points are disregarded. (Note the large fluctuations of the log survival functions in Figure 10.3 for large times.) 10.2.3 Quantile-Quantile Plots Quantile-quantile (Q-Q) plots are useful for comparing distribution functions. If X is a continuous random variable with a strictly increasing distribution function, F, the pth quantile of the distribution was defined in Section 2.2 to be that value of x such that F(x) = p or x p = F−1( p) In a Q-Q plot, the quantiles of one distribution are plotted against those of another. Suppose, for purposes of discussion, that one cdf (F) is a model for observations of a control group and another (G) is a model for observations of a group that has 386 Chapter 10 Summarizing Data received some treatment. Let the observations of the control group be denoted by x with cdf F, and let the observations of the treatment group be denoted by y with cdf G. The simplest effect that the treatment could have would be to increase the expected response of every member of the treatment group by the same amount, say h units. That is, both the weakest and the strongest individuals would have their responses changed by h. Then yp = x p + h, and the Q-Q plot would be a straight line with slope 1 and intercept h. We will now show that this relationship between the quantiles implies that the cumulative distribution functions have the relationship G(y) = F(y − h). This follows, because for every 0 ≤ p ≤ 1, p = G(yp) = F(x p) = F(yp − h) as in Figure 10.4. 0 .2 2 101 cdf y .4 .6 .8 1.0 3 2 43 FIGURE 10.4 An additive treatment effect. The solid line is F(y), and the dotted line is G(y) = F (y − h). Another possible effect of a treatment would be multiplicative: The response (such as lifetime or strength) is multiplied by a constant, c. The quantiles would then be related as yp = cxp, and the Q-Q plot would be a straight line with slope c and intercept 0. The cdf’s would be related as G( y) = F( y/c) (see Figure 10.5). A simple summary of a treatment effect for the additive model would be of the form “the treatment increases lifetime by 2 mo.” For the multiplicative model, one might say something like “the treatment increases lifetime by 25%.” 10.2 Methods Based on the Cumulative Distribution Function 387 0 .2 5101520 cdf y .4 .6 .8 1.0 0 FIGURE 10.5 A multiplicative treatment effect. The solid line is F ( y), and the dotted line is G( y) = F ( y/c). The effect of a treatment can, of course, be much more complicated than either of these two simple models. For example, a treatment could benefit weaker individuals and be to the detriment of stronger individuals. An educational program that places very heavy emphasis on elementary or basic skills might be expected to have this sort of effect relative to a regular program. Given a batch of numbers, or a sample from a probability distribution, quantiles are constructed from the order statistics. Given n observations and the order statistics X(1),...,X(n), the k/(n + 1) quantile of data is assigned to X(k). (This convention is not unique; sometimes, for example, the quantile assigned to X(k) is defined as (k − .5)/n. For descriptive purposes, it makes little difference which definition we use.) In constructing probability plots in Chapter 9, we plotted sample quantiles defined as just described versus the quantiles of a theoretical distribution, such as the normal, and used these plots to informally assess goodness of fit. To compare two batches of n numbers with order statistics X(1),...,X(n) and Y(1),...,Y(n), a Q-Q plot is simply constructed by plotting the points (X(i), Y(i)).If the batches are of unequal size, an interpolation process can be used. A procedure for interpolating intermediate quantiles is described in the end-of-chapter problems. EXAMPLE A Cleveland et al. (1974) used Q-Q plots in a study of air pollution. They plotted the quantiles of distributions of the values of various variables on Sunday against the quantiles for weekdays (Figure 10.6). The Q-Q plot of the ozone maxima shows that the very highest quantiles occur on weekdays but that all the other quantiles are larger on Sundays. For carbon monoxide, nitrogen oxide, and aerosols, the differences in the 388 Chapter 10 Summarizing Data 0.12 0.08 0.04 0.00 Sundays O3 max Linden (ppm) 0.00 0.04 0.08 0.12 (a) 6 4 2 0 CO Elizabeth (ppm) 1357 (b) 0.25 0.15 0.05 NO Elizabeth (ppm) 0.0 0.1 0.2 0.3 (c) 0.4 3.0 2.0 1.0 0.0 Sundays Nonmethane hydrocarbons Linden (ppm) 01 23 Weekdays 480 320 160 0 Solar radiation Central Park (langleys) 0 160 320 480 Weekdays 1.6 0.8 0.4 Aerosols Elizabeth (ruds) 0.0 1.2 2.0 2.8 Weekdays (d) (e) (f) 4 0.0 1.2 FIGURE 10.6 Q-Q plots of air pollution variables: (a) ozone maxima (ppm), (b) carbon monoxide concentration (ppm), (c) nitrogen oxide concentration (ppm), (d) nonmethane hydrocarbons (ppm), (e) solar radiation (langleys), (f) aerosols (ruds). quantiles increase with increasing concentration. The very high and very low quantiles of solar radiation are about the same on Sundays and weekdays (presumably corre- sponding to very clear days and days with heavy cloud cover), but for intermediate quantiles, the Sunday quantiles are larger. ■ EXAMPLE B Figure 10.7 is a Q-Q plot for groups III and V of Bjerkdahl (see Example A in Section 10.2.2). It shows that the difference in the quantiles increases for the larger quantiles; this is consistent with the observations we made earlier. From his analysis of the data, Bjerkdahl concluded that the increases were proportionally the same for animals with little, average, or great resistance—that is, that the treatment effect is multiplicative in the sense defined earlier. If this were the case, the Q-Q plot would be a straight line. For times up to about 200 days, the animals in group III live approximately twice as long as those in group V, but beyond 100 days the difference is roughly constant. The Q-Q plot thus provides a simple and effective means of comparing the lifetimes in the two groups. ■ Further discussion and examples of Q-Q plots can be found in Wilk and Gnanadesikan (1968). 10.3 Histograms, Density Curves, and Stem-and-Leaf Plots 389 0 100 100 200 300 400 Quantiles of group III Quantiles of group V 200 300 400 500 0 600 FIGURE 10.7 Q-Q plots of groups III and V from Bjerkdahl (1960). For reference, the line y = x has been added. 10.3 Histograms, Density Curves, and Stem-and-Leaf Plots The histogram, a time-honored method of displaying data, has already been intro- duced. It displays the shape of the distribution of data values in the same sense that a density function displays probabilities. The range of the data is divided into inter- vals, or bins, and the number or proportion of the observations falling in each bin is plotted. If the bins are not of equal size, the resulting histogram can be mislead- ing. A procedure that is often recommended is to plot the proportion of observations falling in the bin divided by the bin width; if this procedure is used, the area under the histogram is 1. Figure 10.8 shows three histograms of the melting points of beeswax from Ex- ample A in Section 10.2.1 with increasingly larger bin width. If the bin width is too small, the histogram is too ragged; if the bin is too wide, the shape is oversmoothed and obscured. The choice of bin width is usually made subjectively in an attempt to strike a balance between a histogram that is too ragged and one that oversmooths. Rudemo (1982) discusses automatic methods for choosing the bin width. Histograms are frequently used to display data for which there is no assumption of any stochastic model—for example, populations of U.S. cities. If the data are modeled as a random sample from some continuous distribution, the histogram may be viewed as an estimate of the probability density. Regarded in this light, it suffers from not being smooth. A smooth probability density estimate can be constructed in the following way. Let w(x) be a nonnegative, symmetric weight function, centered at zero and 390 Chapter 10 Summarizing Data 2 0 4 6 8 62.5 Melting point (C) Count (a) 63.0 63.5 64.0 64.5 65.0 4 0 8 12 16 62.5 Melting point (C) Count (b) 63.0 63.5 64.0 64.5 65.0 10 0 20 30 40 62.5 Melting point (C) Count (c) 63.0 63.5 64.0 64.5 65.0 FIGURE 10.8 Histograms of melting points of beeswax: (a) bin width = .1, (b) bin width = .2, (c) bin width = .5. integrating to 1. For example, w(x) can be the standard normal density. The function wh(x) = 1 h w x h is a rescaled version of w.Ash approaches zero, wh becomes more concentrated and peaked about zero. As h approaches infinity, wh becomes more spread out and flatter. If w(x) is the standard normal density, then wh(x) is the normal density with standard deviation h.IfX1,...,Xn is a sample from a probability density function, f , an estimate of f is fh(x) = 1 n n i=1 wh(x − Xi ) This estimate, called a kernel probability density estimate, consists of the super- position of “hills” centered on the observations. In the case where w(x) is the stan- dard normal density, wh(x − Xi ) is the normal density with mean Xi and standard deviation h. The parameter h, the bandwidth of the estimating function, controls its smooth- ness and corresponds to the bin width of the histogram. If h is too small, the estimate 10.3 Histograms, Density Curves, and Stem-and-Leaf Plots 391 0.0 0.8 1.6 58 Melting point (C) f h ( x ) (a) 60 62 66 68 7064 0.0 0.6 1.2 58 Melting point (C) f h ( x ) (b) 60 62 66 68 7064 0.0 0.15 0.30 58 Melting point (C) f h ( x ) (c) 60 62 66 68 7064 FIGURE 10.9 Probability density estimates from melting point data. The kernel w is the standard normal density with standard deviation (a) .025, (b) .125, and (c) 1.25. Note that the vertical scales are different. is too rough; if it is too large, the shape of f is smeared out too much. Figure 10.9 shows estimates of the probability density of the melting points of beeswax (from Example A in Section 10.2.1) for various values of h. Making a reasonable choice of the bandwidth is important, just as is choosing the bin width for a histogram. From Figure 10.9, we see that too small a bandwidth yields a ragged curve and too large a bandwidth obscures the shape and spreads the probability mass out too much. Scott (1992) contains extensive discussion of probability density estimation, includ- ing methods for automatic, data-driven bandwidth choice and estimation of densities in more than one dimension. One disadvantage of a histogram or a probability density estimate is that infor- mation is lost; neither allows the reconstruction of the original data. Furthermore, a histogram does not allow one to calculate a statistic such as a median; one can tell from a histogram only in which bin the median lies and not the median’s actual value. Stem-and-leaf plots (Tukey 1977) convey information about shape while re- taining the numerical information. It is easiest to define this type of plot by an example, a stem-and-leaf plot of the beeswax melting-point data (the decimal point is one 392 Chapter 10 Summarizing Data place to the left of the colon): STEM LEAF 1 1 628 :5 1 0 629 : 4 3 630 :358 7 3 631 :033 9 2 632 :77 18 9 633 :001446669 23 5 634 :01335 10 635 :0000113668 26 7 636 :0013689 19 2 637 :88 17 6 638 :334668 11 5 639 :22223 6 0 640 : 6 1 641 :2 5 3 642 :147 2 0 643 : 2 2 644 :02 The first three digits of the melting points have been selected to form the stem and are listed in the third column. The leaves on each stem are the fourth digit of all numbers with that stem. For example, the first stem is 628, and its leaf indicates the presence of the number 62.85 in the data. The third stem is 630, and its leaves indicate the presence of the numbers 63.03, 63.05, and 63.08. This stem-and-leaf plot was constructed by a computer, but they are very easy to make by hand. The second column of numbers gives the number of leaves on each stem. The first column of numbers facilitates finding order statistics, such as quartiles and the median; starting at the top of the plot and continuing down to the stem containing the median, the cumulative numbers of observations out to the smallest observation are listed. The numbering process is then extended symmetrically from the stem containing the median to the largest observation of the data. Straightforward stem-and-leaf plots do not work well for data that range over several orders of magnitude. In such a situation, it is better to make a stem-and-leaf plot of the logarithms of the data. 10.4 Measures of Location Sections 10.2 and 10.3 were concerned with data analogues of the cumulative distri- bution and density functions and with related curves, which convey visual information about the shape of the distribution of the data. Here and in Section 10.5, we discuss simple numerical summaries of data that are useful when there is not enough data to justify constructing a histogram or an ecdf, or when a more concise summary is desired. A measure of location is a measure of the center of a batch of numbers. If the numbers result from different measurements of the same quantity, a measure of 10.4 Measures of Location 393 location is often used in the hope that it is more accurate than any single measure- ment. In other situations, a measure of location is used as a simple summary of the numbers—for example, “the average grade on the exam was 72.” In this section, we will discuss several common measures of location and their relative advantages and disadvantages. 10.4.1 The Arithmetic Mean The most commonly used measure of location is the arithmetic mean, ¯x = 1 n n i=1 xi For illustration, we consider a set of 26 measurements of the heat of sublimation of platinum from an experiment done by Hampson and Walker (1961). The data are listed here: Heats of Sublimation of Platinum (kcal/mol) 136.3 136.6 135.8 135.4 134.7 135.0 134.1 143.3 147.8 148.8 134.8 135.2 134.9 146.5 141.2 135.4 134.8 135.8 135.0 133.7 134.4 134.9 134.8 134.5 134.3 135.2 . The 26 measurements are all attempts to measure the “true” heat of sublimation, and we see that there is variability among them. Intuitively, it may seem that a measure of location or center for this batch of numbers would give a more accurate estimate of the heat of sublimation than any one of the numbers alone. A common statistical model for the variability of a measurement process is the following: Xi = μ + β + εi (See Section 4.2.1.) Here, Xi is the value of the ith measurement, μ is the true value of the heat of sublimation, β represents bias in the measurement procedure, and εi is the random error. The εi are usually assumed to be independent and identically distributed random variables with mean 0 and variance σ 2. The efficacy of measures of location is often judged by comparing their performances (mean squared error, for example) with this model. Note that with this model the data alone tell us nothing about β, the bias in the measurement procedure, which in some cases may be as or more important than the random variability. The observations are listed across rows in the order in which the experiments were done. When observations are acquired sequentially, it is often informative to plot them in order, as in Figure 10.10. From this plot, we see that the first few observations were somewhat high. The most striking aspect of the plot is the presence of five extreme observations that occurred in groups of three and two. Such observations, which are quite far from the bulk of the data, are called outliers. Outliers occur all too frequently, 394 Chapter 10 Summarizing Data 132 134 5101520 Heat of sublimation (kcal/mole) Order of experiment 136 138 140 142 144 146 148 0 25 30 150 FIGURE 10.10 Plot showing time sequence of measurements of heat of sublimation of platinum. even in carefully conducted studies. The outliers in this case might have been caused by improperly calibrated equipment, for example. Outliers can also be caused by recording and transcription errors or by equipment malfunctions. It is important to detect outliers, since they may have an undue influence on subsequent calculations. Graphical presentation is an effective means of detection. Careful reexamination of the data and the circumstances under which they were obtained can sometimes uncover the causes behind the outliers. Although outliers are often unexplainable aberrations, an examination of them and their causes can sometimes deepen an investigator’s understanding of the phenomenon under study. Figure 10.10 also makes us doubt that the model for measurement error given above is appropriate for this set of data. The fact that the outliers occur in groups of two and three, rather than being randomly scattered, makes the independence model somewhat implausible. A stem-and-leaf plot provides another summary of this data (the decimal point is at the colon): 1 1 133:7 4 3 134:134 11 7 134:5788899 6 135:002244 9 2 135:88 7 1 136:3 6 1 136:6 High: 141.2 143.3 146.5 147.8 148.8 10.4 Measures of Location 395 On this stem-and-leaf plot, the outlying observations have been isolated and flagged as high. In their analysis, Hampson and Walker set aside the seven largest observations and the smallest observation and found the average of the remaining observations to be 134.9. Calculated from all the observations, the arithmetic mean is 137.05. Note from the stem-and-leaf plot and from Figure 10.10 that this number is larger than the bulk of the data and is clearly not a good descriptive measure of the “center” of this batch of numbers. We would not be satisfied with it as an estimate of the true heat of sublimation. If the data are modeled as a sample from a probability law, as with the measure- ment error model described above, an approximate 100(1 − α)% confidence interval for the population mean can be obtained from the central limit theorem as in Chapter 7. The interval is of the form ¯x ± z(α/2)s¯x Blindly applying this formula to the platinum data, with α = .05, we obtain the interval 137.05 ± 1.71, or (135.3, 138.8). Note where this interval falls on the stem- and-leaf plot! Although the example presented here may be somewhat extreme, it illustrates the sensitivity of the sample mean to outlying observations. In fact, by changing a single number, the arithmetic mean of a batch of numbers can be made arbitrarily large or small. Thus, if used blindly, without careful attention to the data, the arithmetic mean can produce misleading results. When the data are automatically acquired, stored as files on disks or tapes, and not visually examined, this danger increases. For this reason, measures of location that are robust, or insensitive to outliers, are important. 10.4.2 The Median If the sample size is an odd number, the median is defined to be the middle value of the ordered observations; if the sample size is even, the median is the average of the two middle values. Clearly, moving the extreme observations does not affect the sample median at all, so the median is quite robust. The median of the platinum data is 135.1, which, as can be seen from the stem-and-leaf plot, is more reasonable than the mean as a measure of the center. When the data are a sample from a continuous probability law, the sample me- dian can be viewed as an estimate of the population median, η, for which a simple confidence interval can be formed. We will now demonstrate that this interval is of the form (X(k), X(n−k+1)) The coverage probability of this interval is P(X(k) ≤ η ≤ X(n−k+1)) = 1 − P(η < X(k) or η>X(n−k+1)) = 1 − P(η < X(k)) − P(η > X(n−k+1)) 396 Chapter 10 Summarizing Data since the events are mutually exclusive. To evaluate these terms, we first note that P(η > X(n−k+1)) = k−1 j=0 P( j observations are greater than η) P(η < X(k)) = k−1 j=0 P( j observations are less than η) Since, by definition, the median satisfies P(Xi >η)= P(Xi <η)= 1 2 and since the n observations X1,...,Xn are independent and identically distributed, the distribution of the number of observations greater than the median is binomial with n trials and probability 1 2 of success on each trial. Thus, P(exactly j observations are greater than η) = 1 2n n j and P(η > X(n−k+1)) = 1 2n k−1 j=0 n j From symmetry, we then have that the coverage probability of the interval in question is 1 − 1 2n−1 k−1 j=0 n j These probabilities can be found from tables of the cumulative binomial distribution since 1 2n k−1 j=0 n j = P(Y ≤ k − 1) where Y is a binomial random variable with n trials and probability of success equal to 1 2 . EXAMPLE A As a concrete example, with n = 26, we have the following cumulative binomial probabilities: kP(Y ≤ k) 5 .0012 6 .0047 7 .0145 8 .0378 9 .0843 10.4 Measures of Location 397 If we choose k = 8, P(Y < k) = .0145 and since P(Y < k) = P(Y > n − k + 1), P(Y > 19) = .0145. Since 2 × .0145 = .029, the interval (X(8), X(19)) is a 97% confidence interval. Note that this confidence interval is exact, not approximate, and does not depend on the form of the underlying cdf but only on the assumption that the cdf is continuous and that the observations are independent. For the platinum data, this confidence interval is (134.8, 135.8). Compare this interval to the interval based on the sample mean. (But as we noted, there is reason to doubt the independence assumption for the platinum data, so these calculations should be viewed as an illustrative numerical exercise.) ■ 10.4.3 The Trimmed Mean Another simple and robust measure of location is the trimmed mean. The 100α% trimmed mean is easy to calculate: Order the data, discard the lowest 100α% and the highest 100α%, and take the arithmetic mean of the remaining data. It is generally recommended that the value chosen for α be from .1to.2. Formally, we may write the trimmed mean as ¯xα = x([nα]+1) +···+x(n−[nα]) n − 2[nα] where [nα] denotes the greatest integer less than or equal to nα. Note that the median can be regarded as a 50% trimmed mean. The 20% trimmed mean for the platinum data listed in Section 10.4.1 is formed by discarding the highest and lowest five observations (.2 × 26 = 5.2) and averaging the rest. The result is 135.29; for the same data, the median was 135.1 and the mean was 137.05. 10.4.4 M Estimates The sample mean is the mle of μ, the location parameter, when the underlying distribu- tion is normal. Equivalently, the sample mean minimizes the negative log likelihood, or n i=1 Xi − μ σ 2 This is the simplest case of a least squares estimate. (We will discuss least squares estimates in more detail in the context of curve fitting.) Outliers have a great effect on this estimate, since the deviation of μ from Xi is measured by the square of their difference. In contrast, the median is the minimizer of (see Problem 34 of the end-of-chapter problems) n i=1 Xi − μ σ 398 Chapter 10 Summarizing Data Here, large deviations are not weighted as heavily, and it is this property that causes the median to be robust. Huber (1981) proposed a class of estimates, M estimates, which are the mini- mizers of n i=1  Xi − μ σ where the weight function  is a compromise between the weight functions for the mean and the median. A wide variety of weight functions have been proposed. Huber discusses weight functions that are quadratic near zero and are linear beyond a cutoff point, k. Thus, k =∞corresponds to the mean and k = 0 to the median. A common choice is k = 1.5. With this choice, the influence of observations more than 1.5σ away from the center is reduced. In practice, a robust estimate of σ, such as those discussed in Section 10.5, must be used. The computation of an M estimate is a nonlinear minimization problem and must be done iteratively (using the Newton-Raphson method, for example). If M is a convex function, the minimizer will be unique. Fairly simple computer programs that do this are common in statistical packages. The M estimate (k = 1.5) for the platinum data we have been considering is 135.38, close to the median (135.1) and the trimmed mean (135.29) but quite different from the mean (137.05). 10.4.5 Comparison of Location Estimates We introduced several location estimates (and there are many others). Which one is best? There is no simple answer to this question. It is always important to bear in mind what is being estimated by the location estimates and to what purpose the estimate is being put. If the underlying distribution is symmetric, the trimmed mean, the sample mean, the sample median, and an M estimate all estimate the center of symmetry. If the underlying distribution is not symmetric, however, the four statistics estimate four different population parameters: the population mean, the population median, the population trimmed mean, and a functional of the cdf determined by the weight function . Moreover, there is no single estimate that is best for all symmetric distributions. Life isn’t that simple. Simulations have been done to compare estimates for a variety of distributions. Andrews et al. (1972) report the results of a large number of simulations from symmetric distributions. Their results show that the 10% or 20% trimmed mean is overall quite an effective estimate: Its variance is never much larger than the variance of the ordinary mean (even in the Gaussian case for which the mean is optimal) and can be quite a lot smaller when the underlying distribution is heavy-tailed relative to the Gaussian. The median, although quite robust, has a substantially larger variance in the Gaussian case than does the trimmed mean. The trimmed mean and the median have a certain appealing simplicity and are easy to explain to someone who has little formal statistical training. M estimates performed quite well in the simulations of the Andrews et al. study, and they do generalize more naturally to other problems such as curve fitting. But they are somewhat more difficult to compute and have less immediate intuitive appeal. For the purpose of simply summarizing data, it is often useful to compute more than one measure of location and compare the results. 10.4 Measures of Location 399 10.4.6 Estimating Variability of Location Estimates by the Bootstrap If we view the observations x1, x2,...,xn as realizations of independent random variables with common distribution function F, it is appropriate to investigate the variability and sampling distribution of a location estimate calculated from a sample of size n. Suppose we denote the location estimate as ˆθ; it is important to keep in mind that ˆθ is a function of the random variables X1, X2,...,Xn and hence has a probability distribution, its sampling distribution, which is determined by n and F.We would like to know this sampling distribution, but we are faced with two problems: (1) We don’t know F, and (2) even if we knew F, ˆθ may be such a complicated function of X1, X2,...,Xn that finding its distribution would exceed our analytic abilities. First, we address the second problem. Suppose then, for the moment, that we knew F. How could we find the probability distribution of ˆθ without going through incredibly complicated analytic calculations? The computer comes to our rescue—we can do it by simulation. We generate many, many samples, say B in number, of size n from F; from each sample we calculate the value of ˆθ. The empirical distribution of the resulting values θ∗ 1 ,θ∗ 2 ,...,θ∗ B is an approximation to the distribution function of ˆθ, which is good if B is very large. If we want to know the standard deviation of ˆθ, we can find a good approximation to it by calculating the standard deviation of the collection of values θ∗ 1 ,θ∗ 2 ,...,θ∗ B. We can make these approximations arbitrarily accurate by taking B to be arbitrarily large. All this would be well and good if we knew F, but we don’t. So what do we do? The bootstrap solution is to view the empirical cdf Fn as an approximation to F and sample from Fn. That is, Fn would be used in place of F in the previous paragraph. How do we go about sampling from Fn? Fn is a discrete probability distribution that gives probability 1/n to each observed value x1, x2,...,xn. A sample of size n from Fn is thus a sample of size n drawn with replacement from the collection x1, x2,...,xn. We thus draw B samples of size n with replacement from the observed data, producing θ∗ 1 ,θ∗ 2 ,...,θ∗ B. The standard deviation of ˆθ is then estimated by sˆθ = 1 B B i=1 (θ∗ i − ¯θ∗)2 where ¯θ∗ is the mean of θ∗ 1 ,θ∗ 2 ,...,θ∗ B. EXAMPLE A We illustrate this idea on the platinum data by using the bootstrap to approximate the sampling distribution of the 20% trimmed mean and its standard error. To this end, 1000 samples of size n = 26 were drawn randomly with replacement from the collec- tion of 26 values. A histogram of the 1000 trimmed means is displayed in Figure 10.11. The standard deviation of the 1000 values was .64, which is the estimated standard error of the 20% trimmed mean. The histogram is interesting—note the skewed tail to the right. We see that some of the trimmed means were far from the bulk of the data. This happened because some of the samples drawn with replacement included several 400 Chapter 10 Summarizing Data 135 136 137 138 1390 100 200 300 400 500 kcal/mol FIGURE 10.11 Histogram of 1000 bootstrap 20% trimmed means. replicates of the five outliers (see Figure 10.10). The computer calculation is telling us that if we sample from Fn, the 20% trimmed mean is not as robust as we might like; this is an extremely heavy-tailed distribution and a sample of 26 may contain a large number of outliers. As in Chapter 8, we can use the bootstrap distribution to form an approximate 90% confidence interval. We proceed as in Examples D and E of Section 8.5.3, which you may want to review at this time. Denote the trimmed mean of the sample by ˆθ = 135.29, and denote the 1000 ordered bootstrap trimmed means by θ∗ (1) ≤ θ∗ (2) ≤ ··· ≤ θ∗ (1000). Then the .05 quantile of the bootstrap distribution is θ = θ∗ (50) = 134.00, and the .95 quantile is ¯θ = θ∗ (950) = 136.93. Following the notation of the examples of Section 8.5.3, the approximate 90% confidence interval is (ˆθ − ¯δ, ˆθ −δ), where ˆθ − ¯δ = ˆθ − (¯θ − ˆθ) = 2ˆθ − ¯θ = 133.65 and ˆθ − δ = ˆθ − (θ − ˆθ) = 2ˆθ − θ = 135.58 10.5 Measures of Dispersion 401 135.0 135.5 136.0 136.50 100 200 300 400 FIGURE 10.12 Histogram of 1000 bootstrap medians. Figure 10.12 is a histogram of 1000 bootstrapped medians. It is less dispersed than the histogram of trimmed means; the standard deviation of the medians is .24, considerably less than that of the trimmed mean. The bootstrap simulation is telling us that when sampling from a distribution like this, the median is more robust than the 20% trimmed mean. ■ How accurate are these bootstrap estimates? It is difficult to answer this question in a useful, explicit manner. Essentially, the accuracy depends on two factors: (1) the accuracy of Fn as an estimate of F, and (2) the dependence of the distribution of the statistic ˆθ on F. For example, if the distribution of ˆθ changes little if F changes, then Fn need not be a very good estimate of F, whereas if the distribution of ˆθ is extremely sensitive to F, Fn will have to be a good estimate of F, and hence the sample size will have to be large, in order for the bootstrap approximation to be accurate. 10.5 Measures of Dispersion A measure of dispersion, or scale, gives a numerical indication of the “scatteredness” of a batch of numbers. Simple summaries of data often consist of a measure of location and a measure of dispersion. The most commonly used measure is the sample standard 402 Chapter 10 Summarizing Data deviation, s, which is the square root of the sample variance, s2 = 1 n − 1 n i=1 (Xi − X)2 Using n − 1 as the divisor rather than the more obvious divisor n is based on the rationale that s2 is an unbiased estimate of the population variance if the observations are independent and identically distributed with variance σ 2. (But s is not an unbiased estimate of σ because the square root is a nonlinear function.) If n is of moderate to large size, it makes little difference whether n or n − 1 is used. If the observations are a sample from a normal distribution with variance σ 2, (n − 1)s2 σ 2 ∼ χ2 n−1 This distributional result may be used to construct confidence intervals for σ 2 in the normal case (compare with Example A in Section 8.5.3), but the result is not robust against deviations from normality. Like the sample mean, the sample standard deviation is sensitive to outlying observations. Two simple robust measures of dispersion are the interquartile range (IQR), or the difference between the two sample quartiles; (the 25th and 75th per- centiles) and the median absolute deviation from the median (MAD). If the data are x1,...,xn with median ˜x, the MAD is defined to be the median of the numbers |xi − ˜x|. These two measures of dispersion, the IQR and the MAD, can be converted into estimates of σ for a normal distribution by dividing them by 1.35 and .675, respectively. David (1981) discusses a method for finding a confidence interval for the population interquartile range, using reasoning similar to that used in Section 10.4.2 for developing a confidence interval for the population median. Let us compare all three measures of dispersion for the platinum data: s = 4.45 IQR 1.35 = 1.26 MAD .675 = .934 The two robust estimates are similar. From the stem-and-leaf plot of the platinum values presented earlier, we can see that both the IQR and the MAD give measures of the spread of the central portion of the data, whereas the standard deviation is heavily influenced by the outliers. 10.6 Boxplots A boxplot is a graphical display invented by Tukey that shows a measure of location (the median), a measure of dispersion (the interquartile range), and the presence of possible outliers and also gives an indication of the symmetry or skewness of the distribution. Figure 10.13 is a boxplot of the platinum data. 10.6 Boxplots 403 132 134 Heat of sublimation (kcal/mole) 136 138 140 142 144 146 148 150 Upper quartile Median Lower quartile FIGURE 10.13 Boxplot of the platinum data. We outline the construction of a boxplot: 1. Horizontal lines are drawn at the median and at the upper and lower quartiles and are joined by vertical lines to produce the box. 2. A vertical line is drawn up from the upper quartile to the most extreme data point that is within a distance of 1.5 (IQR) of the upper quartile. A similarly defined vertical line is drawn down from the lower quartile. Short horizontal lines are added to mark the ends of these vertical lines. 3. Each data point beyond the ends of the vertical lines is marked with an asterisk or dot (* or ·). Boxplots are not uniformly standardized, but the basic structure is as outlined above, perhaps with additional embellishments or small variations. A boxplot thus gives an indication of the center of the data (the median), the spread of the data (the interquartile range), and the presence of outliers, and indicates the symmetry or asymmetry of the distribution of data values (the location of the median relative to the quartiles). In Figure 10.13, the five outliers of the platinum data are clearly displayed, and we see an indication that the central part of the distribution is somewhat skewed toward high values. EXAMPLE A Figure 10.14 is taken from Chambers et al. (1983). The data plotted are daily maximum concentrations in parts per billion of sulfur dioxide in Bayonne, N.J., from November 1969 to October 1972 grouped by month. There are thus 36 batches, each of size about 30. The investigators concluded: The boxplots . . . show many properties of the data rather strikingly. There is a general reduction in sulphur dioxide concentration through time due to the gradual conversion to low sulphur fuels in the region. The decline is most dramatic for the highest quantiles. Also, there are higher concentrations 404 Chapter 10 Summarizing Data 0 Sulfur dioxide concentration (ppb) 50 100 150 200 250 NDJ JJFM MAANDJJJASONJFMAMJSOJADFM MASO Month FIGURE 10.14 Boxplots of daily maximum concentrations of sulfur dioxide. during the winter months due to the use of heating oil. In addition, the boxplots show that the distributions are skewed toward high values and that the spread of the distributions ... is larger when the general level of concentration is higher. The boxplot is clearly a very effective method of presenting and summarizing these data. As they are in this example, boxplots are generally useful for comparing batches of numbers, a purpose to which they will be put in the next two chapters. ■ 10.7 Exploring Relationships with Scatterplots Many interesting questions in statistics involve trying to understand the relation- ships among variables. The scatterplot is a basic method for displaying the empirical relationship between two variables based on a collection of pairs (xi , yi ): one merely plots the points in the xyplane. This basic display can be augmented in various ways, as we will illustrate with some examples. EXAMPLE A Allison and Cicchetti (1976) examined the relationships of possible correlates of sleep behavior in mammals. Figure 10.15 is a scatterplot of total sleep versus brain weight. Other than that two mammals with very large brains slept very little, no relationships are apparent in the plot. There is in fact a relationship, but it is obscured in the plot because brain weights vary over orders of magnitude—the brain of the lesser short- tailed shrew weighs 0.14 grams, and at the other extreme the brain of the African elephant weighs 5,712 grams. It is thus much more informative to plot sleep versus the logarithm of brain weight, and annotating the plot helps further—as shown in Figure 10.16. It is now clear that mammals with heavier brains tend to sleep less. 10.7 Exploring Relationships with Scatterplots 405 0 1000 2000 3000 4000 5000 5 10 15 20 Brain weight Total sleep FIGURE 10.15 Sleep versus brain weight for a collection of mammals. Brain weight Total sleep 0.01 1 100 10000 5 10 15 20 African elephant African giant pouched rat Arctic fox Arctic ground squirrel Asian elephant Baboon Big brown bat Brazilian tapir Cat Chimpanzee Chinchilla Cow Desert hedgehog Donkey Eastern American mole Echidna European hedgehog Galago Genet Giant armadillo Goat Golden hamster Gorilla Gray seal Gray wolf Ground squirrel Guinea pig Horse Jaguar Lesser shorttailed shrew Little brown bat Man Mole rat Mountain beaver Mouse Musk shrew N. American opossum Ninebanded armadillo Owl monkey Patas monkey Phanlanger PigRabbit Raccoon Rat Red fox Rhesus monkey Rock hyrax (Hetero. b) Rock hyrax (Procavia hab) Roe deer Sheep Slow loris Star nosed mole Tenrec Tree hyrax Tree shrew Vervet Water opossum FIGURE 10.16 Sleep versus logarithm of brain weight. 406 Chapter 10 Summarizing Data Data on these and other variables (how much do elephants dream?) can be found at http://lib.stat.cmu.edu/datasets/sleep. ■ Correlation coefficients are often used as a simple numerical summary of the strength of a relationship. The Pearson correlation coefficient corresponding to the pairs (xi , yi ) is r = (xi − ¯x)(yi − ¯y)(xi − ¯x)2 (yi − ¯y)2 This statistic measures the strength of a linear relationship. The correlation of brain weight and sleep is −0.36, and the correlation between the logarithm of brain weight and sleep is −0.56. These are different because a nonlinear transformation has been applied and the correlation coefficient measures the strength of a linear relationship. An alternative to the Pearson correlation coefficient is the rank correlation coeffi- cient: the brain weights are replaced by their ordered ranks (1, 2,...), the sleep- ing times are replaced by their ranks, and then the Pearson correlation coefficient of the pairs of ranks is computed. The rank correlation turns out to be −0.39 in our example. Some advantages of the rank correlation coefficient are that it is in- sensitive to outliers and is invariant under any monotone transformation (thus the rank correlation does not depend on whether brain weight or log brain weight is used). Arrays of scatterplots are useful for examining the relationships among more than two variables, as illustrated in the following example. EXAMPLE B Inductive loop detectors are wire loops embedded in the pavement of roadways. They operate by detecting the change in inductance caused by the metal in vehicles that pass over them. During successive intervals of time, a detector reports the number of passing vehicles, and the percentage of time that it was covered by a vehicle. The number of vehicles is called flow, the percentage of coverage is called the occupancy. Such detectors are widely used to measure freeway traffic but are subject to various kinds of malfunction. Faulty detectors must be identified by traffic management cen- ters. One key to detecting malfunction is knowing that measurements in the several freeway lanes at a particular location should be highly related—the increases and decreases of traffic flow in one lane should tend to be mirrored in other lanes. Fig- ure 10.17 shows an array of scatterplots of occupancy measured by detectors in four lanes at a particular location (Bickel et al. 2004). The detectors in lanes three and four were closely related to each other at all times and were correlated with measure- ments in lanes one and two some, but not all, of the time. Apparently the detectors in lanes 1 and 2 malfunctioned some of the time while this set of measurements was taken. ■ 10.8 Concluding Remarks 407 lane 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.05 0.10 0.15 0.20 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 lane 2 lane 3 0.0 0.05 0.15 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.05 0.10 0.15 0.20 0.25 0.0 0.05 0.10 0.15 0.20 0.25 lane 4 FIGURE 10.17 Occupancy measurements by adjacent loops in four lanes. 10.8 Concluding Remarks This chapter introduced several tools for summarizing data, some of which are graph- ical in nature. Under the assumption of a stochastic model for the data, some aspects of the sampling distributions of these summaries have been discussed. Summaries are very important in practice; an intelligent summary of data is often sufficient to fulfill the purposes for which the data were gathered, and more formal techniques such as confidence intervals or hypothesis tests sometimes add little to an investigator’s understanding. Effective summaries can also point to “bad” data or to unexpected aspects of data that might have gone unnoticed if the data had been blindly crunched by a computer. We saw the bootstrap appear again as a method for approximating a sampling distribution and functionals of it such as its standard deviation. The bootstrap, a relatively recent development in statistical methodology, relies on the availability of powerful and inexpensive computing resources. Our development of approximate 408 Chapter 10 Summarizing Data confidence intervals based on the bootstrap followed that of Chapter 8, where we motivated the construction by using the bootstrap distribution of θ∗ − ˆθ to approximate the distribution of ˆθ −θ0. We note that another popular method, known as the bootstrap percentile method, gives the interval (θ, ¯θ) (see Example A of Section 8.5.3 for definition of the notation). The rationale for this is harder to understand. More accurate methods for constructing bootstrap confidence intervals have been proposed and are under study, but we will not pursue these developments. 10.9 Problems 1. Plot the ecdf of this batch of numbers: 1, 14, 10, 9, 11, 9. 2. Suppose that X1, X2,...,Xn are independent U[0, 1] random variables. a. Sketch F(x) and the standard deviation of Fn(x). b. Generate many samples of size 16 on a computer; for each sample, plot Fn(x) and Fn(x) − F(x). Relate what you see to your answer to (a). 3. From Figure 10.1, roughly what are the upper and lower quartiles and the median of the distribution of melting points? 4. In Section 10.2.1, it was claimed that the random variables I(−∞,x](Xi ) are inde- pendent. Why is this so? 5. Let X1,...,Xn be a sample (i.i.d.) from a distribution function, F, and let Fn denote the ecdf. Show that Cov [Fn(u), Fn(v)] = 1 n [F(m) − F(u)F(v)] where m = min(u,v). Conclude that Fn(u) and Fn(v) are positively correlated: If Fn(u) overshoots F(u), then Fn(v) will tend to overshoot F(v). 6. Various chemical tests were conducted on beeswax by White, Riethof, and Kushnir (1960). In particular, the percentage of hydrocarbons in each sample of wax was determined. a. Plot the ecdf, a histogram, and a normal probability plot of the percentages of hydrocarbons given in the following table. Find the .90, .75, .50, .25, and .10 quantiles. Does the distribution appear Gaussian? 14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 15.15 13.98 14.90 15.91 14.52 15.63 13.83 13.66 13.98 14.47 14.65 14.73 15.18 14.49 14.56 15.03 15.40 14.68 13.33 14.41 14.19 15.21 14.75 14.41 14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62 14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 13.75 14.87 15.28 14.43 13.96 14.57 15.49 15.13 14.23 14.44 14.57 b. The average percentage of hydrocarbons in microcrystalline wax (a synthetic commercial wax) is 85%. Suppose that beeswax was diluted with 1% micro- 10.9 Problems 409 crystalline wax. Could this be detected? What about a 3% or a 5% dilution? (Such questions were one of the main concerns of the beeswax study.) 7. Compare group I to group V in Figure 10.2. Roughly, what are the differences in lifetimes for the animals that are the 10% weakest, median, and 10% strongest? 8. Consider a sample of size 100 from an exponential distribution with parameter λ = 1. a. Sketch the approximate standard deviation of the empirical log survival func- tion, log Sn(t), as a function of t. b. Generate several such samples of size 100 on a computer and for each sample plot the empirical log survival function. Relate the plots to your answer to (a). 9. Use the method of propagation of error to derive an approximation to the bias of the log survival function. Where is this bias large, and what is its sign? 10. Let X1,...,Xn be a sample from cdf F and denote the order statistics by X(1), X(2), ...,X(n). We will assume that F is continuous, with density func- tion f . From Theorem A in Section 3.7, the density function of X(k) is fk(x) = n n − 1 k − 1 [F(x)]k−1[1 − F(x)]n−k f (x) a. Find the mean and variance of X(k) from a uniform distribution on [0, 1]. You will need to use the fact that the density of X(k) integrates to 1. Show that Mean = k n + 1 Variance = 1 n + 2 k n + 1 1 − k n + 1 b. Find the approximate mean and variance of Y(k), the kth-order statistic of a sample of size n from F. To do this, let Xi = F(Yi ) or Yi = F−1(Xi ) The Xi are a sample from a U[0, 1] distribution (why?). Use the propagation of error formula, Y(k) = F−1(X(k)) ≈ F−1 k n + 1 + X(k) − k n + 1 d dx F−1(x) k/(n+1) 410 Chapter 10 Summarizing Data and argue that EY(k) ≈ F−1 k n + 1 Var (Y(k)) ≈ k n + 1 1 − k n + 1 1 ( f {F−1[k/(n + 1)]})2 1 n + 2 c. Use the results of parts (a) and (b) to show that the variance of the pth sample quantile is approximately 1 nf2(x p) p(1 − p) where x p is the pth quantile. d. Use the result of part (c) to find the approximate variance of the median of a sample of size n from a N(μ, σ 2) distribution. Compare to the variance of the sample mean. 11. Calculate the hazard function for F(t) = 1 − e−αtβ , t ≥ 0 12. Let f denote the density function and h the hazard function of a nonnegative random variable. Show that f (t) = h(t)e− t 0 h(s)ds that is, that the hazard function uniquely determines the density. 13. Give an example of a probability distribution with increasing failure rate. 14. Give an example of a probability distribution with decreasing failure rate. 15. A prisoner is told that he will be released at a time chosen uniformly at random within the next 24 hours. Let T denote the time that he is released. What is the hazard function for T ? For what values of t is it smallest and largest? If he has been waiting for 5 hours, is it more likely that he will be released in the next few minutes than if he has been waiting for 1 hour? 16. Suppose that F is N(0, 1) and G is N(1, 1). Sketch a Q-Q plot. Repeat for G being N(1, 4). 17. Suppose that F is an exponential distribution with parameter λ = 1 and that G is exponential with λ = 2. Sketch a Q-Q plot. 18. A certain chemotherapy treatment for cancer tends to lengthen the lifetimes of very seriously ill patients and decrease the lifetimes of the least ill patients. Suppose that an experiment is done that compares this treatment to a placebo. Draw a sketch showing the qualitative behavior of a Q-Q plot. 19. Consider the two cdfs: F(x) = x, 0 ≤ x ≤ 1 G(x) = x2, 0 ≤ x ≤ 1 Sketch a Q-Q plot of F versus G. 10.9 Problems 411 20. Sketch what you would expect the qualitative shape of the hazard function of human mortality to look like. 21. Make Q-Q plots for other pairs of treatment groups from Bjerkdahl’s data (see Example A in Section 10.2.2). Does the model of a multiplicative effect appear reasonable? 22. By examining the survival function of group V of Bjerkdahl’s data (see Ex- ample A in Section 10.2.2), make a rough sketch of the qualitative shape of a histogram. Then make a histogram, and compare it to your guess. 23. In the examples of Q-Q plots in the text, we only discussed the case in which quantiles of equal size batches are compared. From two batches of size n the k/(n + 1) quantiles are estimated as X(k) and Y(k), so one merely has to plot X(k) vs. Y(k). Write down a linear interpolation formula for the pth quantile where k/(n + 1) ≤ p ≤ (k + 1)/(n + 1). Now suppose that the batch sizes are not the same, being m and n, m < n say. A Q-Q plot may be constructed by fixing the quantiles k/(m + 1) of the smaller data set and interpolating these quantiles for the larger data set. Interpolate to find the upper and lower quartiles of the following batch of numbers: 1, 2, 3, 4, 5, 6. 24. Show that the probability plots discussed in Section 9.9 are Q-Q plots of the empirical distribution Fn versus a theoretical distribution F. 25. In Section 10.2.3, it was claimed that if yp = cxp, then G( y) = F( y/c). Justify this claim. 26. Hampson and Walker also made measurements of the heats of sublimation of rhodium and iridium. Do the following calculations for each of the two given sets of data: a. Make a histogram. b. Make a stem-and-leaf plot. c. Make a boxplot. d. Plot the observations in the order of the experiment. e. Does that statistical model of independent and identically distributed mea- surement errors seem reasonable? f. Find the mean, 10% and 20% trimmed means, and median and compare them. g. Find the standard error of the sample mean and a corresponding approximate 90% confidence interval. h. Find a confidence interval based on the median that has as close to 90% coverage as possible. i. Use the bootstrap to approximate the sampling distributions of the 10% and 20% trimmed means and their standard errors and compare. j. Use the bootstrap to approximate the sampling distribution of the median and its standard error. Compare to the corresponding results for trimmed means above. k. Find approximate 90% confidence intervals based on the trimmed means and compare to the intervals for the mean and median found previously. 412 Chapter 10 Summarizing Data Iridium (kcal/mol) 136.6 145.2 151.5 162.7 159.1 159.8 160.8 173.9 160.1 160.4 161.1 160.6 160.2 159.5 160.3 159.2 159.3 159.6 160.0 160.2 160.1 160.0 159.7 159.5 159.5 159.6 159.5 Rhodium (kcal/mol) 126.4 135.7 132.9 131.5 131.1 131.1 131.9 132.7 133.3 132.5 133.0 133.0 132.4 131.6 132.6 132.2 131.3 131.2 132.1 131.1 131.4 131.2 131.1 131.1 134.2 133.8 133.3 133.5 133.4 133.5 133.0 132.8 132.6 133.3 133.5 133.5 132.3 132.7 132.9 134.1 27. Demographers often refer to the hazard function as the “age specific mortality rate,” or death rate. Until recently, most researchers in the field of gerontology thought that a death rate increasing with age was a universal fact in the biological world. There has been heavy debate over whether there is a genetically pro- grammed upper limit to lifespan. Using a facility in which sterilized medflies are bred to be released to fight medfly infestations in California, James Carey and co- workers (Carey et al. 1992) bred more than a million medflies and recorded their pattern of mortality. The data file medflies, contains the number of medflies alive from an initial population of 1,203,646 as a function of age in days. Using these data, estimate and plot the age specific mortality rate. Does it increase with age? 28. For a sample of size n = 3 from a continuous probability distribution, what is P(X(1) <ημY H3: μX <μY 11.2 Comparing Two Independent Samples 425 The first of these is a two-sided alternative, and the other two are one-sided alternatives. The first hypothesis is appropriate if deviations could in principle go in either direction, and one of the latter two is appropriate if it is believed that any deviation must be in one direction or the other. In practice, such a priori informa- tion is not usually available, and it is more prudent to conduct two-sided tests, as in Example A. The test statistic that will be used to make a decision whether or not to reject the null hypothesis is t = X − Y sX−Y The t-statistic equals the multiple of its estimated standard deviation that X − Y differs from zero. It plays the same role in the comparison of two samples as is played by the chi-square statistic in testing goodness of fit. Just as we rejected for large values of the chi-square statistic, we will reject in this case for extreme values of t. The distribution of t under H0, its null distribution, is, from Theorem A, the t distribution with m +n −2 degrees of freedom. Knowing this null distribution allows us to determine a rejection region for a test at level α, just as knowing that the null distribution of the chi-square statistic was chi-square with the appropriate degrees of freedom allowed the determination of a rejection region for testing goodness of fit. The rejection regions for the three alternatives just listed are For H1, |t| > tn+m−2(α/2) For H2, t > tn+m−2(α) For H3, t < −tn+m−2(α) Note how the rejection regions are tailored to the particular alternatives and how knowing the null distribution of t allows us to determine the rejection region for any value of α. EXAMPLE B Let us continue Example A. To test H0: μA = μB versus a two-sided alternative, we form and calculate the following test statistic: t = X A − X B sp 1 n + 1 m = 3.33 From Table 4 in Appendix B, t19(.005) = 2.861 < 3.33. The two-sided test would thus reject at the level α = .01. If there were no difference in the two conditions, differences as large or larger than that observed would occur only with probability less than .01—that is, the p-value is less than .01. There is little doubt that there is a difference between the two methods. ■ In Chapter 9, we developed a general duality between hypothesis tests and confi- dence intervals. In the case of the testing and confidence interval methods considered 426 Chapter 11 Comparing Two Samples in this section, the t test rejects if and only if the confidence interval does not include zero (see Problem 10 at the end of this chapter). We will now demonstrate that the test of H0 versus H1 is equivalent to a likelihood ratio test. (The rather long argument is sketched here and should be read with paper and pencil in hand.)  is the set of all possible parameter values:  = {−∞ <μX < ∞, −∞ <μY < ∞, 0 <σ <∞} The unknown parameters are θ = (μX ,μY ,σ). Under H0, θ ∈ ω0, where ω0 = {μX = μY , 0 <σ <∞}. The likelihood of the two samples X1,...,Xn and Y1,...,Ym is lik μX ,μY ,σ2 = n i=1 1√ 2πσ2 e−(1/2)[(Xi −μX )2/σ 2] m j=1 1√ 2πσ2 e−(1/2)[(Y j −μY )2/σ 2] and the log likelihood is l μX ,μY ,σ2 =−(m + n) 2 log 2π − (m + n) 2 log σ 2 − 1 2σ 2 n i=1 (Xi − μX )2 + m j=1 (Y j − μY )2 We must maximize the likelihood under ω0 and under  and then calculate the ratio of the two maximized likelihoods, or the difference of their logarithms. Under ω0, we have a sample of size m + n from a normal distribution with unknown mean μ0 and unknown variance σ 2 0 . The mle’s of μ0 and σ 2 0 are thus ˆμ0 = 1 m + n n i=1 Xi + m j=1 Y j ˆσ 2 0 = 1 m + n n i=1 (Xi − ˆμ0)2 + m j=1 (Y j − ˆμ0)2 The corresponding value of the maximized log likelihood is, after some cancel- lation, l ˆμ0, ˆσ 2 0 =−m + n 2 log 2π − m + n 2 log ˆσ 2 0 − m + n 2 To find the mle’s ˆμX ,ˆμY , and ˆσ 2 1 under , we first differentiate the log likelihood and obtain the equations n i=1 (Xi − ˆμX ) = 0 m j=1 (Y j − ˆμY ) = 0 −m + n 2ˆσ 2 1 + 1 2ˆσ 4 1 n i=1 (Xi − ˆμX )2 + m j=1 (Y j − ˆμY )2 = 0 11.2 Comparing Two Independent Samples 427 The mle’s are, therefore, ˆμX = X ˆμY = Y ˆσ 2 1 = 1 m + n n i=1 (Xi − ˆμX )2 + m j=1 (Y j − ˆμY )2 When these are substituted into the log likelihood, we obtain l ˆμX , ˆμY , ˆσ 2 1 =−m + n 2 log 2π − m + n 2 log ˆσ 2 1 − m + n 2 The log of the likelihood ratio is thus m + n 2 log ˆσ 2 1 ˆσ 2 0 and the likelihood ratio test rejects for large values of ˆσ 2 0 ˆσ 2 1 = n i=1 (Xi − ˆμ0)2 + m j=1 (Y j − ˆμ0)2 n i=1 (Xi − X)2 + m j=1 (Y j − Y)2 We now find an alternative expression for the numerator of this ratio, by using the identities n i=1 (Xi − ˆμ0)2 = n i=1 (Xi − X)2 + n(X − ˆμ0)2 m j=1 (Y j − ˆμ0)2 = m j=1 (Y j − Y)2 + m(Y − ˆμ0)2 We obtain ˆμ0 = 1 m + n (nX + mY) = n m + n X + m m + n Y Therefore, X − ˆμ0 = m(X − Y) m + n Y − ˆμ0 = n(Y − X) m + n The alternative expression for the numerator of the ratio is thus n i=1 (Xi − X)2 + m j=1 (Y j − Y)2 + mn m + n (X − Y)2 428 Chapter 11 Comparing Two Samples and the test rejects for large values of 1 + mn m + n ⎛ ⎜⎜⎝ (X − Y)2 n i=1 (Xi − X)2 + m j=1 (Y j − Y)2 ⎞ ⎟⎟⎠ or, equivalently, for large values of |X − Y| n i=1 (Xi − X)2 + m j=1 (Y j − Y)2 which is the t statistic apart from constants that do not depend on the data. Thus, the likelihood ratio test is equivalent to the t test, as claimed. We have used the assumption that the two populations have the same variance. If the two variances are not assumed to be equal, a natural estimate of Var(X − Y) is s2 X n + s2 Y m If this estimate is used in the denominator of the t statistic, the distribution of that statistic is no longer the t distribution. But it has been shown that its distribution can be closely approximated by the t distribution with degrees of freedom calculated in the following way and then rounded to the nearest integer: df = [(s2 X /n) + (s2 Y /m)]2 (s2 X /n)2 n − 1 + (s2 Y /m)2 m − 1 EXAMPLE C Let us rework Example B, but without the assumption that the variances are equal. Using the preceding formula, we find the degrees of freedom to be 12 rather than 19. The t statistic is 3.12. Since the .995 quantile of the t distribution with 12 df is 3.055 (Table 4 of Appendix B), the test still rejects at level α = .01. ■ If the underlying distributions are not normal and the sample sizes are large, the use of the t distribution or the normal distribution is justified by the central limit theorem, and the probability levels of confidence intervals and hypothesis tests are approximately valid. In such a case, however, there is little difference between the t and normal distributions. If the sample sizes are small, however, and the distributions are not normal, conclusions based on the assumption of normality may not be valid. Unfortunately, if the sample sizes are small, the assumption of normality cannot be tested effectively unless the deviation is quite gross, as we saw in Chapter 9. 11.2.1.1 An Example—A Study of Iron Retention An experiment was per- formed to determine whether two forms of iron (Fe2+ and Fe3+) are retained dif- ferently. (If one form of iron were retained especially well, it would be the better dietary supplement.) The investigators divided 108 mice randomly into 6 groups of 18 each; 3 groups were given Fe2+ in three different concentrations, 10.2, 1.2, and 11.2 Comparing Two Independent Samples 429 .3 millimolar, and 3 groups were given Fe3+ at the same three concentrations. The mice were given the iron orally; the iron was radioactively labeled so that a counter could be used to measure the initial amount given. At a later time, another count was taken for each mouse, and the percentage of iron retained was calculated. The data for the two forms of iron are listed in the following table. We will look at the data for the concentration 1.2 millimolar. (In Chapter 12, we will discuss methods for analyzing all the groups simultaneously.) Fe3+ Fe2+ 10.2 1.2 .3 10.2 1.2 .3 .71 2.20 2.25 2.20 4.04 2.71 1.66 2.93 3.93 2.69 4.16 5.43 2.01 3.08 5.08 3.54 4.42 6.38 2.16 3.49 5.82 3.75 4.93 6.38 2.42 4.11 5.84 3.83 5.49 8.32 2.42 4.95 6.89 4.08 5.77 9.04 2.56 5.16 8.50 4.27 5.86 9.56 2.60 5.54 8.56 4.53 6.28 10.01 3.31 5.68 9.44 5.32 6.97 10.08 3.64 6.25 10.52 6.18 7.06 10.62 3.74 7.25 13.46 6.22 7.78 13.80 3.74 7.90 13.57 6.33 9.23 15.99 4.39 8.85 14.76 6.97 9.34 17.90 4.50 11.96 16.41 6.97 9.91 18.25 5.07 15.54 16.96 7.52 13.46 19.32 5.26 15.89 17.56 8.36 18.4 19.87 8.15 18.3 22.82 11.65 23.89 21.60 8.24 18.59 29.13 12.45 26.39 22.25 As a summary of the data, boxplots (Figure 11.2) show that the data are quite skewed to the right. This is not uncommon with percentages or other variables that are bounded below by zero. Three observations from the Fe2+ group are flagged as possible outliers. The median of the Fe2+ group is slightly larger than the median of the Fe3+ groups, but the two distributions overlap substantially. Another view of these data is provided by normal probability plots (Figure 11.3). These plots also indicate the skewness of the distributions. We should obviously doubt the validity of using normal distribution theory (for example, the t test) for this problem even though the combined sample size is fairly large (36). The mean and standard deviation of the Fe2+ group are 9.63 and 6.69; for the Fe3+ group, the mean is 8.20 and the standard deviation is 5.45. To test the hypothesis that the two means are equal, we can use a t test without assuming that the population standard deviations are equal. The approximate degrees of freedom, calculated as described at the end of Section 11.2.1, are 32. The t statistic is .702, which corresponds to a p-value of .49 for a two-sided test; if the two populations had the same mean, values of the t statistic this large or larger would occur 49% of the time. There is thus insufficient evidence to reject the null hypothesis. A 95% confidence interval for the 430 Chapter 11 Comparing Two Samples 0 5 Percentage retained 10 15 20 25 30 Fe2+ Fe3+ FIGURE 11.2 Boxplots of the percentages of iron retained for the two forms. 0 5 1.5 1.0 .5 0 Ordered observations Normal quantiles 10 15 20 30 2.0 .5 1.0 1.5 2.0 25 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 3 3 2 2 FIGURE 11.3 Normal probability plots of iron retention data. difference of the two population means is (−2.7, 5.6). But the t test assumes that the underlying populations are normally distributed, and we have seen there is reason to doubt this assumption. It is sometimes advocated that skewed data be transformed to a more symmetric shape before normal theory is applied. Transformations such as taking the log or 11.2 Comparing Two Independent Samples 431 the square root can be effective in symmetrizing skewed distributions because they spread out small values and compress large ones. Figures 11.4 and 11.5 show boxplots and normal probability plots for the natural logs of the iron retention data we have been considering. The transformation was fairly successful in symmetrizing these distributions, and the probability plots are more linear than those in Figure 11.3, although some curvature is still evident. .5 1.0Natural log of percentage retained 1.5 2.0 2.5 3.0 3.5 Fe2+ Fe3+ FIGURE 11.4 Boxplots of natural logs of percentages of iron retained. .5 1.5 1.0 .5 0 Ordered observations Normal quantiles 1.0 1.5 2.0 3.0 2.0 .5 1.0 1.5 2.0 2.5 2 3 3.5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 FIGURE 11.5 Normal probability plots of natural logs of iron retention data. 432 Chapter 11 Comparing Two Samples The following model is natural for the log transformation: Xi = μX (1 + εi ), i = 1,...,n Y j = μY (1 + δ j ), j = 1,...,m log Xi = log μX + log(1 + εi ) log Y j = log μY + log(1 + δ j ) Here the εi and δ j are independent random variables with mean zero. This model implies that if the variances of the errors are σ 2, then E(Xi ) = μX E(Y j ) = μY σX = μX σ σY = μY σ or that σX μX = σY μY If the εi and δ j have the same distribution, Var(log X) = Var(log Y). The ratio of the standard deviation of a distribution to the mean is called the coefficient of variation (CV); it expresses the standard deviation as a fraction of the mean. Coefficients of variation are sometimes expressed as percentages. For the iron retention data we have been considering, the CV’s are .69 and .67 for the Fe2+ and Fe3+ groups; these values are quite close. These data are quite “noisy”—the standard deviation is nearly 70% of the mean for both groups. For the transformed iron retention data, the means and standard deviations are given in the following table: Fe2+ Fe3+ Mean 2.09 1.90 Standard Deviation .659 .574 For the transformed data, the t statistic is .917, which gives a p-value of .37. Again, there is no reason to reject the null hypothesis. A 95% confidence interval is (−.61, .23). Using the preceding model, this is a confidence interval for log μX − log μY = log μX μY The interval is −.61 ≤ log μX μY ≤ .23 or .54 ≤ μX μY ≤ 1.26 Other transformations, such as raising all values to some power, are sometimes used. Attitudes toward the use of transformations vary: Some view them as a very 11.2 Comparing Two Independent Samples 433 useful tool in statistics and data analysis, and others regard them as questionable manipulation of the data. 11.2.2 Power Calculations of power are an important part of planning experiments in order to determine how large sample sizes should be. The power of a test is the probability of rejecting the null hypothesis when it is false. The power of the two-sample t test depends on four factors: 1. The real difference,  =|μX − μY |. The larger this difference, the greater the power. 2. The significance level α at which the test is done. The larger the significance level, the more powerful the test. 3. The population standard deviation σ, which is the amplitude of the “noise” that hides the “signal.” The smaller the standard deviation, the larger the power. 4. The sample sizes n and m. The larger the sample sizes, the greater the power. Before continuing, you should try to understand intuitively why these statements are true. We will express them quantitatively below. The necessary sample sizes can be determined from the significance level of the test, the standard deviation, and the desired power against an alternative hypothesis, H1: μX − μY =  To calculate the power of a t test exactly, special tables of the noncentral t distribution are required. But if the sample sizes are reasonably large, one can perform approximate power calculations based on the normal distribution, as we will now demonstrate. Suppose that σ, α, and  are given and that the samples are both of size n. Then Var(X − Y) = σ 2 1 n + 1 n = 2σ 2 n The test at level α of H0: μX = μY against the alternative H1: μX = μY is based on the test statistic Z = X − Y σ√ 2/n The rejection region for this test is |Z| > z(α/2),or |X − Y| > z(α/2)σ 2 n 434 Chapter 11 Comparing Two Samples The power of the test if μX − μY =  is the probability that the test statistic falls in the rejection region, or P |X − Y| > z(α/2)σ 2 n = P X − Y > z(α/2)σ 2 n + P X − Y < −z(α/2)σ 2 n since the two events are mutually exclusive. Both probabilities on the right-hand side are calculated by standardizing. For the first one, we have P X − Y > z(α/2)σ 2 n = P (X − Y) −  σ√ 2/n > z(α/2)σ√ 2/n −  σ√ 2/n = 1 −  z(α/2) −  σ n 2 where  is the standard normal cdf. Similarly, the second probability is  − z(α/2) −  σ n 2 Thus, the probability that the test statistic falls in the rejection region is equal to 1 −  z(α/2) −  σ n 2 +  − z(α/2) −  σ n 2 Typically, as  moves away from zero, one of these terms will be negligible with respect to the other. For example, if  is greater than zero, the first term will be dominant. For fixed n, this expression can be evaluated as a function of ;orfor fixed , it can be evaluated as a function of n. EXAMPLE A As an example, let us consider a situation similar to an idealized form of the iron retention experiment. Assume that we have samples of size 18 from two normal distributions whose standard deviations are both 5, and we calculate the power for various values of  when the null hypothesis is tested at a significance level of .05. The results of the calculations are displayed in Figure 11.6. We see from the plot that if the mean difference in retention is only 1%, the probability of rejecting the null hypothesis is quite small, only 9%. A mean difference of 5% in retention rate gives a more satisfactory power of 85%. Suppose that we wanted to be able to detect a difference of  = 1 with probability .9. What sample size would be necessary? Using only the dominant term in the expression for the power, the sample size should be such that  1.96 −  σ n 2 = .1 11.2 Comparing Two Independent Samples 435 0 .2 2468 Power .4 .6 .8 1.0 0 FIGURE 11.6 Plot of power versus . From the tables for the normal distribution, .1 = (−1.28),so 1.96 −  σ n 2 =−1.28 Solving for n, we find that the necessary sample size would be 525! This is clearly unfeasible; if in fact the experimenters wanted to detect such a difference, some modification of the experimental technique to reduce σ would be necessary. ■ 11.2.3 A Nonparametric Method—The Mann-Whitney Test Nonparametric methods do not assume that the data follow any particular distribu- tional form. Many of them are based on replacement of the data by ranks. With this replacement, the results are invariant under any monotonic transformation; in com- parison, we saw that the p-value of a t test may change if the log of the measurements is analyzed rather than the measurements on the original scale. Replacing the data by ranks also has the effect of moderating the influence of outliers. For purposes of discussion, we will develop the Mann-Whitney test (also some- times called the Wilcoxon rank sum test) in a specific context. Suppose that we have m + n experimental units to assign to a treatment group and a control group. The assignment is made at random: n units are randomly chosen and assigned to the control, and the remaining m units are assigned to the treatment. We are in- terested in testing the null hypothesis that the treatment has no effect. If the null hypothesis is true, then any difference in the outcomes under the two conditions is due to the randomization. 436 Chapter 11 Comparing Two Samples A test statistic is calculated in the following way. First, we group all m + n observations together and rank them in order of increasing size (we will assume for simplicity that there are no ties, although the argument holds even in the presence of ties). We next calculate the sum of the ranks of those observations that came from the control group. If this sum is too small or too large, we will reject the null hypothesis. It is easiest to see how the procedure works by considering a very small example. Suppose that a treatment and a control are to be compared: Of four subjects, two are randomly assigned to the treatment and the other two to the control, and the following responses are observed (the ranks of the observations are shown in parentheses): Treatment Control 1 (1) 6 (4) 3 (2) 4 (3) The sum of the ranks of the control group is R = 7, and the sum of the ranks of the treatment group is 3. Does this discrepancy provide convincing evidence of a systematic difference between treatment and control, or could it be just due to chance? To answer this question, we calculate the probability of such a discrepancy if the treatment had no effect at all, so that the difference was entirely due to the particular randomization—this is the null hypothesis. The key idea of the Mann-Whitney test is that we can explicitly calculate the distribution of R under the null hypothesis, since under this hypothesis every assignment of ranks to observations is equally likely and we can enumerate all 4! = 24 such assignments. In particular, each of the 7 2 = 6 assignments of ranks to the control group shown in the following table is equally likely: Ranks R {1, 2} 3 {1, 3} 4 {1, 4} 5 {2, 3} 5 {2, 4} 6 {3, 4} 7 From this table, we see that under the null hypothesis, the distribution of R (its null distribution) is: r 34567 P(R = r) 1 6 1 6 1 3 1 6 1 6 In particular, P(R = 7) = 1 6 , so this discrepancy would occur one time out of six purely on the basis of chance. 11.2 Comparing Two Independent Samples 437 The small example of the previous paragraph has been laid out for pedagogi- cal reasons, the point being that we could in principle go through similar calcula- tions for any sample sizes m and n. Suppose that there are n observations in the treatment group and m in the control group. If the null hypothesis holds, every as- signment of ranks to the m + n observations is equally likely, and hence each of the m+n m possible assignments of ranks to the control group is equally likely. For each of these assignments, we can calculate the sum of the ranks and thus deter- mine the null distribution of the test statistic—the sum of the ranks of the control group. It is important to note that we have not made any assumption that the observations from the control and treatment groups are samples from a probability distribution. Probability has entered in only as a result of the random assignment of experimental units to treatment and control groups (this is similar to the way that probability enters into survey sampling). We should also note that, although we chose the sum of control ranks as the test statistic, any other test statistic could have been used and its null distribution computed in the same fashion. The rank sum is easy to compute and is sensitive to a treatment effect that tends to make responses larger or smaller. Also, its null distribution has to be computed only once and tabled; if we worked with the actual numerical values, the null distribution would depend on those particular values. Tables of the null distribution of the rank sum are widely available and vary in format. Note that because the sum of the two rank sums is the sum of the integers from 1 to m + n, which is [(m + n)(m + n + 1)/2], knowing one rank sum tells us the other. Some tables are given in terms of the rank sum of the smaller of the two groups, and some are in terms of the smaller of the two rank sums (the advantage of the latter scheme is that only one tail of the distribution has to be tabled). Table 8 of Appendix B makes use of additional symmetries. Let n1 be the smaller sample size and let R be the sum of the ranks from that sample. Let R = n1(m + n + 1) − R and R∗ = min(R, R ). The table gives critical values for R∗. (Fortunately, such fussy tables are largely obsolete with the increasing use of computers.) When it is more appropriate to model the control values, X1,...,Xn, as a sample from some probability distribution F and the experimental values, Y1,...,Ym,asa sample from some distribution G, the Mann-Whitney test is a test of the null hypothesis H0: F = G. The reasoning is exactly the same: Under H0, any assignment of ranks to the pooled m + n observations is equally likely, etc. We have assumed here that there are no ties among the observations. If there are only a small number of ties, tied observations are assigned average ranks (the average of the ranks for which they are tied); the significance levels are not greatly affected. EXAMPLE A Let us illustrate the Mann-Whitney test by referring to the data on latent heats of fusion of ice considered earlier (Example A in Section 11.2.1). The sample sizes are fairly small (13 and 8), so in the absence of any prior knowledge concerning the adequacy of the assumption of a normal distribution, it would seem safer to use a nonparametric 438 Chapter 11 Comparing Two Samples method. The following table exhibits the ranks given to the measurements for each method (refer to Example A in Section 11.2.1 for the original data): Method A Method B 7.5 11.5 19.0 1.0 11.5 7.5 19.0 4.5 15.5 4.5 15.5 15.5 19.0 2.0 4.5 4.5 21.0 15.5 11.5 9.0 11.5 Note how the ties were handled. For example, the four observations with the value 79.97 tied for ranks 3, 4, 5, and 6 were each assigned the rank of 4.5 = (3 + 4 + 5 + 6)/4. Table 8 of Appendix B is used as follows. The sum of the ranks of the smaller sample is R = 51. R = 8(8 + 13 + 1) − R = 125 Thus, R∗ = 51. From the table, 53 is the critical value for a two-tailed test with α = .01, and 60 is the critical value for α = .05. The Mann-Whitney test thus rejects at the .01 significance level. ■ Let TY denote the sum of the ranks of Y1, Y2,...,Ym. Using results from Chap- ter 7, we can easily find E(TY ) and Var(TY ) under the null hypothesis F = G. THEOREM A If F = G, E(TY ) = m(m + n + 1) 2 Var(TY ) = mn(m + n + 1) 12 11.2 Comparing Two Independent Samples 439 Proof Under the null hypothesis, TY is the sum of a random sample of size m drawn without replacement from a population consisting of the integers {1, 2,..., m + n}. TY thus equals m times the average of such a sample. From Theorems A and B of Section 7.3.1, E(TY ) = mμ Var(TY ) = mσ 2 N − m N − 1 where N = m + n is the size of the population, and μ and σ 2 are the population mean and variance. Now, using the identities N k=1 k = N(N + 1) 2 N k=1 k2 = N(N + 1)(2N + 1) 6 we find that for the population {1, 2,...,m + n} μ = N + 1 2 σ 2 = N 2 − 1 12 The result then follows after algebraic simplification. ■ Unlike the t test, the Mann-Whitney test does not depend on an assumption of normality. Since the actual numerical values are replaced by their ranks, the test is insensitive to outliers, whereas the t test is sensitive. It has been shown that even when the assumption of normality holds, the Mann-Whitney test is nearly as pow- erful as the t test and it is thus generally preferable, especially for small sample sizes. The Mann-Whitney test can also be derived starting from a different point of view. Suppose that the X’s are a sample from F and the Y’s a sample from G, and consider estimating, as a measure of the effect of the treatment, π = P(X < Y) where X and Y are independently distributed with distribution functions F and G, respectively. The value π is the probability that an observation from the distribution F is smaller than an independent observation from the distribution G. If, for example, F and G represent lifetimes of components that have been man- ufactured according to two different conditions, π is the probability that a component of one type will last longer than a component of the other type. An estimate of π can be obtained by comparing all n values of X to all m values of Y and calculating the 440 Chapter 11 Comparing Two Samples proportion of the comparisons for which X was less than Y: ˆπ = 1 mn n i=1 m j=1 Zij where Zij = 1, if Xi < Y j 0, otherwise To see the relationship of ˆπ to the rank sum introduced earlier, we will find it conve- nient to work with Vij = 1, if X(i) < Y( j) 0, otherwise Clearly, n i=1 m j=1 Zij = n i=1 m j=1 Vij since the Vij are just a reordering of the Zij. Also, n i=1 m j=1 Vij = (number of X’s that are less than Y(1)) + (number of X’s that are less than Y(2)) +···+ (number of X’s that are less than Y(m)) If the rank of Y(k) in the combined sample is denoted by Ryk, then the number of X’s less than Y(1) is Ry1 − 1, the number of X’s less than Y(2) is Ry2 − 2, etc. Therefore, n i=1 m j=1 Vij = (Ry1 − 1) + (Ry2 − 2) +···+(Rym − m) = m i=1 Ryi − m i=1 i = m i=1 Ryi − m(m + 1) 2 = Ty − m(m + 1) 2 Thus, ˆπ may be expressed in terms of the rank sum of the Y’s (or in terms of the rank sum of the X’s, since the two rank sums add up to a constant). 11.2 Comparing Two Independent Samples 441 From Theorem A, we have COROLLARY A Under the null hypothesis H0: F = G, E(UY ) = mn 2 Var(UY ) = mn(m + n + 1) 12 ■ For m and n both greater than 10, the null distribution of UY is quite well ap- proximated by a normal distribution, UY − E(UY )√ Var(UY ) ∼ N(0, 1) (Note that this does not follow immediately from the ordinary central limit theorem; although UY is a sum of random variables, they are not independent.) Similarly, the distribution of the rank sum of the X’s or Y’s may be approximated by a normal distribution, since these rank sums differ from UY only by constants. EXAMPLE B Referring to Example A, let us use a normal approximation to the distribution of the rank sum from method B. For n = 13 and m = 8, we have from Corollary A that under the null hypothesis, E(T ) = 8(8 + 13 + 1) 2 = 88 σT = 8 × 13(8 + 13 + 1) 12 = 13.8 T is the sum of the ranks from method B, or 51, and the normalized test statistic is T − E(T ) σT =−2.68 From the tables of the normal distribution, this corresponds to a p-value of .007 for a two-sided test, so the null hypothesis is rejected at level α = .01, just as it was when we used the exact distribution. For this set of data, we have seen that the t test with the assumption of equal variances, the t test without that assumption, the exact Mann-Whitney test, and the approximate Mann-Whitney test all reject at level α = .01. ■ 442 Chapter 11 Comparing Two Samples The Mann-Whitney test can be inverted to form confidence intervals. Let us consider a “shift” model: G(x) = F(x − ). This model says that the effect of the treatment (the Y’s) is to add a constant  to what the response would have been with no treatment (the X’s). (This is a very simple model, and we have already seen cases for which it is not appropriate.) We now derive a confidence interval for . To test H0: F = G, we used the statistic UY equal to the number of the Xi − Y j that are less than zero. To test the hypothesis that the shift parameter is , we can similarly use UY () = #[Xi − (Y j − ) < 0] = #(Y j − Xi >) It can be shown that the null distribution of UY () is symmetric about mn/2: P UY () = mn 2 + k = P UY () = mn 2 − k for all integers k. Suppose that k = k(α) is such that P(k ≤ UY () ≤ mn − k) = 1 − α; the level α test then accepts for such UY (). By the duality of confidence intervals and hypothesis tests, a 100(1 − α)% confidence interval for  is thus C ={ | k ≤ UY () ≤ mn − k} C consists of the set of values  for which the null hypothesis would not be rejected. We can find an explicit form for this confidence interval. Let D(1), D(2),...,D(mn) denote the ordered mn differences Y j − Xi . We will show that C = [D(k), D(mn−k+1)) To see this, first suppose that  = D(k). Then UY () = #(Xi − Y j + <0) = #(Y j − Xi >) = mn − k Similarly, if  = D(mn−k+1), UY () = #(Y j − Xi >) = k (You might find it helpful to consider the case m = 3, n = 2, k = 2.) EXAMPLE C We return to the data on iron retention (Section 11.2.1.1). The earlier analysis using the t test rested on the assumption that the populations were normally distributed, which, in fact, seemed rather dubious. The Mann-Whitney test does not make this assumption. The sum of the ranks of the Fe2+ group is used as a test statistic (we could have as easily used the U statistic). The rank sum is 362. Using the normal approximation to the null distribution of the rank sum, we get a p-value of .36. Again, there is insufficient evidence to reject the null hypothesis that there is no differential retention. The 95% confidence interval for the shift between the two distributions is (−1.6, 3.7), which overlaps zero substantially. Note that this interval is shorter than 11.2 Comparing Two Independent Samples 443 the interval based on the t distribution; the latter was inflated by the contributions of the large observations to the sample variance. ■ We close this section with an illustration of the use of the bootstrap in a two- sample problem. As before, suppose that X1, X2,...,Xn and Y1, Y2,...,Ym are two independent samples from distributions F and G, respectively, and that π = P(X < Y) is estimated by ˆπ. How can the standard error of ˆπ be estimated and how can an approximate confidence interval for π be constructed? (Note that the calcula- tions of Theorem A are not directly relevant, since they are done under the assump- tion that F = G.) The problem can be approached in the following way: First suppose for the moment that F and G were known. Then the sampling distribution of ˆπ and its standard error could be estimated by simulation. A sample of sizen would be generated from F, an independent sample of size m would be generated from G, and the resulting value of ˆπ would be computed. This procedure would be repeated many times, say B times, producing ˆπ1, ˆπ2,..., ˆπB. A histogram of these values would be an indication of the sampling distribution of ˆπ and their standard deviation would be an estimate of the standard error of ˆπ. Of course, this procedure cannot be implemented, because F and G are not known. But as in the previous chapter, an approximation can be obtained by using the empirical distributions Fn and Gn in their places. This means that a bootstrap value of ˆπ is generated by randomly selecting n values from X1, X2,...,Xn with replacement, m values from Y1, Y2,...,Ym with replacement and calculating the resulting value of ˆπ. In this way, a bootstrap sample ˆπ1, ˆπ2,..., ˆπB is generated. 11.2.4 Bayesian Approach We consider a Bayesian approach to the model, which stipulates that the Xi are i.i.d. normal with mean μX and precision ξ; and the Y j are i.i.d. normal with mean μY , precision ξ, and independent of the Xi . In general, a prior joint distribution assigned to (μX ,μY ,ξ) would be multiplied by the likelihood and normalized to integrate to 1 to produce a three-dimensional joint posterior distribution for (μX ,μY ,ξ). The marginal joint distribution of (μX ,μY ) could be obtained by integrating out ξ. The marginal distribution of μX − μY could then be obtained by another integration as in Section 3.6.1. Several integrations would thus have to be done, either analytically or numerically. Special Monte Carlo methods have been devised for high dimensional Bayesian problems, but we will not consider them here. An approximate result can be obtained using improper priors. Wetake(μX ,μY ,ξ) to be independent. The means μX and μY are given improper priors that are constant on (−∞, ∞), and ξ is given the improper prior f(ξ) = ξ −1. The posterior is thus proportional to the likelihood multiplied by ξ −1: fpost(μX ,μY ,ξ)∝ ξ n+m 2 −1 exp −ξ m+n 2 n i=1 (xi − μX )2 + m j=1 (y j − μY )2 444 Chapter 11 Comparing Two Samples Next, using n i=1(xi −μX )2 = (n −1)s2 x +n(μX − ¯x)2 and the analogous expression for the y j ,wehave fpost(μX ,μY ,ξ)∝ ξ n+m 2 −1 exp −ξ 2 (n − 1)s2 x + (m − 1)s2 y × exp −nξ 2 (μX − ¯x)2 exp −mξ 2 (μY − ¯y)2 From the form of this expression as a function of μX and μY , we see that for fixed ξ, μX and μY are independent normally distributed with means ¯x and ¯y and precisions nξ and mξ. Their difference, μX − μY , is thus normally distributed with mean ¯x − ¯y and variance ξ −1(n−1 + m−1). With further analysis similar to that of Section 8.6, it can be shown that the marginal posterior distribution of  = μX − μY can be related to the t distribution:  − (¯x − ¯y) sp √ n−1 + m−1 ∼ tn+m−2 Although formally similar to Theorem A of Section 11.2.1, the interpretation is dif- ferent: ¯x − ¯y and sp are random in Theorem A but are fixed here, and  = μX − μY is random here but fixed in Theorem A. The Bayesian formalism makes probability statements about  given the observed data. The posterior probability that>0 can thus be found using thet distribution. Let T denote a random variable with a tm+n−2 distribution. Then, denoting the observations by X and Y P( > 0 | X, Y) = P  − (¯x − ¯y) sp √ n−1 + m−1 ≥ −(¯x − ¯y) sp √ n−1 + m−1 | X, Y = P T ≥ ¯y − ¯x sp √ n−1 + m−1 Letting X denote the measurements of method A, and Y denote the measurements of method B in Example A of Section 11.2.1, we find that for that example, P( > 0|X, Y) = t19(−3.33) = .998 This posterior probability is very close to 1.0, and there is thus little doubt that the mean of method A is larger than the mean of method B. The confidence interval calculated in Section 11.2.1 is formally similar but has a different interpretation under the Bayesian model, which concludes that P(.015 ≤  ≤ .065|X, Y) = .95 by integration of the posterior t distribution over a region containing 95% of the probability. 11.3 Comparing Paired Samples In Section 11.2, we considered the problem of analyzing two independent samples. In many experiments, the samples are paired. In a medical experiment, for example, 11.3 Comparing Paired Samples 445 subjects might be matched by age or weight or severity of condition, and then one member of each pair randomly assigned to the treatment group and the other to the control group. In a biological experiment, the paired subjects might be littermates. In some applications, the pair consists of a “before” and an “after” measurement on the same object. Since pairing causes the samples to be dependent, the analysis of Section 11.2 does not apply. Pairing can be an effective experimental technique, as we will now demonstrate by comparing a paired design and an unpaired design. First, we consider the paired design. Let us denote the pairs as (Xi , Yi ), where i = 1,...,n, and assume the X’s and Y’s have means μX and μY and variances σ 2 X and σ 2 Y . We will assume that different pairs are independently distributed and that Cov(Xi , Yi ) = σXY. We will work with the differences Di = Xi − Yi , which are independent with E(Di ) = μX − μY Var(Di ) = σ 2 X + σ 2 Y − 2σXY = σ 2 X + σ 2 Y − 2ρσX σY when ρ is the correlation of members of a pair. A natural estimate of μX − μY is D = X − Y, the average difference. From the properties of Di , it follows that E(D) = μX − μY Var(D) = 1 n σ 2 X + σ 2 Y − 2ρσX σY Suppose, on the other hand, that an experiment had been done by taking a sample of nX’s and an independent sample of nY’s. Then μX − μY would be estimated by X − Y and E(X − Y) = μX − μY Var(X − Y) = 1 n σ 2 X + σ 2 Y Comparing the variances of the two estimates, we see that the variance of D is smaller if the correlation is positive—that is, if the X’s and Y’s are positively cor- related. In this circumstance, pairing is the more effective experimental design. In the simple case in which σX = σY = σ, the two variances may be more simply expressed as Var(D) = 2σ 2(1 − ρ) n in the paired case and as Var(X − Y) = 2σ 2 n in the unpaired case, and the relative efficiency is Var(D) Var(X − Y) = 1 − ρ 446 Chapter 11 Comparing Two Samples If the correlation coefficient is .5, for example, a paired design with n pairs of subjects yields the same precision as an unpaired design with 2n subjects per treatment. This additional precision results in shorter confidence intervals and more powerful tests if the degrees of freedom for estimating σ 2 are sufficiently large. We next present methods based on the normal distribution for analyzing data from paired designs and then a nonparametric, rank-based method. 11.3.1 Methods Based on the Normal Distribution In this section, we assume that the differences are a sample from a normal distribution with E(Di ) = μX − μY = μD Var(Di ) = σ 2 D Generally, σD will be unknown, and inferences will be based on t = D − μD sD which follows a t distribution with n − 1 degrees of freedom. Following familiar reasoning, a 100(1 − α)% confidence interval for μD is D ± tn−1(α/2)sD A two-sided test of the null hypothesis H0: μD = 0 (the natural null hypothesis for testing no treatment effect) at level α has the rejection region |D| > tn−1(α/2)sD If the sample size n is large, the approximate validity of the confidence interval and hypothesis test follows from the central limit theorem. If the sample size is small and the true distribution of the differences is far from normal, the stated probability levels may be considerably in error. EXAMPLE A To study the effect of cigarette smoking on platelet aggregation, Levine (1973) drew blood samples from 11 individuals before and after they smoked a cigarette and measured the extent to which the blood platelets aggregated. Platelets are involved in the formation of blood clots, and it is known that smokers suffer more often from disorders involving blood clots than do nonsmokers. The data are shown in the following table, which gives the maximum percentage of all the platelets that aggregated after being exposed to a stimulus. 11.3 Comparing Paired Samples 447 Before After Difference 25 27 2 25 29 4 27 37 10 44 56 12 30 46 16 67 82 15 53 57 4 53 80 27 52 61 9 60 59 −1 28 43 15 From the column of differences, D = 10.27 and sD = 2.40. The uncertainty in D is quantified in sD or in a confidence interval. Since t10(.05) = 1.812, a 90% confidence interval is D ± 1.812sD, or (5.9, 14.6). We can also formally test the null hypothesis that means before and after are the same. The t statistic is 10.27/2.40 = 4.28, and since t10(.005) = 3.169, the p-value of a two-sided test is less than .01. There is little doubt that smoking increases platelet aggregation. The experiment was actually more complex than we have indicated. Some sub- jects also smoked cigarettes made of lettuce leaves and “smoked” unlit cigarettes. (You should reflect on why these additional experiments were done.) Figure 11.7 is a plot of the after values versus the before values. They are corre- lated, with a correlation coefficient of .90. Pairing was a natural and effective exper- imental design in this case. ■ 20 30 30 40 50 70 After Before 40 50 60 90 20 60 70 80 FIGURE 11.7 Plot of platelet aggregation after smoking versus aggregation before smoking. 448 Chapter 11 Comparing Two Samples 11.3.2 A Nonparametric Method—The Signed Rank Test A nonparametric test based on ranks can be constructed for paired samples. We illustrate the calculation with a very small example. Suppose there are four pairs, corresponding to “before” and “after” measurements listed in the following table: Before After Difference |Difference| Rank Signed Rank 25 27 2 2 2 2 29 25 −443−3 60 59 −111−1 27 37 10 10 4 4 The test statistic is calculated by the following steps: 1. Calculate the differences, Di , and the absolute values of the differences and rank the latter. 2. Restore the signs of the differences to the ranks, obtaining signed ranks. 3. Calculate W+, the sum of those ranks that have positive signs. For the table, this sum is W+ = 2 + 4 = 6. The idea behind the signed rank test (sometimes called the Wilcoxon signed rank test) is intuitively simple. If there is no difference between the two paired conditions, we expect about half the Di to be positive and half negative, and W+ will not be too small or too large. If one condition tends to produce larger values than the other, W+ will tend to be more extreme. We therefore can use W+ as a test statistic and reject for extreme values. Before continuing, we need to specify more precisely the null hypothesis we are testing with the signed rank test: H0 states that the distribution of the Di is symmetric about zero. This will be true if the members of pairs of experimental units are assigned randomly to treatment and control conditions, and the treatment has no effect at all. As usual, in order to define a rejection region for a test at level α, we need to know the sampling distribution of W+ if the null hypothesis is true. The rejection region will be located in the tails of this null distribution in such a way that the test has level α. The null distribution may be calculated in the following way. If H0 is true, it makes no difference which member of the pair corresponds to treatment and which to control. The difference Xi − Yi = Di has the same distribution as the difference Yi − Xi =−Di , so the distribution of Di is symmetric about zero. The kth largest value of D is thus equally likely to be positive or negative, and any particular assignment of signs to the integers 1,...,n (the ranks) is equally likely. There are 2n such assignments, and for each we can calculate W+. We obtain a list of 2n values (not all distinct) of W+, each of which occurs with probability 1/2n. The probability of each distinct value of W+ may thus be calculated, giving the desired null distribution. The preceding argument has assumed that the Di are a sample from some con- tinuous probability distribution. If we do not wish to regard the Xi and Yi as random variables and if the assignments to treatment and control have been made at random, the hypothesis that there is no treatment effect may be tested in exactly the same 11.3 Comparing Paired Samples 449 manner, except that inferences are based on the distribution induced by the random- ization, as was done for the Mann-Whitney test. The null distribution of W+ is calculated by many computer packages, and tables are also available. The signed rank test is a nonparametric version of the paired sample t test. Unlike the t test, it does not depend on an assumption of normality. Since differences are replaced by ranks, it is insensitive to outliers, whereas the t test is sensitive. It has been shown that even when the assumption of normality holds, the signed rank test is nearly as powerful as the t test. The nonparametric method is thus generally preferable, especially for small sample sizes. EXAMPLE A The signed rank test can be applied to the data on platelet aggregation considered previously (Example A in Section 11.3.1). In this case, it is easier to work with W− rather than W+, since W− is clearly 1. From Table 9 of Appendix B, the two-sided test is significant at α = .01. ■ If the sample size is greater than 20, a normal approximation to the null distri- bution can be used. To find this, we calculate the mean and variance of W+. THEOREM A Under the null hypothesis that the Di are independent and symmetrically dis- tributed about zero, E(W+) = n(n + 1) 4 Var(W+) = n(n + 1)(2n + 1) 24 Proof To facilitate the calculation, we represent W+ in the following way: W+ = n k=1 kIk where Ik = 1, if the kth largest |Di | has Di > 0 0, otherwise Under H0, the Ik are independent Bernoulli random variables with p = 1 2 ,so E(Ik) = 1 2 Var(Ik) = 1 4 450 Chapter 11 Comparing Two Samples We thus have E(W+) = 1 2 n k=1 k = n(n + 1) 4 Var(W+) = 1 4 n k=1 k2 = n(n + 1)(2n + 1) 24 as was to be shown. ■ If some of the differences are equal to zero, the most common technique is to discard those observations. If there are ties, each |Di | is assigned the average value of the ranks for which it is tied. If there are not too many ties, the significance level of the test is not greatly affected. If there are a large number of ties, modifications must be made. For further information on these matters, see Hollander and Wolfe (1973) or Lehmann (1975). 11.3.3 An Example—Measuring Mercury Levels in Fish Kacprzak and Chvojka (1976) compared two methods of measuring mercury levels in fish. A new method, which they called “selective reduction,” was compared to an established method, referred to as “the permanganate method.” One advantage of selective reduction is that it allows simultaneous measurement of both inorganic mercury and methyl mercury. The mercury in each of 25 juvenile black marlin was measured by both techniques. The 25 measurements for each method (in ppm of mercury) and the differences are given in the following table. Fish Selective Reduction Permanganate Difference Signed Rank 1 .32 .39 .07 +15.5 2 .40 .47 .07 +15.5 3 .11 .11 .00 4 .47 .43 −.04 −11 5 .32 .42 .10 +19 6 .35 .30 −.05 −13.5 7 .32 .43 .11 +20 8 .63 .98 .35 +23 9 .50 .86 .36 +24 10 .60 .79 .19 +22 11 .38 .33 −.05 −13.5 12 .46 .45 −.01 −2.5 (Continued) 11.3 Comparing Paired Samples 451 Fish Selective Reduction Permanganate Difference Signed Rank 13 .20 .22 .02 +6.5 14 .31 .30 −.01 −2.5 15 .62 .60 −.02 −6.5 16 .52 .53 .01 +2.5 17 .77 .85 .08 +17.5 18 .23 .21 −.02 −6.5 19 .30 .33 .03 +9.0 20 .70 .57 −.13 −21 21 .41 .43 .02 +6.5 22 .53 .49 −.04 −11 23 .19 .20 .01 +2.5 24 .31 .35 .04 +11 25 .48 .40 −.08 −17.5 In analyzing such data, it is often informative to check whether the differences depend in some way on the level or size of the quantity being measured. The differ- ences versus the permanganate values are plotted in Figure 11.8. This plot is quite interesting. It appears that the differences are small for low permanganate values and larger for higher permanganate values. It is striking that the differences are all posi- tive and large for the highest four values. The investigators do not comment on these phenomena. It is not uncommon for the size of fluctuations to increase as the value being measured increases; the percent error may remain nearly constant but the actual error does not. For this reason, data of this nature are often analyzed on a log scale. .2 .1 .2 .4 .6 1.0 Differences Permanganate value 0 .1 .2 0 .8 .3 .4 FIGURE 11.8 Plot of differences versus permanganate values. 452 Chapter 11 Comparing Two Samples Because the observations are paired (two measurements on each fish), we will use the paired t test for a parametric test. The sample size is large enough that the test should be robust against nonnormality. The mean difference is .04, and the standard deviation of the differences is .116. The t statistic is 1.724; with 24 degrees of freedom, this corresponds to a p-value of .094 for a two-sided test. Although this p-value is fairly small, the evidence against H0: μD = 0 is not overwhelming. The test does not reject at the significance level .05. The signed ranks are shown in the last column of the table above. Note that the single zero difference was set aside, and also note how the tied ranks were handled. The test statistic W+ is 194.5. Under H0, its mean and variance are E(W+) = 24 × 25 4 = 150 Var(W+) = 24 × 25 × 49 24 = 1225 Since n is greater than 20, we use the normalized test statistic, or Z = W+ − E(W+)√ Var(W+) = 1.27 The p-value for a two-sided test from the normal approximation is .20, which is not strong evidence against the null hypothesis. It is possible to correct for the presence of ties, but in this case the correction only amounts to changing the standard deviation of W+ from 35 to 34.95. Neither the parametric nor the nonparametric test gives conclusive evidence that there is any systematic difference between the two methods of measurement. The informal graphical analysis does suggest, however, that there may be a difference for high concentrations of mercury. 11.4 Experimental Design This section covers some basic principles of the interpretation and design of experi- mental studies and illustrates them with case studies. 11.4.1 Mammary Artery Ligation A person with coronary artery disease suffers from chest pain during exercise because the constricted arteries cannot deliver enough oxygen to the heart. The treatment of ligating the mammary arteries enjoyed a brief vogue; the basic idea was that ligating these arteries forced more blood to flow into the heart. This procedure had the advantage of being quite simple surgically, and it was widely publicized in an article in Reader’s Digest (Ratcliffe 1957). Two years later, the results of a more careful study (Cobb et al. 1959) were published. In this study, a control group and an experimental group were established in the following way. When a prospective patient entered surgery, the surgeon made the necessary preliminary incisions prior to tying off the mammary artery. At that point, the surgeon opened a sealed envelope that contained instructions about whether to complete the operation by tying off the artery. Neither 11.4 Experimental Design 453 the patient nor his attending physician knew whether the operation had actually been carried out. The study showed essentially no difference after the operation between the control group (no ligation) and the experimental group (ligation), although there was some suggestion that the control group had done better. The Ratcliffe and Cobb studies differ in that in the earlier one there was no control group and thus no benchmark by which to gauge improvement. The reported improvement of the patients in this earlier study could have been due to the placebo effect, which we discuss next. The design of the later study protected against possible unconscious biases by randomly assigning the control and experimental groups and by concealing from the patients and their physicians the actual nature of the treatment. Such a design is called a double-blind, randomized controlled experiment. 11.4.2 The Placebo Effect The placebo effect refers to the effect produced by any treatment, including dummy pills (placebos), when the subject believes that he or she has been given an effective treatment. The possibility of a placebo effect makes the use of a blind design necessary in many experimental investigations. The placebo effect may not be due entirely to psychological factors, as was shown in an interesting experiment by Levine, Gordon, and Fields (1978). A group of subjects had teeth extracted. During the extraction, they were given nitrous oxide and local anesthesia. In the recovery room, they rated the amount of pain they were experiencing on a numerical scale. Two hours after surgery, the subjects were given a placebo and were again asked to rate their pain. An hour later, some of the subjects were given a placebo and some were given naloxone, a morphine antagonist. It is known that there are specific receptors to morphine in the brain and that the body can also release endorphins that bind to these sites. Naloxone blocks the morphine receptors. In the study, it was found that when those subjects who responded positively to the placebo received naloxone, they experienced an increase in pain that made their pain levels comparable to those of the patients who did not respond to the placebo. The implication is that those who responded to the placebo had produced endorphins, the actions of which were subsequently blocked by the naloxone. An instance of the placebo effect was demonstrated by a psychologist, Claude Steele (2002), who gave a math exam to a group of male and female undergraduates at Stanford University. One group (treatment) was told that the exam was gender- neutral, and the other group (controls) was not so informed. The men outperformed the women in the control group. In the treatment group, men and women performed equally well. Men in the treatment group did worse than men in the control group. (Economist Feb 21, 2002). 11.4.3 The Lanarkshire Milk Experiment The importance of the randomized assignment of individuals (or other experimental units) to treatment and control groups is illustrated by a famous study known as the Lanarkshire milk experiment. In the spring of 1930, an experiment was carried out in Lanarkshire, Scotland, to determine the effect of providing free milk to schoolchildren. In each participating school, some children (treatment group) were given free milk 454 Chapter 11 Comparing Two Samples and others (controls) were not. The assignment of children to control or treatment was initially done at random; however, teachers were allowed to use their judgment in switching children between treatment and control to obtain a better balance of undernourished and well-nourished individuals in the groups. A paper by Gosset (1931), who published under the name Student (as in Stu- dent’s t test), is a very interesting critique of the experiment. An examination of the data revealed that at the start of the experiment the controls were heavier and taller. Student conjectured that the teachers, perhaps unconsciously, had adjusted the initial randomization in a manner that placed more of the undernourished children in the treatment group. A further complication was caused by weighing the children with their clothes on. The experimental data were weight gains measured in late spring relative to early spring or late winter. The more well-to-do children probably tended to be better nourished and may have had heavier winter clothing than the poor children. Thus, the well-to-do children’s weight gains were vitiated as a result of differences in clothing, which may have influenced comparisons between the treatment and control groups. 11.4.4 The Portacaval Shunt Cirrhosis of the liver, to which alcoholics are prone, is a condition in which resistance to blood flow causes blood pressure in the liver to build up to dangerously high levels. Vessels may rupture, which may cause death. Surgeons have attempted to relieve this condition by connecting the portal artery, which feeds the liver, to the vena cava, one of the main veins returning to the heart, thus reducing blood flow through the liver. This procedure, called the Portacaval shunt, had been used for more than 20 years when Grace, Muench, and Chalmers (1966) published an examination of 51 studies of the method. They examined the design of each study (presence or absence of a control group and presence or absence of randomization) and the investigators’ conclusions (categorized as markedly enthusiastic, moderately enthusiastic, or not enthusiastic). The results are summarized in the following table, which speaks for itself: Enthusiasm Design Marked Moderate None No controls 24 7 1 Nonrandomized controls 10 3 2 Randomized controls 0 1 3 The differences between the experiments that used controls and those that did not is not entirely surprising, because the placebo effect was probably operating. The importance of randomized assignment to treatment and control groups is illustrated by comparing the conclusions for the randomized and nonrandomized controlled experiments. Randomization can help to ensure against subtle unconscious biases that may creep into an experiment. For example, a physician might tend to recommend surgery for patients who are somewhat more robust than the average. Articulate 11.4 Experimental Design 455 patients might be more likely to have an influence on the decision as to which group they are assigned to. 11.4.5 FD&C Red No. 40 This discussion follows Lagakos and Mosteller (1981). During the middle and late 1970s, experiments were conducted to determine possible carcinogenic effects of a widely used food coloring, FD&C Red No. 40. One of the experiments involved 500 male and 500 female mice. Both genders were divided into five groups: two control groups, a low-dose group, a medium-dose group, and a high-dose group. The mice were bred in the following way: Males and females were paired and before and during mating were given their prescribed dose of Red No. 40. The regime was continued during gestation and weaning of the young. From litters that had at least three pups of each sex, three of each sex were selected randomly and continued on their parents’ dosage throughout their lives. After 109–111 weeks, all the mice still living were killed. The presence or absence of reticuloendothelial tumors was of particular interest. Although there were significant differences between some of the treatment groups, the results were rather confusing. For example, there was a significant difference between the incidence rates for the two male control groups, and among the males the medium-dose group had the lowest incidence. Several experts were asked to examine the results of this and other experiments. Among them were Lagakos and Mosteller, who requested information on how the cages that housed the mice were arranged. There were three racks of cages, each containing five rows of seven cages in the front and five rows of seven cages in the back. Five mice were housed in each cage. The mice were assigned to the cages in a systematic way: The first male control group was in the top of the front of rack 1; the first female control group was in the bottom of the front of rack 1; and so on, ending with the high-dose females in the bottom of the back of rack 3 (Figure 11.9). Lagakos and Mosteller showed that there were effects due to cage position that could not be explained by gender or by dosage group. A random assignment of cage positions would have eliminated this confounding. Lagakos and Mosteller also suggested some experimental designs to systematically control for cage position. Front Rack 1 Male–C1 Female–C1 Back Male–C2 Female–C2 Female–C1 Rack 2 Male–L Female–L Male–M Female–L Female–C2 Rack 3 Male–H Male–H Female–H Female–M FIGURE 11.9 Location of mice cages in racks. 456 Chapter 11 Comparing Two Samples It was also possible that a litter effect might be complicating the analysis, since littermates received the same treatment and littermates of the same sex were housed in the same or contiguous cages. In the presence of a litter effect, mice from the same litter might show less variability than that present among mice from different litters. This reduces the effective sample size—in the extreme case in which littermates react identically, the effective sample size is the number of litters, not the total number of mice. One way around this problem would have been to use only one mouse from each litter. The presence of a possible selection bias is another problem. Because mice were included in the experiment only if they came from a litter with at least three males and three females, offspring of possibly less healthy parents were excluded. This could be a serious problem since exposure to Red No. 40 might affect the parents’ health and the birth process. If, for example, among the high-dose mice, only the most hardy produced large enough litters, their offspring might be hardier than the controls’ offspring. 11.4.6 Further Remarks on Randomization As well as guarding against possible biases on the part of the experimenter, the pro- cess of randomization tends to balance any factors that may be influential but are not explicitly controlled in the experiment. Time is often such a factor; background variables such as temperature, equipment calibration, line voltage, and chemical com- position can change slowly with time. In experiments that are run over some period of time, therefore, it is important to randomize the assignments to treatment and control over time. Time is not the only factor that should be randomized, however. In agricul- tural experiments, the positions of test plots in a field are often randomly assigned. In biological experiments with test animals, the locations of the animals’ cages may have an effect, as illustrated in the preceding section. Although rarer than in other areas, randomized experiments have been carried out in the social sciences as well (Economist Feb 28, 2002). Randomized trials have been used to evaluate such programs as driver training, as well as the criminal justice system and reduced classroom size. In evaluations of “whole-language” approaches to reading (in which children are taught to read by evaluating contextual clues rather than breaking down words), 52 randomized studies carried out by the National Reading Panel in 2000 showed that effective reading instruction requires phonics. Randomized studies of “scared straight” programs, in which juvenile delinquents are introduced to prison inmates, suggested that the likelihood of subsequent arrests is actually increased by such programs. Generally, if it is anticipated that a variable will have a significant effect, that variable should be included as one of the controlled factors in the experimental design. The matched-pairs design of this chapter can be used to control for a single factor. To control for more than one factor, factorial designs, which are briefly introduced in the next chapter, may be used. 11.4 Experimental Design 457 11.4.7 Observational Studies, Confounding, and Bias in Graduate Admissions It is not always possible to conduct controlled experiments or use randomization. In evaluating some medical therapies, for example, a randomized, controlled experi- ment would be unethical if one therapy was strongly believed to be superior. For many problems of psychological interest (effects of parental modes of discipline, for exam- ple), it is impossible to conduct controlled experiments. In such situations, recourse is often made to observational studies. Hospital records may be examined to compare the outcomes of different therapies, or psychological records of children raised in different ways may be analyzed. Although such studies may be valuable, the results are seldom unequivocal. Because there is no randomization, it is always possible that the groups under comparison differ in respects other than their “treatments.” As an example, let us consider a study of gender bias in admissions to graduate school at the University of California at Berkeley (Bickel and O’Connell 1975). In the fall of 1973, 8442 men applied for admission to graduate studies at Berkeley, and 44% were admitted; 4321 women applied, and 35% were admitted. If the men and women were similar in every respect other than sex, this would be strong evidence of sex bias. This was not a controlled, randomized experiment, however; sex was not randomly assigned to the applicants. As will be seen, the male and female applicants differed in other respects, which influenced admission. The following table shows admission rates for the six most popular majors on the Berkeley campus. Men Women Number of Percentage Number of Percentage Major Applicants Admitted Applicants Admitted A 825 62 108 82 B 560 63 25 68 C 325 37 593 34 D 417 33 375 35 E 191 28 393 34 F 373 6 341 7 If the percentages admitted are compared, women do not seem to be unfavorably treated. But when the combined admission rates for all six majors are calculated, it is found that 44% of the men and only 30% of the women were admitted, which seems paradoxical. The resolution of the paradox lies in the observation that the women tended to apply to majors that had low admission rates (C through F) and the men to majors that had relatively high admission rates (A and B). This factor was not controlled for, because the study was observational in nature; it was also “confounded” with the factor of interest, sex; randomization, had it been possible, would have tended to balance out the confounded factor. Confounding also plays an important role in studies of the effect of coffee drinking. Several studies have claimed to show a significant association of coffee 458 Chapter 11 Comparing Two Samples consumption with coronary disease. Clearly, randomized, controlled trials are not possible here—a randomly selected individual cannot be told that he or she is in the treatment group and must drink 10 cups of coffee a day for the next five years. Also, it is known that heavy coffee drinkers also tend to smoke more than average, so smoking is confounded with coffee drinking. Hennekens et al. (1976) review several studies in this area. 11.4.8 Fishing Expeditions Another problem that sometimes flaws observational studies, and controlled exper- iments as well, is that they engage in “fishing expeditions.” For example, consider a hypothetical study of the effects of birth control pills. In such a case, it would be impossible to assign women to a treatment or a placebo at random, but a nonrandom- ized study might be conducted by carefully matching controls to treatments on such factors as age and medical history. The two groups might be followed up on for some time, with many variables being recorded for each subject such as blood pressure, psychological measures, and incidences of various medical problems. After termina- tion of the study, the two groups might be compared on each of these variables, and it might be found, say, that there was a “significant difference” in the incidence of melanoma. The problem with this “significant finding” is the following. Suppose that 100 independent two-sample t tests are conducted at the .05 level and that, in fact, all the null hypotheses are true. We would expect that five of the tests would produce a “significant” result. Although each of the tests has probability .05 of type I error, as a collection they do not simultaneously have α = .05. The combined significance level is the probability that at least one of the null hypotheses is rejected: α = P{at least one H0 rejected} = 1 − P{no H0 rejected} = 1 − .95100 = .994 Thus, with very high probability, at least one “significant” result will be found, even if all the null hypotheses are true. There are no simple cures for this problem. One possibility is to regard the results of a fishing expedition as merely providing suggestions for further experiments. Alternatively, and in the same spirit, the data could be split randomly into two halves, one half for fishing in and the other half to be locked safely away, unexamined. “Significant” results from the first half could then be tested on the second half. A third alternative is to conduct each individual hypothesis test at a small significance level. To see how this works, suppose that all null hypotheses are true and that each of n null hypotheses is tested at level α. Let Ri denote the event that the ith null hypothesis is rejected, and let α∗ denote the overall probability of a type I error. Then α∗ = P{R1 or R2 or ··· or Rn} ≤ P{R1}+P{R2}+···+P{Rn} = nα Thus, if each of the n null hypotheses is tested at level α/n, the overall significance level is less than or equal to α. This is often called the Bonferroni method. 11.6 Problems 459 11.5 Concluding Remarks This chapter was concerned with the problem of comparing two samples. Within this context, the fundamental statistical concepts of estimation and hypothesis testing, which were introduced in earlier chapters, were extended and utilized. The chapter also showed how informal descriptive and data analytic techniques are used in sup- plementing more formal analysis of data. Chapter 12 will extend the techniques of this chapter to deal with multisample problems. Chapter 13 is concerned with similar problems that arise in the analysis of qualitative data. We considered two types of experiments, those with two independent samples and those with matched pairs. For the case of independent samples, we developed the t test, based on an assumption of normality, as well as a modification of the t test that takes into account possibly unequal variances. The Mann-Whitney test, based on ranks, was presented as a nonparametric method, that is, a method that is not based on an assumption of a particular distribution. Similarly, for the matched-pairs design, we developed a parametric t test and a nonparametric test, the signed rank test. We discussed methods based on an assumption of normality and rank methods, which do not make this assumption. It turns out, rather surprisingly, that even if the normality assumption holds, the rank methods are quite powerful relative to the t test. Lehmann (1975) shows that the efficiency of the rank tests relative to that of the t test—that is, the ratio of sample sizes required to attain the same power—is typically around .95 if the distributions are normal. Thus, a rank test using a sample of size 100 is as powerful as a t test based on 95 observations. Collecting the extra 5 pieces of data is a small price to pay for a safeguard against nonnormality. The bootstrap appeared again in this chapter. Indeed, uses of this recently de- veloped technique are finding applications in a great variety of statistical problems. In contrast with earlier chapters, where bootstrap samples were generated from one distribution, here we have bootstrapped from two empirical distributions. The chapter concluded with a discussion of experimental design, which empha- sized the importance of incorporating controls and randomization in investigations. Possible problems associated with observational studies were discussed. Finally, the difficulties encountered in making many comparisons from a single data set were pointed out; such problems of multiplicity will come up again in Chapter 12. 11.6 Problems 1. A computer was used to generate four random numbers from a normal distribution with a set mean and variance: 1.1650, .6268, .0751, .3516. Five more random normal numbers with the same variance but perhaps a different mean were then generated (the mean may or may not actually be different): .3035, 2.6961, 1.0591, 2.7971, 1.2641. a. What do you think the means of the random normal number generators were? What do you think the difference of the means was? b. What do you think the variance of the random number generator was? c. What is the estimated standard error of your estimate of the difference of the means? 460 Chapter 11 Comparing Two Samples d. Form a 90% confidence interval for the difference of the means of the random number generators. e. In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the means? f. What is the p-value of a two-sided test of the null hypothesis of equal means? g. Would the hypothesis that the means were the same versus a two-sided alter- native be rejected at the significance level α = .1? h. Suppose you know that the variance of the normal distribution was σ 2 = 1. How would your answers to the preceding questions change? 2. The difference of the means of two normal distributions with equal variance is to be estimated by sampling an equal number of observations from each distribution. If it were possible, would it be better to halve the standard deviations of the populations or double the sample sizes? 3. In Section 11.2.1, we considered two methods of estimating Var(X − Y). Under the assumption that the two population variances were equal, we estimated this quantity by s2 p 1 n + 1 m and without this assumption by s2 X n + s2 Y m Show that these two estimates are identical if m = n. 4. Respond to the following: Using the t distribution is absolutely ridiculous—another example of de- liberate mystification! It’s valid when the populations are normal and have equal variance. If the sample sizes were so small that the t distribution were practically different from the normal distribution, you would be unable to check these assumptions. 5. Respond to the following: Here is another example of deliberate mystification—the idea of formulating and testing a null hypothesis. Let’s take Example A of Section 11.2.1. It seems to me that it is inconceivable that the expected values of any two methods of measurement could be exactly equal. It is certain that there will be subtle differences at the very least. What is the sense, then, in testing H0: μX = μY ? 6. Respond to the following: I have two batches of numbers and I have a corresponding ¯x and ¯y.Why should I test whether they are equal when I can just see whether they are or not? 7. In the development of Section 11.2.1, where are the following assumptions used? (1) X1, X2,...,Xn are independent random variables; (2) Y1, Y2,...,Yn are independent random variables; (3) the X’s and Y’s are independent. 11.6 Problems 461 8. An experiment to determine the efficacy of a drug for reducing high blood pressure is performed using four subjects in the following way: two of the subjects are chosen at random for the control group and two for the treatment group. During the course of treatment with the drug, the blood pressure of each of the subjects in the treatment group is measured for ten consecutive days as is the blood pressure of each of the subjects in the control group. a. In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t test with n = m = 20? b. Do you think it is appropriate to use the Mann-Whitney test with n = m = 20? 9. Referring to the data in Section 11.2.1.1, compare iron retention at concentra- tions of 10.2 and .3 millimolar using graphical procedures and parametric and nonparametric tests. Write a brief summary of your conclusions. 10. Verify that the two-sample t test at level α of H0: μX = μY versus HA: μX = μY rejects if and only if the confidence interval for μX − μY does not contain zero. 11. Explain how to modify the t test of Section 11.2.1 to test H0: μX = μY +  versus HA: μX = μY +  where  is specified. 12. An equivalence between hypothesis tests and confidence intervals was demon- strated in Chapter 9. In Chapter 10, a nonparametric confidence interval for the median, η, was derived. Explain how to use this confidence interval to test the hypothesis H0: η = η0. In the case where η0 = 0, show that using this approach on a sample of differences from a paired experiment is equivalent to the sign test. The sign test counts the number of positive differences and uses the fact that in the case that the null hypothesis is true, the distribution of the number of positive differences is binomial with (n, .5). Apply the sign test to the data from the measurement of mercury levels, listed in Section 11.3.3. 13. Let X1,...,X25 be i.i.d. N(.3, 1). Consider testing the null hypothesis H0: μ = 0 versus HA: μ>0 at significance level α = .05. Compare the power of the sign test and the power of the test based on normal theory assuming that σ is known. 14. Suppose that X1,...,Xn are i.i.d. N(μ, σ 2). To test the null hypothesis H0: μ = μ0, the t test is often used: t = X − μ0 sX Under H0, t follows a t distribution with n − 1 df. Show that the likelihood ratio test of this H0 is equivalent to the t test. 15. Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be taken independently under a control condi- tion. It is thought that the standard deviation of a single observation is about 10 under both conditions. How large should n be so that a 95% confidence inter- val for μX − μY has a width of 2? Use the normal distribution rather than the t distribution, since n will turn out to be rather large. 16. Referring to Problem 15, how large should n be so that the test of H0: μX = μY against the one-sided alternative HA: μX >μY has a power of .5 if μX −μY = 2 and α = .10? 462 Chapter 11 Comparing Two Samples 17. Consider conducting a two-sided test of the null hypothesis H0: μX = μY as described in Problem 16. Sketch power curves for (a) α = .05, n = 20; (b) α = .10, n = 20; (c) α = .05, n = 40; (d) α = .10, n = 40. Compare the curves. 18. Two independent samples are to be compared to see if there is a difference in the population means. If a total of m subjects are available for the experiment, how should this total be allocated between the two samples in order to (a) provide the shortest confidence interval for μX − μY and (b) make the test of H0: μX = μY as powerful as possible? Assume that the observations in the two samples are normally distributed with the same variance. 19. An experiment is planned to compare the mean of a control group to the mean of an independent sample of a group given a treatment. Suppose that there are to be 25 samples in each group. Suppose that the observations are approximately normally distributed and that the standard deviation of a single measurement in either group is σ = 5. a. What will the standard error of Y − X be? b. With a significance level α = .05, what is the rejection region of the test of the null hypothesis H0: μY = μX versus the alternative HA: μY >μX ? c. What is the power of the test if μY = μX + 1? d. Suppose that the p-value of the test turns out to be 0.07. Would the test reject at significance level α = .10? e. What is the rejection region if the alternative is HA: μY = μX ? What is the power if μY = μX + 1? 20. Consider Example A of Section 11.3.1 using a Bayesian model. As in the ex- ample, use a normal model for the differences and also use an improper prior for the expected difference and the precision (as in the case of unknown mean and variance in Section 8.6). Find the posterior probability that the expected difference is positive. Find a 90% posterior credibility interval for the expected difference. 21. A study was done to compare the performances of engine bearings made of different compounds (McCool 1979). Ten bearings of each type were tested. The following table gives the times until failure (in units of millions of cycles): Type I Type II 3.03 3.19 5.53 4.26 5.60 4.47 9.30 4.53 9.92 4.67 12.51 4.69 12.95 12.78 15.21 6.79 16.04 9.37 16.84 12.75 11.6 Problems 463 a. Use normal theory to test the hypothesis that there is no difference between the two types of bearings. b. Test the same hypothesis using a nonparametric method. c. Which of the methods—that of part (a) or that of part (b)—do you think is better in this case? d. Estimate π, the probability that a type I bearing will outlast a type II bearing. e. Use the bootstrap to estimate the sampling distribution of ˆπ and its standard error. f. Use the bootstrap to find an approximate 90% confidence interval for π. 22. An experiment was done to compare two methods of measuring the calcium content of animal feeds. The standard method uses calcium oxalate precipitation followed by titration and is quite time-consuming. A new method using flame photometry is faster. Measurements of the percent calcium content made by each method of 118 routine feed samples (Heckman 1960) are contained in the file calcium. Analyze the data to see if there is any systematic difference between the two methods. Use both parametric and nonparametric tests and graphical methods. 23. Let X1,...,Xn be i.i.d. with cdf F, and let Y1,...,Ym be i.i.d. with cdf G. The hypothesis to be tested is that F = G. Suppose for simplicity that m + n is even so that in the combined sample of X’s and Y’s, (m + n)/2 observations are less than the median and (m + n)/2 are greater. a. As a test statistic, consider T , the number of X’s less than the median of the combined sample. Show that T follows a hypergeometric distribution under the null hypothesis: P(T = t) = (m + n)/2 t (m + n)/2 n − t m + n n Explain how to form a rejection region for this test. b. Show how to find a confidence interval for the difference between the median of F and the median of G under the shift model, G(x) = F(x − ).(Hint: Use the order statistics.) c. Apply the results (a) and (b) to the data of Problem 21. 24. Find the exact null distribution of the Mann-Whitney statistic, UY , in the case where m = 3 and n = 2. 25. Referring to Example A in Section 11.2.1, (a) if the smallest observation for method B (79.94) is made arbitrarily small, will the t test still reject? (b) If the largest observation for method B (80.03) is made arbitrarily large, will the t test still reject? (c) Answer the same questions for the Mann-Whitney test. 26. Let X1,...,Xn be a sample from an N(0, 1) distribution and let Y1,...,Yn be an independent sample from an N(1, 1) distribution. a. Determine the expected rank sum of the X’s. b. Determine the variance of the rank sum of the X’s. 464 Chapter 11 Comparing Two Samples 27. Find the exact null distribution of W+ in the case where n = 4. 28. For n = 10, 20, and 30, find the .05 and .01 critical values for a two-sided signed rank test from the tables and then by using the normal approximation. Compare the values. 29. (Permutation Test for Means) Here is another view on hypothesis testing that we will illustrate with Example A of Section 11.2.1. We ask whether the mea- surements produced by methods A and B are identical or exchangeable in the following sense. There are 13 + 8 = 21 measurements in all and there are 21 8 , or about 2 × 105, ways that 8 of these could be assigned to method B. Is the particular assignment we have observed unusual among these in the sense that the means of the two samples are unusually different? a. It’s not inconceivable, but it may be asking too much for you to generate all21 8 partitions. So just choose a random sample of these partitions, say of size 1000, and make a histogram of the resulting values of X A − X B. Where on this distribution does the value of X A − X B that was actually observed fall? Compare to the result of Example B of Section 11.2.1. b. In what way is this procedure similar to the Mann-Whitney test? 30. Use the bootstrap to estimate the standard error of and a confidence interval for X A − X B and compare to the result of Example A of Section 11.2.1. 31. In Section 11.2.3, if F = G, what are E( ˆπ) and Var( ˆπ)? Would there be any advantage in using equal sample sizes m = n in estimating π or does it make no difference? 32. If X ∼ N(μX ,σ2 X ) and Y is independent N(μY ,σ2 Y ), what is π = P(X < Y) in terms of μX , μY , σX , and σY ? 33. To compare two variances in the normal case, let X1,...,Xn be i.i.d. N(μX ,σ2 X ), and let Y1,...,Ym be i.i.d. N(μY ,σ2 Y ), where the X’s and Y’s are independent samples. Argue that under H0: σX = σY , s2 X s2 Y ∼ Fn−1, m−1 a. Construct rejection regions for one- and two-sided tests of H0. b. Construct a confidence interval for the ratio σ 2 X /σ 2 Y . c. Apply the results of parts (a) and (b) to Example A in Section 11.2.1. (Cau- tion: This test and confidence interval are not robust against violations of the assumption of normality.) 34. This problem contrasts the power functions of paired and unpaired designs. Graph and compare the power curves for testing H0: μX = μY for the following two designs. a. Paired: Cov(Xi , Yi ) = 50, σX = σY = 10, i = 1,...,25. b. Unpaired: X1,...,X25 and Y1,...,Y25 are independent with variance as in part (a). 11.6 Problems 465 35. An experiment was done to measure the effects of ozone, a component of smog. A group of 22 seventy-day-old rats were kept in an environment containing ozone for 7 days, and their weight gains were recorded. Another group of 23 rats of a similar age were kept in an ozone-free environment for a similar time, and their weight gains were recorded. The data (in grams) are given below. Analyze the data to determine the effect of ozone. Write a summary of your conclusions. [This problem is from Doksum and Sievers (1976) who provide an interesting analysis.] Controls Ozone 41.0 38.4 24.9 10.1 6.1 20.4 25.9 21.9 18.3 7.3 14.3 15.5 13.1 27.3 28.5 −9.9 6.8 28.2 −16.9 17.4 21.8 17.9 −12.9 14.0 15.4 27.4 19.2 6.6 12.1 15.7 22.4 17.7 26.0 39.9 −15.9 54.6 29.4 21.4 22.7 −14.7 44.1 −9.0 26.0 26.6 −9.0 36. Lin, Sutton, and Qurashi (1979) compared microbiological and hydroxylamine methods for the analysis of ampicillin dosages. In one series of experiments, pairs of tablets were analyzed by the two methods. The data in the following table give the percentages of claimed amount of ampicillin found by the two methods in several pairs of tablets. What are X − Y and sX−Y ? If the pairing had been erro- neously ignored and it had been assumed that the two samples were independent, what would have been the estimate of the standard deviation of X − Y? Ana- lyze the data to determine if there is a systematic difference between the two methods. Microbiological Method Hydroxylamine Method 97.2 97.2 105.8 97.8 99.5 96.2 100.0 101.8 93.8 88.0 79.2 74.0 72.0 75.0 72.0 67.5 69.5 65.8 20.5 21.2 95.2 94.8 90.8 95.8 96.2 98.0 96.2 99.0 91.0 100.2 466 Chapter 11 Comparing Two Samples 37. Stanley and Walton (1961) ran a controlled clinical trial to investigate the effect of the drug stelazine on chronic schizophrenics. The trials were conducted on chronic schizophrenics in two closed wards. In each of the wards, the patients were divided into two groups matched for age, length of time in the hospital, and score on a behavior rating sheet. One member of each pair was given stelazine, and the other a placebo. Only the hospital pharmacist knew which member of each pair received the actual drug. The following table gives the behavioral rating scores for the patients at the beginning of the trial and after 3 mo. High scores are good. Ward A Stelazine Placebo Before After Before After 2.3 3.1 2.4 2.0 2.0 2.1 2.2 2.6 1.9 2.45 2.1 2.0 3.1 3.7 2.9 2.0 2.2 2.54 2.2 2.4 2.3 3.72 2.4 3.18 2.8 4.54 2.7 3.0 1.9 1.61 1.9 2.54 1.1 1.63 1.3 1.72 Ward B Stelazine Placebo Before After Before After 1.9 1.45 1.9 1.91 2.3 2.45 2.4 2.54 2.0 1.81 2.0 1.45 1.6 1.72 1.5 1.45 1.6 1.63 1.5 1.54 2.6 2.45 2.7 1.54 1.7 2.18 1.7 1.54 a. For each of the wards, test whether stelazine is associated with improvement in the patients’ scores. b. Test if there is any difference in improvement between the wards. [These data are also presented in Lehmann (1975), who discusses methods of combining the data from the wards.] 38. Bailey, Cox, and Springer (1978) used high-pressure liquid chromatography to measure the amounts of various intermediates and by-products in food dyes. The following table gives the percentages added and found for two substances in the dye FD&C Yellow No. 5. Is there any evidence that the amounts found differ systematically from the amounts added? 11.6 Problems 467 Sulfanilic Acid Pyrazolone-T Percentage Added Percentage Found Percentage Added Percentage Found .048 .060 .035 .031 .096 .091 .087 .084 .20 .16 .19 .16 .19 .16 .19 .17 .096 .091 .16 .15 .18 .19 .032 .040 .080 .070 .060 .076 .24 .23 .13 .11 0 0 .080 .082 .040 .042 0 0 .060 .056 39. An experiment was done to test a method for reducing faults on telephone lines (Welch 1987). Fourteen matched pairs of areas were used. The following table shows the fault rates for the control areas and for the test areas: Test Control 676 88 206 570 230 605 256 617 280 653 433 2913 337 924 466 286 497 1098 512 982 794 2346 428 321 452 615 512 519 a. Plot the differences versus the control rate and summarize what you see. b. Calculate the mean difference, its standard deviation, and a confidence interval. c. Calculate the median difference and a confidence interval and compare to the previous result. d. Do you think it is more appropriate to use a t test or a nonparametric method to test whether the apparent difference between test and control could be due to chance? Why? Carry out both tests and compare. 40. Biological effects of magnetic fields are a matter of current concern and research. In an early study of the effects of a strong magnetic field on the development of mice (Barnothy 1964), 10 cages, each containing three 30-day-old albino female 468 Chapter 11 Comparing Two Samples mice, were subjected for a period of 12 days to a field with an average strength of 80 Oe/cm. Thirty other mice housed in 10 similar cages were not placed in a magnetic field and served as controls. The following table shows the weight gains, in grams, for each of the cages. a. Display the data graphically with parallel dotplots. (Draw two parallel num- ber lines and put dots on one corresponding to the weight gains of the con- trols and on the other at points corresponding to the gains of the treatment group.) b. Find a 95% confidence interval for the difference of the mean weight gains. c. Use a t test to assess the statistical significance of the observed difference. What is the p-value of the test? d. Repeat using a nonparametric test. e. What is the difference of the median weight gains? f. Use the bootstrap to estimate the standard error of the difference of median weight gains. g. Form a confidence interval for the difference of median weight gains based on the bootstrap approximation to the sampling distribution. Field Present Field Absent 22.8 23.5 10.2 31.0 20.8 19.5 27.0 26.2 19.2 26.5 9.0 25.2 14.2 24.5 19.8 23.8 14.5 27.8 14.8 22.0 41. The Hodges-Lehmann shift estimate is defined to be ˆ = median(Xi − Y j ), where X1, X2,...,Xn are independent observations from a distribution F and Y1, Y2,...,Ym are independent observations from a distribution G and are inde- pendent of the Xi . a. Show that if F and G are normal distributions, then E( ˆ) = μX − μY . b. Why is ˆ robust to outliers? c. What is ˆ for the previous problem and how does it compare to the differences of the means and of the medians? d. Use the bootstrap to approximate the sampling distribution and the standard error of ˆ. e. From the bootstrap approximation to the sampling distribution, form an ap- proximate 90% confidence interval for ˆ. 11.6 Problems 469 42. Use the data of Problem 40 of Chapter 10. a. Estimate π, the probability that more rain will fall from a randomly selected seeded cloud than from a randomly selected unseeded cloud. b. Use the bootstrap to estimate the standard error of ˆπ. c. Use the bootstrap to form an approximate confidence interval for π. 43. Suppose that X1, X2,...,Xn and Y1, Y2,...,Ym are two independent samples. As a measure of the difference in location of the two samples, the difference of the 20% trimmed means is used. Explain how the bootstrap could be used to estimate the standard error of this difference. 44. Interest in the role of vitamin C in mental illness in general and schizophrenia in particular was spurred by a paper of Linus Pauling in 1968. This exercise takes its data from a study of plasma levels and urinary vitamin C excretion in schizophrenic patients (Suboti˘canec et al. 1986). Twenty schizophrenic patients and 15 controls with a diagnosis of neurosis of different origin who had been patients at the same hospital for a minimum of 2 months were selected for the study. Before the experiment, all the subjects were on the same basic hospital diet. A sample of 2 ml of venous blood for vitamin C determination was drawn from each subject before breakfast and after the subjects had emptied their blad- ders. Each subject was then given 1 g ascorbic acid dissolved in water. No foods containing ascorbic acid were available during the test. For the next 6 h all urine was collected from the subjects for assay of vitamin C. A second blood sample was also drawn 2 h after the dose of vitamin C. The following two tables show the plasma concentrations (mg/dl). Schizophrenics Nonschizophrenics 0h 2h 0h 2h .55 1.22 1.27 2.00 .60 1.54 .09 .41 .21 .97 1.64 2.37 .09 .45 .23 .41 1.01 1.54 .18 .79 .24 .75 .12 .94 .37 1.12 .85 1.72 1.01 1.31 .69 1.75 .26 .92 .78 1.60 .30 1.27 .63 1.80 .26 1.08 .50 2.08 .10 1.19 .62 1.58 .42 .64 .19 .86 .11 .30 .66 1.92 .14 .24 .91 1.54 .20 .89 .09 .24 .32 1.68 .24 .99 .25 .67 470 Chapter 11 Comparing Two Samples a. Graphically compare the two groups at the two times and for the difference in concentration at the two times. b. Use the t test to assess the strength of the evidence for differences between the two groups at 0 h, at 2 h, and the difference 2 h − 0h. c. Use the Mann-Whitney test to test the hypotheses of (b). The following tables show the amounts of urinary vitamin C, both total and milligrams per kilogram of body weight, for the two groups: Schizophrenics Nonschizophrenics Total mg/kg Total mg/kg 16.6 .19 289.4 3.96 33.3 .44 0.0 0.00 34.1 .39 620.4 7.95 0.0 .00 0.0 0.00 119.8 1.75 8.5 .10 .1 .01 5.5 .09 25.3 .27 43.2 .91 359.3 5.99 91.7 1.00 6.6 .10 200.9 3.46 .4 .01 113.8 2.01 62.8 .68 102.2 1.50 .2 .01 108.2 1.98 13.0 .15 36.9 .49 0.0 0.00 122.0 1.72 0.0 0.00 101.9 1.52 5.9 .10 .1 .01 6.0 .07 32.1 .42 0.0 0.00 d. Use descriptive statistics and graphical presentations to compare the two groups with respect to total excretion and mg/kg body weight. Do the data look normally distributed? e. Use a t test to compare the two groups on both variables. Is the normality assumption reasonable? f. Use the Mann-Whitney test to compare the two groups. How do the results compare with those obtained in part (e)? The lower levels of plasma vitamin C in the schizophrenics before admin- istration of ascorbic acid could be attributed to several factors. Interindividual differences in the intake of meals cannot be excluded, despite the fact that all patients were offered the same food. A more interesting possibility is that the differences are the result of poorer resorption or of higher ascorbic acid utilization in schizophrenics. In order to answer this question, another 11.6 Problems 471 experiment was run on 15 schizophrenics and 15 controls. All subjects were given 70 mg of ascorbic acid daily for 4 weeks before the ascorbic acid load- ing test. The following table shows the concentration of plasma vitamin C (mg/dl) and the 6-h urinary excretion (mg) after administration of 1 g ascorbic acid. Schizophrenics Controls Plasma Urine Plasma Urine .72 86.20 1.02 190.14 1.11 21.55 .86 149.76 .96 182.07 .78 285.27 1.23 88.28 1.38 244.93 .76 76.58 .95 184.45 .75 18.81 1.00 135.34 1.26 50.02 .47 157.74 .64 107.74 .60 125.65 .67 .09 1.15 164.98 1.05 113.23 .86 99.65 1.28 34.38 .61 86.29 .54 8.44 1.01 142.23 .77 109.03 .77 144.60 1.11 144.44 .77 265.40 .51 172.09 .94 28.26 g. Use graphical methods and descriptive statistics to compare the two groups with respect to plasma concentrations and urinary excretion. h. Use the t test to compare the two groups on the two variables. Does the normality assumption look reasonable? i. Compare the two groups using the Mann-Whitney test. 45. This and the next two problems are based on discussions and data in Le Cam and Neyman (1967), which is devoted to the analysis of weather modification experiments. The examples illustrate some ways in which principles of experi- mental design have been used in this field. During the summers of 1957 through 1960, a series of randomized cloud-seeding experiments were carried out in the mountains of Arizona. Of each pair of successive days, one day was randomly selected for seeding to be done. The seeding was done during a two-hour to four-hour period starting at midday, and rainfall during the afternoon was mea- sured by a network of 29 gauges. The data for the four years are given in the following table (in inches). Observations in this table are listed in chronological order. a. Analyze the data for each year and for the years pooled together to see if there appears to be any effect due to seeding. You should use graphical descriptive methods to get a qualitative impression of the results and hypothesis tests to assess the significance of the results. 472 Chapter 11 Comparing Two Samples b. Why should the day on which seeding is to be done be chosen at random rather than just alternating seeded and unseeded days? Why should the days be paired at all, rather than just deciding randomly which days to seed? 1957 1958 1959 1960 Seeded Unseeded Seeded Unseeded Seeded Unseeded Seeded Unseeded 0 .154 .152 .013 .015 0 0 .010 .154 0 0 0 0 0 0 0 .003 .008 0 .445 0 .086 .042 .057 .084 .033 .002 0 .021 .006 0 0 .002 .035 .007 .079 0 .115 0 .093 .157 .007 .013 .006 .004 .090 0 .183 .010 .140 .161 .008 .010 0 .152 0 0 .022 0 .001 0 0 0 0 .002 0 .274 .001 .055 0 0 0 .078 .074 .001 .025 .004 .076 0 0 .101 .002 .122 .046 .053 .090 0 0 .169 .318 .101 .007 0 0 0 0 .139 .096 .012 .019 0 .078 .008 0 .172 0 .002 0 .090 .121 .040 .060 0 0 .066 0 .028 1.027 .003 .102 0 .050 .040 .012 0 .104 .011 .041 .032 .023 .133 .172 .083 .002 0 0 46. The National Weather Bureau’s ACN cloud-seeding project was carried out in the states of Oregon and Washington. Cloud seeding was accomplished by dis- persing dry ice from an aircraft; only clouds that were deemed “ripe” for seeding were candidates for seeding. On each occasion, a decision was made at random whether to seed, the probability of seeding being 2 3 . This resulted in 22 seeded and 13 control cases. Three types of targets were considered, two of which are dealt with in this problem. Type I targets were large geographical areas downwind from the seeding; type II targets were sections of type I targets located so as to have, theoretically, the greatest sensitivity to cloud seeding. The following table gives the average target rainfalls (in inches) for the seeded and control cases, listed in chronological order. Is there evidence that seeding has an effect on either type of target? In what ways is the design of this experiment different from that of the one in Problem 45? 11.6 Problems 473 Control Cases Seeded Cases Type I Type II Type I Type II .0080 .0000 .1218 .0200 .0046 .0000 .0403 .0163 .0549 .0053 .1166 .1560 .1313 .0920 .2375 .2885 .0587 .0220 .1256 .1483 .1723 .1133 .1400 .1019 .3812 .2880 .2439 .1867 .1720 .0000 .0072 .0233 .1182 .1058 .0707 .1067 .1383 .2050 .1036 .1011 .0106 .0100 .1632 .2407 .2126 .2450 .0788 .0666 .1435 .1529 .0365 .0133 .2409 .2897 .0408 .0425 .2204 .2191 .1847 .0789 .3332 .3570 .0676 .0760 .1097 .0913 .0952 .0400 .2095 .1467 47. During 1963 and 1964, an experiment was carried out in France; its design dif- fered somewhat from those of the previous two problems. A 1500-km target area was selected, and an adjacent area of about the same size was designated as the control area; 33 ground generators were used to produce silver iodide to seed the target area. Precipitation was measured by a network of gauges for each suit- able “rainy period,” which was defined as a sequence of periods of continuous precipitation between dry spells of a specified length. When a forecaster deter- mined that the situation was favorable for seeding, he telephoned an order to a service agent, who then opened a sealed envelope that contained an order to actually seed or not. The envelopes had been prepared in advance, using a table of random numbers. The following table gives precipitation (in inches) in the target and control areas for the seeded and unseeded periods. a. Analyze the data, which are listed in chronological order, to see if there is an effect of seeding. b. The analysis done by the French investigators used the square root transfor- mation in order to make normal theory more applicable. Do you think that taking the square root was an effective transformation for this purpose? c. Reflect on the nature of this design. In particular, what advantage is there to using the control area? Why not just compare seeded and unseeded periods on the target area? 474 Chapter 11 Comparing Two Samples Seeded Unseeded Target Control Target Control 1.6 1.0 1.1 2.2 28.1 27.0 3.5 5.2 7.8 .3 2.6 0.0 4.0 6.0 2.6 2.0 9.6 12.6 9.8 4.9 0.2 0.5 5.6 8.5 18.7 8.7 .1 3.5 16.5 21.5 0.0 1.1 4.6 13.9 17.7 11.0 9.3 6.7 19.4 19.8 3.5 4.5 8.9 5.3 0.1 0.7 10.6 8.9 11.5 8.7 10.2 4.5 0.0 0.0 16.0 13.0 9.3 10.7 9.7 21.1 5.5 4.7 21.4 15.9 70.2 29.1 6.1 19.5 0.7 1.9 24.3 16.3 38.6 34.7 20.9 6.3 11.3 10.2 60.2 47.0 3.3 2.7 15.2 10.8 8.9 2.8 2.7 4.8 11.1 4.3 0.3 0.0 64.3 38.7 12.2 5.7 16.6 11.1 2.2 5.1 7.3 6.5 23.3 30.6 3.2 3.0 9.9 3.7 23.9 13.6 0.6 0.1 48. Proteinuria, the presence of excess protein in urine, is a symptom of renal (kidney) distress among diabetics. Taguma et al. (1985) studied the effects of captopril for treating proteinuria in diabetics. Urinary protein was measured for 12 patients before and after eight weeks of captopril therapy. The amounts of urinary protein (in g/24 hrs) before and after therapy are shown in the following table. What can you conclude about the effect of captopril? Consider using parametric or nonparametric methods and analyzing the data on the original scale or on a log scale. 11.6 Problems 475 Before After 24.6 10.1 17.0 5.7 16.0 5.6 10.4 3.4 8.2 6.5 7.9 0.7 8.2 6.5 7.9 0.7 5.8 6.1 5.4 4.7 5.1 2.0 4.7 2.9 49. Egyptian researchers, Kamal et al. (1991), took a sample of 126 police officers subject to inhalation of vehicle exhaust in downtown Cairo and found an average blood level concentration of lead equal to 29.2 g/dl with a standard deviation of 7.5 g/dl. A sample of 50 policemen from a suburb, Abbasia, had an average concentration of 18.2 g/dl and a standard deviation of 5.8 g/dl. Form a confi- dence interval for the population difference and test the null hypothesis that there is no difference in the populations. 50. The file bodytemp contains normal body temperature readings (degrees Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females (coded by 2) from Shoemaker (1996). a. Using normal theory, form a 95% confidence interval for the difference of mean body temperatures between males and females. Is the use of the normal approximation reasonable? b. Using normal theory, form a 95% confidence interval for the difference of mean heart rates between males and females. Is the use of the normal approx- imation reasonable? c. Use both parametric and nonparametric tests to compare the body tempera- tures and heart rates. What do you conclude? 51. A common symptom of otitis-media (inflamation of the middle ear) in young children is the prolonged presence of fluid in the middle ear, called middle-ear effusion. It is hypothesized that breast-fed babies tend to have less prolonged effusions than do bottle-fed babies. Rosner (2006) presents the results of a study of 24 pairs of infants who were matched according to sex, socioeconomic status, and type of medication taken. One member of each pair was bottle-fed and the other was breast-fed. The file ears gives the durations (in days) of middle-ear effusions after the first episode of otitis-media. a. Examine the data using graphical methods and summarize your conclusions. b. In order to test the hypothesis of no difference, do you think it is more appro- priate to use a parametric or a nonparametric test? Carry out a test. What do you conclude? 476 Chapter 11 Comparing Two Samples 52. The media often present short reports of the results of experiments. To the crit- ical reader or listener, such reports often raise more questions than they answer. Comment on possible pitfalls in the interpretation of each of the following. a. It is reported that patients whose hospital rooms have a window recover faster than those whose rooms do not. b. Nonsmoking wives whose husbands smoke have a cancer rate twice that of wives whose husbands do not smoke. c. A 2-year study in North Carolina found that 75% of all industrial accidents in the state happened to workers who had skipped breakfast. d. A school integration program involved busing children from minority schools to majority (primarily white) schools. Participation in the program was vol- untary. It was found that the students who were bused scored lower on stan- dardized tests than did their peers who chose not to be bused. e. When a group of students were asked to match pictures of newborns with pictures of their mothers, they were correct 36% of the time. f. A survey found that those who drank a moderate amount of beer were healthier than those who totally abstained from alcohol. g. A 15-year study of more than 45,000 Swedish soldiers revealed that heavy users of marijuana were six times more likely than nonusers to develop schizophrenia. h. A University of Wisconsin study showed that within 10 years of the wedding, 38% of those who had lived together before marriage had split up, compared to 27% of those who had married without a “trial period.” i. A study of nearly 4,000 elderly North Carolinians has found that those who attended religious services every week were 46% less likely to die over a six-year period than people who attended less often or not at all, according to researchers at Duke University Medical Center. 53. Explain why in Levine’s experiment (Example A in Section 11.3.1) subjects also smoked cigarettes made of lettuce leaves and unlit cigarettes. 54. This example is taken from an interesting article by Joiner (1981) and from data in Ryan, Joiner, and Ryan (1976). The National Institute of Standards and Technol- ogy supplies standard materials of many varieties to manufacturers and other par- ties, who use these materials to calibrate their own testing equipment. Great pains are taken to make these reference materials as homogeneous as possible. In an ex- periment, a long homogeneous steel rod was cut into 4-inch lengths, 20 of which were randomly selected and tested for oxygen content. Two measurements were made on each piece. The 40 measurements were made over a period of 5 days, with eight measurements per day. In order to avoid possible bias from time-related trends, the sequence of measurements was randomized. The file steelrods contains the measurements. There is an unexpected systematic source of variabil- ity in these data. Can you find it by making an appropriate plot? Would this effect have been detectable if the measurements had not been randomized over time? CHAPTER 12 The Analysis of Variance 12.1 Introduction Chapter 11 was concerned with the analysis of data arising from experimental designs with two samples. Experiments frequently involve more than two samples; they may compare several treatments, such as different drugs, and perhaps other factors, such as sex, at the same time. This chapter is an introduction to the statistical analysis of such experiments. The methods we will discuss are called analysis of variance. Contrary to what this phrase seems to imply, we will be primarily concerned with the comparison of the means of the data, not their variances. We will consider the two most elementary multisample designs: the one-way and two-way layouts. Methods based on the normal distribution and nonparametric methods will be developed. 12.2 The One-Way Layout A one-way layout is an experimental design in which independent measurements are made under each of several treatments. The techniques we will introduce are thus generalizations of the techniques for comparing two independent samples that were covered in Chapter 11. In this section, we will use as an example data from Kirchhoefer (1979), who studied the measurement of chlorpheniramine maleate in tablets. Measurements of composites that had nominal dosages equal to 4 mg were made by seven laboratories, each laboratory making 10 measurements. Data is shown in the following table. There are two possible sources of variability in the data: variability within labs and variability between labs. 477 478 Chapter 12 The Analysis of Variance Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7 4.13 3.86 4.00 3.88 4.02 4.02 4.00 4.07 3.85 4.02 3.88 3.95 3.86 4.02 4.04 4.08 4.01 3.91 4.02 3.96 4.03 4.07 4.11 4.01 3.95 3.89 3.97 4.04 4.05 4.08 4.04 3.92 3.91 4.00 4.10 4.04 4.01 3.99 3.97 4.01 3.82 3.81 4.02 4.02 4.03 3.92 3.89 3.98 3.91 4.06 4.04 3.97 3.90 3.89 3.99 3.96 4.10 3.97 3.98 3.97 3.99 4.02 4.05 4.04 3.95 3.98 3.90 4.00 3.93 4.06 Figure 12.1, a boxplot of these data, shows some variation in the medians among the seven labs, as well as some variation in the interquartile ranges. It appears from the figure that there may be some systematic differences between the labs and that there is less variability in some labs than in others. We will discuss the following question: Are the differences in the means of the measurements from the various labs significant, or might they be due to chance? 3.80 3.85 Amount of chlorpheniramine maleate (mg) 3.90 3.95 4.00 4.05 4.10 Lab 1 4.15 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7 FIGURE 12.1 Boxplots of determinations of amounts of chlorpheniramine maleate in tablets by seven laboratories. 12.2.1 Normal Theory; the F Test We first discuss the analysis of variance and the F test in the case of I groups, each containing J samples. The I groups will be referred to generically as treatments, or levels. (In the preceding example, I = 7 and J = 10. We will discuss the case of unequal sample sizes later.) 12.2 The One-Way Layout 479 We first define some notation and introduce the basic model. Let Yij = the jth observation of the ith treatment Our model is that the observations are corrupted by random errors and that the error in one observation is independent of the errors in the other observations. The statistical model is Yij = μ + αi + εij Here μ is the overall mean level, αi is the differential effect of the ith treatment, and εij is the random error in the jth observation under the ith treatment. The errors are assumed to be independent, normally distributed with mean zero and variance σ 2. The αi are normalized: I i=1 αi = 0 The expected response to the ith treatment is E(Yij) = μ + αi . Thus, if αi = 0, for i = 1,...,I, all treatments have the same expected response, and, in general, αi −α j is the difference between the expected values under treatments i and j. We will derive a test for the null hypothesis, which is that all the means are equal. The analysis of variance is based on the following identity: I i=1 J j=1 (Yij − Y ..)2 = I i=1 J j=1 (Yij − Y i.)2 + J I i=1 (Y i. − Y ..)2 where Y i. = 1 J J j=1 Yij is the average of the observations under the ith treatment and Y .. = 1 IJ I i=1 J j=1 Yij is the overall average. The terms appearing in the first identity above are called sums of squares, and the identity may be symbolically expressed as SSTOT = SSW + SSB In words, this means that the total sum of squares equals the sum of squares within groups plus the sum of squares between groups. The terminology reflects that SSW is a measure of the variation of the data within the treatment groups and that SSB is a measure of the variation of the treatment means among or between treatments. To establish the identity, we express the left-hand side as I i=1 J j=1 (Yij − Y ..)2 = I i=1 J j=1 [(Yij − Y i.) + (Y i. − Y ..)]2 480 Chapter 12 The Analysis of Variance = I i=1 J j=1 (Yij − Y i.)2 + I i=1 J j=1 (Y i. − Y ..)2 + 2 I i=1 J j=1 (Yij − Y i.)(Y i. − Y ..) = I i=1 J j=1 (Yij − Y i.)2 + I i=1 J j=1 (Y i. − Y ..)2 + 2 I i=1 (Y i. − Y ..) J j=1 (Yij − Y i.) The last term of the final expression vanishes because the sum of deviations from a mean is zero. As we will see, the basic idea underlying analysis of variance is the comparison of the sizes of various sums of squares. We can calculate the expected values of the sums of squares defined previously using the following lemma. LEMMA A Let Xi , where i = 1,...,n, be independent random variables with E(Xi ) = μi and Var(Xi ) = σ 2. Then E(Xi − X)2 = (μi − μ)2 + n − 1 n σ 2 where ¯μ = 1 n n i=1 μi Proof We use the fact that E(U 2) = [E(U)]2 +Var(U) for any random variable U with finite variance. The first term on the right-hand side of the equation in the lemma follows immediately. For the second term, we have to calculate Var(Xi − X): Var(Xi − X) = Var(Xi ) + Var(X) − 2Cov(Xi , X) and Var(Xi ) = σ 2 Var(X) = 1 n σ 2 Cov(Xi , X) = Cov Xi , 1 n n j=1 X j = 1 n σ 2 (Here we have used Cov(Xi , X j ) = 0ifi = j, since the X’s are indepen- dent.) Putting these results together proves the lemma. ■ 12.2 The One-Way Layout 481 Lemma A may be applied to the sums of squares discussed before, yielding the following theorem. THEOREM A Under the assumptions for the model stated at the beginning of this section, E(SSW ) = I i=1 J j=1 E(Yij − Y i.)2 = I i=1 J j=1 J − 1 J σ 2 = I (J − 1)σ 2 Here we have used Lemma A, with the role of Xi being played by Yij and that of X being played by Y i.. The second line then follows since E(Yij) = E(Y i ) = μ + αi . To find E(SSB), we again use the lemma with Y i. and Y .. in place of Xi and X: E(SSB) = J I i=1 E(Y i. − Y ..)2 = J I i=1 α2 i + (I − 1)σ 2 IJ = J I i=1 α2 i + (I − 1)σ 2 ■ SSW may be used to estimate σ 2; the estimate is s2 p = SSW I (J − 1) which is unbiased. The subscript p stands for pooled. Estimates of σ 2 from the I treatments are pooled together, since SSW can be written as SSW = I i=1 (J − 1)s2 i where s2 i is the sample variance in the ith group. If all the αi are equal to zero, then the expectation of SSB/(I − 1) is also σ 2. Thus, in this case, SSW /[I (J − 1)] and SSB/(I − 1) should be about equal. If some of the αi are nonzero, SSB will be inflated. We next develop a method of comparing the two sums of squares to find a test statistic for testing the null hypothesis that all the αi are equal. Under the assumption that the errors are normally distributed, the probability distributions of the sums of squares can be calculated. 482 Chapter 12 The Analysis of Variance THEOREM B If the errors are independent and normally distributed with means 0 and variances σ 2, then SSW /σ 2 follows a chi-square distribution with I (J − 1) degrees of freedom. If, additionally, the αi are all equal to zero, then SSB/σ 2 follows a chi- square distribution with I − 1 degrees of freedom and is independent of SSW . Proof We first consider SSW . From Theorem B of Section 6.3, 1 σ 2 J j=1 (Yij − Y i.)2 follows a chi-square distribution with J −1 degrees of freedom. There are I such sums in SSW , and they are independent of each other since the observations are independent. The sum of I independent chi-square random variables that each have J − 1 degrees of freedom follows a chi-square distribution with I (J − 1) degrees of freedom. Theorem B of Section 6.3 can also be applied to SSB, noting that Var(Y i.) = σ 2/J. We next prove that the two sums of squares are independent of each other. SSW is a function of the vector U, which has elements Yij − Y i., where i = 1,...,I and j = 1,...,J. SSB is a function of the vector V, whose elements are Y i., where i = 1,...,I, since Y .. can be obtained from the Y i.. Thus, it is sufficient to show that these two vectors are independent of each other. First, if i = i , Yij − Y i. and Y i . are independent since they are functions of different observations. Second, Yij −Y i. and Y i. are independent by Theorem A of Section 6.3. This completes the proof of the theorem. ■ The statistic F = SSB/(I − 1) SSW /[I (J − 1)] is used to test the following null hypothesis: H0: α1 = α2 =···=αI = 0 By Theorem A, the denominator of the F statistic has expected value equal to σ 2, and the expectation of the numerator is J(I − 1)−1 I i=1 α2 i + σ 2. Thus, if the null hypothesis is true, the F statistic should be close to 1, whereas if it is false, the statistic should be larger. If the null hypothesis is false, the numerator reflects variation between the different groups as well as variation within groups, whereas the denominator reflects only variation within groups. The hypothesis is thus rejected for large values of F. As usual, in order to apply this test, we must know the null distribution of the test statistic. 12.2 The One-Way Layout 483 THEOREM C Under the assumption that the errors are normally distributed, the null distribution of F is the F distribution with (I − 1) and I (J − 1) degrees of freedom. Proof The theorem follows from Theorem B and from the definition of the F distribution (Section 6.2), since, under H0, F is the ratio of two independent chi-square random variables divided by their degrees of freedom. ■ Percentage points of the F distribution are widely tabled. It can show that, under the normality assumption, the F test is equivalent to the likelihood ratio test. EXAMPLE A We can illustrate the use of the F statistic by applying it to the tablet data from Section 12.2. In doing so, we adopt an explicit statistical model for the variability seen in Figure 12.1. According to this model, there is an unknown mean level associated with each laboratory, and the deviations from this mean level of the 10 measurements within a laboratory are independent, normally distributed, random variables. With the aid of this model, we will see whether it is plausible that the unknown laboratory means are all equal, so that the variability between labs displayed in Figure 12.1 is entirely due to chance. The sums of squares defined previously are calculated and presented in a table called the analysis of variance table: Source df SS MS F Labs 6 .125 .021 5.66 Error 63 .231 .0037 Total 69 .356 In the table, SSW is the sum of squares due to error, and SSB is the sum of squares due to labs. MS stands for mean square and equals the sum of squares divided by the degrees of freedom. The column headed F gives the F statistic for testing the null hypothesis that there is no systematic difference among the seven labs. The F statistic has 6 and 63 df and a value of 5.66. This particular combination of degrees of freedom is not included in Table 5 of Appendix B, but upon examining the en- tries with 6 and 60 df, it is clear that the p-value is less than .01. We may thus conclude that the means of the measurements from the various labs are significantly different. Figure 12.2 is a normal probability plot of the residuals from the analysis of variance model (the residuals are formed by simply subtracting from the measure- ments of each lab the mean value for that lab). There is some indication of deviation 484 Chapter 12 The Analysis of Variance .20 .15 2 101 Ordered residuals Normal quantiles .10 .05 0 .05 3 2 .10 .15 3 FIGURE 12.2 Normal probability plot of residuals from one-way analyses of variance of tablet data. from normality in the lower tail of the distribution, but the data do not appear grossly nonnormal. ■ We now outline the procedure for the case in which the numbers of observations under the various treatments are not necessarily equal. The only difficulties with this case are algebraic; conceptually, the analysis is the same as for the case of equal sample sizes. Suppose that there are Ji observations under treatment i, for i = 1,...,I. The basic identity still holds; that is, we have I i=1 Ji j=1 (Yij − Y ..)2 = I i=1 Ji j=1 (Yij − Y i.)2 + I i=1 Ji (Y i − Y ..)2 By reasoning similar to that used here for the simple case, it can be shown that E(SSW ) = σ 2 I i=1 (Ji − 1) E(SSB) = (I − 1)σ 2 + I i=1 Ji α2 i The degrees of freedom for these sums of squares are I i=1 Ji − I and I − 1, respectively. It may be argued, as in the proof of Theorem B, that the normalized sums of squares follow chi-square distributions and that the ratio of mean squares follows an F distribution under the null hypothesis of no treatment differences. 12.2 The One-Way Layout 485 To conclude this section, let us review the basic assumptions of the model and comment on their importance. The model is Yij = μ + αi + εij We assume the following: 1. The εij are normally distributed. The F test, like the t test, remains approximately valid for moderate to large samples from moderately nonnormal distributions. 2. The error variance, σ 2, is constant. In many applications, the error variances may be different in different groups. For example, Figure 12.1 suggests that some labs may be more precise in their measurements than others. Fortunately, if there are an equal number of observations in each group, the F test is not strongly affected. 3. The εij are independent. This assumption is very important, both for normal theory and for the nonparametric analysis we will present later. 12.2.2 The Problem of Multiple Comparisons The application of the F test in Example A in Section 12.2.1 has an anticlimactic character. We concluded that the means of measurements from different labs are not all equal, but the test gives no information about how they differ, in particular about which pairs are significantly different. In many applications, the null hypothesis is a “straw man” that is not seriously entertained. Real interest may be focused on comparing pairs or groups of treatments and estimating the treatment means and their differences. A naive approach would be to compare all pairs of treatment means using t tests. The difficulty with such a procedure was pointed out in the section on experimental design in Chapter 11: Although each individual comparison would have a type I error rate of α, the collection of all comparisons considered simultaneously would not. In this section, we discuss two solutions to this problem—Tukey’s method and the Bonferroni method. More discussion can be found in Miller (1981). 12.2.2.1 Tukey's Method Tukey’s method is used to construct confidence inter- vals for the differences of all pairs of means in such a way that the intervals simulta- neously have a set coverage probability. The duality of confidence intervals and tests can then be used to determine which particular pairs are significantly different. If the sample sizes are all equal and the errors are normally distributed with a constant variance, the centered sample means, Y i. − μi , are independent and normally distributed with means 0 and variances σ 2/J, which may be estimated by s2 p/J. Tukey’s method is based on the probability distribution of the random variable maxi1,i2 |(Y i1. − μi1 ) − (Y i2. − μi2 )| sp/ √ J where the maximum is taken over all pairs i1 and i2. This distribution is called the studentized range distribution with parameters I (the number of samples being compared) and I (J − 1) (the degrees of freedom in sp). The upper 100α percentage 486 Chapter 12 The Analysis of Variance point of the distribution is denoted by qI,I (J−1)(α).Now, P |(Y i1. − μi1 ) − (Y i2. − μi2 )|≤qI,I (J−1)(α) sp√ J , for all i1 and i2 = P maxi1,i2 |(Y i1. − μi1 ) − (Y i2. − μi2 )|≤qI,I (J−1)(α) sp√ J By definition, this latter probability equals 1 − α. The idea is that all the differences are less than some number if and only if the largest difference is. The above proba- bility statement can be converted directly into a set of confidence intervals that hold simultaneously for all differences μi1 − μi2 with confidence level 100(1 − α)%. The intervals are (Y i1. − Y i2.) ± qI,I (J−1)(α) sp√ J By the duality of confidence intervals and hypothesis tests, if the 100(1 − α)% confidence interval for (Y i1. − Y i2.) does not include zero—that is, if |Y i1. − Y i2.| > qI,I (J−1)(α) sp√ J the null hypothesis that there is no difference between ui1 and ui2 may be rejected at level α. Also, all such hypothesis tests considered collectively have level α. EXAMPLE A We can illustrate Tukey’s method by applying it to the tablet data of Section 12.2. We list the labs in decreasing order of the mean of their measurements: Lab Mean 1 4.062 3 4.003 7 3.998 2 3.997 5 3.957 6 3.955 4 3.920 sp is the square root of the mean square for error in the analysis of variance table of Example A of Section 12.2.1: sp = .06. The degrees of freedom of the appropriate studentized range distribution are 7 and 63. Using 7 and 60 df in Table 6 of Appendix B as an approximation, q7,60(.05) = 4.31, two of the means in the preceding table are significantly different at the .05 level if they differ by more than q7,63(.05) sp√ J = .082 The mean from lab 1 is thus significantly different from those from labs 4, 5, and 6; The mean of lab 3 is significantly greater than that of lab 4. No other comparisons are significant at the .05 level. 12.2 The One-Way Layout 487 At the 95% confidence level, the other differences in mean level that are seen in Figure 12.1 cannot be judged to be significantly different from zero. Although differences between these labs must certainly exist, we cannot reliably establish the signs of the differences. It is interesting to note that a price is paid here for performing multiple compar- isons simultaneously. If separate t tests had been conducted using the pooled sample variance, labs would have been declared significantly different if their means had differed by more than t63(.025)sp 2 J = .053 ■ 12.2.2.2 The Bonferroni Method The Bonferroni method was briefly introduced in Section 11.4.8. The idea is very simple. If k null hypotheses are to be tested, a desired overall type I error rate of at most α can be guaranteed by testing each null hypothesis at level α/k. Equivalently, if k confidence intervals are each formed to have confidence level 100(1 − α/k)%, they all hold simultaneously with confidence level at least 100(1 − α)%. The method is simple and versatile and, although crude, gives surprisingly good results if k is not too large. EXAMPLE A To apply the Bonferroni method to the data on tablets, we note that there are k = 7 2 = 21 pairwise comparisons among the seven labs. A set of simultaneous 95% confidence intervals for the pairwise comparisons is (Y i1. − Y i2.) ± sp t63(.025/21)√ 5 Special tables for such values of the t distribution have been prepared; from Table 7 of Appendix B, we find t60 .025 20 = 3.16 which we will use as an approximation to t63(.025/21), giving confidence intervals (Y i1. − Y i2.) ± .085 that we will use as an approximation. Given the crude nature of the Bonferroni method, these are surprisingly close to the intervals produced by Tukey’s method, which have a half-width of .082. Here, too, we conclude that lab 1 produced significantly higher measurements than those of labs 4, 5, and 6. ■ A significant advantage of the Bonferroni method over Tukey’s method is that it does not require equal sample sizes in each treatment. 488 Chapter 12 The Analysis of Variance 12.2.3 A Nonparametric Method—The Kruskal-Wallis Test The Kruskal-Wallis test is a generalization of the Mann-Whitney test that is conceptu- ally quite simple. The observations are assumed to be independent, but no particular distributional form, such as the normal, is assumed. The observations are pooled together and ranked. Let Rij = the rank of Yij in the combined sample Let Ri. = 1 Ji Ji j=1 Rij be the average rank in the ith group. Let R.. = 1 N I i=1 Ji j=1 Rij = N + 1 2 where N is the total number of observations. As in the analysis of variance, let SSB = I i=1 Ji (Ri. − R..)2 be a measure of the dispersion of the Ri.. SSB may be used to test the null hypoth- esis that the probability distributions generating the observations under the various treatments are identical. The larger SSB is, the stronger is the evidence against the null hypothesis. The exact null distribution of this statistic for various combinations of I and Ji can be enumerated, as for the Mann-Whitney test. The null distribution is commonly available in computer packages. Tables are given in Lehmann (1975) and in references therein. For I = 3 and Ji ≥ 5orI > 3 and Ji ≥ 4, a chi- square approximation to a normalized version of SSB is fairly accurate. Under the null hypothesis that the probability distributions of the I groups are identical, the statistic K = 12 N(N + 1) SSB is approximately distributed as a chi-square random variable with I − 1 degrees of freedom. The value of K can be found by running the ranks through an analysis of variance program and multiplying SSB by 12/[N(N + 1)]. It can be shown that K can also be expressed as K = 12 N(N + 1) I i=1 Ji R2 i. − 3(N + 1) which is easier to compute by hand. 12.3 The Two-Way Layout 489 EXAMPLE A For the data on the tablets, K = 29.51. Referring to Table 3 of Appendix B with 6 df, we see that the p-value is less than .005. The nonparametric analysis, too, indicates that there is a systematic difference among the labs. ■ Multiple comparison procedures for nonparametric methods are discussed in detail in Miller (1981). The Bonferroni method requires no special discussion; it can be applied to all comparisons tested by Mann-Whitney tests. Like the Mann-Whitney test, the Kruskal-Wallis test makes no assumption of normality and thus has a wider range of applicability than does the F test. It is especially useful in small-sample situations. Also, because data are replaced by their ranks, outliers will have less influence on this nonparametric test than on the F test. In some applications, the data consist of ranks—for example, in a wine tasting, judges usually rank the wines—which makes the use of the Kruskal-Wallis test natural. 12.3 The Two-Way Layout A two-way layout is an experimental design involving two factors, each at two or more levels. The levels of one factor might be various drugs, for example, and the levels of the other factor might be genders. If there are I levels of one factor and J of the other, there are I × J combinations. We will assume that K independent observations are taken for each of these combinations. (The last section of this chapter will outline the advantages of such an experimental design.) The next section defines the parameters that we might want to estimate from a two-way layout. Later sections present statistical methods based on normal theory and nonparametric methods. 12.3.1 Additive Parametrization To develop and illustrate the ideas in this section, we will use a portion of the data contained in a study of electric range energy consumption (Fechter and Porter 1978). The following table shows the mean number of kilowatt-hours used by three electric ranges in cooking on each of three menu days (means are over several cooks). Menu Day Range 1 Range 2 Range 3 1 3.97 4.24 4.44 2 2.39 2.61 2.82 3 2.76 2.75 3.01 We wish to describe the variation in the numbers in this table in terms of the effects of different ranges and different menu days. Denoting the number in the ith 490 Chapter 12 The Analysis of Variance row and jth column by Yij, we first calculate a grand average ˆμ = Y .. = 1 9 3 i=1 3 j=1 Yij = 3.22 This gives a measure of typical energy consumption per menu day. The menu day means, averaged over the ranges, are Y 1. = 4.22 Y 2. = 2.61 Y 3. = 2.84 We will define the differential effect of a menu day as the difference between the mean for that day and the overall mean; we will denote these differential effects by ˆαi , where i = 1, 2, or 3. ˆα1 = Y 1. − Y .. = 1.00 ˆα2 = Y 2. − Y .. =−.61 ˆα3 = Y 3. − Y .. =−.38 (Note that, except for rounding error, the αi would sum to zero.) In words, on menu day 1, 1 kwh more than the average is consumed, and so on. The range means, averaged over the menu days, are Y .1 = 3.04 Y .2 = 3.20 Y .3 = 3.42 The differential effects of the ranges are ˆβ1 = Y .1 − Y .. =−.18 ˆβ2 = Y .2 − Y .. =−.02 ˆβ3 = Y .3 − Y .. = .20 The effects of the ranges are smaller than the effects of the menu days. The preceding description of the values in the table incorporates an overall aver- age level plus differential effects of ranges and menu days. This is a simple additive model. ˆYij = ˆμ + ˆαi + ˆβ j Here we use ˆYij to denote the fitted or predicted values of Yij from the additive model. According to this additive model, the differences between the three ranges are the same on all menu days. For example, for i = 1, 2, 3, ˆYi1 − ˆYi2 = ( ˆμ + ˆαi + ˆβ1) − ( ˆμ + ˆαi + ˆβ2) = ˆβ1 − ˆβ2 Figure 12.3 shows that this is not quite the case. If the differences were exactly the same on all menu days, the three lines would be exactly parallel. The differences between menu days 1 and 2 appear nearly the same—the lines are nearly parallel. But on menu day 3, the difference between ranges 2 and 3 increased and the difference between 12.3 The Two-Way Layout 491 2.0 2.5 23 Energy consumed (kWh) Menu day 3.0 3.5 4.0 4.5 1 FIGURE 12.3 Plot of energy consumption versus menu day for three electric ranges. The dashed line corresponds to range 3, the dotted line to range 2, and the solid line to range 1. ranges 1 and 2 decreased. This phenomenon is called an interaction between menu days and ranges—it is as if there were something about menu day 3 that especially affected adversely the energy consumption of range 1 relative to range 2. The differences of the observed values and the fitted values, Yij − ˆYij, are the residuals from the additive model and are shown in the following table: Menu Day Range 1 Range 2 Range 3 1 −.07 .04 .02 2 −.04 .02 .01 3 .10 −.07 −.03 The residuals are small relative to the main effects, with the possible exception of those for menu day 3. Interactions can be incorporated into the model to make it fit the data exactly. The residual in cell ij is Yij − ˆμ − ˆαi − ˆβ j = Yij − Y.. − (Y i. − Y ..) − (Y . j − Y ..) = Yij − Y i. − Y . j + Y .. = ˆδij Note that 3 i=1 ˆδij = 3 j=1 ˆδij = 0 492 Chapter 12 The Analysis of Variance For example, 3 i=1 ˆδij = 3 i=1 (Yij − Y i. − Y . j + Y ..) = 3Y . j − 3Y .. − 3Y . j + 3Y .. = 0 In the preceding table of residuals, the row and column sums are not exactly zero because of rounding errors. The model Yij = ˆμ + ˆαi + ˆβ j + ˆδij thus fits the data exactly; it is merely another way of expressing the numbers listed in the table. An additive model is simple and easy to interpret, especially in the absence of interactions. Transformations of the data are sometimes used to improve the adequacy of an additive model. The logarithmic transformation, for example, converts a mul- tiplicative model into an additive one. Transformations are also used to stabilize the variance (to make the variance independent of the mean) and to make normal theory more applicable. There is no guarantee, of course, that a given transformation will accomplish all these aims. The discussion in this section has centered on the parametrization and interpre- tation of the additive model as used in the analysis of variance. We have not taken into account the possibility of random errors and their effects on the inferences about the parameters, but will do so in the next section. 12.3.2 Normal Theory for the Two-Way Layout In this section, we will assume that there are K > 1 observations per cell in a two-way layout. A design with an equal number of observations per cell is called balanced. Let Yijk denote the kth observation in cell ij; the statistical model is Yijk = μ + αi + β j + δij + εijk We will assume that the random errors, εijk, are independent and normally distributed with mean zero and common variance σ 2. Thus, E(Yijk) = μ + αi + β j + δij. The parameters satisfy the following constraints: I i=1 αi = 0 J j=1 β j = 0 I i=1 δij = J j=1 δij = 0 12.3 The Two-Way Layout 493 We now find the mle’s of the unknown parameters. Since the observations in cell ij are normally distributed with mean μ + αi + β j + δij and variance σ 2, and since all the observations are independent, the log likelihood is l =−IJK 2 log(2πσ2) − 1 2σ 2 I i=1 J j=1 K k=1 (Yijk − μ − αi − β j − δij)2 Maximizing the likelihood subject to the constraints given above yields the following estimates (see Problem 17 at the end of this chapter): ˆμ = Y ... ˆαi = Y i.. − Y ..., i = 1,...,I ˆβ j = Y . j. − Y ..., j = 1,...,J ˆδij = Y ij. − Y i.. − Y . j. + Y ... as is expected from the discussion in Section 12.3.1. Like one-way analysis of variance, two-way analysis of variance is conducted by comparing various sums of squares. The sums of squares are as follows: SSA = JK I i=1 (Y i.. − Y ...)2 SSB = IK J j=1 (Y . j. − Y ...)2 SSAB = K I i=1 J j=1 (Y ij. − Y i.. − Y . j. + Y ...)2 SSE = I i=1 J j=1 K k=1 (Yijk − Y ij.)2 SSTOT = I i=1 J j=1 K k=1 (Yijk − Y ...)2 The sums of squares satisfy this algebraic identity: SSTOT = SSA + SSB + SSAB + SSE This identity may be proved by writing Yijk − Y ... = (Yijk − Y ij.) + (Y i.. − Y ...) + (Y . j. − Y ...) + (Y ij. − Y i.. − Y . j. + Y ...) and then squaring both sides, summing, and verifying that the cross products vanish. 494 Chapter 12 The Analysis of Variance The following theorem gives the expectations of these sums of squares. THEOREM A Under the assumption that the errors are independent with mean zero and variance σ 2, E(SSA) = (I − 1)σ 2 + JK I i=1 α2 i E(SSB) = (J − 1)σ 2 + IK J j=1 β2 j E(SSAB) = (I − 1)(J − 1)σ 2 + K I i=1 J j=1 δ2 ij E(SSE ) = IJ(K − 1)σ 2 Proof The results for SSA, SSB, and SSE follow from Lemma A of Section 12.2.1. Applying the lemma to SSTOT,wehave E(SSTOT) = E I i=1 J j=1 K k=1 (Yijk − Y ...)2 = (IJK − 1)σ 2 + I i=1 J j=1 K k=1 (αi + β j + δij)2 = (IJK − 1)σ 2 + JK I i=1 α2 i + IK J j=1 β2 j + K I i=1 J j=1 δ2 ij In the last step, we used the constraints on the parameters. For example, the cross product involving αi and β j is I i=1 J j=1 K k=1 αi β j = K I i=1 αi J j=1 β j = 0 The desired expression for E(SSAB) now follows, since E(SSTOT) = E(SSA) + E(SSB) + E(SSAB) + E(SSE ) ■ The distributions of these sums of squares are given by the following theorem. 12.3 The Two-Way Layout 495 THEOREM B Assume that the errors are independent and normally distributed with means zero and variances σ 2. Then a. SSE /σ 2 follows a chi-square distribution with IJ(K −1) degrees of freedom. b. Under the null hypothesis HA: αi = 0, i = 1,...,I SSA/σ 2 follows a chi-square distribution with I − 1 degrees of freedom. c. Under the null hypothesis HB: β j = 0, j = 1,...,J SSB/σ 2 follows a chi-square distribution with J − 1 degrees of freedom. d. Under the null hypothesis HAB: δij = 0, i = 1,...,I, j = 1,...,J SSAB/σ 2 follows a chi-square distribution with (I − 1)(J − 1) degrees of freedom. e. The sums of squares are independently distributed. Proof We will not give a full proof of this theorem. The results for SSA, SSB, and SSE follow from arguments similar to those used in proving Theorem B of Section 12.2.1. The result for SSAB requires some additional argument. ■ F tests of the various null hypotheses are conducted by comparing the appropriate sums of squares to the sum of squares for error, as was done for the simpler case of the one-way layout. The mean squares are the sums of squares divided by their degrees of freedom and the F statistics are ratios of mean squares. When such a ratio is substan- tially larger than 1, the presence of an effect is suggested. Note, for example, that from Theorem A, E(MSA) = σ 2 +(JK/(I −1)) i α2 i and that E(MSE ) = σ 2.Soifthe ratio MSA/MSE is large, it suggests that some of the αi are nonzero. The null distri- bution of this F statistic is the F distribution with (I − 1) and IJ(K − 1) degrees of freedom, and knowing this null distribution allows us to assess the significance of the ratio. EXAMPLE A As an example, we return to the experiment on iron retention discussed in Section 11.2.1.1. In the complete experiment, there were I = 2 forms of iron, J = 3 dosage levels, and K = 18 observations per cell. In Section 11.2.1.1, we discussed a loga- rithmic transformation of the data to make it more nearly normal and to stabilize the variance. Figure 12.4 shows boxplots of the data on the original scale; boxplots of the log data are given in Figure 12.5. The distribution of the log data is more symmetrical, and the interquartile ranges are less variable. Figure 12.6 is a plot of cell standard 496 Chapter 12 The Analysis of Variance 0 5 Percentage retained 10 15 20 25 30 High dosage of Fe3+ High dosage of Fe2+ Medium dosage of Fe3+ Medium dosage of Fe2+ Low dosage of Fe3+ Low dosage of Fe2+ FIGURE 12.4 Boxplots of iron retention for two forms of iron at three dosage levels. 1 0Natural log of percentage retained 1 2 3 4 High dosage of Fe3+ High dosage of Fe2+ Medium dosage of Fe3+ Medium dosage of Fe2+ Low dosage of Fe3+ Low dosage of Fe2+ FIGURE 12.5 Boxplots of log data on iron retention. 12.3 The Two-Way Layout 497 2 3 46810 Standard deviation Mean 4 5 6 7 2 12 8 14 FIGURE 12.6 Plot of cell standard deviations versus cell means for iron retention data. deviations versus cell means for the untransformed data; it shows that the error vari- ance increases with the mean. Figure 12.7 is a plot of cell standard deviations versus means for the log data; it shows that the transformation is successful in stabilizing the variance. Note that one of the assumptions of Theorem B is that the errors have equal variance. .45 .50 1.2 1.4 1.6 1.8 Standard deviation Mean .55 .60 .65 .70 1.0 2.0 2.2 2.4 2.6 FIGURE 12.7 Plot of cell standard deviations versus cell means for log data on iron retention. 498 Chapter 12 The Analysis of Variance 1.0 1.2 612 Mean response Dosage level 1.4 1.6 1.8 2.0 0 2 4 8 10 2.2 2.4 2.6 FIGURE 12.8 Plot of cell means of log data versus dosage level. The dashed line corresponds to Fe2+ and the solid line to Fe3+. Figure 12.8 is a plot of the cell means of the transformed data versus the dosage levels for the two forms of iron. It suggests that Fe2+ may be retained more than Fe3+.If there is no interaction, the two curves should be parallel except for random variation. This appears to be roughly the case, although there is a hint that the difference in retention of the two forms of iron increases with dosage level. To check this, we will perform a quantitative test for interaction. In the following analysis of variance table, SSA is the sum of squares due to the form of iron, SSB is the sum of squares due to dosage, and SSAB is the sum of squares due to interaction. The F statistics were found by dividing the appropriate mean square by the mean square for error. Analysis of Variance Table Source df SS MS F Iron form 1 2.074 2.074 5.99 Dosage 2 15.588 7.794 22.53 Interaction 2 .810 .405 1.17 Error 102 35.296 .346 Total 107 53.768 To test the effect of the form of iron, we test HA: α1 = α2 = 0 12.3 The Two-Way Layout 499 using the statistic F = SSIRON/1 SSE /102 = 5.99 From computer evaluation of the F distribution with 1 and 102 df, the p-value is less than .025. There is an effect due to the form of iron. An estimate of the difference α1 − α2 is Y 1.. − Y 2.. = .28 and a confidence interval for the difference may be obtained by noting that Y1.. and Y2.. are uncorrelated since they are averages over different observations and that Var(Y 1..) = Var(Y 2..) = σ 2 JK Thus, Var(Y 1.. − Y 2..) = 2σ 2 JK Estimating σ 2 by the mean square for error, Var(Y 1.. − Y 2..) is estimated by s2 Y 1.−Y 2. = 2 × .346 54 = .0128 A confidence interval can be constructed using the t distribution with IJ(K − 1) degrees of freedom. The interval is of the form (Y 1.. − Y 2..) ± tIJ(K−1)(α/2)sY 1..−Y 2.. There are 102 df; to form a 95% confidence interval we use t120(.025) = 1.98 from Table 4 of Appendix B as an approximation, producing the interval .28 ± 1.98 √ .0128, or (.06, .5). Recall that we are working on a log scale. The additive effect of .28 on the log scale corresponds to a multiplicative effect of e.28 = 1.32 on a linear scale and the interval (.06, .50) transforms to (e.06, e.50), or (1.06, 1.65). Thus, we estimate that Fe2+ increases retention by a factor of 1.32, and the uncertainty in this factor is expressed in the confidence interval (1.06, 1.65). The F statistic for testing the effect of dosage is significant, but this effect is expected and is not of major interest. To test the hypothesis HAB which states that there is no interaction, we consider the following F statistic: F = SSAB/(i − 1)(J − 1) SSE /IJ(K − 1) = 1.17 From computer evaluation of the F distribution with 2 and 102 df, the p-value is .31, so there is insufficient evidence to reject this hypothesis. Thus, the deviation of the lines of Figure 12.8 from parallelism could easily be due to chance. In conclusion, it appears that there is a difference of 6–65% in the ratio of percentage retained between the two forms of iron and that there is little evidence that this difference depends on dosage. ■ 500 Chapter 12 The Analysis of Variance 12.3.3 Randomized Block Designs Randomized block designs originated in agricultural experiments. To compare the effects of I different fertilizers, J relatively homogeneous plots of land, or blocks, are selected, and each is divided into I plots. Within each block, the assignment of fertilizers to plots is made at random. By comparing fertilizers within blocks, the variability between blocks, which would otherwise contribute “noise” to the results, is controlled. This design is a multisample generalization of a matched-pairs design. A randomized block design might be used by a nutritionist who wants to compare the effects of three different diets on experimental animals. To control for genetic variation in the animals, the nutritionist might select three animals from each of several litters and randomly determine their assignments to the diets. Randomized block designs are used in many areas. If an experiment is to be carried out over a substantial period of time, the blocks may consist of stretches of time. In industrial experiments, the blocks are often batches of raw material. Randomization helps ensure against unintentional bias and can form a basis for inference. In principle, the null distribution of a test statistic can be derived by permutation arguments, just as we derived the null distribution of the Mann-Whitney test statistic in Section 11.2.3. Parametric procedures often give a good approximation to the permutation distribution. As a model for the responses in the randomized block design, we will use Yij = μ + αi + β j + εij where αi is the differential effect of the ith treatment, β j is the differential effect of the jth block, and the εij are independent random errors. This is the model of Section 12.3.2 but with the additional assumption of no interactions between blocks and treatments. Interest is focused on the αi . From Theorem A of Section 12.3.2, if there is no interaction, E(MSA) = σ 2 + J I − 1 I i=1 α2 i E(MSB) = σ 2 + I J − 1 J j=1 β2 j E(MSAB) = σ 2 Thus, σ 2 can be estimated from MSAB. Also, since these mean squares are inde- pendently distributed, F tests can be performed to test HA or HB. For example, to test HA: αi = 0, i = 1,...,I this statistic can be used: F = MSA MSAB From Theorem B in Section 12.3.2, under HA, the statistic follows an F distribution with I − 1 and (I − 1)(J − 1) degrees of freedom. HB may be tested similarly but is 12.3 The Two-Way Layout 501 not usually of interest. Note that if, contrary to the assumption, there is an interaction, then E(MSAB) = σ 2 + 1 (I − 1)(J − 1) I i=1 J j=1 δ2 ij MSAB will tend to overestimate σ 2. This will cause the F statistic to be smaller than it should be and will result in a test that is conservative; that is, the actual probability of type I error will be smaller than desired. EXAMPLE A Let us consider an experimental study of drugs to relieve itching (Beecher 1959). Five drugs were compared to a placebo and no drug with 10 volunteer male subjects aged 20–30. (Note that this set of subjects limits the scope of inference; from a statistical point of view, one cannot extrapolate the results of the experiment to older women, for example. Any such extrapolation could be justified only on grounds of medical judgment.) Each volunteer underwent one treatment per day, and the time-order was randomized. Thus, individuals were “blocks.” The subjects were given a drug (or placebo) intravenously, and then itching was induced on their forearms with cowage, an effective itch stimulus. The subjects recorded the duration of the itching. More details are in Beecher (1959). The following table gives the durations of the itching (in seconds): No Papa- Amino- Pento- Tripelen- Subject Drug Placebo verine Morphine phylline barbital namine BG 174 263 105 199 141 108 141 JF 224 213 103 143 168 341 184 BS 260 231 145 113 78 159 125 SI 255 291 103 225 164 135 227 BW 165 168 144 176 127 239 194 TS 237 121 94 144 114 136 155 GM 191 137 35 87 96 140 121 SS 100 102 133 120 222 134 129 MU 115 89 83 100 165 185 79 OS 189 433 237 173 168 188 317 Average 191.0 204.8 118.2 148.0 144.3 176.5 167.2 Figure 12.9 shows boxplots of the responses to the six treatments and to the control (no drugs). Although the boxplot is probably not the ideal visual display of these data, since it takes no account of the blocking, Figure 12.9 does show some interesting aspects of the data. There is a suggestion that all the drugs had some effect and that papaverine was the most effective. There is a lot of scatter relative to the differences between the medians, and there are some outliers. It is interesting that the placebo responses have the greatest spread; this might be because some subjects responded to the placebo and some did not. 502 Chapter 12 The Analysis of Variance 0 100 Duration of itching (sec) 200 300 400 500 No drug Placebo Papaverine Morphine Amino- phylline Pentobarbital Tripelennamine FIGURE 12.9 Boxplots of durations of itching under seven treatments. We next construct an analysis of variance table for this experiment: Source df SS MS F Drugs 6 53013 8835 2.85 Subjects 9 103280 11476 3.71 Interaction 54 167130 3095 Total 69 323422 The F statistic for testing differences between drugs is 2.85 with 6 and 54 df, corre- sponding to a p-value less than .025. The null hypothesis that there is no difference between subjects is not experimentally interesting. Figure 12.10 is a probability plot of the residuals from the two-way analysis of variance model. The residual in cell ij is rij = Yij − ˆμ − ˆαi − ˆβ j = Yij − Y i. − Y . j + Y .. There is a slightly bowed character to the probability plot, indicating some skewness in the distribution of the residuals. But because the F test is robust against moderate deviations from normality, we should not be overly concerned. Tukey’s method may be applied to make multiple comparisons. Suppose that we want to compare the drug means, Y 1.,...,Y 7. (I = 7). These have expectations μ + αi , where i = 1,...,I, and each is an average over J = 10 independent ob- servations. The error variance is estimated by MSAB with 54 df. Simultaneous 95% 12.3 The Two-Way Layout 503 100 50 2 101 Ordered residuals Normal quantiles 0 50 100 150 3 2 3 FIGURE 12.10 Normal probability plot of residuals from two-way analysis of variance of data on duration of itching. confidence intervals for all differences between drug means have half-widths of q7,54(.05)s√ J = 4.31 3095 10 = 75.8 [Here we have used q7,60(.05) from Table 6 of Appendix B as an approximation to q7,54(.05).] Examining the table of means, we see that, at the 95% confidence level, we can conclude only that papaverine achieves a reduction of itching over the effect of a placebo. ■ 12.3.4 A Nonparametric Method—Friedman s Test This section presents a nonparametric method for the randomized block design. Like other nonparametric methods we have discussed, Friedman’s test relies on ranks and does not make an assumption of normality. The test is very simple. Within each of the J blocks, the observations are ranked. To test the hypothesis that there is no effect due to the factor corresponding to treatments (I), the following statistic is calculated: SSA = J I i=1 (Ri. − R..)2 just as in the ordinary analysis of variance. Under the null hypothesis that there is no treatment effect and that the only effect is due to the randomization within blocks, the permutation distribution of the statistic can, in principle, be calculated. 504 Chapter 12 The Analysis of Variance For sample sizes such as that of the itching experiment, a chi-square approximation to this distribution is perfectly adequate. The null distribution of Q = 12J I (I + 1) I i=1 (Ri. − R..)2 is approximately chi-square with I − 1 degrees of freedom. EXAMPLE A To carry out Friedman’s test on the data from the experiment on itching, we first construct the following table by ranking durations of itching for each subject: No Amino- Pento- Tripelen- Drug Placebo Papaverine Morphine phylline barbital namine BG 5 7 1 6 3.5 2 3.5 JF 6 5 1 2 3 7 4 BS 7 6 4 2 1 5 3 SI 6 7 1 4 3 2 5 BW 3 4 2 5 1 7 6 TS 7 3 1 5 2 4 6 GM 7 5 1 2 3 6 4 SS 1 2 5 3 7 6 4 MU 5 3 2 4 6 7 1 OS 4 7 5 2 1 3 6 Average 5.10 4.90 2.30 3.50 3.05 4.90 4.25 Note that we have handled ties in the usual way by assigning average ranks. From the preceding table, no drug, placebo, and pentobarbitol have the highest average ranks. From these average ranks, we find R = 4, (Ri. − R..)2 = 6.935 and Q = 14.86. From Table 3 of Appendix B with 6 df, the p-value is less than .025. The nonparametric analysis also rejects the hypothesis that there is no drug effect. ■ Procedures for using Friedman’s test for multiple comparisons are discussed by Miller (1981). When these methods are applied to the data from the experiment on itching, the conclusions reached are identical to those reached by the parametric analysis. 12.4 Concluding Remarks The most complicated experimental design considered in this chapter was the two- way layout; more generally, a factorial design incorporates several factors with one or more observations per cell. With such a design, the concept of interaction be- comes more complicated—there are interactions of various orders. For instance, in a three-factor experiment, there are two-factor and three-factor interactions. It is both 12.5 Problems 505 interesting and useful that the two-factor interactions in a three-way layout can be estimated using only one observation per cell. To gain some insight into why factorial designs are effective, we can begin by considering a two-way layout, with each factor at five levels, no interaction, and one observation per cell. With this design, comparisons of two levels of any factor are based on 10 observations. A traditional alternative to this design is to do first an experiment comparing the levels of factor A and then another experiment comparing the levels of factor B. To obtain the same precision as is achieved by the two-way layout in this case, 25 observations in each experiment, or a total of 50 observations, would be needed. The factorial design achieves its economy by using the same ob- servations to compare the levels of factor A as are used to compare the levels of factor B. The advantages of factorial designs become greater as the number of factors increases. For example, in an experiment with four factors, with each factor at two levels (which might be the presence or absence of some chemical, for example) and one observation per cell, there are 16 observations that may be used to compare the levels of each factor. Furthermore, it can be shown that two- and three-factor interactions can be estimated. By comparison, if each of the four factors were investigated in a separate experiment, 64 observations would be required to attain the same precision. As the number of factors increases, the number of observations necessary for a factorial experiment with only one observation per cell grows very rapidly. To decrease the cost of an experiment, certain cells, designated in a systematic way, can be left empty, and the main effects and some interactions can still be estimated. Such arrangements are called fractional factorial designs. Similarly, with a randomized block design, the individual blocks may not be able to accommodate all the treatments. For example, in a chemical experiment that compares a large number of treatments, the blocks of the experiment, batches of raw material of uniform quality, may not be large enough. In such situations, incomplete block designs may be used to retain the advantages of blocking. The basic theoretical assumptions underlying the analysis of variance are that the errors are independent and normally distributed with constant variance. Because we cannot fully check the validity of these assumptions in practice and can probably detect only gross violations, it is natural to ask how robust the procedures are with respect to violations of the assumptions. It is impossible to give a complete and conclusive answer to this question. Generally speaking, the independence assumption is probably the most important (and this is true for nonparametric procedures as well). The F test is robust against moderate departures from normality; if the design is balanced, the F test is also robust against unequal error variance. For further reading, Box, Hunter, and Hunter (1978) is recommended. 12.5 Problems 1. Simulate observations like those of Figure 12.1 under the null hypothesis of no treatment effects. That is, simulate seven batches of ten normally distributed random numbers with mean 4 and variance .0037. Make parallel boxplots of these seven batches like those of Figure 12.1. Do this several times. Your figures 506 Chapter 12 The Analysis of Variance display the kind of variability that random fluctuations can cause; do you see any pairs of labs that appear quite different in either mean level or dispersion? 2. Verify that if I = 2, the estimate s2 p of Theorem A of Section 11.2.1 is the s2 p given in Section 12.2.1. 3. For a one-way analysis of variance with I = 2 treatment groups, show that the F statistic is t2, where t is the usual t statistic for a two-sample case. 4. Prove the analogues of Theorems A and B in Section 12.2.1 for the case of unequal numbers of observations in the cells of a one-way layout. 5. Derive the likelihood ratio test for the null hypothesis of the one-way layout, and show that it is equivalent to the F test. 6. Prove this version of the Bonferroni inequality: P n i=1 Ai ≥ 1 − n i=1 P Ac i (Use Venn diagrams if you wish.) In the context of simultaneous confidence intervals, what is Ai and what is Ac i ? 7. Show that, as claimed in Theorem B of Section 12.2.1, SSB/σ 2 ∼ χ2 I−1. 8. Form simultaneous confidence intervals for the difference of the mean of lab 1 and those of labs 4, 5, and 6 in Example A of Section 12.2.2.1. 9. Compare the tables of the t distribution and the studentized range in Appendix B. For example, consider the column corresponding to t.95; multiply the numbers in that column by √ 2 and observe that you get the numbers in the column t = 2 of the table of q.90. Why is this? 10. Suppose that in a one-way layout there are 10 treatments and seven observations under each treatment. What is the ratio of the length of a simultaneous confidence interval for the difference of two means formed by Tukey’s method to that of one formed by the Bonferroni method? How do both of these compare in length to an interval based on the t distribution that does not take account of multiple comparisons? 11. Consider a hypothetical two-way layout with four factors (A, B, C, D) each at three levels (I, III, III). Construct a table of cell means for which there is no interaction. 12. Consider a hypothetical two-way layout with three factors (A, B, C) each at two levels (I, II). Is it possible for there to be interactions but no main effects? 13. Show that for comparing two groups the Kruskal-Wallis test is equivalent to the Mann-Whitney test. 14. Show that for comparing two groups Friedman’s test is equivalent to the sign test. 12.5 Problems 507 15. Show the equality of the two forms of K given in Section 12.2.3: K = 12 N(N + 1) I i=1 Ji (Ri. − R..)2 = 12 N(N + 1) I i=1 Ji R2 i. − 3(N + 1) 16. Prove the sums of squares identity for the two-way layout: SSTOT = SSA + SSB + SSAB + SSE 17. Find the mle’s of the parameters αi , β j , δij, and μ of the model for the two-way layout. 18. The table below gives the energy use of five gas ranges for seven menu days. (The units are equivalent kilowatt-hours; .239 kwh = 1ft3 of natural gas.) Estimate main effects and discuss interaction, paralleling the discussion of Section 12.3. Menu Day Range 1 Range 2 Range 3 Range 4 Range 5 1 8.25 8.26 6.55 8.21 6.69 2 5.12 4.81 3.87 4.81 3.99 3 5.32 4.37 3.76 4.67 4.37 4 8.00 6.50 5.38 6.51 5.60 5 6.97 6.26 5.03 6.40 5.60 6 7.65 5.84 5.23 6.24 5.73 7 7.86 7.31 5.87 6.64 6.03 19. Develop a parametrization for a balanced three-way layout. Define main effects and two-factor and three-factor interactions, and discuss their interpretation. What linear constraints do the parameters satisfy? 20. This problem introduces a random effects model for the one-way layout. Con- sider a balanced one-way layout in which the I groups being compared are regarded as being a sample from some larger population. The random effects model is Yij = μ + Ai + εij where the Ai are random and independent of each other with E(Ai ) = 0 and Var(Ai ) = σ 2 A. The εij are independent of the Ai and of each other, and E(εij) = 0 and Var(εij) = σ 2ε . To fix these ideas, we can consider an example from Davies (1960). The variation of the strength (coloring power) of a dyestuff from one manufactured batch to another was studied. Strength was measured by dyeing a square of cloth with a standard concentration of dyestuff under carefully controlled conditions and visually comparing the result with a standard. The result was numerically scored by a technician. Large samples were taken from six batches of a dyestuff; each sample was well mixed, and from each six subsamples were taken. These 508 Chapter 12 The Analysis of Variance 36 subsamples were submitted to the laboratory in random order over a period of several weeks for testing as described. The percentage strengths of the dyestuff are given in the following table. Subsample Subsample Subsample Subsample Subsample Subsample Batch 123456 I 94.5 93.0 91.0 89.0 96.5 88.0 II 89.0 90.0 92.5 88.5 91.5 91.5 III 88.5 93.5 93.5 88.0 92.5 91.5 IV 100.0 99.0 100.0 98.0 95.0 97.5 V 91.5 93.0 90.0 92.5 89.0 91.0 VI 98.5 100.0 98.0 100.0 96.5 98.0 There are two sources of variability in these numbers: batch-to-batch variabil- ity and measurement variability. It is hoped that variability between subsamples has been eliminated by the mixing. We will consider the random effects model, Yij = μ + Ai + εij Here, μ is the overall mean level, Ai is the random effect of the ith batch, and εij is the measurement error on the jth subsample from the ith batch. We assume that the Ai are independent of each other and of the measurement errors, with E(Ai ) = 0 and Var(Ai ) = σ 2 A. The εij are assumed to be independent of each other and to have mean 0 and variance σ 2ε . Thus, Var(Yij) = σ 2 A + σ 2 ε Large variability in the Yij could be caused by large variability among batches, large measurement error, or both. The former could be decreased by changing the manufacturing process to make the batches more homogeneous, and the latter by controlling the scoring process more carefully. a. Show that for this model E(MSW ) = σ 2 ε E(MSB) = σ 2 ε + Jσ 2 A and that therefore σ 2ε and σ 2 A can be estimated from the data. Calculate these estimates. b. Suppose that the samples had not been mixed, but that duplicate measurements had been made on each subsample. Formulate a model that also incorporates variability between subsamples. How could the parameters of this model be estimated? 21. During each of four experiments on the use of carbon tetrachloride as a worm killer, ten rats were infested with larvae (Armitage 1983). Eight days later, five rats were treated with carbon tetrachloride; the other five were kept as controls. After two more days, all the rats were killed and the numbers of worms were counted. The table below gives the counts of worms for the four control groups. Significant differences, although not expected, might be attributable to changes in 12.5 Problems 509 experimental conditions. A finding of significant differences could result in more carefully controlled experimentation and thus greater precision in later work. Use both graphical techniques and the F test to test whether there are significant differences among the four groups. Use a nonparametric technique as well. Group I Group II Group III Group IV 279 378 172 381 338 275 335 346 334 412 335 340 198 265 282 471 303 286 250 318 22. Referring to Section 12.2, the file tablets gives the measurements on chlor- pheniramine maleate tablets from another manufacturer. Are there systematic differences between the labs? If so, which pairs differ significantly? How do these data compare to those given for the other manufacturer in Section 12.2? 23. For a study of the release of luteinizing hormone (LH), male and female rats kept in constant light were compared to male and female rats in a regime of 14 h of light and 10 h of darkness. Various dosages of luteinizing releasing factor (LRF) were given: control (saline), 10, 50, 250, and 1250 ng. Levels of LH (in nanograms per milliliter of serum) were measured in blood samples at a later time. Analyze the data given in file LHfemale, LHmale to determine the effects of light regime and LRF on release of LH for both males and females. Use both graphical techniques and more formal analyses. 24. A collaborative study was conducted to study the precision and homogeneity of a method of determining the amount of niacin in cereal products (Campbell and Pelletier 1962). Homogenized samples of bread and bran flakes were enriched with 0, 2, 4, or 8 mg of niacin per 100 g. Portions of the samples were sent to 12 labs, which were asked to carry out the specified procedures on each of three separate days. The data (in milligrams per 100 g) are given in the file niacin. Conduct two-way analyses of variance for both the bread and bran data and discuss the results. (Two data points are missing. Substitute for them the corresponding cell means.) 25. This problem deals with an example from Youden(1962). An ingot of magnesium alloy was drawn into a square rod about 100 m long with a cross section of about 4.5 cm on a side. The rod was then cut into 100 bars, each a meter long. Five of these were selected at random, and a test piece 1.2 cm thick was cut from each. From each of these five specimens, 10 test points were selected in a particular geometric pattern. Two determinations of the magnesium content were made at each test point (the analyst ran all 50 points once and then made a set of repeat measurements). The overall purpose of the experiment was to test for homogeneity of magnesium content in the different bars and different locations. Analyze the data in the file magnesium (giving percentage of magnesium times 1000) to determine if there is significant variability between bars and between 510 Chapter 12 The Analysis of Variance locations. There are a couple of unexpected aspects of these data—can you find them? 26. The concentrations (in nanograms per milliliter) of plasma epinephrine were measured for 10 dogs under isofluorane, halothane, and cyclopropane anes- thesia; the measurements are given in the following table (Perry et al. 1974). Is there a difference in treatment effects? Use a parametric and a nonparametric analysis. Dog Dog Dog Dog Dog Dog Dog Dog Dog Dog 12345678910 Isofluorane .28 .51 1.00 .39 .29 .36 .32 .69 .17 .33 Halothane .30 .39 .63 .68 .38 .21 .88 .39 .51 .32 Cyclopropane 1.07 1.35 .69 .28 1.24 1.53 .49 .56 1.02 .30 27. Three species of mice were tested for “aggressiveness.” The species were A/J, C57, and F2 (a cross of the first two species). A mouse was placed in a 1-m2 box, which was marked off into 49 equal squares. The mouse was let go on the center square, and the number of squares traversed in a 5-min period was counted. Analyze the file C57, AJ, F2, using the Bonferroni method, to determine if there is a significant difference among species. 28. Samples of each of three types of stopwatches were tested. The following table gives thousands of cycles (on-of